Proxmox in production: how to design a solid private cloud

Proxmox VE has become one of the most serious alternatives for companies that want to regain control over their virtualization infrastructure. Its open source foundation, integration with KVM and LXC, clustering capabilities, high availability, Ceph, ZFS, and Proxmox Backup Server have placed it firmly on the radar of CIOs, CTOs, and systems teams looking for a powerful platform without being tied to closed licensing models.

But Proxmox should not be understood only as a way to reduce costs compared to VMware. That reading falls short. In production, Proxmox is a critical infrastructure platform and requires the same discipline as any enterprise environment: well-designed networking, proper storage, quorum, high availability, backup, monitoring, failure testing, and day-to-day operations.

At Stackscale, we are seeing this more and more in private cloud projects. Many companies do not simply want to “move to Proxmox”. They want to build a stable, controlled platform that is ready to grow. To do that, installing Proxmox on several servers is not enough. The architecture underneath must be properly designed.

Proxmox does not start with the hypervisor, it starts with the architecture

One of the most common mistakes when approaching Proxmox is starting with the management interface or the creation of virtual machines. It is understandable: Proxmox VE offers a very straightforward user experience and makes it easy to deploy VMs and containers quickly. But in a real environment, stability does not depend only on the hypervisor.

It depends on how the nodes have been designed, how networks are separated, what storage is used, how quorum is handled, what happens if a node fails, where backups are stored, and how much room there is to grow without redesigning the entire environment.

Design layer	Question to solve before production
Compute	How many nodes are needed and what headroom remains in case of failure?
Network	Are management, migration, storage, backup, and customer traffic separated?
Storage	Will you use local storage, Ceph, NFS/iSCSI, network storage, or synchronous storage?
High availability	Which VMs should be automatically restarted, and where?
Quorum	Can the cluster make safe decisions during failures?
Backup	Where are the backups, how are they verified, and how long does recovery take?
Operations	Are monitoring, alerts, documentation, and maintenance procedures in place?

This difference is what separates a functional lab from an enterprise platform. In a lab, it is enough for VMs to boot. In production, you need to ask what happens when a host fails at 3 a.m., when storage approaches its limit, when a migration overlaps with a traffic peak, or when a restore becomes the only recovery path.

Dedicated nodes: the foundation of a private cloud with Proxmox

In a Proxmox-based private cloud, compute nodes are a central component. They are not just servers where virtual machines “fit”. They are the foundation on which CPU, memory, connectivity, availability, and maintenance are distributed.

Stackscale designs these environments on dedicated nodes, allowing each customer to have exclusive compute resources, without noisy neighbours or direct competition for CPU and RAM with third parties. This separation is important when the infrastructure supports ERP systems, databases, ecommerce platforms, virtual desktops, SaaS platforms, internal systems, or customer-facing applications.

The value of Proxmox on dedicated nodes is not only performance. It is also predictability. The technical team knows what hardware is available, what real capacity exists, how much can be overcommitted, what headroom remains in case of failure, and when new nodes should be added.

Node decision	Operational impact
Minimum number of nodes	Determines HA, quorum, and fault tolerance
CPU and memory per node	Conditions VM density and overcommit
Redundancy	Enables maintenance without bringing down the whole environment
Hardware homogeneity	Simplifies migrations, balancing, and operations
Available headroom	Prevents a failure from leaving the cluster without enough capacity
Spare node pool	Reduces recovery times after physical failure

A three-node cluster may be enough for many initial environments, but not all three-node clusters are the same. If all three are running at 90 % capacity, high availability will be mostly theoretical: if one node fails, the other two will not have enough real capacity to absorb the workload. That is why sizing must take failure into account, not just normal daily operation.

The question should not be “how many VMs fit”. The right question is: “how many VMs can continue running if I lose one node and need to maintain service”.

The network determines much of the cluster’s behaviour

Proxmox can move virtual machines between nodes, manage high availability, and work with different types of storage. But all these capabilities depend on the network. If the network is poorly designed, the cluster will eventually show it through latency, slow migrations, backups affecting production, or communication problems between nodes.

In a professional environment, several types of traffic should be separated, at least logically: management, Corosync, migration, storage, backup, and customer traffic. In more demanding projects, that separation should also be physical or based on dedicated networks with guaranteed bandwidth.

Type of traffic	Risk if mixed without control
Management	Unnecessary exposure and harder operations during incidents
Corosync / cluster	Loss of stability if there is latency or packet loss
Migration	VM degradation if it competes with production traffic
Storage	High latency in virtual disks and databases
Backup	Impact during critical hours if not limited or scheduled
Customer	Contention with internal cluster traffic

Corosync deserves special mention. It is the basis for cluster communication and quorum. It should not casually coexist with heavy traffic or depend on unstable links. A saturated network may not bring down VMs immediately, but it can make cluster management less reliable exactly when fast decisions are needed.

Live migration also depends on this layer. In Proxmox it works very well when the design supports it, but a large VM with a lot of memory and high write activity can take longer than expected if the migration network does not have enough capacity. The conclusion is simple: the network is not designed at the end. It is designed before creating the first production VM.

Network storage: far more than capacity

Storage is often the part that most determines the behaviour of a Proxmox environment. It is also one of the most underestimated. In production, looking only at available terabytes is not enough. You need to look at latency, IOPS, bandwidth, redundancy, snapshots, replication, recovery, growth, and maintenance.

Proxmox can work with many options: local storage, ZFS, Ceph, NFS, iSCSI, Fibre Channel, external arrays, or network storage. Each option makes sense in specific scenarios, but none is universal.

Storage option	Main advantage	Precaution
Local SSD/NVMe	Very good local performance	Less flexibility for HA without replication
Local ZFS	Snapshots, integrity, and advanced management	Requires RAM, planning, and avoiding full pools
Ceph	Integrated distributed storage	Requires properly sized nodes, disks, and network
NFS/iSCSI	Simple integration with shared storage	Depends on array/network and availability design
Dedicated network storage	Separates compute and data	Requires low latency and real redundancy
Synchronous storage	Very low RPO and advanced continuity	Higher architectural and cost requirements

At Stackscale, Proxmox can be supported by network storage and synchronous storage designed to decouple compute and data. This separation makes it possible to scale compute nodes and storage more independently, simplifies certain recovery scenarios, and reduces dependence on each host’s local disks.

This does not mean Ceph or ZFS are not good options. They are, when they fit. Ceph can be very powerful in environments with enough nodes, a fast network, and specialized operations. ZFS is an excellent technology for integrity, snapshots, and local performance. But in enterprise environments where the goal is a manageable and predictable private cloud platform, network storage can provide a solid foundation for operations, growth, and continuity.

The important point is not to make storage an improvised decision. Many virtualization incidents start as storage problems: pools that are too full, growing latency, forgotten snapshots, backups saturating links, arrays without enough headroom, or disks that do not behave as expected under load.

HA in Proxmox: what it does and what it does not do

Proxmox VE high availability allows virtual machines or containers to be automatically restarted on another node when the original node fails. It is a very valuable feature and one of the reasons Proxmox can be used in enterprise environments. But its limits must be clearly understood.

HA does not mean that a VM will never be interrupted. If a node fails, the VM must start on another node. That implies recovery time. For many workloads, this is acceptable. For others, it is not. A critical database, a transactional application, or a system with connected users may also need application-level replication, load balancers, internal clusters, or its own fault-tolerance mechanisms.

Concept	What it protects	What it does not replace
Proxmox HA	Node failure and automatic VM restart	Application-level high availability
Live migration	Planned movement of VMs between nodes	Recovery from sudden host failure
Backup	Recovery of data and systems	Immediate continuity
Application replication	Service continuity at software level	Historical backup
DR	Recovery from site failure or disaster	Normal operation of the local cluster

The confusion between HA, backup, and disaster recovery is common. HA helps restore service after a node failure. Backup allows data or systems to be recovered at an earlier point in time. DR enables response to a larger failure, such as the loss of a data centre, a serious platform issue, or a security incident. They are different layers and should coexist.

Quorum: the small detail that decides the cluster

In Proxmox, as in other cluster systems, quorum is what allows decisions to be made while avoiding dangerous situations such as split-brain. If the cluster does not have quorum, it may block certain operations to protect consistency.

That is why two-node clusters must be handled with particular care. They can make sense in specific scenarios, but they usually require a qdevice or a very carefully tested design. In production environments, the most common and recommended approach is to start with at least three nodes to provide more reliable quorum.

Cluster design	Practical reading
1 node	No real high availability
2 nodes without qdevice	High risk of losing quorum
2 nodes with qdevice	Can work in controlled scenarios
3 nodes	Common baseline for reliable HA
4+ nodes	Better growth and maintenance capacity

Quorum should not be discovered during an incident. It must be tested beforehand. Simulating node failure, network loss, controlled reboots, and maintenance helps understand how the cluster behaves and avoids surprises in production.

Snapshots, backups, and Proxmox Backup Server

Snapshots are useful, but they are not backups. This sentence is repeated often because it remains one of the most common mistakes in virtualization. A snapshot is useful to preserve the state of a VM during a short window: an update, a configuration change, a controlled test, or a deployment. If kept for too long, it can grow, consume storage, and affect performance.

Backup must be designed differently. In Proxmox environments, Proxmox Backup Server provides incremental backups, deduplication, verification, encryption, and natural integration with Proxmox VE. At Stackscale, Proxmox Backup Server can be combined with Archive storage via NFS or S3-compatible access, as well as faster network storage layers when reducing restore times is required.

Element	Correct use
Snapshot	Temporary change and short maintenance window
Local backup	Fast recovery, but not enough as the only copy
Proxmox Backup Server	Incremental, deduplicated, and verifiable backups
Archive storage	Retention and separate long-term copy
Copy in another data centre	Protection against major site failure
Restore testing	Real validation that the backup works

A professional backup policy must answer specific questions: how often backups are taken, how long they are retained, where the copy is stored, who can delete it, how it is verified, how long a restore takes, and which systems are recovered first.

A backup that has never been restored is not a guarantee. It is a hypothesis.

Sizing and overcommit: efficiency without putting the cluster at risk

Proxmox allows CPU and memory to be used efficiently, but overcommit is not unlimited. Assigning too many vCPUs or too much RAM without measuring real usage eventually creates contention. The problem does not always appear at the beginning. It appears when several VMs consume resources at the same time, when backups run, when a database grows, or when a node fails and the rest of the cluster must absorb its workload.

Resource	Risk of poor sizing	Good practice
vCPU	Contention and CPU latency	Measure real usage and avoid excessive allocation
RAM	Swapping or lack of headroom during failure	Reserve capacity for peaks and HA
Disk	Latency and I/O wait	Measure IOPS, not only capacity
Network	Bottlenecks in migration, backup, and storage	Separate traffic and size links properly
Backup	Impact on production	Windows, limits, and monitoring
HA	Lack of capacity when a node fails	Design with N+1 headroom or equivalent

In private cloud, sizing must be reviewed continuously. Workloads change. A VM that consumes little today may become critical tomorrow. An environment that started with ten VMs may grow to one hundred. The advantage of a well-designed infrastructure is that it allows nodes to be added, storage to be expanded, and resources to be adjusted without rebuilding the whole platform.

Monitoring and operations: the difference between installing and managing

A well-designed Proxmox platform needs operations. Monitoring of nodes, storage, network, backups, latency, capacity, SMART errors, CPU usage, memory, I/O wait, HA state, and scheduled tasks. It also needs useful alerts, not noise. If everything alerts, nothing alerts.

Operations also include procedures: how a node is updated, how VMs are evacuated before maintenance, how HA is tested, how backups are reviewed, how changes are documented, how access is managed, and how teams respond to storage degradation.

Proxmox has an important advantage: it is transparent. It allows many layers of the system to be inspected, works with familiar Linux tools, and can be automated through API and CLI. But that transparency requires knowledge. It does not remove the need for administration. It makes it more visible.

Proxmox as the foundation of private cloud at Stackscale

For many companies, the value of Proxmox is not only licence savings. It is the possibility of building an open, flexible, and controlled private cloud. At Stackscale, this architecture can be based on dedicated nodes, private networks, network storage, synchronous storage, backup, monitoring, support, and multi-data-centre options.

This approach helps separate responsibilities. Stackscale provides the physical foundation, connectivity, data centre environment, hardware, network, storage, and infrastructure support. The customer can focus on systems, applications, security, data, and platform evolution.

Business need	Proxmox on Stackscale approach
Infrastructure control	Dedicated nodes and exclusive-use environment
Lower proprietary dependency	Open source Proxmox VE platform
Continuity	HA, network storage, backup, and multi-DC options
Growth	Expansion of nodes and storage on demand
Migration from VMware	Target environment design, pilot, and phased transition
Recovery	Proxmox Backup Server, Archive, and restore testing
Operations	Monitoring, support, and technical procedures

The idea is not to present Proxmox as a magic solution. It is not. Proxmox works very well when it is designed with discipline. It can also create problems if deployed as a quick installation without proper architecture. The same is true of any virtualization platform in production.

The difference lies in acknowledging it from the start: Proxmox is not cheap virtualization. It is critical infrastructure when it supports critical workloads.

Technical summary for administrators

Area	Quick recommendation
Cluster	Design with at least three nodes when reliable HA is required
Quorum	Test node loss and cluster behaviour before production
Network	Separate management, migration, storage, backup, and customer traffic
Storage	Choose according to workload: network, Ceph, ZFS, synchronous, or a combination
HA	Use it for automatic restart, not as a replacement for application HA
Backup	Use PBS, retention, verification, and tested restores
Capacity	Alert before the limit, not when the pool is already full
Sizing	Measure real consumption and reserve headroom for failure
Operations	Document changes, update in phases, and monitor everything

Frequently asked questions

Is Proxmox a real alternative to VMware for private cloud?
Yes, as long as it is designed as a production platform. Proxmox VE can be a solid foundation for private cloud when combined with clustering, proper storage, segmented networking, high availability, backup, and professional operations.

Is shared storage required to use Proxmox HA?
For many HA scenarios, VMs must be able to start on other nodes with access to their disks, either through shared storage, distributed storage, or properly designed replication. The exact architecture depends on the use case.

Is Ceph mandatory in Proxmox?
No. Ceph is a powerful option, especially for distributed storage, but it is not mandatory. Proxmox can also work with local storage, ZFS, NFS, iSCSI, Fibre Channel, network storage, or external arrays.

Do Proxmox snapshots work as backups?
No. Snapshots are useful for short maintenance windows, but they do not replace a backup strategy. For real backups, it is advisable to use Proxmox Backup Server or another backup solution, with retention, verification, and restore testing.

What does Stackscale bring to a Proxmox project?
Stackscale provides private cloud infrastructure with dedicated nodes, connectivity, network storage, synchronous storage options, backup, monitoring, and specialized support to design and operate production Proxmox environments.

Proxmox in production: how to design a solid private cloud

Proxmox does not start with the hypervisor, it starts with the architecture

Dedicated nodes: the foundation of a private cloud with Proxmox

The network determines much of the cluster’s behaviour

Network storage: far more than capacity

HA in Proxmox: what it does and what it does not do

Quorum: the small detail that decides the cluster

Snapshots, backups, and Proxmox Backup Server

Sizing and overcommit: efficiency without putting the cluster at risk

Monitoring and operations: the difference between installing and managing

Proxmox as the foundation of private cloud at Stackscale

Technical summary for administrators

Frequently asked questions

Share it on Social Media!

Related articles

Proxmox in 2025: the definitive leap from “alternative” to standard — and how Stackscale speeds up the migration

Proxmox Backup Server: enterprise-grade backups for Proxmox environments on Stackscale

Proxmox Datacenter Manager 1.0: the new “command center” for Proxmox environments at Stackscale