Proxmox in production: how to design a solid private cloud

Proxmox VE has become one of the most serious alternatives for companies that want to regain control over their virtualization infrastructure. Its open source foundation, integration with KVM and LXC, clustering capabilities, high availability, Ceph, ZFS, and Proxmox Backup Server have placed it firmly on the radar of CIOs, CTOs, and systems teams looking for a powerful platform without being tied to closed licensing models.

But Proxmox should not be understood only as a way to reduce costs compared to VMware. That reading falls short. In production, Proxmox is a critical infrastructure platform and requires the same discipline as any enterprise environment: well-designed networking, proper storage, quorum, high availability, backup, monitoring, failure testing, and day-to-day operations.

At Stackscale, we are seeing this more and more in private cloud projects. Many companies do not simply want to “move to Proxmox”. They want to build a stable, controlled platform that is ready to grow. To do that, installing Proxmox on several servers is not enough. The architecture underneath must be properly designed.

Proxmox does not start with the hypervisor, it starts with the architecture

One of the most common mistakes when approaching Proxmox is starting with the management interface or the creation of virtual machines. It is understandable: Proxmox VE offers a very straightforward user experience and makes it easy to deploy VMs and containers quickly. But in a real environment, stability does not depend only on the hypervisor.

It depends on how the nodes have been designed, how networks are separated, what storage is used, how quorum is handled, what happens if a node fails, where backups are stored, and how much room there is to grow without redesigning the entire environment.

Design layerQuestion to solve before production
ComputeHow many nodes are needed and what headroom remains in case of failure?
NetworkAre management, migration, storage, backup, and customer traffic separated?
StorageWill you use local storage, Ceph, NFS/iSCSI, network storage, or synchronous storage?
High availabilityWhich VMs should be automatically restarted, and where?
QuorumCan the cluster make safe decisions during failures?
BackupWhere are the backups, how are they verified, and how long does recovery take?
OperationsAre monitoring, alerts, documentation, and maintenance procedures in place?

This difference is what separates a functional lab from an enterprise platform. In a lab, it is enough for VMs to boot. In production, you need to ask what happens when a host fails at 3 a.m., when storage approaches its limit, when a migration overlaps with a traffic peak, or when a restore becomes the only recovery path.

Dedicated nodes: the foundation of a private cloud with Proxmox

In a Proxmox-based private cloud, compute nodes are a central component. They are not just servers where virtual machines “fit”. They are the foundation on which CPU, memory, connectivity, availability, and maintenance are distributed.

Stackscale designs these environments on dedicated nodes, allowing each customer to have exclusive compute resources, without noisy neighbours or direct competition for CPU and RAM with third parties. This separation is important when the infrastructure supports ERP systems, databases, ecommerce platforms, virtual desktops, SaaS platforms, internal systems, or customer-facing applications.

The value of Proxmox on dedicated nodes is not only performance. It is also predictability. The technical team knows what hardware is available, what real capacity exists, how much can be overcommitted, what headroom remains in case of failure, and when new nodes should be added.

Node decisionOperational impact
Minimum number of nodesDetermines HA, quorum, and fault tolerance
CPU and memory per nodeConditions VM density and overcommit
RedundancyEnables maintenance without bringing down the whole environment
Hardware homogeneitySimplifies migrations, balancing, and operations
Available headroomPrevents a failure from leaving the cluster without enough capacity
Spare node poolReduces recovery times after physical failure

A three-node cluster may be enough for many initial environments, but not all three-node clusters are the same. If all three are running at 90 % capacity, high availability will be mostly theoretical: if one node fails, the other two will not have enough real capacity to absorb the workload. That is why sizing must take failure into account, not just normal daily operation.

The question should not be “how many VMs fit”. The right question is: “how many VMs can continue running if I lose one node and need to maintain service”.

The network determines much of the cluster’s behaviour

Proxmox can move virtual machines between nodes, manage high availability, and work with different types of storage. But all these capabilities depend on the network. If the network is poorly designed, the cluster will eventually show it through latency, slow migrations, backups affecting production, or communication problems between nodes.

In a professional environment, several types of traffic should be separated, at least logically: management, Corosync, migration, storage, backup, and customer traffic. In more demanding projects, that separation should also be physical or based on dedicated networks with guaranteed bandwidth.

Type of trafficRisk if mixed without control
ManagementUnnecessary exposure and harder operations during incidents
Corosync / clusterLoss of stability if there is latency or packet loss
MigrationVM degradation if it competes with production traffic
StorageHigh latency in virtual disks and databases
BackupImpact during critical hours if not limited or scheduled
CustomerContention with internal cluster traffic

Corosync deserves special mention. It is the basis for cluster communication and quorum. It should not casually coexist with heavy traffic or depend on unstable links. A saturated network may not bring down VMs immediately, but it can make cluster management less reliable exactly when fast decisions are needed.

Live migration also depends on this layer. In Proxmox it works very well when the design supports it, but a large VM with a lot of memory and high write activity can take longer than expected if the migration network does not have enough capacity. The conclusion is simple: the network is not designed at the end. It is designed before creating the first production VM.

Network storage: far more than capacity

Storage is often the part that most determines the behaviour of a Proxmox environment. It is also one of the most underestimated. In production, looking only at available terabytes is not enough. You need to look at latency, IOPS, bandwidth, redundancy, snapshots, replication, recovery, growth, and maintenance.

Proxmox can work with many options: local storage, ZFS, Ceph, NFS, iSCSI, Fibre Channel, external arrays, or network storage. Each option makes sense in specific scenarios, but none is universal.

Storage optionMain advantagePrecaution
Local SSD/NVMeVery good local performanceLess flexibility for HA without replication
Local ZFSSnapshots, integrity, and advanced managementRequires RAM, planning, and avoiding full pools
CephIntegrated distributed storageRequires properly sized nodes, disks, and network
NFS/iSCSISimple integration with shared storageDepends on array/network and availability design
Dedicated network storageSeparates compute and dataRequires low latency and real redundancy
Synchronous storageVery low RPO and advanced continuityHigher architectural and cost requirements

At Stackscale, Proxmox can be supported by network storage and synchronous storage designed to decouple compute and data. This separation makes it possible to scale compute nodes and storage more independently, simplifies certain recovery scenarios, and reduces dependence on each host’s local disks.

This does not mean Ceph or ZFS are not good options. They are, when they fit. Ceph can be very powerful in environments with enough nodes, a fast network, and specialized operations. ZFS is an excellent technology for integrity, snapshots, and local performance. But in enterprise environments where the goal is a manageable and predictable private cloud platform, network storage can provide a solid foundation for operations, growth, and continuity.

The important point is not to make storage an improvised decision. Many virtualization incidents start as storage problems: pools that are too full, growing latency, forgotten snapshots, backups saturating links, arrays without enough headroom, or disks that do not behave as expected under load.

HA in Proxmox: what it does and what it does not do

Proxmox VE high availability allows virtual machines or containers to be automatically restarted on another node when the original node fails. It is a very valuable feature and one of the reasons Proxmox can be used in enterprise environments. But its limits must be clearly understood.

HA does not mean that a VM will never be interrupted. If a node fails, the VM must start on another node. That implies recovery time. For many workloads, this is acceptable. For others, it is not. A critical database, a transactional application, or a system with connected users may also need application-level replication, load balancers, internal clusters, or its own fault-tolerance mechanisms.

ConceptWhat it protectsWhat it does not replace
Proxmox HANode failure and automatic VM restartApplication-level high availability
Live migrationPlanned movement of VMs between nodesRecovery from sudden host failure
BackupRecovery of data and systemsImmediate continuity
Application replicationService continuity at software levelHistorical backup
DRRecovery from site failure or disasterNormal operation of the local cluster

The confusion between HA, backup, and disaster recovery is common. HA helps restore service after a node failure. Backup allows data or systems to be recovered at an earlier point in time. DR enables response to a larger failure, such as the loss of a data centre, a serious platform issue, or a security incident. They are different layers and should coexist.

Quorum: the small detail that decides the cluster

In Proxmox, as in other cluster systems, quorum is what allows decisions to be made while avoiding dangerous situations such as split-brain. If the cluster does not have quorum, it may block certain operations to protect consistency.

That is why two-node clusters must be handled with particular care. They can make sense in specific scenarios, but they usually require a qdevice or a very carefully tested design. In production environments, the most common and recommended approach is to start with at least three nodes to provide more reliable quorum.

Cluster designPractical reading
1 nodeNo real high availability
2 nodes without qdeviceHigh risk of losing quorum
2 nodes with qdeviceCan work in controlled scenarios
3 nodesCommon baseline for reliable HA
4+ nodesBetter growth and maintenance capacity

Quorum should not be discovered during an incident. It must be tested beforehand. Simulating node failure, network loss, controlled reboots, and maintenance helps understand how the cluster behaves and avoids surprises in production.

Snapshots, backups, and Proxmox Backup Server

Snapshots are useful, but they are not backups. This sentence is repeated often because it remains one of the most common mistakes in virtualization. A snapshot is useful to preserve the state of a VM during a short window: an update, a configuration change, a controlled test, or a deployment. If kept for too long, it can grow, consume storage, and affect performance.

Backup must be designed differently. In Proxmox environments, Proxmox Backup Server provides incremental backups, deduplication, verification, encryption, and natural integration with Proxmox VE. At Stackscale, Proxmox Backup Server can be combined with Archive storage via NFS or S3-compatible access, as well as faster network storage layers when reducing restore times is required.

ElementCorrect use
SnapshotTemporary change and short maintenance window
Local backupFast recovery, but not enough as the only copy
Proxmox Backup ServerIncremental, deduplicated, and verifiable backups
Archive storageRetention and separate long-term copy
Copy in another data centreProtection against major site failure
Restore testingReal validation that the backup works

A professional backup policy must answer specific questions: how often backups are taken, how long they are retained, where the copy is stored, who can delete it, how it is verified, how long a restore takes, and which systems are recovered first.

A backup that has never been restored is not a guarantee. It is a hypothesis.

Sizing and overcommit: efficiency without putting the cluster at risk

Proxmox allows CPU and memory to be used efficiently, but overcommit is not unlimited. Assigning too many vCPUs or too much RAM without measuring real usage eventually creates contention. The problem does not always appear at the beginning. It appears when several VMs consume resources at the same time, when backups run, when a database grows, or when a node fails and the rest of the cluster must absorb its workload.

ResourceRisk of poor sizingGood practice
vCPUContention and CPU latencyMeasure real usage and avoid excessive allocation
RAMSwapping or lack of headroom during failureReserve capacity for peaks and HA
DiskLatency and I/O waitMeasure IOPS, not only capacity
NetworkBottlenecks in migration, backup, and storageSeparate traffic and size links properly
BackupImpact on productionWindows, limits, and monitoring
HALack of capacity when a node failsDesign with N+1 headroom or equivalent

In private cloud, sizing must be reviewed continuously. Workloads change. A VM that consumes little today may become critical tomorrow. An environment that started with ten VMs may grow to one hundred. The advantage of a well-designed infrastructure is that it allows nodes to be added, storage to be expanded, and resources to be adjusted without rebuilding the whole platform.

Monitoring and operations: the difference between installing and managing

A well-designed Proxmox platform needs operations. Monitoring of nodes, storage, network, backups, latency, capacity, SMART errors, CPU usage, memory, I/O wait, HA state, and scheduled tasks. It also needs useful alerts, not noise. If everything alerts, nothing alerts.

Operations also include procedures: how a node is updated, how VMs are evacuated before maintenance, how HA is tested, how backups are reviewed, how changes are documented, how access is managed, and how teams respond to storage degradation.

Proxmox has an important advantage: it is transparent. It allows many layers of the system to be inspected, works with familiar Linux tools, and can be automated through API and CLI. But that transparency requires knowledge. It does not remove the need for administration. It makes it more visible.

Proxmox as the foundation of private cloud at Stackscale

For many companies, the value of Proxmox is not only licence savings. It is the possibility of building an open, flexible, and controlled private cloud. At Stackscale, this architecture can be based on dedicated nodes, private networks, network storage, synchronous storage, backup, monitoring, support, and multi-data-centre options.

This approach helps separate responsibilities. Stackscale provides the physical foundation, connectivity, data centre environment, hardware, network, storage, and infrastructure support. The customer can focus on systems, applications, security, data, and platform evolution.

Business needProxmox on Stackscale approach
Infrastructure controlDedicated nodes and exclusive-use environment
Lower proprietary dependencyOpen source Proxmox VE platform
ContinuityHA, network storage, backup, and multi-DC options
GrowthExpansion of nodes and storage on demand
Migration from VMwareTarget environment design, pilot, and phased transition
RecoveryProxmox Backup Server, Archive, and restore testing
OperationsMonitoring, support, and technical procedures

The idea is not to present Proxmox as a magic solution. It is not. Proxmox works very well when it is designed with discipline. It can also create problems if deployed as a quick installation without proper architecture. The same is true of any virtualization platform in production.

The difference lies in acknowledging it from the start: Proxmox is not cheap virtualization. It is critical infrastructure when it supports critical workloads.

Technical summary for administrators

AreaQuick recommendation
ClusterDesign with at least three nodes when reliable HA is required
QuorumTest node loss and cluster behaviour before production
NetworkSeparate management, migration, storage, backup, and customer traffic
StorageChoose according to workload: network, Ceph, ZFS, synchronous, or a combination
HAUse it for automatic restart, not as a replacement for application HA
BackupUse PBS, retention, verification, and tested restores
CapacityAlert before the limit, not when the pool is already full
SizingMeasure real consumption and reserve headroom for failure
OperationsDocument changes, update in phases, and monitor everything

Frequently asked questions

Is Proxmox a real alternative to VMware for private cloud?
Yes, as long as it is designed as a production platform. Proxmox VE can be a solid foundation for private cloud when combined with clustering, proper storage, segmented networking, high availability, backup, and professional operations.

Is shared storage required to use Proxmox HA?
For many HA scenarios, VMs must be able to start on other nodes with access to their disks, either through shared storage, distributed storage, or properly designed replication. The exact architecture depends on the use case.

Is Ceph mandatory in Proxmox?
No. Ceph is a powerful option, especially for distributed storage, but it is not mandatory. Proxmox can also work with local storage, ZFS, NFS, iSCSI, Fibre Channel, network storage, or external arrays.

Do Proxmox snapshots work as backups?
No. Snapshots are useful for short maintenance windows, but they do not replace a backup strategy. For real backups, it is advisable to use Proxmox Backup Server or another backup solution, with retention, verification, and restore testing.

What does Stackscale bring to a Proxmox project?
Stackscale provides private cloud infrastructure with dedicated nodes, connectivity, network storage, synchronous storage options, backup, monitoring, and specialized support to design and operate production Proxmox environments.

Share it on Social Media!