Proxmox VE has become one of the most serious alternatives for companies that want to regain control over their virtualization infrastructure. Its open source foundation, integration with KVM and LXC, clustering capabilities, high availability, Ceph, ZFS, and Proxmox Backup Server have placed it firmly on the radar of CIOs, CTOs, and systems teams looking for a powerful platform without being tied to closed licensing models.
But Proxmox should not be understood only as a way to reduce costs compared to VMware. That reading falls short. In production, Proxmox is a critical infrastructure platform and requires the same discipline as any enterprise environment: well-designed networking, proper storage, quorum, high availability, backup, monitoring, failure testing, and day-to-day operations.
At Stackscale, we are seeing this more and more in private cloud projects. Many companies do not simply want to “move to Proxmox”. They want to build a stable, controlled platform that is ready to grow. To do that, installing Proxmox on several servers is not enough. The architecture underneath must be properly designed.
Proxmox does not start with the hypervisor, it starts with the architecture
One of the most common mistakes when approaching Proxmox is starting with the management interface or the creation of virtual machines. It is understandable: Proxmox VE offers a very straightforward user experience and makes it easy to deploy VMs and containers quickly. But in a real environment, stability does not depend only on the hypervisor.
It depends on how the nodes have been designed, how networks are separated, what storage is used, how quorum is handled, what happens if a node fails, where backups are stored, and how much room there is to grow without redesigning the entire environment.
| Design layer | Question to solve before production |
|---|---|
| Compute | How many nodes are needed and what headroom remains in case of failure? |
| Network | Are management, migration, storage, backup, and customer traffic separated? |
| Storage | Will you use local storage, Ceph, NFS/iSCSI, network storage, or synchronous storage? |
| High availability | Which VMs should be automatically restarted, and where? |
| Quorum | Can the cluster make safe decisions during failures? |
| Backup | Where are the backups, how are they verified, and how long does recovery take? |
| Operations | Are monitoring, alerts, documentation, and maintenance procedures in place? |
This difference is what separates a functional lab from an enterprise platform. In a lab, it is enough for VMs to boot. In production, you need to ask what happens when a host fails at 3 a.m., when storage approaches its limit, when a migration overlaps with a traffic peak, or when a restore becomes the only recovery path.
Dedicated nodes: the foundation of a private cloud with Proxmox
In a Proxmox-based private cloud, compute nodes are a central component. They are not just servers where virtual machines “fit”. They are the foundation on which CPU, memory, connectivity, availability, and maintenance are distributed.
Stackscale designs these environments on dedicated nodes, allowing each customer to have exclusive compute resources, without noisy neighbours or direct competition for CPU and RAM with third parties. This separation is important when the infrastructure supports ERP systems, databases, ecommerce platforms, virtual desktops, SaaS platforms, internal systems, or customer-facing applications.
The value of Proxmox on dedicated nodes is not only performance. It is also predictability. The technical team knows what hardware is available, what real capacity exists, how much can be overcommitted, what headroom remains in case of failure, and when new nodes should be added.
| Node decision | Operational impact |
|---|---|
| Minimum number of nodes | Determines HA, quorum, and fault tolerance |
| CPU and memory per node | Conditions VM density and overcommit |
| Redundancy | Enables maintenance without bringing down the whole environment |
| Hardware homogeneity | Simplifies migrations, balancing, and operations |
| Available headroom | Prevents a failure from leaving the cluster without enough capacity |
| Spare node pool | Reduces recovery times after physical failure |
A three-node cluster may be enough for many initial environments, but not all three-node clusters are the same. If all three are running at 90 % capacity, high availability will be mostly theoretical: if one node fails, the other two will not have enough real capacity to absorb the workload. That is why sizing must take failure into account, not just normal daily operation.
The question should not be “how many VMs fit”. The right question is: “how many VMs can continue running if I lose one node and need to maintain service”.
The network determines much of the cluster’s behaviour
Proxmox can move virtual machines between nodes, manage high availability, and work with different types of storage. But all these capabilities depend on the network. If the network is poorly designed, the cluster will eventually show it through latency, slow migrations, backups affecting production, or communication problems between nodes.
In a professional environment, several types of traffic should be separated, at least logically: management, Corosync, migration, storage, backup, and customer traffic. In more demanding projects, that separation should also be physical or based on dedicated networks with guaranteed bandwidth.
| Type of traffic | Risk if mixed without control |
|---|---|
| Management | Unnecessary exposure and harder operations during incidents |
| Corosync / cluster | Loss of stability if there is latency or packet loss |
| Migration | VM degradation if it competes with production traffic |
| Storage | High latency in virtual disks and databases |
| Backup | Impact during critical hours if not limited or scheduled |
| Customer | Contention with internal cluster traffic |
Corosync deserves special mention. It is the basis for cluster communication and quorum. It should not casually coexist with heavy traffic or depend on unstable links. A saturated network may not bring down VMs immediately, but it can make cluster management less reliable exactly when fast decisions are needed.
Live migration also depends on this layer. In Proxmox it works very well when the design supports it, but a large VM with a lot of memory and high write activity can take longer than expected if the migration network does not have enough capacity. The conclusion is simple: the network is not designed at the end. It is designed before creating the first production VM.
Network storage: far more than capacity
Storage is often the part that most determines the behaviour of a Proxmox environment. It is also one of the most underestimated. In production, looking only at available terabytes is not enough. You need to look at latency, IOPS, bandwidth, redundancy, snapshots, replication, recovery, growth, and maintenance.
Proxmox can work with many options: local storage, ZFS, Ceph, NFS, iSCSI, Fibre Channel, external arrays, or network storage. Each option makes sense in specific scenarios, but none is universal.
| Storage option | Main advantage | Precaution |
|---|---|---|
| Local SSD/NVMe | Very good local performance | Less flexibility for HA without replication |
| Local ZFS | Snapshots, integrity, and advanced management | Requires RAM, planning, and avoiding full pools |
| Ceph | Integrated distributed storage | Requires properly sized nodes, disks, and network |
| NFS/iSCSI | Simple integration with shared storage | Depends on array/network and availability design |
| Dedicated network storage | Separates compute and data | Requires low latency and real redundancy |
| Synchronous storage | Very low RPO and advanced continuity | Higher architectural and cost requirements |
At Stackscale, Proxmox can be supported by network storage and synchronous storage designed to decouple compute and data. This separation makes it possible to scale compute nodes and storage more independently, simplifies certain recovery scenarios, and reduces dependence on each host’s local disks.
This does not mean Ceph or ZFS are not good options. They are, when they fit. Ceph can be very powerful in environments with enough nodes, a fast network, and specialized operations. ZFS is an excellent technology for integrity, snapshots, and local performance. But in enterprise environments where the goal is a manageable and predictable private cloud platform, network storage can provide a solid foundation for operations, growth, and continuity.
The important point is not to make storage an improvised decision. Many virtualization incidents start as storage problems: pools that are too full, growing latency, forgotten snapshots, backups saturating links, arrays without enough headroom, or disks that do not behave as expected under load.
HA in Proxmox: what it does and what it does not do
Proxmox VE high availability allows virtual machines or containers to be automatically restarted on another node when the original node fails. It is a very valuable feature and one of the reasons Proxmox can be used in enterprise environments. But its limits must be clearly understood.
HA does not mean that a VM will never be interrupted. If a node fails, the VM must start on another node. That implies recovery time. For many workloads, this is acceptable. For others, it is not. A critical database, a transactional application, or a system with connected users may also need application-level replication, load balancers, internal clusters, or its own fault-tolerance mechanisms.
| Concept | What it protects | What it does not replace |
|---|---|---|
| Proxmox HA | Node failure and automatic VM restart | Application-level high availability |
| Live migration | Planned movement of VMs between nodes | Recovery from sudden host failure |
| Backup | Recovery of data and systems | Immediate continuity |
| Application replication | Service continuity at software level | Historical backup |
| DR | Recovery from site failure or disaster | Normal operation of the local cluster |
The confusion between HA, backup, and disaster recovery is common. HA helps restore service after a node failure. Backup allows data or systems to be recovered at an earlier point in time. DR enables response to a larger failure, such as the loss of a data centre, a serious platform issue, or a security incident. They are different layers and should coexist.
Quorum: the small detail that decides the cluster
In Proxmox, as in other cluster systems, quorum is what allows decisions to be made while avoiding dangerous situations such as split-brain. If the cluster does not have quorum, it may block certain operations to protect consistency.
That is why two-node clusters must be handled with particular care. They can make sense in specific scenarios, but they usually require a qdevice or a very carefully tested design. In production environments, the most common and recommended approach is to start with at least three nodes to provide more reliable quorum.
| Cluster design | Practical reading |
|---|---|
| 1 node | No real high availability |
| 2 nodes without qdevice | High risk of losing quorum |
| 2 nodes with qdevice | Can work in controlled scenarios |
| 3 nodes | Common baseline for reliable HA |
| 4+ nodes | Better growth and maintenance capacity |
Quorum should not be discovered during an incident. It must be tested beforehand. Simulating node failure, network loss, controlled reboots, and maintenance helps understand how the cluster behaves and avoids surprises in production.
Snapshots, backups, and Proxmox Backup Server
Snapshots are useful, but they are not backups. This sentence is repeated often because it remains one of the most common mistakes in virtualization. A snapshot is useful to preserve the state of a VM during a short window: an update, a configuration change, a controlled test, or a deployment. If kept for too long, it can grow, consume storage, and affect performance.
Backup must be designed differently. In Proxmox environments, Proxmox Backup Server provides incremental backups, deduplication, verification, encryption, and natural integration with Proxmox VE. At Stackscale, Proxmox Backup Server can be combined with Archive storage via NFS or S3-compatible access, as well as faster network storage layers when reducing restore times is required.
| Element | Correct use |
|---|---|
| Snapshot | Temporary change and short maintenance window |
| Local backup | Fast recovery, but not enough as the only copy |
| Proxmox Backup Server | Incremental, deduplicated, and verifiable backups |
| Archive storage | Retention and separate long-term copy |
| Copy in another data centre | Protection against major site failure |
| Restore testing | Real validation that the backup works |
A professional backup policy must answer specific questions: how often backups are taken, how long they are retained, where the copy is stored, who can delete it, how it is verified, how long a restore takes, and which systems are recovered first.
A backup that has never been restored is not a guarantee. It is a hypothesis.
Sizing and overcommit: efficiency without putting the cluster at risk
Proxmox allows CPU and memory to be used efficiently, but overcommit is not unlimited. Assigning too many vCPUs or too much RAM without measuring real usage eventually creates contention. The problem does not always appear at the beginning. It appears when several VMs consume resources at the same time, when backups run, when a database grows, or when a node fails and the rest of the cluster must absorb its workload.
| Resource | Risk of poor sizing | Good practice |
|---|---|---|
| vCPU | Contention and CPU latency | Measure real usage and avoid excessive allocation |
| RAM | Swapping or lack of headroom during failure | Reserve capacity for peaks and HA |
| Disk | Latency and I/O wait | Measure IOPS, not only capacity |
| Network | Bottlenecks in migration, backup, and storage | Separate traffic and size links properly |
| Backup | Impact on production | Windows, limits, and monitoring |
| HA | Lack of capacity when a node fails | Design with N+1 headroom or equivalent |
In private cloud, sizing must be reviewed continuously. Workloads change. A VM that consumes little today may become critical tomorrow. An environment that started with ten VMs may grow to one hundred. The advantage of a well-designed infrastructure is that it allows nodes to be added, storage to be expanded, and resources to be adjusted without rebuilding the whole platform.
Monitoring and operations: the difference between installing and managing
A well-designed Proxmox platform needs operations. Monitoring of nodes, storage, network, backups, latency, capacity, SMART errors, CPU usage, memory, I/O wait, HA state, and scheduled tasks. It also needs useful alerts, not noise. If everything alerts, nothing alerts.
Operations also include procedures: how a node is updated, how VMs are evacuated before maintenance, how HA is tested, how backups are reviewed, how changes are documented, how access is managed, and how teams respond to storage degradation.
Proxmox has an important advantage: it is transparent. It allows many layers of the system to be inspected, works with familiar Linux tools, and can be automated through API and CLI. But that transparency requires knowledge. It does not remove the need for administration. It makes it more visible.
Proxmox as the foundation of private cloud at Stackscale
For many companies, the value of Proxmox is not only licence savings. It is the possibility of building an open, flexible, and controlled private cloud. At Stackscale, this architecture can be based on dedicated nodes, private networks, network storage, synchronous storage, backup, monitoring, support, and multi-data-centre options.
This approach helps separate responsibilities. Stackscale provides the physical foundation, connectivity, data centre environment, hardware, network, storage, and infrastructure support. The customer can focus on systems, applications, security, data, and platform evolution.
| Business need | Proxmox on Stackscale approach |
|---|---|
| Infrastructure control | Dedicated nodes and exclusive-use environment |
| Lower proprietary dependency | Open source Proxmox VE platform |
| Continuity | HA, network storage, backup, and multi-DC options |
| Growth | Expansion of nodes and storage on demand |
| Migration from VMware | Target environment design, pilot, and phased transition |
| Recovery | Proxmox Backup Server, Archive, and restore testing |
| Operations | Monitoring, support, and technical procedures |
The idea is not to present Proxmox as a magic solution. It is not. Proxmox works very well when it is designed with discipline. It can also create problems if deployed as a quick installation without proper architecture. The same is true of any virtualization platform in production.
The difference lies in acknowledging it from the start: Proxmox is not cheap virtualization. It is critical infrastructure when it supports critical workloads.
Technical summary for administrators
| Area | Quick recommendation |
|---|---|
| Cluster | Design with at least three nodes when reliable HA is required |
| Quorum | Test node loss and cluster behaviour before production |
| Network | Separate management, migration, storage, backup, and customer traffic |
| Storage | Choose according to workload: network, Ceph, ZFS, synchronous, or a combination |
| HA | Use it for automatic restart, not as a replacement for application HA |
| Backup | Use PBS, retention, verification, and tested restores |
| Capacity | Alert before the limit, not when the pool is already full |
| Sizing | Measure real consumption and reserve headroom for failure |
| Operations | Document changes, update in phases, and monitor everything |
Frequently asked questions
Is Proxmox a real alternative to VMware for private cloud?
Yes, as long as it is designed as a production platform. Proxmox VE can be a solid foundation for private cloud when combined with clustering, proper storage, segmented networking, high availability, backup, and professional operations.
Is shared storage required to use Proxmox HA?
For many HA scenarios, VMs must be able to start on other nodes with access to their disks, either through shared storage, distributed storage, or properly designed replication. The exact architecture depends on the use case.
Is Ceph mandatory in Proxmox?
No. Ceph is a powerful option, especially for distributed storage, but it is not mandatory. Proxmox can also work with local storage, ZFS, NFS, iSCSI, Fibre Channel, network storage, or external arrays.
Do Proxmox snapshots work as backups?
No. Snapshots are useful for short maintenance windows, but they do not replace a backup strategy. For real backups, it is advisable to use Proxmox Backup Server or another backup solution, with retention, verification, and restore testing.
What does Stackscale bring to a Proxmox project?
Stackscale provides private cloud infrastructure with dedicated nodes, connectivity, network storage, synchronous storage options, backup, monitoring, and specialized support to design and operate production Proxmox environments.



