Scaling Proxmox for Large Deployments With OpenMetal IaaS

Resources » Blog » Scaling Proxmox for Large Deployments With OpenMetal IaaS

In this article

Proxmox VE is a popular choice for teams building virtualization infrastructure on bare metal, but what works at small scale doesn’t automatically translate to production. This guide covers what actually changes as your deployment grows, from storage architecture and clustering to backup strategy and operational support.

Proxmox VE is a genuinely capable virtualization platform. It’s free, it’s open source, it runs KVM virtual machines and LXC containers side by side, and it has a web interface that doesn’t make you want to throw your monitor out the window. For a lot of teams, it’s the obvious choice when standing up a small cluster or migrating away from VMware.

But there’s a gap between “running Proxmox in a lab or small environment” and “running Proxmox reliably at production scale”. The platform itself can handle larger deployments, but the infrastructure decisions you make early on either support that growth or fight against it. This article walks through what actually changes as you scale Proxmox, what you need to think about at each stage, and how your hosting and infrastructure choices affect your options.

If you’re just getting started with Proxmox on bare metal, our installation guide covers the basics of getting up and running. This article picks up from there.

What “Small Scale” Proxmox Usually Looks Like

Most Proxmox deployments start simply. You have one, two, or three physical servers. You’ve installed Proxmox on each node, maybe formed a basic cluster so you can see them all in one interface, and you’re running VMs. Storage is probably local ZFS or a simple NFS share. Backups might be Proxmox Backup Server on one of the nodes, or even just a scheduled dump to a NAS.

This works fine. It’s a reasonable setup for a dev environment, a small business with modest VM needs, or a proof of concept. But it has limitations that become more painful as workloads grow:

Single points of failure everywhere. Local storage means a failed drive or node takes VMs down with it. If you haven’t configured HA, losing a node means manually restarting those VMs somewhere else.

Backup isn’t the same as high availability. You can restore from a Proxmox Backup Server snapshot, but that takes time. Applications that need to stay up during hardware failures need live migration and shared storage, not just a good backup.

Manual management doesn’t scale. Managing 10 VMs by hand through a GUI is fine. Managing 100 or 500 requires automation, templating, and consistent configuration that ad-hoc single-node setups rarely have.

Network configuration becomes a liability. A single server with basic networking is simple. Adding more nodes with VLAN tagging, bonded NICs, and isolated networks requires more planning than most small deployments start with.

The good news is that Proxmox has real solutions for all of these. The transition from small-scale to production just requires intentional decisions about hardware, networking, storage, and operations.

The Hardware Foundation Changes at Scale

Small Proxmox deployments often run on whatever hardware is available. A refurbished server, a workstation, or even a high-end desktop. At production scale, the hardware requirements become more specific.

CPU: Proxmox uses KVM for VMs, which relies on hardware virtualization extensions (Intel VT-x or AMD-V). These are present in essentially all modern server CPUs, but the number of cores and clock speed determine how many VMs you can run and at what performance level. For larger deployments, multi-socket servers with higher core counts give you more scheduling flexibility. If you’re running workloads that benefit from Intel TDX for confidential computing, hardware support for that needs to be considered at the node level.

Memory: RAM is often the binding constraint before CPU becomes an issue. VMs are typically allocated several GB each, and overcommitting RAM leads to performance degradation through swapping. For production deployments, plan for actual memory usage rather than overcommitting, and make sure nodes have enough physical RAM to handle workloads even if one node fails and its VMs migrate to remaining nodes.

Storage: This is where the most significant decisions live, and we’ll cover it in detail below.

Networking: Production Proxmox clusters need at minimum two network interfaces per node: one for management and VM traffic, one for storage traffic (especially if you’re using Ceph). For serious deployments, bonded NICs (802.3ad/LACP) provide both redundancy and additional throughput. Dedicated storage networks keep Ceph replication traffic from competing with VM traffic.

NVMe vs SATA SSD vs HDD: For most production VM workloads, NVMe drives are worth the cost. The IOPS difference between NVMe and SATA SSD is significant for workloads with frequent random I/O, and the difference between SSD and spinning disk is even more dramatic. If you’re working with providers like OpenMetal, all servers ship with Micron 7450 or 7500 MAX NVMe drives as the baseline, which removes a lot of the performance guesswork.

When you’re deciding whether to build your own physical infrastructure or use a bare metal IaaS provider, hardware consistency is a real consideration. At scale, having identical or very similar nodes simplifies cluster management, live migration, and troubleshooting considerably.

Clustering: What Changes and Why It Matters

A single Proxmox node isn’t really a cluster. You get a management interface for that node, but you lose the features that make Proxmox worth running at scale: live migration, high availability, and centralized management.

A Proxmox cluster requires a minimum of three nodes to implement HA properly. This is because Proxmox HA uses a quorum-based mechanism to avoid split-brain scenarios. With only two nodes, if they lose connectivity to each other, neither can be certain the other is actually down (rather than just network-isolated), so neither takes action. Three nodes means a majority can always be established.

What Clustering Enables

Live migration lets you move running VMs between nodes without downtime. This is useful for maintenance (drain a node before rebooting for updates), load balancing (move VMs to less-loaded nodes), and HA failover.

Proxmox HA automatically restarts VMs on other nodes when a node fails. It monitors node health and VM health separately, so it can respond to both hardware failures and VM crashes.

The cluster-wide GUI gives you a single view of all nodes, VMs, and storage. Without clustering, you’d need to log into each node separately.

What Clustering Requires

Reliable low-latency network connectivity between all nodes. Proxmox cluster communication is sensitive to packet loss and high latency. Cluster nodes that are geographically distributed (across different datacenters or availability zones) need dedicated cluster links with reliable connectivity, not shared internet connections.

Shared storage for HA to work. When a VM needs to restart on a different node, that node needs access to the VM’s disk. Local ZFS storage doesn’t provide this. You need either Ceph (which Proxmox integrates with natively) or an external shared storage solution.

Time synchronization across nodes. Proxmox is sensitive to clock drift between cluster members. NTP should be configured and verified before expanding a cluster.

If you’re evaluating IaaS providers for your Proxmox nodes, private network connectivity between servers matters a lot here. OpenMetal’s infrastructure provides dual 10 Gbps private links per server on isolated VLANs, which means cluster communication and storage replication traffic stays off the public network entirely and benefits from low latency between nodes in the same location.

Storage Architecture: The Most Important Decision You’ll Make

Storage is where small-scale Proxmox deployments and production deployments diverge most significantly. The storage approach you choose affects performance, redundancy, operational complexity, and cost at every scale.

Local ZFS

ZFS on local drives is the default storage approach for many Proxmox deployments. It’s straightforward to set up, performs well, and has solid data integrity features including checksumming, snapshots, and built-in compression.

At small scale, local ZFS is a perfectly reasonable choice. If you’re running a single node or a small cluster where HA isn’t required, local storage is simpler and often faster than networked alternatives.

The limitation at scale is that local ZFS storage isn’t shared between nodes. A VM’s disk image lives on one physical server. If that server needs maintenance or fails, VMs using that storage can’t live-migrate to another node. HA is also not possible for VMs on local storage.

You can work around this at small scale by using Proxmox Backup Server to maintain snapshots that can be restored elsewhere, but that’s recovery rather than high availability.

Ceph (Recommended for Production Multi-Node Clusters)

Ceph is a distributed storage system that Proxmox integrates with natively. It stores data across multiple nodes, and Proxmox can both deploy and manage Ceph directly from its interface.

At production scale, Ceph provides several things local storage can’t:

Shared storage for live migration and HA. VM disks in a Ceph RBD (RADOS Block Device) pool are accessible from any node in the cluster. This is what makes live migration and HA actually work.

Configurable redundancy. You can set replication factor (2x or 3x copies of each piece of data) based on your durability requirements. 3x replication means the cluster can lose an entire node without data loss or availability impact.

Horizontal scalability. Adding nodes to a Ceph cluster increases both storage capacity and throughput. The cluster automatically rebalances data across available nodes.

Erasure coding for larger clusters. At sufficient scale (typically 6+ OSDs), erasure coding can reduce storage overhead compared to full replication while maintaining redundancy. This is worth looking at once you’re beyond a basic three-node cluster.

The tradeoff is that Ceph adds operational complexity. It has its own concepts, its own CLI tools, and its own failure modes. It also has minimum requirements: you need at least three storage nodes (Ceph OSDs) for a redundant cluster. In a three-node Proxmox cluster where each node contributes storage, this works naturally. Smaller setups don’t have enough nodes to run Ceph safely.

Ceph also has network requirements. Storage replication traffic between Ceph nodes should be on a dedicated network, separate from VM and management traffic. In practice this means either dedicated NICs for Ceph, or careful VLAN configuration. Running Ceph replication over the same interface as VM traffic creates contention that degrades both.

External Shared Storage (NFS, iSCSI, SMB)

For clusters that can’t or don’t want to run Ceph, external shared storage is another option. NFS and iSCSI are both supported by Proxmox and provide shared storage without requiring Ceph on the cluster nodes themselves.

This approach works well when you already have a storage array (NetApp, Pure Storage, etc.) or when you want to separate storage management from compute management. The operational overhead of managing a separate storage system is a real consideration, though.

For latency-sensitive workloads, iSCSI generally outperforms NFS because of lower protocol overhead. For workloads that are less latency-sensitive and more concerned with manageability, NFS is often simpler to operate.

The question of which storage approach makes sense depends heavily on your workload mix, your team’s existing skills, and your scale. At OpenMetal, we’ve written specifically about the Ceph vs ZFS decision for bare metal Proxmox deployments if you want to go deeper on those tradeoffs.

Networking at Scale

Small Proxmox deployments often have simple networking: one or two interfaces, basic Linux bridges, maybe some VLANs. Production deployments need more structure.

Network separation: At minimum, production Proxmox clusters should separate management traffic (SSH, web UI, cluster communication) from VM traffic (guest operating system network access) from storage traffic (Ceph replication). Mixing all of this on a single interface creates contention and security exposure.

Bonding/LACP: For nodes that carry significant traffic, bonded NICs provide both redundancy and additional throughput. Linux bond mode 4 (802.3ad LACP) requires switch support but provides active-active bonding where both interfaces carry traffic. Bond mode 1 (active-backup) is simpler but only uses one interface at a time.

VLAN tagging: In larger environments, VLANs allow you to isolate different VM groups, different tenants, or different network segments on the same physical infrastructure. Proxmox supports VLAN-aware bridges that allow VMs to tag their own traffic, which is more flexible than creating separate bridges for each VLAN.

Jumbo frames: For storage networks, enabling jumbo frames (MTU 9000) can meaningfully improve Ceph replication throughput by reducing protocol overhead per packet. This requires switch support and consistent MTU configuration across all nodes and switches on the storage VLAN.

SDN (Software Defined Networking): Proxmox has a built-in SDN feature that provides more advanced networking capabilities including VXLANs and complex network topologies. This is worth evaluating for larger deployments that need tenant isolation or complex routing, though it adds management overhead compared to simple bridge networking.

At OpenMetal, bare metal and hosted private cloud customers get dual 10 Gbps private links per server on isolated customer VLANs. This is particularly useful for Proxmox Ceph clusters because storage replication traffic stays on dedicated private links rather than competing with public or VM traffic.

Backup Strategy Changes at Scale

Backup requirements at scale are different from what works for a handful of VMs.

Proxmox Backup Server (PBS) is the native backup solution for Proxmox. It supports incremental backups (sending only changed blocks rather than full disk images), deduplication, and compression. For small to medium deployments, a dedicated PBS server is usually the right approach.

At larger scales, a few things change:

Backup windows and bandwidth: With many VMs, completing backups within a maintenance window becomes harder. PBS’s incremental backup support helps significantly here. Rather than dumping full disk images every night, it transfers only changed blocks, which reduces both backup duration and storage requirements.

Retention and storage: Deciding how many backup generations to keep and for how long affects storage requirements. For regulated environments, you may have specific retention requirements. PBS supports configurable retention policies.

Off-site or off-cluster backup: Storing backups only within the same cluster they protect is a risk. If the cluster has a major failure (power event, network partition, physical disaster), backups go down with the production environment. At scale, replicating backups to a separate location matters.

Options for off-site backup from Proxmox include syncing PBS backups to an S3-compatible object store (AWS S3, MinIO, or Ceph’s own S3-compatible gateway), replicating to a PBS instance in a different datacenter, or using a commercial backup solution that supports Proxmox. OpenMetal’s Ceph clusters support S3-compatible object storage via the Ceph RADOS Gateway, which can serve as a backup target for PBS.

VM replication vs backup: Proxmox has a built-in replication feature that continuously syncs VM disk state to another node or cluster. This is faster to recover from than restoring a backup (minutes vs potentially hours), but it’s not a substitute for backup because it replicates corruption and accidental deletions along with good data.

High Availability Configuration

HA in Proxmox is more specific than just “the cluster keeps running.” It’s a configurable system where you define which VMs should be automatically restarted on surviving nodes if their current node fails.

HA groups let you define preferences for where VMs should run. You can set primary nodes and fallback nodes, and assign priorities so the most important VMs get resources first if capacity is constrained during a failover event.

Fencing is required for HA to work safely. Before a failed node’s VMs are restarted elsewhere, the cluster needs to be certain the failed node is actually down and not just disconnected from the network (where it might still be running the same VMs, creating split-brain). Fencing methods in Proxmox include hardware watchdog timers, IPMI/BMC power control, and external power management. Without a working fence device, Proxmox HA won’t restart VMs from a failed node because it can’t guarantee safety.

At OpenMetal, bare metal servers include IPMI access, which supports IPMI-based fencing. This is worth configuring explicitly when setting up HA rather than relying on watchdog fencing alone.

HA storage requirements: All VMs in an HA group must use shared storage. VMs on local ZFS cannot be included in HA. This is another reason to deploy Ceph early if HA is a requirement.

Security Considerations Grow with the Cluster

A single-node Proxmox install in a homelab has different security requirements than a production cluster handling multiple workloads.

Node hardening: Proxmox nodes are Debian Linux under the hood, and they should be treated like any production Linux server from a security standpoint. This means keeping packages updated, restricting SSH access, configuring firewalls to limit management interface exposure, and disabling unused services. We’ve documented specific steps for this in our Proxmox node security guide.

Management network isolation: The Proxmox web interface and API should not be exposed to the public internet. Management access should come through a VPN or bastion host, or at minimum be restricted to known IP ranges via firewall rules.

API tokens: For automation, use Proxmox API tokens with the minimum permissions needed rather than using root credentials in scripts. Tokens can be revoked individually without changing root access.

VM network isolation: VMs belonging to different customers, different security zones, or different application tiers should be on separate VLANs with firewall rules between them. Proxmox supports configuring firewall rules at both the cluster and VM level.

Role-based access control: Proxmox’s user management allows you to create users with specific permissions rather than giving everyone root access to the whole cluster. At scale, this matters both for security and for operational sanity when multiple people are managing the environment.

When to Consider Proxmox vs OpenStack

As your Proxmox deployment grows, you’ll eventually hit scenarios where you’re building features that OpenStack provides out of the box. This isn’t a criticism of Proxmox – the two platforms serve different use cases.

Proxmox is a virtualization manager. It’s excellent at managing VMs and containers across a cluster of servers. It has HA, live migration, backup, and a solid web interface.

OpenStack is a cloud platform. In addition to VMs, it provides self-service provisioning, project-based isolation with quotas, object storage, load balancing as a service, Kubernetes orchestration, and a full API surface compatible with tools designed for public cloud environments.

If you’re primarily running internal workloads and managing them yourself, Proxmox handles this well at most scales. If you’re building infrastructure that needs to serve multiple teams self-service, integrate with cloud-native tooling like Terraform and Kubernetes, or provide true tenant isolation with network segmentation per project, OpenStack starts making more sense.

You can read a detailed comparison of how the two platforms compare in our Proxmox vs OpenStack article, and there’s also a breakdown of when to run Proxmox vs OpenStack on OpenMetal specifically

Running Proxmox on Bare Metal IaaS

One of the decisions you’ll face when scaling Proxmox is where the hardware lives and who manages it. Your options broadly fall into three categories:

Your own datacenter or colocation: You own or rent rack space, purchase servers, and manage all the hardware yourself. You have maximum control and often the best long-term economics at very large scale, but you’re also responsible for hardware procurement, datacenter operations, and physical failure response.

Bare metal IaaS: A provider gives you dedicated physical servers with a known hardware specification, and you install and manage the software yourself. The provider handles the physical infrastructure and data center operations. You handle everything from the OS up.

Managed hosting: Some providers offer Proxmox-specific managed services where they handle both the hardware and some or all of the Proxmox administration.

For teams that want to scale Proxmox without building out physical infrastructure, bare metal IaaS is often the right fit. You get predictable hardware specs, known network capabilities, and the ability to add nodes relatively quickly compared to procuring physical servers.

OpenMetal’s bare metal offering is one option here. Our servers come with high-performance NVMe drives, dual 10 Gbps network connectivity, and isolated private networking between servers in the same customer environment. This matters for Proxmox specifically because Ceph storage replication benefits from dedicated low-latency private links, and cluster communication needs reliable low-latency connectivity. You can see more about what running Proxmox on our infrastructure looks like in our dedicated Proxmox on OpenMetal article or watch Wendell of Level1Techs’ great demo video:

Other bare metal IaaS providers worth evaluating include Hetzner (popular in Europe for cost-effective bare metal), OVHcloud (broad hardware selection, multiple regions), and Vultr Bare Metal. The right choice depends on your location requirements, budget, and how much support you need.

Migration Considerations

If you’re scaling an existing Proxmox deployment rather than starting fresh, migration planning matters.

Storage migration: Moving VMs from local ZFS to Ceph can be done live (with some performance impact) or during a maintenance window. The simplest approach is to add Ceph storage to the cluster, then live-migrate VMs to new disks on Ceph using Proxmox’s storage migration feature. This avoids downtime but requires enough Ceph capacity to hold the migrated VMs alongside existing storage.

Network reconfiguration: Adding proper VLAN separation, bonded NICs, and dedicated storage networks to an existing cluster usually requires maintenance windows. These changes touch live network interfaces on running systems, so scheduling them carefully matters.

Expanding from a single node to a cluster: If you started with a single Proxmox node and want to form a cluster, you’ll need to create a new cluster on the existing node (if one doesn’t exist), join additional nodes, and then configure shared storage. VMs on local storage won’t automatically become HA-eligible; you need to migrate their disks to shared storage first.

If you’re coming from VMware rather than expanding an existing Proxmox deployment, we’ve covered the infrastructure requirements for that transition in our VMware to Proxmox migration guide.

Operational Support at Scale

The operational requirements for a larger Proxmox cluster are different from managing a few VMs.

Monitoring: At scale, you need visibility into cluster health, node resource utilization, VM performance, and storage metrics. Proxmox has built-in metrics and status displays, but for production environments, exporting metrics to an external monitoring stack (Prometheus with Grafana is a common choice) gives you historical data, alerting, and better visibility into trends. Proxmox supports Graphite and InfluxDB metric export natively, and the Prometheus community has exporter projects that work with Proxmox.

Capacity planning: Understanding when you’ll run out of CPU, memory, or storage requires tracking utilization over time. This is where external monitoring pays off – you can build dashboards that show utilization trends and project when you’ll hit constraints.

Update management: Keeping Proxmox nodes updated is more complex in a cluster. Running different Proxmox versions across nodes can cause compatibility issues. The general approach is to update one node at a time, after migrating its VMs to other nodes, and verifying cluster stability before moving to the next.

Documentation and runbooks: For any environment where multiple people are involved in operations, documented procedures for common tasks (adding nodes, replacing failed drives, recovering from HA events) are worth the time to create. These become especially valuable at 3am during an incident.

When to consider managed services: Some teams find that as their Proxmox environment grows, the operational overhead of managing the cluster itself takes too much engineering time away from higher-value work. If that’s the case, managed bare metal or managed Proxmox hosting is worth evaluating. OpenMetal offers managed services for customers who want infrastructure management handled for them, starting at $800/month plus per-server fees. This covers hardware and software break/fix, dedicated support via Slack, and ongoing management of the infrastructure stack.

Wrapping Up: Scaling Proxmox

Proxmox scales well, but scaling it successfully requires intentional decisions at each stage. The jump from a few VMs on a single node to a production-ready cluster with HA, shared storage, and proper networking isn’t automatic. It requires planning around hardware specifications, storage architecture, network design, backup strategy, and security.

The short version of what changes at production scale:

Three or more nodes for HA quorum
Shared storage (Ceph is the native option, external shared storage works too) for live migration and HA
Dedicated network interfaces for storage replication traffic
Proxmox Backup Server with off-cluster backup targets
Fencing configured before relying on HA
Monitoring and alerting beyond the built-in status displays
RBAC and network isolation for security

If you’re evaluating where to run your Proxmox nodes, the infrastructure decisions you make at the provider level affect all of these areas. Hardware consistency, private network quality, NVMe storage, and IPMI access all matter for a production Proxmox cluster.

And if your requirements eventually grow beyond what Proxmox is designed for, OpenStack is a natural step up that provides a full cloud platform on the same type of infrastructure. The comparison between the two is worth reading before you hit that crossroads rather than after.

Chat With Our Team

We’re available to answer questions and provide information.

Reach Out

Schedule a Consultation

Get a deeper assessment and discuss your unique requirements.

Schedule Consultation

Try It Out

Take a peek under the hood of our cloud platform or launch a trial.

Trial Options

Scaling Proxmox for Large Deployments With OpenMetal IaaS

What “Small Scale” Proxmox Usually Looks Like

The Hardware Foundation Changes at Scale

Clustering: What Changes and Why It Matters

What Clustering Enables

What Clustering Requires

Storage Architecture: The Most Important Decision You’ll Make

Local ZFS

Ceph (Recommended for Production Multi-Node Clusters)

External Shared Storage (NFS, iSCSI, SMB)

Networking at Scale

Backup Strategy Changes at Scale

High Availability Configuration

Security Considerations Grow with the Cluster

When to Consider Proxmox vs OpenStack

Running Proxmox on Bare Metal IaaS

Migration Considerations

Operational Support at Scale

Wrapping Up: Scaling Proxmox

Chat With Our Team

Schedule a Consultation

Try It Out

Why Crypto and Blockchain Teams Choose Amsterdam for European Infrastructure

Secret Network to Silicon: Building a True Confidential Computing Stack with Intel TDX on Bare Metal

Why MENA Tech Companies Choose Amsterdam for European Expansion

Adding Confidential Computing to Existing Infrastructure Without Starting Over

Scaling Proxmox for Large Deployments With OpenMetal IaaS

OLAP Databases on Bare Metal Dedicated Servers: Cost and Performance Analysis vs AWS

How Mid-Market SaaS Companies Use Intel TDX to Win Enterprise Deals

Bare Metal Dedicated Server – Large v5 – Intel Xeon 6517P, 512GB DDR5-6400, Micron 7500 MAX

Reference Architecture: Building Multi-Agent AI Systems on Elixir and Bare Metal Dedicated Servers

How to Build Multi-Region Infrastructure Across Three Continents