Storage Migration from VMware to OpenStack + Ceph Tips, Tools & Pitfalls

Evaluating landing zones for your VMware workloads?

With OpenMetal, you get  Hosted Private Cloud built with Ceph, and OpenStack, and the predictable cost model that makes migration planning feasible.

Contact Us

Moving workloads from VMware to OpenStack isn’t primarily a compute challenge—it’s a storage challenge. Your VMs can be re-instantiated quickly. Your networking can be reconfigured in an afternoon. But your storage layer—your persistent data, your stateful workloads, your multi-terabyte databases—that’s where migrations stall, fail, or drag on for months.

If you’re a storage architect or platform engineer tasked with migrating off VMware ESXi, vSAN, or VMFS to an OpenStack environment backed by Ceph, you’re facing a fundamentally different storage architecture. This isn’t a lift-and-shift. It’s a re-architecture of how block storage, shared filesystems, and object storage are provisioned, accessed, and managed. This guide walks through the migration methods, tooling, validation steps, and pitfalls that matter when you’re moving production storage workloads—not lab environments.


VMware Storage Model vs OpenStack + Ceph Storage Model

VMware’s storage stack is tightly integrated with ESXi’s hypervisor layer. Whether you’re using local VMFS datastores, shared NFS mounts, or vSAN’s distributed object store, the storage abstraction lives inside VMware’s control plane. Virtual disks are VMDK files. Storage policies are enforced by vCenter. Snapshots, clones, and thin provisioning are all managed through VMware’s APIs.

OpenStack with Ceph operates differently. Ceph is a software-defined storage system that provides block storage (RBD), shared filesystems (CephFS), and object storage (RADOS Gateway) through a unified cluster. OpenStack’s Cinder (block), Manila (file shares), and Swift/S3 (object) services interface with Ceph, but Ceph itself is hypervisor-agnostic. Virtual disks are stored as RADOS Block Device (RBD) images, not VMDK files. Snapshots are COW (copy-on-write) operations at the Ceph layer. Storage policies are defined in CRUSH maps and Ceph pools, not vCenter.

AspectVMware (ESXi/vSAN/VMFS)

OpenStack + Ceph (RBD/CephFS/Object)

Virtual disk formatVMDK (monolithic or split)

RBD image (object-striped across OSDs)

Storage provisioningvCenter datastoresCinder volumes backed by Ceph pools
Shared file storage

NFS/vSAN file services

CephFS mounted via kernel or FUSE
Object storagevSAN object store (limited)

RADOS Gateway (S3/Swift-compatible)

Snapshot mechanismVMDK delta files

Ceph COW snapshots at RBD layer

Thin provisioningVMDK thin disks

RBD thin provisioning (default)

Replication/HAvSAN erasure coding or mirroringCeph replica pools or erasure coding
CLI toolingvmkfstools, esxcli

rbd, ceph, rados

Ceph is not VMFS with different branding. It’s a distributed object store that exposes block and file interfaces on top of RADOS. You’ll need to adjust your mental model for how storage is allocated, how data is replicated, and how failure domains are defined. VMware admins expect storage to be “attached” to a cluster. Ceph storage is distributed across nodes, and failure domains are defined by CRUSH topologies—not vSphere clusters.

Migration Method Options

You have four primary approaches when migrating storage from VMware to OpenStack. Each has different downtime requirements, tooling complexity, and risk profiles.

  • Cold migration involves shutting down the VM, exporting the VMDK, converting it to a raw or qcow2 image, and importing it into Ceph as an RBD volume. This is the simplest method, but it requires full downtime for the workload. Acceptable for dev/test environments or workloads with scheduled maintenance windows.
  • Live migration uses tools like virt-v2v or commercial platforms (Hystax, Trilio) to sync block-level changes while the VM remains running in VMware. A cutover window is still required, but it’s measured in minutes rather than hours. This method requires network bandwidth, intermediate storage, and careful handling of I/O consistency.
  • Block streaming involves attaching the source VMDK as a backing file to the destination RBD volume and streaming blocks on-demand as they’re accessed. This minimizes initial downtime but can cause performance degradation during the migration window. Rarely used in production due to complexity.
  • Rebuild means standing up a new VM in OpenStack, installing the OS, and migrating application data separately (rsync, database replication, object sync). This is the cleanest method for stateless workloads or when you’re modernizing the stack during migration. It’s also the most time-consuming.
MethodDowntimeComplexityBest For
Cold migrationHours to daysLowNon-critical workloads, scheduled maintenance windows
Live migrationMinutesMediumProduction databases, stateful apps
Block streamingSeconds (initial)High

Experimental; rarely used

RebuildVariableMediumStateless apps, modernization efforts
  • Choose cold migration when downtime is acceptable and tooling simplicity matters.
  • Choose live migration when uptime SLAs are strict and you have the bandwidth to sync deltas.
  • Choose rebuild when you’re refactoring the application stack or when the workload doesn’t justify VMDK conversion.

Tools

The tooling landscape for VMware-to-OpenStack storage migration ranges from open-source CLI utilities to commercial platforms. Your choice depends on scale, automation requirements, and tolerance for manual intervention.

  • qemu-img is the workhorse for VMDK-to-raw or VMDK-to-qcow2 conversion. It’s free, well-documented, and handles most disk formats. It doesn’t migrate metadata (VM config, network settings), so you’ll need to recreate those in OpenStack manually or via scripting.
  • virt-v2v (part of libguestfs) automates more of the process. It converts VMDKs, injects virtio drivers (critical for performance on KVM), and can push images directly to OpenStack via Glance. It’s purpose-built for VMware-to-KVM migrations, but it requires access to the VMware API or exported OVF files.
  • rbd is Ceph’s native block device CLI. You’ll use it to import raw disk images into Ceph pools, create snapshots, clone volumes, and manage RBD mappings. It’s fast, but you need to ensure your disk images are in raw format before importing. On modern Ceph deployments managed by cephadm (standard on OpenMetal v3.0.0+ environments), Ceph services run in Docker containers. You can execute rbd commands either from the host if ceph-common is installed, or via the containerized environment using cephadm shell — rbd <command>.
  • ovftool (VMware’s OVF export utility) packages VMDKs and VM metadata into OVF/OVA archives. Useful when you need to export VMs from vCenter in a structured format before conversion. It doesn’t handle the Ceph import step—just the export from VMware.
  • Hystax, Trilio, Storware are commercial migration and disaster recovery platforms. They offer live migration capabilities, automated cutover, and delta sync. They’re expensive, but they reduce manual labor for large-scale migrations (50+ VMs). Hystax specifically supports VMware-to-OpenStack workflows.
ToolUse CaseLicenseLive Migration?Ceph Integration
qemu-imgVMDK conversionOpen sourceNo

Manual rbd import

virt-v2vAutomated V2V conversionOpen sourceNoVia Glance/Cinder
rbdCeph block device mgmtOpen sourceNoNative
ovftoolVMware VM exportFree (VMware)NoNone
HystaxEnterprise migrationCommercialYesVia OpenStack APIs

Trilio

Backup and migrationCommercialYesNative Ceph support
StorwareBackup and DRCommercialYesCeph plugin available
  • For small migrations (under 20 VMs), stick with qemu-img and rbd.
  • For mid-size migrations (20–100 VMs), virt-v2v will save time.
  • For large migrations (100+ VMs) or when you need live cutover, evaluate Hystax or Trilio. Don’t assume a commercial tool will solve architectural mismatches—they won’t convert VMFS-specific features (like Storage DRS policies) into Ceph equivalents.

Example Command Workflows

Here’s a typical cold migration workflow using open-source tools. This assumes you’ve already exported the VM from VMware and have SSH access to a machine with Ceph client tools installed.

Note for OpenMetal v3.0.0+ deployments: Ceph services run in containers managed by cephadm. Execute rbd commands via cephadm shell — <command> or install ceph-common on the host for direct CLI access. The examples below show direct CLI usage for clarity—prefix with cephadm shell — if working in a containerized environment.

Export VM from VMware with ovftool

ovftool vi://vcenter.example.com/Datacenter/vm/production-db01 \
  /mnt/staging/production-db01.ova

This exports the VM as an OVA file. Extract the VMDK from the OVA:

tar -xvf /mnt/staging/production-db01.ova

Convert VMDK to raw format with qemu-img

qemu-img convert -f vmdk -O raw \
  production-db01-disk1.vmdk \
  production-db01-disk1.raw

Check the converted image size and format:

qemu-img info production-db01-disk1.raw

Import raw image into Ceph RBD

rbd import --pool openstack-volumes \
  production-db01-disk1.raw \
  production-db01-disk1

Verify the RBD image exists:

rbd ls openstack-volumes
rbd info openstack-volumes/production-db01-disk1

Create a Cinder volume from the RBD image

openstack volume create \
  --size 100 \
  --image production-db01-disk1 \
  production-db01-volume

Attach the volume to a new OpenStack instance or boot directly from the volume using Nova.

Benchmarking

Before you migrate production workloads, benchmark your Ceph cluster to confirm it meets performance expectations. VMware admins are accustomed to vSAN’s predictable latency profiles. Ceph performance depends on OSD count, network topology, disk types (NVMe vs SSD vs HDD), and CRUSH map configuration.

Use fio to test block-level I/O performance on an RBD volume:

fio --name=rbd-randwrite \
  --ioengine=rbd \
  --pool=openstack-volumes \
  --rbdname=test-volume \
  --rw=randwrite \
  --bs=4k \
  --iodepth=32 \
  --numjobs=4 \
  --runtime=60 \
  --group_reporting

This tests random 4K writes with 32 outstanding I/Os. Compare the IOPS and latency results to your VMware baseline. If you’re seeing >10ms p99 latency on NVMe-backed Ceph, investigate network bottlenecks or OSD configuration.

Use rados bench to test raw Ceph cluster performance (bypassing RBD):

rados bench -p openstack-volumes 60 write --no-cleanup
rados bench -p openstack-volumes 60 seq

This writes objects directly to the pool for 60 seconds, then reads them back sequentially. It helps isolate whether performance issues are in Ceph itself or in the RBD/Cinder layer.

Data Integrity Validation

Migrating storage without validating data integrity is asking for corruption issues weeks after cutover. Always checksum your data before and after migration.

Generate SHA256 hash of source VMDK

sha256sum production-db01-disk1.vmdk > vmdk-hash.txt

After converting to raw and importing to Ceph, map the RBD volume and hash it:

rbd map openstack-volumes/production-db01-disk1
sha256sum /dev/rbd0 > rbd-hash.txt

Compare the hashes:

diff vmdk-hash.txt rbd-hash.txt

If the hashes don’t match, you have a corruption or conversion issue. Don’t proceed to cutover until you’ve identified the cause. Common culprits include incomplete VMDK exports, qemu-img version mismatches, or network interruptions during rbd import.

For large volumes, consider block-level validation tools like virt-diff (part of libguestfs) or filesystem-level checksums (e.g., ZFS checksums if your source datastore supports it).

Rollback Planning

No rollback = no migration. You need a tested rollback path before you cut over production workloads. Ceph snapshots make this straightforward, but you need to plan the workflow in advance.

Before cutover, take a snapshot of the original VMDK in VMware. Keep the VM powered off but don’t delete it. In OpenStack, create a Ceph snapshot of the newly imported RBD volume immediately after import:

rbd snap create
openstack-volumes/production-db01-disk1@pre-cutover

If the cutover fails (application doesn’t start, data corruption discovered, performance unacceptable), you have two rollback options:

  1. Roll back to VMware: Power on the original VM in vCenter. You’re back to the pre-migration state within minutes.
  2. Roll back the Ceph volume: Revert the RBD image to the snapshot, detach it from the OpenStack instance, and troubleshoot offline.
rbd snap rollback
openstack-volumes/production-db01-disk1@pre-cutover

Define your rollback SLA before migration. For Tier 1 workloads, you should be able to roll back within 15 minutes. Test the rollback procedure in a dev environment before attempting it in production. Keep the source VMDKs and VMware VMs intact for at least 30 days post-migration.

Common Pitfalls

PitfallSymptom

Solution

Missing virtio drivers

VM boots slowly or not at all

Inject virtio drivers via virt-v2v or install manually

Thin VMDK converted to thick

Ceph volume consumes full allocated size

Use qemu-img with sparse flag; preallocate=off
Network MTU mismatch

High packet loss during migration

Set jumbo frames (MTU 9000) on migration network

Ceph replication lagRBD import stalls or times out

Check OSD health; reduce concurrent migrations

Incorrect CRUSH mapData on wrong failure domain (e.g., all on one rack)Review CRUSH rules before migration; reweight OSDs
No I/O scheduler tuningPoor performance post-migration

Set mq-deadline or none scheduler on Ceph OSD nodes

Cinder volume type mismatchVolume created in wrong pool or replication tierDefine Cinder volume types that map to correct Ceph pools
Incomplete VM metadata

VM boots but network/hostname wrong

Export and parse VMX file; recreate metadata in OpenStack

The most common failure mode isn’t corruption—it’s performance degradation. Your workload boots, runs, but responds 2x slower than it did in VMware. This usually points to missing virtio drivers, suboptimal Ceph pool configuration (e.g., replica 2 instead of 3), or network bottlenecks (1Gbps instead of 10Gbps+). Benchmark early, benchmark often, and compare against your VMware baselines before declaring success.

Another frequent issue: migrating VMDKs with snapshots or linked clones. qemu-img and virt-v2v don’t handle VMDK snapshots gracefully. Consolidate all snapshots in VMware before exporting the VM. If you have linked clones, convert them to full clones first.

Migration Timeline

A realistic storage migration timeline for a 50-VM production environment looks like this:

  • Weeks 1–2: Inventory and discovery. Identify VMDK sizes, snapshot dependencies, application dependencies, and downtime windows.
  • Weeks 3–4: Pilot migration of 5 non-critical VMs. Test tooling, validate performance, document workflows.
  • Weeks 5–8: Migrate dev/test workloads (20 VMs). Refine scripts, train team, identify performance gaps.
  • Weeks 9–12: Migrate Tier 2 production workloads (15 VMs). Schedule downtime windows, execute cold migrations, validate data integrity.
  • Weeks 13–16: Migrate Tier 1 production workloads (10 VMs). Use live migration tools if available, or schedule extended maintenance windows.
  • Weeks 17–20: Decommission VMware infrastructure. Archive VMDKs, power off ESXi hosts, reclaim licenses.

This timeline assumes you have a functioning Ceph cluster, competent OpenStack operators, and no major architectural surprises. If you’re also deploying Ceph and OpenStack from scratch, add 8–12 weeks to the front end. If you’re migrating 500+ VMs, scale the timeline linearly but add buffer for coordination overhead and troubleshooting.

Don’t rush the pilot phase. A poorly executed pilot will cascade into production failures. Use the pilot to identify gaps in your tooling, networking, or Ceph configuration—not to declare victory and accelerate the timeline.

Example Storage Migration Checklist

Task

Owner

Status
☐ Inventory all VMs, VMDK sizes, snapshot dependenciesPlatform team 
☐ Benchmark Ceph cluster (fio, rados bench)Storage architect 
☐ Test qemu-img/virt-v2v tooling on dev VMMigration engineer 
☐ Define rollback procedure and test in devOperations team 
☐ Export VMDKs from vCenter with ovftoolVMware admin 
☐ Convert VMDKs to raw formatMigration engineer 
☐ Import raw images to Ceph RBDStorage engineer 
☐ Create Cinder volumes from RBD imagesOpenStack operator 
☐ Take pre-cutover snapshots (VMware + Ceph)Operations team 
☐ Boot OpenStack instance from migrated volumePlatform team 
☐ Validate data integrity (checksums)Storage engineer 
☐ Run application smoke testsApplication owner 
☐ Monitor performance for 48 hours post-cutoverOperations team 
☐ Archive source VMDKs for 30 daysVMware admin 
☐ Decommission VMware hosts after 30-day retentionPlatform team 

Why OpenMetal’s Hosted Private Cloud Works for VMware Migrations

If you’re planning a VMware-to-OpenStack migration, you need a stable, performant Ceph-backed landing zone. OpenMetal’s Hosted Private Cloud provides exactly that—without the operational burden of deploying and managing Ceph yourself.

OpenMetal’s infrastructure is built on NVMe storage, 25–100Gbps networking, and Ceph pools configured for production workloads. Starting with OpenMetal v3.0.0, deployments use cephadm for simplified cluster lifecycle management—making it easier to add OSDs, replace disks, or enable CephFS during or after your migration. You’re not inheriting someone else’s underprovisioned cluster. You get dedicated hardware with predictable, fixed-cost pricing—no surprise egress fees or noisy-neighbor performance drops.

For storage architects migrating off VMware, this means you can focus on the migration process itself—VMDK conversion, data validation, application cutover—rather than tuning CRUSH maps or troubleshooting OSD failures at 2 AM. You still get root access to the OpenStack control plane and Ceph cluster, so you maintain full operational control when you need it.


If you’re evaluating landing zones for your VMware workloads, consider OpenMetal as an alternative to hyperscaler cloud, DIY OpenStack, or proprietary converged infrastructure. You get Ceph, OpenStack, and the predictable cost model that makes storage migration planning feasible.

Contact Us



Read More Blog Posts

Learn how to migrate storage from VMware ESXi/vSAN to OpenStack with Ceph. Covers VMDK conversion tools, benchmarking, data validation, and common pitfalls to avoid.

Hyperscalers lock you in by owning your IP addresses. Moving infrastructure means updating firewall rules, losing email reputation, and coordinating DNS changes across partners. BYOIP gives you control over your network identity. Learn why this matters for multi-region, hybrid, and enterprise workloads.

Most cloud platforms promised predictability but delivered predictable bills, not predictable performance. True infrastructure reliability requires operational visibility—baseline latency, IO consistency, and debuggable systems. Learn why visibility isn’t a luxury—it’s the prerequisite for stability at scale.

 


Works Cited