The Hidden Complexity of Managed Kubernetes

Resources » Blog » The Hidden Complexity of Managed Kubernetes

In this article

Managed Kubernetes was supposed to simplify container orchestration. Hand off the control plane to AWS, Google, or Azure, and focus on your applications instead of your infrastructure. For many teams, that promise holds up reasonably well at small scale. At larger scale, or when compliance, cost, or portability requirements enter the picture, the hidden complexity starts surfacing in ways that catch most teams off guard.

If you’ve been running EKS, GKE, or AKS for a while, some of this will be familiar. The goal isn’t to talk you out of managed Kubernetes. It’s to name the problems clearly so you can make better decisions about where managed services fit in your architecture and where they don’t.

The Control Plane Is Managed. Everything Else Isn’t.

This is the single most common misconception about managed Kubernetes, and it’s worth stating plainly.

When AWS, Google, or Azure say they manage Kubernetes, they mean they manage the control plane: the API server, etcd, the scheduler, and the controller manager. That’s genuinely valuable. Running a highly available control plane yourself is non-trivial work.

What they don’t manage is most of the stack you actually operate. Node groups and auto-scaling configuration are yours. The CNI plugin and network policies are yours. Ingress controllers, cert-manager, external-dns, and the rest of your cluster add-ons are yours. Storage class configuration, persistent volume provisioners, and backup tooling are yours. The observability stack (metrics, logs, traces) is yours to build and maintain.

The division of responsibility becomes painfully clear when something breaks at 2am. The provider’s managed piece is usually fine. The failure is almost always in the layer between the control plane and your application, which is the layer you own.

This isn’t a reason to avoid managed Kubernetes. It’s a reason to go in with accurate expectations about what you’re signing up to operate, because “managed” covers less than the marketing suggests.

Provider-Specific Dependencies Accumulate Faster Than You Expect

Most teams start with a reasonable intention: use Kubernetes as an abstraction layer, keep manifests portable, avoid vendor lock-in. That intention tends to erode quickly once you start connecting to provider services.

EKS clusters use EBS volumes for persistent storage by default. Those volumes don’t exist outside of AWS. Load balancers provisioned by the AWS Load Balancer Controller use ALB-specific annotations that mean nothing on GKE or AKS. IAM roles for service accounts (IRSA) tie your pod identity model to AWS’s IAM system. Spot instance node groups use provider-specific node selectors and tolerations. Each of these is a reasonable individual decision. Collectively, they mean your “portable” Kubernetes workloads are deeply entangled with a specific provider.

GKE has its own version of this. GKE Autopilot changes how you configure resource requests and limits in ways that don’t translate directly to other environments. Workload Identity replaces the IAM integration model that Kubernetes-native solutions use. GKE-specific node auto-provisioning behavior differs from the upstream cluster autoscaler.

AKS does the same thing with Azure-specific managed identity integration, Azure CNI networking that doesn’t behave identically to other CNI plugins, and Azure Disk persistent volumes that are tied to Azure storage pricing and availability zones.

The practical consequence is that teams who planned to maintain cloud portability often discover, when they actually try to move a workload, that the migration effort is far larger than expected. The cloud native architecture article on our site covers this dynamic in detail if you want to dig into the portability question specifically.

The Upgrade Treadmill

Kubernetes releases a new minor version roughly every four months. Managed Kubernetes providers support minor versions for varying periods, but the timelines are aggressive. EKS supports a given minor version for approximately 14 months before it reaches end of standard support, after which you’re either paying for extended support fees or forced into an upgrade.

Missing an upgrade cycle is where the problems compound. Clusters that fall two minor versions behind are running with known CVEs, potentially unsupported add-ons, and an increasing delta between their configuration and what the provider’s tooling expects. When you finally have to upgrade, you’re not doing a single step upgrade. You’re doing multiple sequential upgrades, each of which needs to be tested, and each of which carries risk for stateful workloads and custom configurations.

The upgrade process on managed services is also not as clean as it’s presented. Worker node upgrades require draining nodes, which means your workloads need to tolerate disruption correctly. Pod Disruption Budgets need to be configured accurately. Stateful workloads need storage that survives node replacement. Admission webhooks need to be compatible with the new API version. Custom Resource Definitions may have API version changes that require migration. None of this is insurmountable, but “click upgrade in the console” is not how production cluster upgrades actually work, and teams that haven’t invested in upgrade runbooks find this out the hard way.

The Costs That Don’t Show Up in the Kubernetes Line Item

The control plane fee is visible and predictable. The rest of the cost structure for managed Kubernetes is less transparent, and most teams running EKS or GKE haven’t fully audited what they’re actually paying.

Inter-availability-zone data transfer is the most common surprise. Kubernetes pods communicate with each other regardless of which AZ their nodes are in. When a pod on a node in us-east-1a talks to a pod on a node in us-east-1b, that traffic crosses an AZ boundary and gets charged at AWS’s inter-AZ transfer rate of $0.01 per GB in each direction. For a busy microservices cluster, the cumulative inter-AZ transfer cost can be substantial. Teams that have deployed pods without topology-aware routing, or that haven’t configured services to prefer same-AZ endpoints, often discover this cost only when they look at their data transfer line items carefully.

Load balancer costs compound with cluster scale. Every LoadBalancer-type service provisions a cloud load balancer. AWS ALBs have an hourly fee plus LCU charges based on traffic. A cluster with many services, or one that creates and destroys load balancers frequently in CI/CD workflows, accumulates load balancer costs that weren’t part of the original infrastructure budget.

NAT gateway fees apply to outbound traffic from private node groups, which is the default recommended configuration for production clusters. Nodes pulling container images, reaching external APIs, or sending telemetry data are all generating NAT gateway traffic at $0.045 per GB on AWS. This is easy to miss because it appears as NAT gateway costs rather than Kubernetes costs in the billing breakdown.

Persistent volume costs tie directly to provider storage pricing, which for high-IOPS workloads can be significant. EBS gp3 volumes running StatefulSets at production scale are a material infrastructure cost that scales with storage requirements in a way that fixed-cost dedicated storage doesn’t.

For a detailed cost comparison of managed Kubernetes on EKS versus running Kubernetes on private cloud infrastructure, the Kubernetes on a Private Cloud: Cost and Performance vs. EKS and GKE article covers the specific numbers.

Compliance and Auditability Gaps

For teams with SOC 2 Type II, HIPAA, or GDPR requirements, managed Kubernetes creates a specific category of compliance challenge that doesn’t have a clean solution within the managed service model.

The control plane is a black box. You don’t have visibility into the infrastructure running your API server, etcd, or scheduler. You can’t produce evidence of the underlying hardware’s security posture or the physical isolation of your control plane from other customers’ control planes. For most compliance frameworks this is acceptable because the provider’s compliance certifications cover the control plane. For frameworks that require you to demonstrate end-to-end control over your infrastructure, the black box creates an audit gap.

Node access and audit logging are more constrained on managed services than on infrastructure you control directly. The specific audit log formats, retention policies, and access controls available to you are determined by what the provider exposes, not by what your compliance framework asks for. Building a complete audit trail from the hardware layer through the application layer is harder when the hardware layer is entirely abstracted away.

Data residency on managed Kubernetes requires careful configuration and ongoing verification. Pods can be scheduled to nodes in any AZ within a region by default. Persistent volumes may replicate across AZs. etcd data lives on control plane nodes managed by the provider. Demonstrating clean data residency to an auditor requires understanding and controlling all of these data flows, which adds compliance overhead that infrastructure you control directly doesn’t have.

For organizations that need hardware-level isolation for specific workloads, OpenMetal’s Intel TDX support provides cryptographic isolation that managed Kubernetes can’t replicate. Workloads running in TDX Trust Domains are isolated even from the infrastructure operator.

What the Alternative Actually Looks Like

It’s worth being direct about the tradeoffs here, because the alternative to managed Kubernetes isn’t without its own complexity.

Self-managed Kubernetes on bare metal or private cloud infrastructure solves the portability problem: your manifests work anywhere that runs standard Kubernetes without provider-specific annotations and integrations. It solves the audit problem: you have full visibility from the hardware layer up and can produce the evidence compliance frameworks ask for. It addresses the hidden cost problem: fixed-cost dedicated infrastructure doesn’t have per-GB inter-AZ transfer fees, per-load-balancer charges, or NAT gateway costs.

What it adds is operational responsibility for the control plane and infrastructure layer. That’s real overhead, and for small teams with limited infrastructure engineering capacity it may not be the right tradeoff. The upgrade process is your responsibility. Hardware failures are your problem to handle. Node provisioning is manual or requires tooling you build and maintain.

The middle path for teams that want the benefits of private infrastructure without fully self-managing it is hosted private cloud with managed operations. OpenMetal manages the infrastructure layer, including hardware, networking, and the OpenStack layer underneath your Kubernetes clusters. Your team manages the Kubernetes configuration and the workloads running on top. You get the portability, auditability, and cost predictability of private infrastructure without taking on the full operational burden of running it yourself.

Running Kubernetes on OpenMetal’s private cloud gives you standard upstream Kubernetes on dedicated hardware with 20 Gbps private networking between nodes, no inter-node traffic charges, and the ability to use any CNI plugin, ingress controller, or storage provider that works with standard Kubernetes. The manifests that run on OpenMetal run anywhere else that runs standard Kubernetes, without modification.

Figuring Out Where Managed Services Still Make Sense

None of this means managed Kubernetes is the wrong choice across the board. For small clusters, development environments, and workloads where the operational simplicity genuinely outweighs the complexity and cost concerns, managed services are a reasonable choice.

The decision is worth revisiting when your cluster has grown large enough that the hidden costs are a meaningful budget item, when portability requirements mean provider-specific dependencies are creating migration risk, when compliance requirements are creating audit gaps you can’t cleanly close, or when the upgrade treadmill is consuming engineering time that should be going elsewhere.

If you’re at that point, the question isn’t whether to use Kubernetes but where to run it. The answer depends on your team’s capacity, your compliance requirements, and whether the operational overhead of private infrastructure is worth the portability, cost, and auditability benefits it provides.

Evaluating Kubernetes infrastructure options? See how OpenMetal’s private cloud supports Kubernetes workloads or explore bare metal server configurations for teams running Kubernetes directly on dedicated hardware.

Chat With Our Team

We’re available to answer questions and provide information.

Reach Out

Schedule a Consultation

Get a deeper assessment and discuss your unique requirements.

Schedule Consultation

Try It Out

Take a peek under the hood of our cloud platform or launch a trial.

Trial Options

The Hidden Complexity of Managed Kubernetes

The Control Plane Is Managed. Everything Else Isn’t.

Provider-Specific Dependencies Accumulate Faster Than You Expect

The Upgrade Treadmill

The Costs That Don’t Show Up in the Kubernetes Line Item

Compliance and Auditability Gaps

What the Alternative Actually Looks Like

Figuring Out Where Managed Services Still Make Sense

Chat With Our Team

Schedule a Consultation

Try It Out

When Managed Kubernetes Gets Expensive Enough to Justify Running Your Own

The Hidden Complexity of Managed Kubernetes

Cloud Native Architecture Goes Beyond Kubernetes and Containers

Why Running Cilium with eBPF on Bare Metal Outperforms Virtualized Overlay Networks

Kubernetes on a Private Cloud: Cost and Performance vs. EKS and GKE

5 Best Practices for Kubernetes and OpenStack Integration

Multi-Cloud Networking with Kubernetes and OpenStack

MicroVMs: Scaling Out Over Scaling Up in Modern Cloud Architectures

OpenStack Networking vs. Kubernetes Networking

How Stakater Found the Right Cloud Infrastructure Partner in OpenMetal