Kubernetes in Production: Lessons Nobody Writes About
Kubernetes has won the container orchestration war. If you’re running containers at scale, you’re probably running Kubernetes — or you’re evaluating it. The tutorials, certifications, and “getting started” guides are everywhere. You can go from zero to a running cluster in an afternoon.
What’s missing is the uncomfortable truths about what happens after the initial deployment. The lessons that only come from operating Kubernetes in production for six months or more — the 3am discoveries, the silent failures, the cost surprises, the operational complexity that no “getting started” guide prepares you for.
Here’s what I wish someone had told me before we went live.
Resource Requests and Limits Are Not Optional
Every pod needs resource requests and limits. Every single one. No exceptions. This is the most important operational decision in your cluster, and it’s the one most teams get wrong.
Without requests, the Kubernetes scheduler has no idea how much capacity your pod needs. It’ll pack pods onto nodes until something runs out of memory, and then the OOM (Out of Memory) killer starts terminating processes. Not your processes — random processes. The OOM killer doesn’t understand your application’s priority scheme. It just kills whatever it needs to kill to free memory.
Without limits, a single misbehaving pod can consume all available CPU or memory on a node, starving every other pod on that node. Your perfectly healthy API service goes unresponsive because someone else’s batch processing job has a memory leak. The batch job’s pod runs fine — it’s consuming exactly as much memory as it wants. Your API service is the collateral damage.
The hard part isn’t setting limits — it’s setting the right limits. Set them too low and your pods get CPU-throttled (mysterious latency spikes with no apparent cause) or OOM-killed (random pod restarts that look like application crashes but are actually Kubernetes enforcement). Set them too high and you waste capacity — paying for resources that sit idle while the scheduler thinks they’re in use.
The only reliable approach:
- Deploy without limits initially in a staging environment under realistic load
- Monitor actual CPU and memory usage for at least two weeks, capturing all traffic patterns including peak hours, batch processing windows, and deployment spikes
- Set requests at the P95 of actual usage (handles normal operation comfortably)
- Set limits at 2-3x the P99 (handles spikes without wasting excessive capacity)
- Monitor and adjust quarterly as traffic patterns evolve
Node Groups Are Not One-Size-Fits-All
Running everything on m5.xlarge is like buying one size of shoe for your entire family. It technically fits on everyone’s feet, but it’s optimal for nobody.
Your workloads have different resource profiles:
- API services need fast CPUs and moderate memory — compute-optimized instances
- Data processing jobs need lots of memory and moderate CPU — memory-optimized instances
- ML inference pods need GPUs — GPU instances (at 10x the cost of regular compute)
- Observability stack needs large, fast disks — storage-optimized instances
- Batch processing can tolerate interruption — spot instances at 60-90% discount
Create node groups per workload type. Use node selectors and taints to place pods on appropriate hardware. Yes, this adds complexity — more node groups to manage, more autoscaler configurations to tune, more capacity planning to maintain. But the cost efficiency improvement is significant: 30-40% savings compared to a single homogeneous node pool, because each workload runs on hardware that matches its actual requirements.
DNS Is the Silent Killer
The most common source of mysterious, intermittent failures in Kubernetes? DNS. And it’s the last thing most teams debug because DNS is supposed to “just work.”
CoreDNS handles every service discovery lookup in your cluster. When your API service calls the payment service by its Kubernetes service name, that’s a DNS lookup. When your service connects to an external database by hostname, that’s a DNS lookup. Every HTTP request, every database connection, every service-to-service call passes through CoreDNS.
When CoreDNS is overwhelmed — and at scale, with hundreds of services making thousands of lookups per second, it will be — lookups time out. Your services can’t find each other. Everything looks fine from the pod’s perspective — the application is running, the network is up, the health check passes — but connections fail because names don’t resolve.
The symptoms are infuriating: intermittent connection timeouts that appear randomly, affect random services, and resolve themselves before you can diagnose them. The monitoring dashboard shows occasional latency spikes that don’t correlate with any application metric.
The fix:
- Monitor CoreDNS latency and error rate as a first-class SLI (Service Level Indicator)
- Scale CoreDNS replicas proactively — don’t wait for it to be overwhelmed
- Deploy NodeLocal DNS Cache for high-query-rate workloads, which caches DNS responses at the node level and eliminates the network round-trip to CoreDNS
- Set appropriate
ndotsvalues in your pod DNS config — the default of 5 causes unnecessary lookup chains for external domains, multiplying DNS query volume
Secrets Management Needs a Real Strategy
Kubernetes Secrets are base64-encoded, not encrypted. Let me state that more directly: base64 is a encoding format, not a security mechanism. Anyone with access to the etcd database — which stores all cluster state — can read every secret in your cluster by running a single command.
This means your database passwords, API keys, encryption keys, and service credentials are stored in a format that provides zero security. If an attacker compromises etcd, they compromise every secret. If a backup of etcd is stored insecurely, every secret is exposed. If someone with cluster-admin access runs kubectl get secret -o yaml, they see everything.
The production-grade approach:
- Use an external secrets manager (HashiCorp Vault, AWS Secrets Manager, Azure Key Vault, or Google Secret Manager) as the source of truth for all credentials
- Deploy the External Secrets Operator to sync secrets from the external manager into Kubernetes, with automatic rotation
- Enable encryption at rest for etcd using a KMS (Key Management Service) provider
- Implement RBAC policies that restrict secret access to specific namespaces and service accounts — not every pod in the cluster should be able to read every secret
- Audit secret access using the Kubernetes audit log to detect unauthorized reads
The Upgrade Treadmill
Kubernetes releases a new minor version every four months. Each version is supported for approximately fourteen months. This means you must upgrade at least once per year — and practically, you should upgrade every 4-6 months to avoid falling too far behind.
This is the operational cost that most Kubernetes evaluations dramatically underestimate. Each upgrade requires:
- Reading the changelog and identifying breaking changes (API deprecations, removed features, behavior changes)
- Testing every application in your cluster against the new version in a staging environment
- Validating that every controller, operator, and custom resource definition is compatible
- Coordinating the upgrade across node pools with zero-downtime rolling updates
- Verifying that monitoring, logging, and alerting continue to function correctly
Upgrades aren’t optional — running unsupported versions means no security patches, no bug fixes, and no vendor support. And upgrades aren’t free — they require planning, testing, coordination, and typically 2-4 weeks of engineering effort per upgrade cycle.
Build an upgrade pipeline:
- Maintain a staging cluster that mirrors production architecture
- Upgrade staging first, run automated compatibility tests, then wait two weeks
- Upgrade production with canary node pools — new nodes running the new version alongside old nodes running the current version
- Budget engineering time for upgrades every quarter — treat it as mandatory operational work, not discretionary investment
Networking Is More Complex Than You Think
Kubernetes networking is a layer of abstraction on top of your cloud provider’s networking, which is itself a layer of abstraction on top of physical networking. Each layer adds failure modes.
Common networking surprises in production:
- NetworkPolicy not enforced by default. Most Kubernetes installations don’t enforce NetworkPolicy rules unless you install a CNI (Container Network Interface) plugin that supports them. Your carefully written NetworkPolicies might be doing nothing.
- Pod-to-pod latency varies. Pods on the same node communicate with microsecond latency. Pods on different nodes communicate with millisecond latency. Pods in different availability zones communicate with multi-millisecond latency. If your application is latency-sensitive, pod placement matters.
- Service mesh overhead. Istio, Linkerd, and other service meshes add a sidecar proxy to every pod. That proxy adds 1-5ms of latency to every request. For a call chain that touches six services, that’s 6-30ms of added latency.
The Honest Assessment
Kubernetes is powerful, flexible, and increasingly essential for organizations running containerized workloads at scale. It solves real problems that no other platform solves as well: multi-cloud portability, declarative infrastructure, sophisticated scheduling, and extensive ecosystem support.
It is also complex, operationally demanding, and easy to get catastrophically wrong. The total cost of operating Kubernetes — including the engineering time for upgrades, monitoring, security, and incident response — is significantly higher than the managed alternatives.
If you have fewer than 10 services, managed platforms like Cloud Run, AWS App Runner, Azure Container Apps, or AWS Fargate give you 80% of the benefits at 20% of the operational cost. Only adopt Kubernetes when the additional control, flexibility, and ecosystem justify the additional complexity.
The Garnet Grid perspective: We help organizations assess whether Kubernetes is the right choice for their workloads — and if it is, ensure it’s operated effectively with production-grade security, monitoring, and upgrade practices. Explore our DevOps maturity assessment →