Kubernetes in Production: What Nobody Tells You
I remember the exact moment I regretted choosing Kubernetes. It was 2 AM on a Saturday. Our production cluster had run out of IP addresses because nobody had planned the VPC CIDR range properly. Pods were stuck in Pending state, our alerting was screaming, and the fix required draining nodes one by one while frantically expanding the subnet. Total downtime: 47 minutes. Root cause: a networking decision made six months earlier during initial setup, when "just use the defaults" seemed reasonable.
That incident taught me something that no Kubernetes tutorial or certification course will tell you: Kubernetes doesn't fail the way you expect it to. It fails in ways that require deep systems knowledge, and the abstractions that make it powerful are the same abstractions that make debugging it miserable.
This article isn't about bashing Kubernetes. I use it. I believe in it for the right use cases. But after running K8s clusters in production for multiple organizations, I've accumulated a collection of war stories and hard-won lessons that I wish someone had told me before I typed kubectl apply for the first time.
War Story #1: The OOMKilled Cascade
Here's a scenario that plays out in almost every Kubernetes deployment at some point. You set memory limits on your pods — 512Mi, because that's what the app uses in development. In production, under load, the app occasionally spikes to 520Mi. Kubernetes OOMKills the pod. The pod restarts. Traffic gets redistributed to surviving pods. Those pods now handle more traffic. They spike to 520Mi. They get OOMKilled. Within 90 seconds, you have a cascading failure across your entire deployment.
The fix is straightforward (set higher limits, use Vertical Pod Autoscaler), but the failure mode is insidious because it's self-reinforcing. Traditional server deployments don't have this problem — if an app uses too much memory on a VM, it swaps or slows down. It doesn't get killed and restart in a loop.
According to Datadog's Container Report, OOMKilled is the second most common pod failure reason after CrashLoopBackOff, affecting approximately 15% of all Kubernetes deployments monthly. Yet most teams don't set memory limits correctly because the "right" limit requires production load testing that rarely happens.
War Story #2: The $14,000 Monthly Surprise
A startup I advised migrated from three EC2 instances ($450/month total) to EKS because "we need to scale." Their total infrastructure cost after migration: $14,200/month. Here's where the money went:
| Item | Monthly Cost |
|---|---|
| EKS control plane | $73 |
| 3x m5.xlarge nodes (was t3.medium) | $420 |
| NAT Gateway (data processing) | $3,200 |
| ALB (Application Load Balancer) | $180 |
| EBS volumes (PersistentVolumeClaims) | $340 |
| CloudWatch logging (container logs) | $2,100 |
| ECR (container registry) | $85 |
| Datadog Kubernetes monitoring | $5,400 |
| DevOps engineer time (partial) | $2,400 |
| Total | $14,198 |
The NAT Gateway cost alone was 7x their previous total infrastructure bill. Why? Because in a private VPC (which AWS recommends for EKS), all outbound internet traffic routes through NAT Gateways, which charge $0.045 per GB. Their app pulled container images, downloaded dependencies, and made API calls — all of which added up.
The Datadog bill was the real shock. Kubernetes generates an extraordinary volume of metrics. Each pod, container, node, and service produces metrics every 10-15 seconds. At the per-host pricing of most monitoring tools, a modest cluster can generate thousands of dollars in monitoring costs alone. The CNCF FinOps for Kubernetes report found that monitoring and observability costs account for 15-25% of total Kubernetes spend for most organizations.
War Story #3: The Helm Chart From Hell
You install a Helm chart for PostgreSQL (Bitnami, because everyone uses it). Everything works. Six months later, you need to upgrade PostgreSQL from 15 to 16. You run helm upgrade. The chart creates a new StatefulSet, but the PersistentVolumeClaim is immutable. The upgrade fails. Your database pod is stuck in a crash loop because the new image can't mount the old volume with incompatible settings.
Helm charts are convenient until they're not. The abstraction hides complexity that becomes critical during upgrades, rollbacks, and disaster recovery. I've seen teams spend days recovering from Helm upgrades gone wrong because they never read the 2,000-line values.yaml and didn't understand the underlying Kubernetes resources.
My policy now: use Helm for stateless applications only. For stateful workloads (databases, message queues, caches), either use a managed service or write plain Kubernetes manifests that you fully understand. The Helm Best Practices guide itself recommends caution with StatefulSets.
When Kubernetes Is Overkill
I'm going to be opinionated here because the Kubernetes ecosystem has a strong financial incentive to make you think you need it. Cloud providers, tooling vendors, and consulting firms all benefit from Kubernetes adoption. Here's my honest assessment:
You Don't Need Kubernetes If:
- You have fewer than 10 services. Docker Compose on a single server or a few VMs with a reverse proxy handles this perfectly. Kamal (from the Rails team) and similar tools make deployment trivially simple.
- Your team is under 20 engineers. The operational overhead of Kubernetes requires dedicated platform engineering capacity. If nobody on your team has "Kubernetes" in their job description, you'll accumulate tech debt rapidly.
- You're not doing horizontal scaling. If your services don't need to scale from 2 to 20 pods based on traffic, you're paying the Kubernetes tax without the primary benefit.
- Your deployment frequency is under 10 per day. If you deploy once a day or less, the sophisticated rollout strategies (canary, blue-green) that Kubernetes enables aren't worth the complexity.
- You're a startup before product-market fit. Your infrastructure needs will change dramatically as you iterate. Committing to Kubernetes too early locks you into an architecture that may not match your future needs.
Alternatives That Actually Work:
| Scenario | Alternative | Why |
|---|---|---|
| Single app, simple deployment | Railway, Render, Fly.io | Zero infrastructure management, $5-50/mo |
| Multiple services, small team | Docker Compose + Kamal | Full control, minimal overhead, $20-200/mo |
| Serverless workloads | AWS Lambda, Cloud Functions | Pay per invocation, auto-scaling built in |
| Static sites + APIs | Vercel, Netlify + managed backend | Edge deployment, zero config |
| GPU workloads | Modal, RunPod, Replicate | GPU scheduling is K8s's worst experience |
Managed vs Self-Hosted: The Real Trade-offs
If you've decided you need Kubernetes, the next decision is managed (EKS, GKE, AKS) vs self-hosted (kubeadm, k3s, Rancher). Here's the honest comparison:
Managed Kubernetes (EKS/GKE/AKS)
Pros:
- Control plane is managed — you don't deal with etcd, API server, or scheduler failures
- Automatic version upgrades (with caveats)
- Native integration with cloud provider services (IAM, networking, storage)
- SLA-backed availability
Cons:
- GKE charges $73/mo per cluster (free tier available for one Autopilot cluster). EKS charges $73/mo with no free tier. AKS is free for the control plane. GKE pricing is the most transparent.
- You're still responsible for node management, networking, security policies, and application-level configuration
- Version upgrade windows can break workloads — GKE's auto-upgrade has caused production issues for many teams
- "Managed" means "we manage the control plane." Everything else is still your problem.
Self-Hosted Kubernetes
Pros:
- Full control over versions, configuration, and networking
- Can run on bare metal for significant cost savings at scale
- No cloud vendor lock-in
- k3s makes lightweight self-hosted K8s surprisingly approachable
Cons:
- etcd management is non-trivial — etcd corruption can destroy your cluster
- Certificate rotation, API server patching, and kubelet updates are your responsibility
- Disaster recovery requires custom tooling (Velero, etcd snapshots)
- The Kubernetes documentation on production setup is over 50 pages long for a reason
My recommendation: Use GKE if you're on Google Cloud (it's the most mature managed offering). Use k3s if you need lightweight self-hosted. Avoid self-hosting full Kubernetes unless you have a dedicated platform team of at least 2-3 engineers.
The Things That Will Bite You (And How to Prevent Them)
1. Resource Requests and Limits
The number one mistake: not setting resource requests and limits. Without them, a single memory-hungry pod can consume all node resources and evict other pods. With limits set too low, you get the OOMKill cascade I described earlier.
Action: Use VPA (Vertical Pod Autoscaler) in recommendation mode for two weeks before setting production limits. It analyzes actual resource usage and suggests appropriate values.
2. Pod Disruption Budgets
During node upgrades, Kubernetes drains pods from nodes. Without Pod Disruption Budgets (PDBs), it can drain all replicas of a service simultaneously, causing downtime. This is especially common during cluster upgrades.
Action: Set PDBs for every production deployment: minAvailable: 1 at minimum, maxUnavailable: 25% for larger deployments.
3. DNS Resolution Failures
CoreDNS is the cluster DNS provider, and it's a single point of failure that nobody monitors until it breaks. Under heavy load, CoreDNS can become a bottleneck, causing intermittent DNS resolution failures that manifest as random timeouts across your entire cluster.
Action: Monitor CoreDNS latency and cache hit rate. Use dnsPolicy: None with custom dnsConfig for pods that make heavy external DNS queries. Consider NodeLocal DNSCache for large clusters.
4. Ingress Controller Misconfiguration
NGINX Ingress Controller's default configuration allows unlimited request body size and has no rate limiting. This is a security and stability risk. I've seen production clusters taken down by a single user uploading a 10GB file because nobody set proxy-body-size.
Action: Set sensible defaults in your Ingress Controller: request body limits, connection timeouts, rate limiting per IP. Review the NGINX Ingress annotations documentation.
5. Secret Management
Kubernetes Secrets are base64-encoded, not encrypted. Anyone with RBAC access to the namespace can read them. Storing database passwords, API keys, and certificates as Kubernetes Secrets without additional encryption is a security vulnerability.
Action: Use External Secrets Operator with a secrets manager (AWS Secrets Manager, HashiCorp Vault, GCP Secret Manager). Enable encryption at rest for etcd. Audit RBAC policies regularly.
The Production Readiness Checklist
Before going to production with Kubernetes, ensure you have answers for every item on this list:
| Category | Requirement | Status |
|---|---|---|
| Networking | VPC CIDR range planned for growth | [ ] |
| Networking | Network policies restricting pod-to-pod traffic | [ ] |
| Security | RBAC policies defined per team/namespace | [ ] |
| Security | Pod Security Standards enforced (restricted) | [ ] |
| Security | Image scanning in CI/CD pipeline | [ ] |
| Reliability | Resource requests/limits on all pods | [ ] |
| Reliability | Pod Disruption Budgets on all deployments | [ ] |
| Reliability | Liveness and readiness probes configured | [ ] |
| Observability | Centralized logging (not just kubectl logs) | [ ] |
| Observability | Metrics collection (Prometheus or equivalent) | [ ] |
| Observability | Alerting on cluster-level metrics | [ ] |
| Disaster Recovery | etcd backup strategy (if self-hosted) | [ ] |
| Disaster Recovery | Cluster recreation runbook tested | [ ] |
| Cost | Cost monitoring per namespace/team | [ ] |
Action Plan: Adopting Kubernetes Responsibly
Step 1: Validate the Need (Week 1)
Write a one-page document answering: "What specific problems will Kubernetes solve that our current infrastructure can't?" If the answer is primarily about developer experience (not scaling), consider lighter alternatives like Docker Compose + CI/CD.
Step 2: Start with a Non-Critical Workload (Week 2-4)
Deploy a staging environment or internal tool on Kubernetes. Don't start with your revenue-generating production service. Learn the failure modes in a low-stakes environment.
Step 3: Build Observability First (Week 3-5)
Set up monitoring, logging, and alerting BEFORE migrating production workloads. You need visibility into the cluster before you can operate it reliably. Use Prometheus + Grafana (open source) to avoid the $5,000/month Datadog bill.
Step 4: Migration with Rollback (Week 5-8)
Migrate one service at a time. Keep the old deployment running in parallel. Use DNS-based traffic shifting (weighted routing) to gradually move traffic. Ensure you can roll back to the previous infrastructure within minutes.
Step 5: Operationalize (Ongoing)
Write runbooks for common failure scenarios. Practice disaster recovery. Set up cost alerts. Train the team on kubectl and cluster debugging. Kubernetes is not a "set it and forget it" technology — it requires ongoing operational investment.
Final Thoughts
Kubernetes is an extraordinarily powerful tool when used appropriately. It enables reliable, scalable, self-healing infrastructure for complex distributed systems. But it's also the most over-adopted technology in the industry. The gap between "can we use Kubernetes?" and "should we use Kubernetes?" is where most teams waste months and thousands of dollars.
The best Kubernetes deployments I've seen share a common trait: they were adopted deliberately, with full awareness of the operational cost, by teams that had exhausted simpler alternatives. The worst deployments were driven by resume-driven development, conference talks, and the fear of "not being cloud-native."
Choose boring technology when boring technology works. And when it doesn't — when you genuinely need the power of container orchestration at scale — Kubernetes will be there, waiting, with all its complexity and all its capability.
Sources
- Datadog Container Report
- CNCF FinOps for Kubernetes Report
- GKE Pricing
- Kubernetes Production Setup Documentation
- Helm Best Practices Guide
- NGINX Ingress Controller Documentation
- External Secrets Operator
- Vertical Pod Autoscaler
I'm Ismat, and I build BirJob — Azerbaijan's job aggregator scraping 80+ sources daily.
