Kubernetes in Production: What Nobody Tells You

Container ships in a port representing containerized infrastructure

I remember the exact moment I regretted choosing Kubernetes. It was 2 AM on a Saturday. Our production cluster had run out of IP addresses because nobody had planned the VPC CIDR range properly. Pods were stuck in Pending state, our alerting was screaming, and the fix required draining nodes one by one while frantically expanding the subnet. Total downtime: 47 minutes. Root cause: a networking decision made six months earlier during initial setup, when "just use the defaults" seemed reasonable.

That incident taught me something that no Kubernetes tutorial or certification course will tell you: Kubernetes doesn't fail the way you expect it to. It fails in ways that require deep systems knowledge, and the abstractions that make it powerful are the same abstractions that make debugging it miserable.

This article isn't about bashing Kubernetes. I use it. I believe in it for the right use cases. But after running K8s clusters in production for multiple organizations, I've accumulated a collection of war stories and hard-won lessons that I wish someone had told me before I typed kubectl apply for the first time.

War Story #1: The OOMKilled Cascade

Here's a scenario that plays out in almost every Kubernetes deployment at some point. You set memory limits on your pods — 512Mi, because that's what the app uses in development. In production, under load, the app occasionally spikes to 520Mi. Kubernetes OOMKills the pod. The pod restarts. Traffic gets redistributed to surviving pods. Those pods now handle more traffic. They spike to 520Mi. They get OOMKilled. Within 90 seconds, you have a cascading failure across your entire deployment.

The fix is straightforward (set higher limits, use Vertical Pod Autoscaler), but the failure mode is insidious because it's self-reinforcing. Traditional server deployments don't have this problem — if an app uses too much memory on a VM, it swaps or slows down. It doesn't get killed and restart in a loop.

According to Datadog's Container Report, OOMKilled is the second most common pod failure reason after CrashLoopBackOff, affecting approximately 15% of all Kubernetes deployments monthly. Yet most teams don't set memory limits correctly because the "right" limit requires production load testing that rarely happens.

War Story #2: The $14,000 Monthly Surprise

A startup I advised migrated from three EC2 instances ($450/month total) to EKS because "we need to scale." Their total infrastructure cost after migration: $14,200/month. Here's where the money went:

Item	Monthly Cost
EKS control plane	$73
3x m5.xlarge nodes (was t3.medium)	$420
NAT Gateway (data processing)	$3,200
ALB (Application Load Balancer)	$180
EBS volumes (PersistentVolumeClaims)	$340
CloudWatch logging (container logs)	$2,100
ECR (container registry)	$85
Datadog Kubernetes monitoring	$5,400
DevOps engineer time (partial)	$2,400
Total	$14,198

The NAT Gateway cost alone was 7x their previous total infrastructure bill. Why? Because in a private VPC (which AWS recommends for EKS), all outbound internet traffic routes through NAT Gateways, which charge $0.045 per GB. Their app pulled container images, downloaded dependencies, and made API calls — all of which added up.

The Datadog bill was the real shock. Kubernetes generates an extraordinary volume of metrics. Each pod, container, node, and service produces metrics every 10-15 seconds. At the per-host pricing of most monitoring tools, a modest cluster can generate thousands of dollars in monitoring costs alone. The CNCF FinOps for Kubernetes report found that monitoring and observability costs account for 15-25% of total Kubernetes spend for most organizations.

Laptop showing technical infrastructure dashboard

War Story #3: The Helm Chart From Hell

You install a Helm chart for PostgreSQL (Bitnami, because everyone uses it). Everything works. Six months later, you need to upgrade PostgreSQL from 15 to 16. You run helm upgrade. The chart creates a new StatefulSet, but the PersistentVolumeClaim is immutable. The upgrade fails. Your database pod is stuck in a crash loop because the new image can't mount the old volume with incompatible settings.

Helm charts are convenient until they're not. The abstraction hides complexity that becomes critical during upgrades, rollbacks, and disaster recovery. I've seen teams spend days recovering from Helm upgrades gone wrong because they never read the 2,000-line values.yaml and didn't understand the underlying Kubernetes resources.

My policy now: use Helm for stateless applications only. For stateful workloads (databases, message queues, caches), either use a managed service or write plain Kubernetes manifests that you fully understand. The Helm Best Practices guide itself recommends caution with StatefulSets.

When Kubernetes Is Overkill

I'm going to be opinionated here because the Kubernetes ecosystem has a strong financial incentive to make you think you need it. Cloud providers, tooling vendors, and consulting firms all benefit from Kubernetes adoption. Here's my honest assessment:

You Don't Need Kubernetes If:

You have fewer than 10 services. Docker Compose on a single server or a few VMs with a reverse proxy handles this perfectly. Kamal (from the Rails team) and similar tools make deployment trivially simple.
Your team is under 20 engineers. The operational overhead of Kubernetes requires dedicated platform engineering capacity. If nobody on your team has "Kubernetes" in their job description, you'll accumulate tech debt rapidly.
You're not doing horizontal scaling. If your services don't need to scale from 2 to 20 pods based on traffic, you're paying the Kubernetes tax without the primary benefit.
Your deployment frequency is under 10 per day. If you deploy once a day or less, the sophisticated rollout strategies (canary, blue-green) that Kubernetes enables aren't worth the complexity.
You're a startup before product-market fit. Your infrastructure needs will change dramatically as you iterate. Committing to Kubernetes too early locks you into an architecture that may not match your future needs.

Alternatives That Actually Work:

Scenario	Alternative	Why
Single app, simple deployment	Railway, Render, Fly.io	Zero infrastructure management, $5-50/mo
Multiple services, small team	Docker Compose + Kamal	Full control, minimal overhead, $20-200/mo
Serverless workloads	AWS Lambda, Cloud Functions	Pay per invocation, auto-scaling built in
Static sites + APIs	Vercel, Netlify + managed backend	Edge deployment, zero config
GPU workloads	Modal, RunPod, Replicate	GPU scheduling is K8s's worst experience

Managed vs Self-Hosted: The Real Trade-offs

If you've decided you need Kubernetes, the next decision is managed (EKS, GKE, AKS) vs self-hosted (kubeadm, k3s, Rancher). Here's the honest comparison:

Managed Kubernetes (EKS/GKE/AKS)

Pros:

Control plane is managed — you don't deal with etcd, API server, or scheduler failures
Automatic version upgrades (with caveats)
Native integration with cloud provider services (IAM, networking, storage)
SLA-backed availability

Cons:

GKE charges $73/mo per cluster (free tier available for one Autopilot cluster). EKS charges $73/mo with no free tier. AKS is free for the control plane. GKE pricing is the most transparent.
You're still responsible for node management, networking, security policies, and application-level configuration
Version upgrade windows can break workloads — GKE's auto-upgrade has caused production issues for many teams
"Managed" means "we manage the control plane." Everything else is still your problem.

Self-Hosted Kubernetes

Pros:

Full control over versions, configuration, and networking
Can run on bare metal for significant cost savings at scale
No cloud vendor lock-in
k3s makes lightweight self-hosted K8s surprisingly approachable

Cons:

etcd management is non-trivial — etcd corruption can destroy your cluster
Certificate rotation, API server patching, and kubelet updates are your responsibility
Disaster recovery requires custom tooling (Velero, etcd snapshots)
The Kubernetes documentation on production setup is over 50 pages long for a reason

My recommendation: Use GKE if you're on Google Cloud (it's the most mature managed offering). Use k3s if you need lightweight self-hosted. Avoid self-hosting full Kubernetes unless you have a dedicated platform team of at least 2-3 engineers.

Dark terminal screen with green text showing system commands

The Things That Will Bite You (And How to Prevent Them)

1. Resource Requests and Limits

The number one mistake: not setting resource requests and limits. Without them, a single memory-hungry pod can consume all node resources and evict other pods. With limits set too low, you get the OOMKill cascade I described earlier.

Action: Use VPA (Vertical Pod Autoscaler) in recommendation mode for two weeks before setting production limits. It analyzes actual resource usage and suggests appropriate values.

2. Pod Disruption Budgets

During node upgrades, Kubernetes drains pods from nodes. Without Pod Disruption Budgets (PDBs), it can drain all replicas of a service simultaneously, causing downtime. This is especially common during cluster upgrades.

Action: Set PDBs for every production deployment: minAvailable: 1 at minimum, maxUnavailable: 25% for larger deployments.

3. DNS Resolution Failures

CoreDNS is the cluster DNS provider, and it's a single point of failure that nobody monitors until it breaks. Under heavy load, CoreDNS can become a bottleneck, causing intermittent DNS resolution failures that manifest as random timeouts across your entire cluster.

Action: Monitor CoreDNS latency and cache hit rate. Use dnsPolicy: None with custom dnsConfig for pods that make heavy external DNS queries. Consider NodeLocal DNSCache for large clusters.

4. Ingress Controller Misconfiguration

NGINX Ingress Controller's default configuration allows unlimited request body size and has no rate limiting. This is a security and stability risk. I've seen production clusters taken down by a single user uploading a 10GB file because nobody set proxy-body-size.

Action: Set sensible defaults in your Ingress Controller: request body limits, connection timeouts, rate limiting per IP. Review the NGINX Ingress annotations documentation.

5. Secret Management

Kubernetes Secrets are base64-encoded, not encrypted. Anyone with RBAC access to the namespace can read them. Storing database passwords, API keys, and certificates as Kubernetes Secrets without additional encryption is a security vulnerability.

Action: Use External Secrets Operator with a secrets manager (AWS Secrets Manager, HashiCorp Vault, GCP Secret Manager). Enable encryption at rest for etcd. Audit RBAC policies regularly.

The Production Readiness Checklist

Before going to production with Kubernetes, ensure you have answers for every item on this list:

Category	Requirement	Status
Networking	VPC CIDR range planned for growth	[ ]
Networking	Network policies restricting pod-to-pod traffic	[ ]
Security	RBAC policies defined per team/namespace	[ ]
Security	Pod Security Standards enforced (restricted)	[ ]
Security	Image scanning in CI/CD pipeline	[ ]
Reliability	Resource requests/limits on all pods	[ ]
Reliability	Pod Disruption Budgets on all deployments	[ ]
Reliability	Liveness and readiness probes configured	[ ]
Observability	Centralized logging (not just kubectl logs)	[ ]
Observability	Metrics collection (Prometheus or equivalent)	[ ]
Observability	Alerting on cluster-level metrics	[ ]
Disaster Recovery	etcd backup strategy (if self-hosted)	[ ]
Disaster Recovery	Cluster recreation runbook tested	[ ]
Cost	Cost monitoring per namespace/team	[ ]

Action Plan: Adopting Kubernetes Responsibly

Step 1: Validate the Need (Week 1)

Write a one-page document answering: "What specific problems will Kubernetes solve that our current infrastructure can't?" If the answer is primarily about developer experience (not scaling), consider lighter alternatives like Docker Compose + CI/CD.

Step 2: Start with a Non-Critical Workload (Week 2-4)

Deploy a staging environment or internal tool on Kubernetes. Don't start with your revenue-generating production service. Learn the failure modes in a low-stakes environment.

Step 3: Build Observability First (Week 3-5)

Set up monitoring, logging, and alerting BEFORE migrating production workloads. You need visibility into the cluster before you can operate it reliably. Use Prometheus + Grafana (open source) to avoid the $5,000/month Datadog bill.

Step 4: Migration with Rollback (Week 5-8)

Migrate one service at a time. Keep the old deployment running in parallel. Use DNS-based traffic shifting (weighted routing) to gradually move traffic. Ensure you can roll back to the previous infrastructure within minutes.

Step 5: Operationalize (Ongoing)

Write runbooks for common failure scenarios. Practice disaster recovery. Set up cost alerts. Train the team on kubectl and cluster debugging. Kubernetes is not a "set it and forget it" technology — it requires ongoing operational investment.

Final Thoughts

Kubernetes is an extraordinarily powerful tool when used appropriately. It enables reliable, scalable, self-healing infrastructure for complex distributed systems. But it's also the most over-adopted technology in the industry. The gap between "can we use Kubernetes?" and "should we use Kubernetes?" is where most teams waste months and thousands of dollars.

The best Kubernetes deployments I've seen share a common trait: they were adopted deliberately, with full awareness of the operational cost, by teams that had exhausted simpler alternatives. The worst deployments were driven by resume-driven development, conference talks, and the fear of "not being cloud-native."

Choose boring technology when boring technology works. And when it doesn't — when you genuinely need the power of container orchestration at scale — Kubernetes will be there, waiting, with all its complexity and all its capability.

Sources

I'm Ismat, and I build BirJob — Azerbaijan's job aggregator scraping 80+ sources daily.

Loading BirJob...

Kubernetes in Production: What Nobody Tells You

Kubernetes in Production: What Nobody Tells You

War Story #1: The OOMKilled Cascade

War Story #2: The $14,000 Monthly Surprise

War Story #3: The Helm Chart From Hell

When Kubernetes Is Overkill

You Don't Need Kubernetes If:

Alternatives That Actually Work:

Managed vs Self-Hosted: The Real Trade-offs

Managed Kubernetes (EKS/GKE/AKS)

Self-Hosted Kubernetes

The Things That Will Bite You (And How to Prevent Them)

1. Resource Requests and Limits

2. Pod Disruption Budgets

3. DNS Resolution Failures

4. Ingress Controller Misconfiguration

5. Secret Management

The Production Readiness Checklist

Action Plan: Adopting Kubernetes Responsibly

Step 1: Validate the Need (Week 1)

Step 2: Start with a Non-Critical Workload (Week 2-4)

Step 3: Build Observability First (Week 3-5)

Step 4: Migration with Rollback (Week 5-8)

Step 5: Operationalize (Ongoing)

Final Thoughts

Sources

İş axtarışınıza başlayın