The Complete Kubernetes Guide for Production Workloads
A working engineer's reference to running Kubernetes in production — architecture, security, networking, storage, observability, and the operational practices that prevent 2 AM pages.

Should you actually run Kubernetes?
Before any architecture decision, answer this question honestly: does this workload actually need Kubernetes? A surprising number of organizations adopt Kubernetes because it is fashionable rather than because the workload requires it. For a small number of services, a managed serverless container platform (Cloud Run, ECS Fargate, Azure Container Apps) usually delivers the same outcome with a fraction of the operational burden.
Kubernetes earns its complexity when you have many services from many teams, when you need fine-grained scheduling, when you have stateful workloads that benefit from operator patterns, or when you are building an internal developer platform. If none of those apply, consider whether a simpler platform would serve you better. This guide is for teams who have made the decision and now have to operate it well.
Cluster topology and control plane
For production, use a managed Kubernetes service — EKS, AKS, or GKE. Self-managed clusters are appropriate only if you have a dedicated platform team with deep Kubernetes expertise and a specific reason the managed services do not meet.
Decide between one large cluster and many smaller ones deliberately. Many small clusters per environment per business unit improve blast radius and upgrade safety; one large multi-tenant cluster reduces operational overhead and improves resource efficiency. Most organizations land somewhere in between, with one production cluster per region per tier of criticality.
Workload patterns: Deployments, StatefulSets, Jobs
Stateless services run as Deployments behind a Service. Stateful services run as StatefulSets with persistent volumes. Batch and scheduled work runs as Jobs and CronJobs. DaemonSets are for node-level agents.
Every workload should have resource requests and limits. Requests are the contract you give the scheduler; limits prevent a runaway pod from starving its neighbours. Pods without requests get scheduled badly and contribute to the most common cluster reliability problem: noisy-neighbour resource contention.
Networking: Services, Ingress, and CNI
Pod-to-pod networking is handled by the CNI plugin. Calico, Cilium, and the cloud-native CNIs (AWS VPC CNI, Azure CNI) are the realistic choices. Cilium has emerged as the modern default because of its eBPF-based observability and network policy capabilities.
External traffic enters through an Ingress controller (NGINX, HAProxy, Traefik) or, increasingly, through the Gateway API. Service mesh (Istio, Linkerd) adds mTLS between services and rich traffic management — adopt it when you have a real need, not as a default.
Security: RBAC, Pod Security, network policy, supply chain
Lock down the API server. Use RBAC; no human should have cluster-admin in production. Use OIDC federation to your enterprise identity provider so access is governed centrally.
Apply Pod Security Standards in the restricted profile to every namespace. Enable network policies that default-deny pod-to-pod traffic and explicitly allow what each workload needs. Sign container images and verify signatures in admission control with Sigstore or Notary.
Observability: metrics, logs, traces, events
The standard observability stack on Kubernetes is Prometheus + Grafana for metrics, Loki or an external log platform for logs, Tempo or Jaeger for traces, and the Kubernetes events stream surfaced through your alerting platform. The kube-prometheus-stack Helm chart gives you a sensible default in fifteen minutes.
Set SLOs for each service and burn-rate alerts that page only on real problems. The default Kubernetes alerts that ship with kube-prometheus-stack are mostly informational; tune them.
Upgrades, drift, and the operational treadmill
Kubernetes releases a new minor version roughly every four months and supports each version for about fourteen months. You will upgrade at least three times a year, every year, forever. Build the discipline.
Use GitOps — Flux or Argo CD — to deploy workloads. The pattern of "the cluster matches what is in this Git repository" is the only sustainable way to manage configuration drift at scale.
Reader questions, answered
How many nodes should my cluster have?+
Enough that losing one is invisible — typically at least three control plane nodes (managed) and three or more worker nodes spread across zones.
Do I need a service mesh?+
Only when you have a clear need: mTLS, fine-grained traffic shaping, or cross-cluster service discovery. Otherwise the operational cost outweighs the value.

Raza Ahmad is a technology author and IT infrastructure specialist based in Melbourne, Australia. He writes practitioner-grade guides on cloud computing (Azure and AWS), cybersecurity, enterprise networking with Cisco platforms, Linux administration, DevOps, and virtualization. His work focuses on translating complex infrastructure topics into clear, accurate guidance that engineers, system administrators, and IT decision makers can put to work in production environments. Every article published under his byline is fact-checked against current vendor documentation, official standards, and Raza's own hands-on experience operating the technologies he covers.
More from Programming & Development

The Kubernetes Production Readiness Checklist Engineers Actually Use
A practitioner's checklist for taking a Kubernetes cluster from “it works on my laptop” to “I am happy to be on call for this.”

Terraform vs Pulumi: Which Infrastructure-as-Code Tool Should You Choose?
A working engineer's comparison of the two leading IaC platforms based on real deployments at scale.

The Complete Linux Administration Guide for Production Servers
A working systems administrator's reference for installing, hardening, monitoring, and troubleshooting Linux servers in real production environments.
One email. The technology stories that actually matter for engineers.
A curated digest of the week's most useful tutorials, reviews, and analysis — no clickbait, no AI summaries of someone else's work.
Free. Unsubscribe anytime. See our privacy policy.