Skip to content
SoftwareMarketplace.NetDigital Engineering & Technology Insights
Programming & Development

The Complete Kubernetes Guide for Production Workloads

A working engineer's reference to running Kubernetes in production — architecture, security, networking, storage, observability, and the operational practices that prevent 2 AM pages.

Raza Ahmad
By Raza Ahmad
Technology Author & IT Infrastructure Specialist
Published
Updated · 22 min read
The Complete Kubernetes Guide for Production Workloads

Should you actually run Kubernetes?

Before any architecture decision, answer this question honestly: does this workload actually need Kubernetes? A surprising number of organizations adopt Kubernetes because it is fashionable rather than because the workload requires it. For a small number of services, a managed serverless container platform (Cloud Run, ECS Fargate, Azure Container Apps) usually delivers the same outcome with a fraction of the operational burden.

Kubernetes earns its complexity when you have many services from many teams, when you need fine-grained scheduling, when you have stateful workloads that benefit from operator patterns, or when you are building an internal developer platform. If none of those apply, consider whether a simpler platform would serve you better. This guide is for teams who have made the decision and now have to operate it well.

Cluster topology and control plane

For production, use a managed Kubernetes service — EKS, AKS, or GKE. Self-managed clusters are appropriate only if you have a dedicated platform team with deep Kubernetes expertise and a specific reason the managed services do not meet.

Decide between one large cluster and many smaller ones deliberately. Many small clusters per environment per business unit improve blast radius and upgrade safety; one large multi-tenant cluster reduces operational overhead and improves resource efficiency. Most organizations land somewhere in between, with one production cluster per region per tier of criticality.

Workload patterns: Deployments, StatefulSets, Jobs

Stateless services run as Deployments behind a Service. Stateful services run as StatefulSets with persistent volumes. Batch and scheduled work runs as Jobs and CronJobs. DaemonSets are for node-level agents.

Every workload should have resource requests and limits. Requests are the contract you give the scheduler; limits prevent a runaway pod from starving its neighbours. Pods without requests get scheduled badly and contribute to the most common cluster reliability problem: noisy-neighbour resource contention.

Networking: Services, Ingress, and CNI

Pod-to-pod networking is handled by the CNI plugin. Calico, Cilium, and the cloud-native CNIs (AWS VPC CNI, Azure CNI) are the realistic choices. Cilium has emerged as the modern default because of its eBPF-based observability and network policy capabilities.

External traffic enters through an Ingress controller (NGINX, HAProxy, Traefik) or, increasingly, through the Gateway API. Service mesh (Istio, Linkerd) adds mTLS between services and rich traffic management — adopt it when you have a real need, not as a default.

Security: RBAC, Pod Security, network policy, supply chain

Lock down the API server. Use RBAC; no human should have cluster-admin in production. Use OIDC federation to your enterprise identity provider so access is governed centrally.

Apply Pod Security Standards in the restricted profile to every namespace. Enable network policies that default-deny pod-to-pod traffic and explicitly allow what each workload needs. Sign container images and verify signatures in admission control with Sigstore or Notary.

Observability: metrics, logs, traces, events

The standard observability stack on Kubernetes is Prometheus + Grafana for metrics, Loki or an external log platform for logs, Tempo or Jaeger for traces, and the Kubernetes events stream surfaced through your alerting platform. The kube-prometheus-stack Helm chart gives you a sensible default in fifteen minutes.

Set SLOs for each service and burn-rate alerts that page only on real problems. The default Kubernetes alerts that ship with kube-prometheus-stack are mostly informational; tune them.

Upgrades, drift, and the operational treadmill

Kubernetes releases a new minor version roughly every four months and supports each version for about fourteen months. You will upgrade at least three times a year, every year, forever. Build the discipline.

Use GitOps — Flux or Argo CD — to deploy workloads. The pattern of "the cluster matches what is in this Git repository" is the only sustainable way to manage configuration drift at scale.

Frequently asked questions

Reader questions, answered

How many nodes should my cluster have?+

Enough that losing one is invisible — typically at least three control plane nodes (managed) and three or more worker nodes spread across zones.

Do I need a service mesh?+

Only when you have a clear need: mTLS, fine-grained traffic shaping, or cross-cluster service discovery. Otherwise the operational cost outweighs the value.

References
Raza Ahmad
About the authorRaza Ahmad
Technology Author & IT Infrastructure Specialist

Raza Ahmad is a technology author and IT infrastructure specialist based in Melbourne, Australia. He writes practitioner-grade guides on cloud computing (Azure and AWS), cybersecurity, enterprise networking with Cisco platforms, Linux administration, DevOps, and virtualization. His work focuses on translating complex infrastructure topics into clear, accurate guidance that engineers, system administrators, and IT decision makers can put to work in production environments. Every article published under his byline is fact-checked against current vendor documentation, official standards, and Raza's own hands-on experience operating the technologies he covers.

The Brief · Weekly

One email. The technology stories that actually matter for engineers.

A curated digest of the week's most useful tutorials, reviews, and analysis — no clickbait, no AI summaries of someone else's work.

Free. Unsubscribe anytime. See our privacy policy.