On-call support has always been a core part of Site Reliability Engineer and DevOps duties. Over the years, I’ve been paged for a lot of incidents, some were easy to resolve, others vividly remembered because of how hard they were to deal with.
After every incident and postmortem, I’ve made it a habit to write things down. Not polished reports, but personal notes: what broke, how it was detected, how it was solved, and what was done to prevent it in the future. Over time, those notes turned into a valuable record of hard-earned experience.
This is a key summary of some of the incidents that I experienced while managing Kubernetes, both as a user and as an administrator. Each incident is not meant to be a full detailed postmortem, but it’s more of a storytelling approach to share hints and key learnings that can help you troubleshoot similar issues.
1- Hot Pods Caused by Default Kubernetes Load Balancing for HTTP/2 Traffic
This incident occurred when we first time introduced HTTP/2 traffic into Kubernetes. An unusual load pattern was observed in HTTP/2 traffic distribution across backend pods. While the node and pod metrics showed plenty of idle capacity, one pod was consistently hitting high CPU and latency, while others in the same service were almost completely idle.
Our mistake was expecting HTTP/2 traffic to behave the same way as HTTP/1.x in Kubernetes, which was not the case. kube-proxy acts differently for each of them. By default, kube-proxy performs Layer-4 (TCP) load balancing, it distributes traffic at the connection level, not at the individual request level: When a client opens a TCP connection to a Kubernetes Service, kube-proxy selects one backend pod using round-robin algorithm. All requests over that same TCP connection go to the same pod. For HTTP/1.x clients, this works fine, as we typically have multiple short-lived connections, the traffic is evenly distributed. For HTTP/2 clients, which support multiplexing many requests over a single connection, this will lead to forwarding all the requests to the same pod, leaving the rest mostly idle. As a result, One pod became the bottleneck, CPU and latency spiked on that pod.
The issue was initially difficult to mitigate due to limited understanding of kube-proxy’s connection handling. Tuning application-level keep-alive settings helped temporarily. The long-term solution was introducing Layer-7 load balancing through a service mesh capable of HTTP/2-aware routing.
key Learnings:
By default, Kubernetes Service provides connection-level load balancing, which can lead to traffic imbalance when used with HTTP/2 multiplexing. Without Layer-7 routing, a single pod can become a bottleneck while others remain idle, leading to performance degradation.
2- Replica Field Ownership Conflict Between ArgoCD and HPA
This incident didn’t have a major impact, but it highlighted the importance of carefully evaluating technical decisions when introducing non-native Kubernetes tools such as ArgoCD.
The issue was detected through a confusing scaling behavior with no obvious errors. Pods would suddenly scale down for a few seconds and then scale back up to the expected replica count. This happened during every Argo CD sync. Obviously the problem was with ArgoCD, since there was nothing wrong with HPA.
It turned out that the replicas field had been added to the last-applied-configuration annotation when someone manually ran kubectl edit to temporarily adjust the replica count. This was confirmed by running:
kubectl get deployment my-app -o jsonpath='{.metadata.annotations.kubectl\.kubernetes\.io/last-applied-configuration}'
In our case, ArgoCD was using client-side apply. Client-side apply relies on last-applied-configuration , which caused ArgoCD to continuously override the replica count managed by HPA, creating a drift between the Git-defined desired state and the runtime state.
If ArgoCD had been using server-side apply, it would have thrown the following error:
the field "spec.replicas" is already owned by autoscaling/v2.HorizontalPodAutoscaler
Since HPA applies server-side, there will be ownership conflict between ArgoCD and HPA over the replica field.
This serves as safety mechanism to protect from accidental overwrite of deployment manifests fields.
The incident was mitigated by immediately removing the replica field from the last-applied-configuration annotation and wait for the next ArgoCD sync to pick up the changes.
kubectl annotate deployment my-app \
kubectl.kubernetes.io/last-applied-configuration-
Long-term solution was switching back ArgoCD from applying client-side to server-side.
Key Learnings:
In GitOps-managed clusters, avoid any manual changes to deployment fields that are managed dynamically by controllers.
Avoid client-side apply in GitOps workflows, and always remove replicas from Deployment manifests when using HPA to ensure clear ownership and predictable scaling behavior.
3- CPU Limits Causing Unnecessary Throttling
This was one of the most confusing incidents I’ve dealt with regarding resource management. At first glance, the cluster looked perfectly healthy, nodes dashboards showed available CPU, no resource exhaustion alerts reported. However, from the user experience, the application shows latency spikes and request timeouts.
The root cause turned out to be CPU limits configured at the pod level, which caused CPU throttling and slowed down the application.
Here's why: unlike memory, CPU isn't consumable resource, it’s compressible and renews every scheduling cycle. Every scheduling period (CFS scheduling cycle), CPU time is allocated, reclaimed, and redistributed.
With CPU limits in place, a pod may have idle CPU available on the node but still be prevented from using it once its limit is reached.
When a container hits its CPU limit:
- The kernel throttles the container.
- The process is paused until the next CPU period.
- Even if the node has idle CPU, the container cannot burst beyond its limit.
CPU throttling metrics were the key signals to detect the incident root cause, specifically container_cpu_cfs_throttled_seconds_total and container_cpu_cfs_throttled_periods_total, correlating between both of those metrics and the latency graphs shows a clear alignment between latency spikes and CPU throttling.
Removing the CPU limits and keeping the requests alone are enough to prevent throttling, they guarantee a share of CPU proportionally and protect the CPU from greedy pods.
Key Learnings:
CPU limits can introduce hidden performance bottlenecks by throttling pods even when nodes have idle CPU, making latency issues hard to diagnose. In most cases, well-sized CPU requests are sufficient to allow workloads to burst while preventing CPU starvation.
4- Critical Workloads Without PriorityClass
Imagine core business pods getting randomly evicted across the cluster and failing to get rescheduled quickly while less important workloads continue running.
When you don't set a PriorityClass for critical deployments, everything looks fine until the cluster comes under resource pressure. This could be memory pressure, CPU starvation, or disk pressure. At this point, Kubernetes must decide which pods to evict or to kill on a node.
Kubelet evicts pods based on Quality Of Service (QoS). If multiple pods have the same QoS class (very common), any of them can get evicted.
The worst part comes later during rescheduling. Evicted pods can get stuck in the Pending state with the message:
preemption: No victims found for incoming pod
Because all pods have the same priority, preemption becomes impossible, and all nodes are already full.
Key Learnings:
Setting resources requests and limits alone is not enough to survive under pressure. While they help delay eviction, they do not guarantee scheduling priority. Always define PriorityClass for critical workloads.
5- EKS IP Exhaustion Caused by Stuck Terminating Pods
Imagine having more than 100 pods stuck in the Terminating state while new pods fail to schedule due to IP exhaustion.
After deleting a custom operator, all pods with finalizers got stuck in the Terminating state, leaving new pods pending due to IP exhaustion. Because EKS assigns real VPC IP addresses to pods, these terminating pods continued to hold IPs, eventually exhausting the subnet and blocking new pod scheduling.
The incident was mitigated by identifying pods blocked by finalizers, and carefully removing those finalizers to allow pod deletion and IP release. Once subnet IP capacity recovered, scheduling resumed and the cluster stabilized.
To remove the finalizer from the pods in batch:
kubectl get pods -n <controller-namespace> -o json |
jq -r '
.items[]
| select(.metadata.deletionTimestamp != null)
| select(.metadata.finalizers | index("<finalizer-name>"))
| "kubectl patch pod \(.metadata.name) -n app-ns -p '\''{\"metadata\":{\"finalizers\":[]}}'\'' --type=merge"
' | sh
Key Learnings:
Finalizers must be treated as permanent until explicitly removed, controllers must never be deleted before their managed resources, and and Terminating pods in EKS still consume real capacity.
6- Disk Pressure Caused by Local Temporary Logs
An incident that I recall was noticed when deployments began failing with scheduling errors and alerts fired for unexpected pod evictions, alongside disk pressure alerts on several nodes.
By correlating between disk space usage spikes and cpu/memory usage for pods, a pod was exposed generating an abnormally high volume of logs, quickly exhausting node disk and triggering DiskPressure. Kubelet began evicting unrelated pods and sets the node condition to NodeHasDiskPressure = True, new workloads stopped scheduling with:
0/12 nodes are available: 12 node(s) had disk pressure.
The issue was solved by tearing down the noisy pod, manually spinning up new nodes in the cluster to accommodate pending pods, cleaning up disk space on the affected nodes by removing container logs. Since the logs were not critical, a 30-minute downtime was acceptable. The permanent fix involved switching from local storage to persistent storage in the Helm chart, ensuring logs were written to external storage instead of node disk.
Key Learnings:
Never use emptyDir volumes to store pods data locally on nodes even temporarily, instead use an external persistent storage to avoid node disk pressure incidents or you need to set ephemeral-storage limits to prevent the pod from consuming unlimited disk.
7- Missing ConfigMap Reloader
This incident started with a simple configuration change in the database connections settings, those settings were configured via ConfigMap. The application pod didn't pick up the new changes, simply because there was no ConfigMap reloader in the deployment.
A ConfigMap reloader watches for changes in ConfigMaps and Secrets. When a change is detected, it typically triggers a rolling restart of the affected Pods. Missing Configmap reloader requires manual restart of the pods to detect the changes.
The outage occurred when the application attempted to operate with outdated database connection settings, resulting in connection failures:
ERROR Database connection failed: could not connect to server
ERROR Connection timeout while attempting to reach database endpoint
ERROR Failed to initialize database pool: invalid connection parameters
ERROR Database unavailable after retrying 5 times
The issue remained unresolved until all Pods were manually restarted and picked up the latest configuration.
Key Learnings:
ConfigMap updates are not automatically applied to running Pods, and relying on manual Pod restarts is an operational risk and does not scale. Always add ConfigMap reloader container (e.g., stakater/reloader) to your deployments.
Conclusion:
They say Kubernetes is easy—until you use it in production. That's true !
The more you scale, the more engineering effort is required to deeply understand its behavior and avoid downtime. Most of Kubernetes incidents are not features bugs, but more of implicit defaults, and assumptions that only break under real pressure.
Kubernetes is powerful, but its defaults are not always production-safe, and every configuration choice is an operational decision.