sia.hackernoon.com

Kubernetes offers built-in health checks called probes to automatically manage container health and keep your services resilient. Three types of probes – liveness, readiness, and startup – work in concert to prevent pods from failing silently or restarting unnecessarily. When configured correctly, these probes enable Kubernetes to detect when an application is alive, ready to serve traffic, or still starting up, and to take appropriate action (like restarting a container or removing it from a load balancer). We will be going over best practices for tuning probe parameters, and common pitfalls to avoid. The aim is to understand how to leverage liveness, readiness, and startup probes to build robust, self-healing Kubernetes deployments.

Liveness vs. Readiness vs. Startup Probes

Kubernetes uses liveness probes to determine if a container is still alive and functioning. If a liveness probe fails, Kubernetes will restart the container. This is useful for cases like deadlocks, where an application process is running but stuck and unable to progress. By restarting a non-responsive container, liveness probes can help restore service availability despite bugs. However, as we’ll discuss, misconfigured liveness probes can also introduce instability if they cause needless restarts.

In contrast, readiness probes indicate whether a container is ready to handle requests. When a readiness probe fails (or has not yet succeeded), Kubernetes marks the pod as not ready and temporarily removes it from Service load balancers. This means no traffic is sent to the pod until the readiness probe starts passing. Readiness probes are crucial during startup and during any transient error conditions – they prevent sending requests to a pod that isn’t ready to serve them. For example, an application might need to load large data files or wait for an external service at startup; during that time you don’t want to kill the container, but you also shouldn’t route traffic to it. The readiness probe covers this scenario by keeping the pod out of rotation until it reports healthy.

The newer startup probe (introduced in Kubernetes 1.16) is designed to handle slow-starting containers. If a startup probe is configured for a container, Kubernetes will suspend liveness and readiness checks until the startup probe succeeds. Essentially, the startup probe gives the application time to initialize fully without being interrupted. Once the startup probe passes (meaning the app has started successfully), Kubernetes begins normal liveness and readiness checks. If the startup probe fails too many times, it’s treated as an indication the container will never start; Kubernetes will kill the container, respecting the pod’s restartPolicy. Using a startup probe can avoid the common issue of a slow-starting app getting killed by a liveness probe before it even finishes initializing. In summary:

Liveness Probe: “Is my container alive?”– Restart the container if this probe fails (e.g. detect stuck processes).
Readiness Probe: “Can this container serve traffic right now?”– Remove pod from service endpoints if this probe fails (without restarting it).
Startup Probe: “Has the application startup completed?”– If configured, delay liveness/readiness until startup is successful, or kill the container if startup never succeeds.

All three probes use the same mechanisms to check health: you can perform an HTTP GET on a URL, open a TCP socket, or run an arbitrary command. A probe is considered successful if it returns an exit code 0 (for commands) or a 200-range HTTP status (for HTTP GET), etc. You define these probes in your Pod spec, under the container definition. Let’s look at how to configure them with some examples.

Configuring Probes with YAML Examples

Probes are defined per container in a Pod (or Deployment) YAML. Here’s an example snippet of a container with all three probes configured:

containers:
- name: web-app
  image: my-web-app:latest
  ports:
  - containerPort: 3000
  # Startup Probe: wait for up to 60s (30*2s) for app initialization (e.g. DB connection)
  startupProbe:
    httpGet:
      path: /api/startup
      port: 3000
    periodSeconds: 2
    failureThreshold: 30
  # Liveness Probe: check every 15s that the app is responsive
  livenessProbe:
    httpGet:
      path: /api/health
      port: 3000
    periodSeconds: 15
    timeoutSeconds: 3
    failureThreshold: 3
  # Readiness Probe: check every 5s that the app can serve traffic
  readinessProbe:
    httpGet:
      path: /api/ready
      port: 3000
    periodSeconds: 5
    timeoutSeconds: 2
    failureThreshold: 3

In this example , the startup probe gives the application up to 60 seconds to finish initialization and connect to its database (it will try the /api/startup endpoint every 2 seconds, up to 30 failures). Only after the startup probe succeeds does the liveness probe kick in, checking the /api/health endpoint every 15 seconds to ensure the app remains healthy. The readiness probe runs more frequently (every 5 seconds on /api/ready) to ensure the app can serve requests (for instance, the readiness check might verify a database connection or other dependencies). This setup ensures the app has sufficient time to start, containers are restarted only if they truly become unhealthy, and traffic is sent only to pods that are ready.

Probe Configuration Fields

Each probe has several tunable parameters that control its behavior:

initialDelaySeconds– How long to wait after the container starts before performing the first probe. If your application needs some warm-up time before a health endpoint is available, set an initial delay. For example, if a web server typically takes 20 seconds to start, you might use initialDelaySeconds: 20 for the readiness probe so Kubernetes doesn’t mark it unhealthy during startup. (If a startup probe is in use, it will override the initial delays for liveness/readiness until startup completes.)
periodSeconds– How often to run the probe (default is 10 seconds for liveness and readiness). In our example above, liveness runs every 15s and readiness every 5s. Faster probes catch issues sooner but also consume more overhead and risk flapping; common practice is to run readiness checks more frequently than liveness checks, since readiness failure is less disruptive (doesn’t restart the container). For most applications, a readiness interval of 5-10s and liveness interval of 15-30s is a reasonable starting point.
timeoutSeconds– How long to wait for the probe to succeed before considering it a failure. The default timeout is only 1 second. This can be too short for some checks or during high load. For instance, one team found that under heavy CPU load, their app’s health endpoint responses slowed slightly, causing occasional timeouts at the 1s default and triggering unwanted restarts. Increasing the timeout to 3-5 seconds for HTTP probes can prevent false negatives in those situations.
failureThreshold– How many consecutive failures of the probe are allowed before Kubernetes takes action. For liveness, if the probe fails this many times in a row, the kubelet will kill the container. For readiness, after this many failures, the pod is marked not ready. The default is 3. That means by default, roughly 3 consecutive failures * period (plus timeouts) will cause action. For example, with default settings (period 10s, failureThreshold 3), a pod will be marked Unready after about 30 seconds of failing readiness checks. You can tune the failureThreshold higher if you want to tolerate brief failures without restarting. In the heavy-load scenario mentioned above, the team increased their liveness probe’s failureThreshold to 10 to require a more sustained failure before restarting. This, combined with a longer timeout, made their system far less likely to reboot due to transient load spikes.
successThreshold– How many consecutive successes are required to consider the probe successful. This field is rarely changed for liveness (which only needs a single success), but it can be useful for readiness. For example, if your readiness check is flickering between success/failure, you might set successThreshold: 2 so that you require two successful checks in a row before marking the pod Ready. The default is 1 for all probes.

By carefully tuning these parameters, you can balance fast detection of real issues with tolerance for transient conditions. A common pattern is to use the same HTTP endpoint for both readiness and liveness probes, but give the liveness probe a higher failureThreshold or longer interval. This way, if the health endpoint starts failing, the pod will first be marked Unready (removing it from traffic) for a grace period; only if the failures persist will the liveness probe finally restart the container. This approach avoids killing a container for a short-lived glitch. Always consider the nature of your application when configuring probes – e.g. how long it takes to start, how it behaves under load, and what external dependencies it has.

Why Probes Matter (and How They Prevent Outages)

Properly using liveness, readiness, and startup probes can prevent a lot of common production issues. Here are some real-world scenarios highlighting their importance:

Avoiding Traffic to Unready Pods: During a rolling deployment or startup, an application might be running but not yet ready to serve requests (perhaps it’s still loading config or warming up caches). If no readiness probe is defined, Kubernetes assumes the pod is ready as soon as the container is running, and it will start sending user traffic immediately. This often leads to users hitting errors or timeouts. By defining a readiness probe, you ensure the Service will wait until your app explicitly reports ready before routing traffic. This prevents those “half-baked” responses and reduces errors during deployments. In practice, always defining a readiness probe for microservices (especially ones with HTTP APIs) is considered a best practice– it’s a simple way to avoid a class of production issues where new pods receive requests too early.
Self-Healing from Hangs or Deadlocks: If an application gets stuck (e.g. due to a deadlock, or a thread pool exhaustion, etc.), without a liveness probe it might just sit there indefinitely – not serving traffic but also not crashing. Kubernetes won’t automatically restart it because from the outside the process is still running. This is a silent failure. A liveness probe that checks an appropriate heartbeat endpoint can detect this situation and restart the container. That said, liveness probes should be used to catch unrecoverablefailures in the app. If your app can detect and recover or exit on its own, that’s often better than relying on Kubernetes to restart it. Use liveness probes for scenarios where a restart is the only remedy (e.g. stuck processes). They are a powerful tool for resiliency – one that has saved many teams from late-night pages by auto-recovering hung services.
Preventing Cascading Failures: Probes, especially liveness probes, must be used carefully to avoid making a bad situation worse. A classic example is misconfiguring a liveness probe to depend on an external service. Imagine your liveness probe tries to connect to your database to check health. If the database goes down or has a hiccup, the liveness probe will start failing even though your application code might be perfectly fine. Kubernetes will then restart all your pods because it thinks they are unhealthy, when in reality they were only temporarily unable to reach the DB. Now your service is doubly impacted: a database issue turned into a full application restart for all pods. As one engineer noted, “a single DB hiccup will restart all your containers” if your liveness checks hit the database. This is usually worse than the original problem. The better approach in such cases is to use readiness probes (to stop receiving traffic when the dependency is down) but not liveness, so that the pods stay running and can automatically resume work when the DB comes back. The takeaway: don’t tie liveness probes to external dependenciesthat your app doesn’t control. Liveness checks should ideally be self-contained (checking the internals of the container only).
Handling Slow Startups Gracefully: Many production incidents have occurred because an application took slightly longer to start than expected, and the liveness probe (with its timer set too aggressively) killed it mid-startup. This often leads to a CrashLoopBackOff cycle. For example, suppose you set a liveness probe with initialDelaySeconds: 30, expecting your app to be ready by then. If a new deployment happens where the app actually needs 40 seconds to finish booting (maybe a migration or a cold start), the liveness probe will fail at 30s and restart the pod. Now the cycle begins anew, and the app may never come up. The fix is to either increase the initial delay or, more elegantly, use a startup probe that allows a one-time longer window for startup. Startup probes are specifically designed to prevent this scenario – they let you accommodate a “worst-case startup time” without compromising the responsiveness of liveness checks once the app is running. Always account for the variability in startup times (e.g. slower in resource-constrained environments or after large changes) when setting these thresholds.
Resisting Overload Failures: Even with the best tuning, an overly aggressive liveness probe can amplify problems under heavy load. A real-world incident in 2023 described how a combination of a tight liveness probe and a CPU limit caused an unstable feedback loop. The application would get a burst of traffic and slow down under CPU throttling; the liveness probe (set with a 1 second timeout) would occasionally time out and force a restart of the pod. But restarting cleared in-memory state and also removed the pod from service briefly, pushing more load onto fewer pods. When the pod came back, it was hit with an even larger backlog of requests, causing the liveness probe to fail again, leading to another restart. This cycle resulted in a permanent CrashLoopBackOffwhere the service was effectively down, caused entirely by the health check and resource limit interplay. The resolution was to relax the liveness probe settings (5s timeout and require 10 failures) and remove the CPU limit, which broke the feedback loop. The moral here is that liveness probes should be tuned to truly indicate a stuck application, not just momentary slowness. During overload, failing a probe might worsen the situation by killing containers that were actually working (just slow). It’s often better to let a slow pod continue serving what it can (perhaps marked not ready) rather than restart it and add more strain on the system.

In summary, probes are critical for automation and self-healing, but they must be used thoughtfully. When correctly configured, readiness probes prevent users from seeing errors, liveness probes revive genuinely broken containers, and startup probes ensure you don’t shoot down a container that’s still initializing. But if misconfigured, these same mechanisms can cause cascading failures or downtime. Next, we’ll cover some best practices and common mistakes to help you get probes right.

Best Practices and Common Pitfalls

As you plan your liveness/readiness/startup probe strategy, keep these best practices and cautions in mind:

Prefer Readiness for Dependency Issues: Use readiness probes to handle situations where the app cannot serve traffic temporarily, but will likely recover without a full restart (e.g. waiting for a dependent service). Mark the pod unready during the outage so no traffic is sent, but avoid using liveness to restart the pod in those cases. This way, when the dependency comes back, your app can immediately resume serving, without the extra downtime of rebooting the process.
Don’t Probe Deeper Than Necessary: Keep probes simple and low-cost. Hitting an HTTP health endpoint on the app itself is a common approach. Avoid heavy computations or external calls in your probe handlers. The probe should quickly answer the question “are you healthy/ready?” and nothing more. Expensive probes can overload your app or time out unnecessarily. For example, instead of performing a complex DB query for a readiness check, a simpler check like “can I ping the database?” might suffice – or consider designing the app such that if the DB is down, the app returns an HTTP 503 on its own health endpoint, which the readiness probe will catch. Simplicity in probes helps prevent false failures.
Tune Initial Delays and Timeouts to Your App: There is no one-size-fits-all timing. Analyze your application’s startup sequence and worst-case response times. Set initialDelaySeconds long enough that your readiness and liveness checks don’t start until the app is truly likely to be up. For liveness checks, if using them at all, ensure the app can typically respond well within the timeoutSeconds. A probe timing out doesn’t give your app any extra time – it immediately counts as a failure. So if you expect that under load a health check might take 2-3 seconds, don’t leave the timeout at 1s. As a rule of thumb, start conservative (longer delays, longer timeouts, slightly higher failureThreshold) and then dial them tighter if needed, rather than the opposite. It’s safer to be a bit lenient than to accidentally induce a crash loop.
Use Startup Probes for Slow Initializations: If your container performs heavy initialization (downloading large files, doing migrations, warming caches, etc.), configure a startup probe instead of simply cranking initialDelaySeconds way up on the liveness probe. The startup probe is a one-time gate that, when in effect, pauses liveness and readiness checks so the container can fully initialize. You typically point the startup probe at the same endpoint as liveness. Give it a generous failureThreshold * periodSeconds to cover the worst-case startup duration. Once it succeeds, normal probes take over. This approach avoids the situation where you have to compromise between a very long liveness delay (hurting detection of real crashes) versus killing slow boots.
Test Your Probes in Staging: A common mistake is deploying a new liveness or readiness probe configuration straight to production without fully understanding its impact. It’s easy to have a typo in the path or command, or a timeout that’s too short, which could cause every pod to fail the probe. Always test probe configs in a safe environment. Intentionally misconfigure one to see how the system reacts (does it restart as expected? mark unready correctly?). Also test that when your app does hang or go unhealthy, the probes actually catch it. Probes add a bit of logic to your deployment – treat that logic as code that needs testing.
Watch Out for Probe Flapping: If a probe’s success/failure flaps rapidly (e.g., an endpoint that sometimes returns 200 and sometimes 500), it can cause pods to be added and removed from load balancers repeatedly or even restart repeatedly. This could be worse than having no probe at all. If you notice flapping, consider increasing failureThreshold (to require more consecutive failures) or for readiness, maybe require a higher successThreshold to stabilize the signal. Ultimately, a flapping health endpoint suggests the app’s health is borderline – you may need to address the root cause. In the interim, tune the probe to be a bit more tolerant or slower to declare failure.
Don’t Overuse Liveness Probes: Perhaps surprisingly, many Kubernetes experts advise using liveness probes sparingly. You should have a clear reason if you add a liveness probe. If your application is well-behaved (it fails fast on its own when something is wrong, or you already have external monitoring), a liveness probe might be unnecessary. Every liveness check is a potential trigger for a restart – which is a disruptive event. In fact, if you don’t configure any liveness probe, Kubernetes by default assumes the container is live as long as its process is running. That’s fine for a lot of simple services. Only introduce a liveness check if you’ve identified a failure mode (like occasional hangs) that truly warrants an automatic restart as a remedy. Kubernetes livenessProbe can be dangerous. I recommend avoiding them unless you have a clear use case and understand the consequences.” That might be a bit extreme, but it underscores the point – use liveness probes deliberately, not by default.
Common Misconfiguration Pitfalls: Be aware of these frequent mistakes:
- Wrong probe type: e.g. using a liveness probe when you meant readiness. This can cause unexpected restarts. Remember: readiness is for traffic gating, liveness is for actual container restarts.
- Not defining a readiness probe: as discussed, this can lead to race conditions where pods receive traffic too early. Always have readiness for user-facing services.
- Probe path errors: If you configure an HTTP probe with the wrong path or port, it will always fail. Double-check that your container is actually listening on that path/port and that it’s accessible from within the pod.
- Ignoring the default behavior: If you don’t specify a field like initialDelaySeconds, it might default to 0 – meaning the probe starts immediately when the container starts. In some cases that’s fine, but often you need a delay. Be explicit in your YAML to avoid surprises.
- Not aligning probes with app logic: Ensure your application’s health endpoints align with what the probes are checking. For example, if your readiness probe hits /health, make sure that endpoint returns failure when the app is not actually ready (and success when it is). Sometimes developers implement a health check endpoint that always returns 200 – which makes the probe meaningless. The probes are only as good as the signals your app gives them.

By following these guidelines, you can sidestep most of the issues teams encounter with Kubernetes health checks. Next, we’ll look at how to monitor and troubleshoot probes in a live cluster, and when it might make sense not to use probes at all.

Monitoring and Troubleshooting Probes in Production

Once you deploy probes, it’s important to monitor their behavior in your production cluster. Misconfigured probes will typically manifest as pods cycling or going unready, which you want to catch early.

Use Kubernetes Events and Logs: The first place to check probe status is the pod’s events. Running kubectl describe pod <pod-name> will show recent events, including probe successes and failures. For example, you might see events like “Startup probe passed” or “Readiness probe failed: HTTP probe failed with statuscode: 503”. In the earlier example, after applying the deployment, kubectl describe showed the startup probe succeeding in 10 seconds. If a liveness probe fails and the container is killed, you’ll see events about killing the container and possibly the reason from the probe. These events are invaluable for debugging why Kubernetes might be restarting your pods.

Metrics and Alerts: You can set up monitoring to alert on unhealthy probe conditions. Kubernetes doesn’t directly expose “probe failed” metrics out of the box, but you can infer health from other signals:

The pod’s ready status is exposed via metrics if you use kube-state-metrics. For example, the metric kube_pod_status_ready{condition="false"} will be 1 for pods that are not ready. You can create an alert in Prometheus to trigger if a pod stays unready for too long or if too many pods of a deployment are unready at once.
The container restart count (visible in kubectl get pods or kubectl describe) is a clue to liveness probe failures. An increasing restart count, especially with a reason of LivenessFailed (check kubectl describe pod to see the last state), means the liveness probe is flunking and rebooting the container. You can monitor this via kube-state-metrics (e.g., kube_pod_container_status_restarts_total) and alert on rapid restarts.
CrashLoopBackOff status is another red flag often tied to failing liveness probes. If you see pods in CrashLoopBackOff, describe them to find if liveness probe failures are the cause. An alert on pods stuck in CrashLoopBackOff for X minutes might be useful.
Some teams also set up synthetic external checks or dashboards to monitor service availability which indirectly catches if pods are being taken out by probes.

Regularly review these signals. Probes are not a “set and forget” thing – changes in your application or environment may necessitate adjusting probe settings. For instance, if you notice that on every deployment, your new pods spend 60 seconds unready (triggering alerts) because they’re warming up caches, you might extend the readiness probe initial delay or use a startup probe to reflect that reality. Or if you find your liveness probe never actually fires, maybe your app never hangs (which is good) and perhaps you don’t need such a frequent liveness check at all. Monitoring data helps inform these decisions.

In summary, treat probe failures as you would any other warning sign in production. Investigate whether it’s a misconfiguration (false alarm) or a genuine app issue. And definitely alert on abnormal probe behavior – like a sudden wave of restarts or many pods going unready – as it often signifies a problem either in your app or in how the probes are configured.

When Might You Not Need Probes?

It may sound counterintuitive after all the above, but there are scenarios where you might choose not to configure certain probes:

Stable or Short-lived Containers: If your container is very simple, starts instantly, and either runs to completion or fails on its own (for example, a short batch job or a CronJob), you might skip readiness and liveness probes. A batch job isn’t serving requests, so readiness is irrelevant. And if it hangs or fails, you might rely on a built-in timeout or just let Kubernetes’ activeDeadlineSeconds or other mechanisms handle it, rather than a liveness probe. Similarly, a static file server (like nginx serving static content) that either comes up or crashes immediately might not need a liveness probe – if it’s not responding, it will likely exit and Kubernetes will restart it anyway. In these cases, probes would add little value.
Applications that Fail Fast: Some apps are designed to crash or exit if they encounter an unrecoverable error (rather than hanging). If your application already has this behavior, a liveness probe isn’t strictly necessary because the kubelet will notice the process has exited and will restart the container (per the restart policy) regardless. In fact, one philosophy is to build applications that fail fastand let the orchestrator restart them, instead of trying to detect subtle failures via liveness probes. This ensures you only restart when truly necessary. In such scenarios, adding a liveness probe that essentially does the same thing (kills the app when it’s not responding) may be redundant.
When the Risk of a Misprobe Outweighs the Benefit: If you’re not confident in what to check for liveness, or you fear a liveness probe might inadvertently cause harm (as in the earlier examples of cascading failures), it can be reasonable to hold off on implementing it initially. You might deploy a new service with just readiness probes, and only add liveness later if you observe the service actually getting stuck in practice. Kubernetes will happily run your pod without liveness probes – it just won’t automatically restart it on its own unless the process dies. That might be acceptable if manual intervention is rare or if other systems are monitoring it. The key is understanding the trade-off: liveness gives automatic recovery at the risk of false positive. Choose based on what makes sense for each workload.
Probes on Sidecars or Infrastructure Pods: Some pods (especially sidecars or daemonset pods) don’t serve user traffic and have specific roles (logging agents, service meshes, etc.). In some cases, readiness probes on these might not make sense, or could interfere with their operation. Always evaluate whether a probe on such components is actually checking a meaningful condition. For example, a service mesh sidecar like Envoy/Linkerd might have its own health logic; duplicating that with Kubernetes probes might not be needed. However, many of these do implement their own health endpoints and it can still be wise to use them.

In general, readiness probes are almost always recommended for services that handle requests, since they’re low-risk (they don’t kill the container) and high-reward in avoiding traffic to unhealthy pods. Liveness probes are optional – use them when they solve a known problem (like deadlocks), and skip or disable them when they introduce more problems than they solve. Startup probes are optional and scenario-dependent – they’re great for slow-starting apps, but unnecessary for quick-starting ones. There’s no need to add a startup probe if your app is fully ready in 2 seconds; in that case, a readiness probe with a 2-second initial delay is perfectly fine. The mantra is: use the probes that make sense for your application’s behavior.

Conclusion

Kubernetes liveness, readiness, and startup probes are powerful tools for building resilient containerized applications. They enable the platform to automatically detect when your app is healthy, when it’s ready to serve users, and when it might need a nudge (restart) to get back on track. As we’ve seen, using probes correctly can prevent common issues like sending traffic to unready pods, leaving hung processes running, or repeatedly killing containers that just need a bit more time to start. Proper probe configuration involves understanding your application’s characteristics and tuning parameters like timeouts and thresholds accordingly – there are sensible defaults, but don’t shy away from adjusting them to fit your workload (e.g. longer timeouts for I/O-heavy apps, higher failureThreshold for services that can spike under load, etc.).

Finally, apply probes thoughtfully, test them, and monitor their effects in production. When in doubt, start with readiness probes (you almost can’t go wrong there) and add liveness probes only as needed. Kubernetes gives us the building blocks for self-healing systems; it’s up to us as engineers to use them wisely to build robust, reliable services. Happy deploying, and may your pods always be healthy (and if not, may your probes catch it)!

Join the Conversation

Have experience with kubernetes probes? What trade-offs have you encountered in production? Share your story in the comments below —I’d love to hear your perspective.

Kubernetes Liveness, Readiness, and Startup Probes – Keys to Container Health and Resilience