Skip to main content

Sre

A Readiness Probe That Can't Fail Is Just Wallpaper

Writing up a production outage report today, I was making the case that workloads should fail their readiness probes when they can’t reach their dependencies — databases, caches, anything required to do useful work. The collaborating Claude session put it better than I had: A probe that doesn’t fail when its workload can’t reach its database isn’t a probe — it’s wallpaper. That’s the whole thing. A readiness probe answers one question: is this pod ready to serve traffic? If the answer depends on a database connection and you’re not checking for that, you’re not answering the question — you’re decorating the pod spec.

Nobody cares that your Kubernetes cluster is healthy (and what to measure instead)

A few weeks ago, our new principal engineer sat down with our team and said something that stung a little: “I can see your cluster is up. I have no idea if anyone finds it useful.” That’s a hard sentence to sit with when you’ve spent months tuning alerts and building dashboards. I manage a team of SREs. We look after EKS, ArgoCD, Loki, Backstage, Karpenter, and a handful of other tools that together form what we loosely call “the platform.” We’re good at keeping things running. We have alerts. We have runbooks. We have dashboards full of green lights.