Incident Case Files · Production Debugging · Systems Notes
Production Debugging Archive
Incident
Case Files
Real incidents. Real debugging. Real lessons. Each chapter is a story from the trenches — the kind of production problems that don't show up in documentation, only in 2am alerts and packet captures.
4
case files
8
topics
prod
source
Filter by topic
Case 01
The 5-Second Time Bomb
Every Node.js server ships with a silent misconfiguration. Most never notice it - until nginx does.
SymptomIntermittent 502 after idle upstream connections
SystemNode.js / nginx / Kubernetes
Read8 min read / Apr 2025
Node.js / nginx / Kubernetes / TCP
Case 02
CF-RAY: -
curl from the pod worked. The app from the same pod did not. Cloudflare returned a 400 for a request it pretended never existed.
SymptomCloudflare 400 with no Ray ID
SystemJava / Spring / Cloudflare
Read9 min read / May 2026
Cloudflare / Java / SOAP / TLS
Case 03
The Ghost Connections
A Java service was taking 21 seconds to call an API that responded in 179ms. The wire logs exonerated everyone.
Symptom21s call for a 179ms upstream
SystemApache HC5 / Camel / Kafka
Read10 min read / May 2026
Apache HC5 / Camel / Kafka / Java
Case 04
Guilty by Association
Checkpoint was blocking the right IP for the wrong reason. The domain it showed wasn't ours. The IP was shared with thousands of others.
SymptomIntermittent connection resets on a Cloudflare-fronted endpoint
SystemCheckpoint Firewall / Cloudflare / AKS
Read8 min read / May 2026
Checkpoint / Cloudflare / Firewall / Networking