Vishnu KS is an end-to-end engineer based in Kochi, Kerala, India. He builds applications, scalable cloud infrastructure, DevOps pipelines, security controls, Kubernetes platforms, observability systems, and AI integrations. He is a KCD Kochi organiser, CNCF Kerala community leader, and Top 3% global talent verified by Toptal.

What is Vishnu KS available for?

Vishnu KS is available for application engineering, cloud architecture, DevOps automation, Kubernetes platforms, security controls, production debugging, advisory roles, architecture reviews, and speaking opportunities via Topmate at https://topmate.io/iamvishnuks.

What open source projects has Vishnu KS built?

Vishnu KS has built XReplicator (an eBPF-powered VM backup engine), OpsBrew (a Kubernetes log pipeline tool that won the Microsoft Azure Sentinel Hackathon), eBee (an interactive eBPF learning platform), AudioNet (audio classification using TensorFlow), and XMigrate (an open-source VM portability platform).

What is SRE Chronicles?

SRE Chronicles is a technical blog by Vishnu KS featuring real production incident investigations written as narrative stories. Each chapter covers a real debugging investigation - the alert, the hunt, the discovery, the fix, and the lessons. Available at https://iamvishnuks.com/chronicles.

Is Vishnu KS a Kubernetes expert?

Yes. Vishnu KS has extensive Kubernetes expertise spanning platform engineering at fleet scale, GitOps with ArgoCD, eBPF-level observability with Cilium, multi-tenant Loki/Grafana stacks, and deep incident investigation. He is also the organiser of KCD Kochi 2026 - Kerala's flagship Kubernetes Community Days event.

Incident Case 01

The 5-Second Time Bomb

Every Node.js server ships with a silent misconfiguration. Most never notice it - until nginx does. This is the story of a 5-second default that had been quietly waiting to fire - and the 502 storm that finally set it off.

01 / The Alert

It started like most production incidents do - not with a bang, but with a Slack message nobody wanted to send.

Users were hitting 502 Bad Gateway errors on a high-traffic microservice. Not constantly. Not predictably. Just enough to be undeniable - the kind of intermittent failure that makes engineers doubt themselves before they start doubting the system.

The service was a Node.js backend running in a Kubernetes cluster, sitting behind an nginx ingress controller. On paper, nothing exotic. In practice, a ticking clock.

The first instinct - as it always is - was to blame the database. Then the downstream APIs. Then the network. Each one came back clean. The service looked healthy. Latency was normal between the bursts. And then: another 502. Then silence. Then another.

The worst kind of bug is the one that hides between the errors.

02 / The Hunt

The nginx access logs were the first real signal. Buried in the noise was a pattern - the 502s weren't random. They were clustered. And they were always preceded by a quiet period with no traffic to that upstream connection.

That detail changed everything.

nginx was configured with an upstream keepalive pool - a standard setup that holds open persistent TCP connections to upstream pods instead of opening a fresh one for every request. More efficient. Faster. Usually completely invisible.

But "holding open" requires both sides to agree on how long.

nginx upstream keepalive confignginx

upstream app-service {
  server app-service:3000;
  keepalive 320;          # max idle connections held open
  keepalive_timeout 60s;  # how long nginx holds them
}

nginx would hold connections open for 60 seconds. After that, it closes them gracefully. Nothing wrong with that.

The next step was to check what Node.js thought about this arrangement.

Inspecting Node.js server timeout defaultsbash

kubectl exec -it <pod-name> -n <namespace> -- node -e "
const http = require('http');
console.log('keepAliveTimeout:', http.Server.prototype.keepAliveTimeout);
console.log('headersTimeout:', http.Server.prototype.headersTimeout);
"

# Output:
# keepAliveTimeout: 5000
# headersTimeout: 60000

There it was. keepAliveTimeout: 5000.

Five seconds. Node.js was closing idle connections after just 5 seconds - while nginx was holding the other end open for 60.

nginx: "I'll keep this connection warm, we'll use it in a bit." Node.js closes it after 5 seconds of silence. nginx later writes to the stale socket. TCP RST. 502.

03 / The Discovery

The sequence of events, reconstructed:

nginx establishes a persistent TCP connection to a Node.js pod.
Traffic goes quiet. The connection sits idle for more than 5 seconds.
Node.js quietly closes it from its end. No fanfare. Just a FIN.
nginx doesn't immediately notice - the socket is in a half-closed state, or the FIN is still in transit.
A new request arrives. nginx picks this connection from its keepalive pool - it still looks alive from nginx's perspective.
nginx writes the request to the socket.
Node.js has already closed it. The OS responds with TCP RST.
nginx receives the RST, has no retry configured, and returns 502 to the client.

This is a known failure mode. The default keepAliveTimeout of 5 seconds predates the modern era of nginx keepalive pools. It made sense in an earlier era. In a cloud-native stack behind a properly configured ingress, it's a loaded gun.

What made this particularly insidious was that the bug was traffic-dependent. Under high load, connections never sat idle long enough for Node.js to close them. The service looked perfectly healthy. It was only during quieter periods - off-peak hours, after a traffic dip - that the time bomb armed itself and detonated.

The bug wasn't in the code. It wasn't in the infrastructure. It was in a default that everyone assumed someone else had changed.

04 / The Fix

The fix is surgical. Two lines. Applied at server initialisation:

server.js - the fixjavascript

const server = app.listen(3000);

// Node.js default is 5000ms - dangerously low behind nginx.
// Must be higher than nginx's keepalive_timeout (60s) + buffer.
server.keepAliveTimeout = 65000;   // 65 seconds

// headersTimeout must always exceed keepAliveTimeout
server.headersTimeout = 66000;     // 66 seconds

The logic: Node.js must hold connections open longer than nginx does. That way, nginx always closes first - gracefully, intentionally - and never writes to a socket that Node.js has already abandoned.

After deploying the fix to a non-production environment, validation was straightforward: monitor 502 error rates over several days across the affected services and confirm a flat line.

Validating zero 502s post-fixsql

SELECT count(*) FROM Log
WHERE cluster_name = '<your-cluster>'
AND message LIKE '%502%'
AND service.name = '<your-service>'
SINCE 7 days ago
TIMESERIES 1 hour

Result: flat line. Zero 502s in the four days following deployment. The time bomb had been defused.

05 / The Lesson

This incident repeats itself across organisations more than most engineers realise. It's not a Node.js bug - it's a configuration gap that emerges at the intersection of two systems that each behave correctly in isolation, but make conflicting assumptions when combined.

Any proxy with a keepalive pool will eventually write to a stale socket

Unless the upstream closes connections more slowly than the proxy does. This is the core invariant. Violate it and 502s are inevitable.

Node.js keepAliveTimeout defaults to 5 seconds

This is almost certainly wrong for any service behind nginx, Envoy, HAProxy, or any keepalive-aware load balancer or ingress controller.

headersTimeout must always exceed keepAliveTimeout

Otherwise you trade 502s for a different class of timeout error - request header timeouts mid-connection.

Traffic-dependent bugs evade load testing

Standard load tests saturate connections continuously and never let them go idle. This bug only surfaces at real-world traffic patterns - particularly off-peak hours.

The rule of thumbtext

nginx keepalive_timeout   = 60s  (proxy)

Node.js keepAliveTimeout  = 65s  (proxy + 5s buffer)
Node.js headersTimeout    = 66s  (keepAliveTimeout + 1s buffer)

Rule: upstream timeout > proxy timeout. Always.

The 5-second time bomb ships in every Node.js installation. Whether it detonates depends entirely on what sits in front of it. Check your defaults before your users find them for you.