Chapter 01 · Node.js · nginx · Kubernetes · TCP
Published April 2025 · 8 min read
The 5-Second
Time Bomb
Every Node.js server ships with a silent misconfiguration. Most never notice it - until nginx does. This is the story of a 5-second default that had been quietly waiting to fire - and the 502 storm that finally set it off.
It started like most production incidents do - not with a bang, but with a Slack message nobody wanted to send.
Users were hitting 502 Bad Gateway errors on a high-traffic microservice. Not constantly. Not predictably. Just enough to be undeniable - the kind of intermittent failure that makes engineers doubt themselves before they start doubting the system.
The service was a Node.js backend running in a Kubernetes cluster, sitting behind an nginx ingress controller. On paper, nothing exotic. In practice, a ticking clock.
The first instinct - as it always is - was to blame the database. Then the downstream APIs. Then the network. Each one came back clean. The service looked healthy. Latency was normal between the bursts. And then: another 502. Then silence. Then another.
The worst kind of bug is the one that hides between the errors.
The nginx access logs were the first real signal. Buried in the noise was a pattern - the 502s weren't random. They were clustered. And they were always preceded by a quiet period with no traffic to that upstream connection.
That detail changed everything.
nginx was configured with an upstream keepalive pool - a standard setup that holds open persistent TCP connections to upstream pods instead of opening a fresh one for every request. More efficient. Faster. Usually completely invisible.
But "holding open" requires both sides to agree on how long. And here's where it gets interesting.
upstream app-service {
server app-service:3000;
keepalive 320; # max idle connections held open
keepalive_timeout 60s; # how long nginx holds them
}nginx would hold connections open for 60 seconds. After that, it closes them gracefully. Nothing wrong with that.
The next step was to check what Node.js thought about this arrangement.
kubectl exec -it <pod-name> -n <namespace> -- node -e "
const http = require('http');
console.log('keepAliveTimeout:', http.Server.prototype.keepAliveTimeout);
console.log('headersTimeout:', http.Server.prototype.headersTimeout);
"
# Output:
# keepAliveTimeout: 5000
# headersTimeout: 60000There it was. keepAliveTimeout: 5000.
Five seconds. Node.js was closing idle connections after just 5 seconds - while nginx was holding the other end open for 60.
nginx: "I'll keep this connection warm, we'll use it in a bit."
Node.js: (closes connection after 5 seconds of silence)
nginx: "Great, request time." writes to the socket TCP RST 502.
The sequence of events, reconstructed:
- nginx establishes a persistent TCP connection to a Node.js pod.
- Traffic goes quiet. The connection sits idle for more than 5 seconds.
- Node.js quietly closes it from its end. No fanfare. Just a FIN.
- nginx doesn't immediately notice - the socket is in a half-closed state, or the FIN is still in transit.
- A new request arrives. nginx picks this connection from its keepalive pool - it still looks alive from nginx's perspective.
- nginx writes the request to the socket.
- Node.js has already closed it. The OS responds with TCP RST.
- nginx receives the RST, has no retry configured, and returns 502to the client.
This is a known failure mode - documented quietly in the Node.js changelog. The default keepAliveTimeout of 5 seconds predates the modern era of nginx keepalive pools. It made sense in an earlier era. In a cloud-native stack behind a properly configured ingress, it's a loaded gun.
What made this particularly insidious was that the bug was traffic-dependent. Under high load, connections never sat idle long enough for Node.js to close them. The service looked perfectly healthy. It was only during quieter periods - off-peak hours, after a traffic dip - that the time bomb armed itself and detonated.
The bug wasn't in the code. It wasn't in the infrastructure. It was in a default that everyone assumed someone else had changed.
The fix is surgical. Two lines. Applied at server initialisation:
const server = app.listen(3000);
// Node.js default is 5000ms - dangerously low behind nginx.
// Must be higher than nginx's keepalive_timeout (60s) + buffer.
server.keepAliveTimeout = 65000; // 65 seconds
// headersTimeout must always exceed keepAliveTimeout
server.headersTimeout = 66000; // 66 secondsThe logic: Node.js must hold connections open longer than nginx does. That way, nginx always closes first - gracefully, intentionally - and never writes to a socket that Node.js has already abandoned.
After deploying the fix to a non-production environment, the validation was straightforward - monitor 502 error rates over several days across the affected services and confirm a flat line. Using your APM tooling of choice:
SELECT count(*) FROM Log
WHERE cluster_name = '<your-cluster>'
AND message LIKE '%502%'
AND service.name = '<your-service>'
SINCE 7 days ago
TIMESERIES 1 hourResult: flat line. Zero 502s in the four days following deployment. The time bomb had been defused.
This incident repeats itself across organisations more than most engineers realise. It's not a Node.js bug - it's a configuration gap that emerges at the intersection of two systems that each behave correctly in isolation, but make conflicting assumptions when combined.
The mental model to carry forward:
Any proxy with a keepalive pool will eventually write to a stale socket
Unless the upstream closes connections more slowly than the proxy does. This is the core invariant. Violate it and 502s are inevitable.
Node.js keepAliveTimeout defaults to 5 seconds
This is almost certainly wrong for any service behind nginx, Envoy, HAProxy, or any keepalive-aware load balancer or ingress controller.
headersTimeout must always exceed keepAliveTimeout
Otherwise you trade 502s for a different class of timeout error - request header timeouts mid-connection.
Traffic-dependent bugs evade load testing
Standard load tests saturate connections continuously and never let them go idle. This bug only surfaces at real-world traffic patterns - particularly off-peak hours.
nginx keepalive_timeout = 60s (proxy)
Node.js keepAliveTimeout = 65s (proxy + 5s buffer)
Node.js headersTimeout = 66s (keepAliveTimeout + 1s buffer)
Rule: upstream timeout > proxy timeout. Always.The 5-second time bomb ships in every Node.js installation. Whether it detonates depends entirely on what sits in front of it. Check your defaults before your users find them for you.
Written by
Vishnu KS
Senior Cloud Infrastructure Engineer