Case FilesCase 03 / 10 min read

Incident Case 03

The Ghost Connections

A Java service was taking 21 seconds to call an API that responded in 179ms. The wire logs exonerated the connection pool, the firewall, and the network.

01 / The Alert

The symptom looked like a classic network or connection-pool issue: a Java service needed more than 20 seconds to complete an outbound HTTP call.

The upstream API itself was fast. Direct calls showed responses around 179ms. The service path, however, showed roughly 21 seconds.

The first suspects were obvious: Apache HC5 connection pooling, firewall idle connection drops, DNS, TLS, and external network latency.

Slow requests invite folklore. Wire logs are how you remove folklore from the room.

02 / The Hunt

Apache HC5 DEBUG logs showed when a connection was leased, when the request was sent, and when the response arrived.

The critical signal was that connection lease wait time was 0ms. The pool was not blocking. It had a connection ready.

Packet captures showed a separate firewall problem: stale idle TCP connections could trigger SYN retransmissions. That was real and needed fixing, but it did not explain the 21-second request path.

The investigation moved up the stack: if the HTTP client and upstream were fast, where was the time being spent before the HTTP call actually executed?

03 / The Discovery

The timeline exposed two gaps:

Gap 1 - 3.3 seconds

Between OperationRequestProcessor completing and the destination handler route starting - likely synchronous Kafka publish or other I/O on the processing thread before dispatch.

Gap 2 - 13.7 seconds

Inside the destination handler route, before HC5 was called - additional synchronous Kafka publishes on the same Tomcat NIO thread holding it from executing the HTTP call.

The upstream API was not slow. The network was not slow. The connection pool was not slow. The application pipeline was blocking its own outbound HTTP call with synchronous I/O on the same thread.

The ghost connections weren't in the network. They were in the thread - blocked, waiting for Kafka, while the HTTP client sat ready with warm connections and nowhere to go.

04 / The Resolution

The firewall problem was closed out first. SO_KEEPALIVE was applied to keep TCP connections warm in the firewall's NAT table, and external API traffic was rerouted through a correctly configured firewall path.

The 21-second latency was different. The wire logs had made the diagnosis clear: blocking happened inside the Camel pipeline, on the Tomcat NIO thread, before the HTTP client ever ran.

The fix is conceptually straightforward: Kafka publishes that do not need to complete before the HTTP response should be asynchronous.

Synchronous vs asynchronous Kafka publishjava
// Synchronous - blocks the thread until broker acknowledges
// Every ms the broker takes adds directly to request latency
kafkaTemplate.send(topic, message).get();  // <- .get() is the problem

// Asynchronous - fire and forget, thread moves on immediately
kafkaTemplate.send(topic, message);        // <- no .get(), no block

The key insight: an SRE's job is to find and prove where the time goes - not to guess from dashboards. The wire logs provided an annotated timeline showing exactly which thread was blocked, for how long, and at which point in the pipeline.

Find where the time goes. Show your working. Hand it over.

05 / The Lesson

Wire logs tell the truth - APM and metrics often lie by omission

APM tools show transaction duration end to end. They rarely show the gaps inside - the time between steps where a thread is blocked but nothing is happening.

Exonerating suspects is as important as finding the culprit

HC5 defaults, the firewall, and the network all had evidence pointing at them. Testing and eliminating each one prevented wasted work on the wrong fix.

Synchronous I/O on HTTP dispatch threads is a latency multiplier

A Tomcat NIO thread that makes a synchronous Kafka publish before an outbound HTTP call will block for as long as the broker takes.

An HTTP client with 0ms connection wait is not the problem

When HC5 logs show "connection leased, 0ms wait" and the upstream responds in 179ms, the HTTP client is doing its job.

The latency investigation checklisttext
1. Is the upstream API actually slow?
   -> curl directly to the API, measure response time

2. Is the connection pool the bottleneck?
   -> enable HC5 DEBUG logging
   -> if connection wait is 0ms, pool is fine

3. Is the firewall dropping idle connections?
   -> tcpdump during a slow request, look for SYN retransmissions

4. Where is the time actually going?
   -> add per-step timing to the application pipeline
   -> look for gaps between steps on the same thread

5. Is the HTTP dispatch thread blocked by something else?
   -> if there's a gap before the HTTP client executes, the problem is not in the HTTP layer