SRE ChroniclesChapter 03 · 10 min read

Chapter 03 · Apache HC5 · Camel · Kafka · Kubernetes · TCP

Published May 2026 · 10 min read

The Ghost
Connections

A Java service was taking 21 seconds to call an API that responded in 179ms. The connection pool looked suspicious. The firewall looked suspicious. The network looked suspicious. The wire logs exonerated all three - and pointed somewhere nobody expected.

🔴 The Alert

A Java-based integration service - a Camel pipeline making outbound HTTP calls to external APIs - was intermittently taking over 20 seconds per request. Not all the time. Not on every call. Just often enough to breach SLAs and generate alerts.

The external APIs themselves were healthy. Response times from curl were under 200ms. Load tests showed the APIs handling load without issue. But the service kept producing 21-second transactions. Sometimes longer.

The initial suspects were familiar: the Apache HttpClient 5 connection pool, the network path, and a stateful firewall sitting between the service and the external endpoints. All three seemed plausible. All three had evidence pointing at them.

The hardest investigations aren't the ones where you can't find a suspect. They're the ones where every suspect looks guilty.

🔍 The Hunt - Round 1: The Connection Pool

The first suspect was Apache HttpClient 5 (HC5). The service had no explicit connection pool configuration - it was running entirely on HC5 defaults. And HC5 defaults are not kind to low-traffic, persistent connections.

validateAfterInactivity2000ms (default)

HC5 validates - and often discards - pooled connections after just 2 seconds of inactivity. Under low traffic, almost every request hits a fresh connection rather than reusing a warm one.

maxConnectionsPerRoute5 (default)

Under concurrent load, 5 connections per route saturates quickly, causing requests to queue for a free slot.

keepAliveStrategynone configured

Without an explicit strategy, HC5 falls back to the server's Keep-Alive header or closes connections aggressively.

A k6 load test confirmed the suspicion - 62 new TCP connections were being opened in a short burst, paying a full TCP + TLS handshake penalty (~150ms) on every single one. The pool was churning rather than reusing.

This looked like the root cause. But a k6 test against the external endpoint directly - bypassing the service entirely - showed clean, fast responses. The pool was inefficient, but was it responsible for 21 seconds? That didn't add up.

A bad connection pool costs you milliseconds. Something else was costing 21 seconds.

🔍 The Hunt - Round 2: The Firewall

The second suspect was the stateful firewall. A parallel investigation had already confirmed that the same firewall was silently expiring TCP sessions after a period of inactivity - holding connections in its NAT table just long enough for the application to think they were alive, then dropping packets on the next use without sending a RST.

tcpdump captures during the 21-second windows showed SYN retransmissions - the client sending SYN packets and waiting, retransmitting, waiting again. The firewall was involved. But when the captures were analysed more carefully, the pattern was inconsistent. The SYN stalls affected only 4 connections out of 62 in one capture, all to the same destination IP simultaneously - more consistent with a brief upstream load balancer rotation than a systematic firewall problem.

SO_KEEPALIVE was applied to the service to keep connections warm in the firewall's NAT table. The SYN retransmissions stopped. But the 21-second latency spikes continued.

SO_KEEPALIVE fixed the firewall problem. The 21 seconds wasn't the firewall.

🧪 The Wire Logs

With both leading suspects cleared, the investigation turned to the HC5 wire logs - the detailed per-request trace that Apache HttpClient emits when debug logging is enabled. These logs show every step of the HTTP client's lifecycle: connection acquisition, request dispatch, response receipt.

Enabling HC5 wire loggingyaml
logging:
  level:
    org.apache.hc.client5.http.impl.io.PoolingHttpClientConnectionManager: DEBUG
    org.apache.hc.client5.http.impl.io: DEBUG

One failing transaction was captured in full. The log timestamps told a story that neither the connection pool nor the firewall theory could explain:

Annotated timeline - the 21-second transactiontext
10:31:35.368   Request enters the service
10:31:35.380   Freemarker template rendered, payload ready        (+12ms - normal)
10:31:36.910   Request fully built, ready to dispatch

               ↕ 3.3 seconds - thread blocked, unknown cause

10:31:40.185   Camel destination handler route starts

               ↕ 13.7 seconds - thread blocked, unknown cause

10:31:53.897   HC5 InternalHttpClient begins execution
10:31:53.898   Connection leased - 0ms wait (6 available in pool)
10:31:54.077   HTTP 200 from upstream API - wire time: 179ms

10:31:55.630   Total elapsed: 20,262ms

The numbers were unambiguous. HC5 had 6 warm, available connectionsin the pool. It acquired one in 0 milliseconds. The upstream API responded in 179 milliseconds. The connection pool was fine. The network was fine. The API was fine.

The 21 seconds was happening before HC5 ran. Two gaps - 3.3 seconds and 13.7 seconds - where the Tomcat NIO thread executing the Camel route was blocked on something entirely outside the HTTP client.

The HTTP client spent 179ms doing its job. Something else spent 20,828ms doing nothing visible.

💡 The Discovery

The logs from the same Tomcat NIO thread (nio-8080-exec-1) throughout the transaction showed a recurring pattern: multiple "successfully pushed message to Kafka" entries at various points in the pipeline - before the HTTP call was made.

The Camel route was making synchronous Kafka publisheson the same thread that would eventually make the outbound HTTP call. Each Kafka publish was blocking the thread until the broker acknowledged the message. Under normal conditions this is fast. But Kafka acknowledgement latency is variable - and on a busy broker, or under any broker-side pressure, each synchronous publish could block the thread for seconds.

The two gaps in the timeline mapped directly to this pattern:

01

Gap 1 - 3.3 seconds

Between OperationRequestProcessor completing and the destination handler route starting - a synchronous Kafka publish or other I/O on the processing thread before dispatch.

02

Gap 2 - 13.7 seconds

Inside the destination handler route, before HC5 was called - additional synchronous Kafka publishes on the same Tomcat NIO thread holding it from executing the HTTP call.

The upstream API was not slow. The network was not slow. The connection pool was not slow. The application pipeline was blocking its own outbound HTTP call with synchronous I/O on the same thread.

The ghost connections weren't in the network. They were in the thread - blocked, waiting for Kafka, while the HTTP client sat ready with warm connections and nowhere to go.

🛠 The Resolution

The firewall problem was the first to be closed out. SO_KEEPALIVE was applied to the service - keeping TCP connections warm in the firewall's NAT table so they would never reach the stale state that caused SYN retransmissions. External API traffic was also rerouted through a new, correctly configured firewall path, confirming zero SYN stalls on subsequent captures. That thread was done.

The 21-second latency was a different matter. The wire logs had made the diagnosis clear: the blocking was happening inside the Camel pipeline, on the Tomcat NIO thread, before the HTTP client ever ran. The pattern in the logs - repeated synchronous Kafka publishes on the same thread interspersed between processing steps - was the signal. Two gaps. Two places where the thread was waiting for broker acknowledgements instead of doing work.

The fix is conceptually straightforward: any Kafka publish that doesn't need to complete before the HTTP response should be made asynchronous. A fire-and-forget publish frees the thread immediately. The broker can catch up in its own time. The outbound HTTP call no longer waits.

The pattern to look for - synchronous vs asynchronousjava
// Synchronous - blocks the thread until broker acknowledges
// Every ms the broker takes adds directly to the request latency
kafkaTemplate.send(topic, message).get();  // <- .get() is the problem

// Asynchronous - fire and forget, thread moves on immediately
// Broker processes in its own time, latency stays in the HTTP path
kafkaTemplate.send(topic, message);        // <- no .get(), no block

The key insight from this investigation: an SRE's job is to find and prove where the time goes - not to fix the application code. The wire logs provided an annotated timeline showing exactly which thread was blocked, for how long, and at which point in the pipeline. That evidence is precise enough for any developer to trace the specific call and fix it without guesswork.

Find where the time goes. Show your working. Hand it over. The best SRE investigation ends with a clear answer, not an action list.

📖 The Lesson
01

Wire logs tell the truth - APM and metrics often lie by omission

APM tools show transaction duration end-to-end. They rarely show you the gaps inside - the time between steps where a thread is blocked but nothing is happening. HC5 wire logs with timestamps showed exactly where the time went. When you can't find the problem in metrics, go to the raw execution trace.

02

Exonerating suspects is as important as finding the culprit

Three suspects - HC5 defaults, the firewall, and the network - all had evidence pointing at them. The disciplined approach was to test each one and eliminate it with data before moving on. The firewall was real but not the cause of 21 seconds. The pool was inefficient but not the cause. Eliminating them cleanly prevented wasted effort on the wrong fix.

03

Synchronous I/O on HTTP dispatch threads is a latency multiplier

A Tomcat NIO thread that makes a synchronous Kafka publish before an outbound HTTP call will block for as long as the broker takes - and that latency adds directly to the HTTP call's total duration from the client's perspective. Any I/O that does not need to complete before the HTTP response should be made asynchronous.

04

An HTTP client with 0ms connection wait is not the problem

When HC5 logs show "connection leased, 0ms wait" and the upstream responds in 179ms, the HTTP client is doing its job perfectly. Total transaction time of 21 seconds means the other 20 seconds happened before or after the HTTP layer. Stop tuning the client and find the thread.

The latency investigation checklisttext
1. Is the upstream API actually slow?
   -> curl directly to the API, measure response time
   -> if <500ms: API is fine, look elsewhere

2. Is the connection pool the bottleneck?
   -> enable HC5 DEBUG logging
   -> look for "connection leased" - how many ms did it wait?
   -> if 0ms wait: pool is fine

3. Is the firewall dropping idle connections?
   -> tcpdump during a slow request, look for SYN retransmissions
   -> if present: apply SO_KEEPALIVE or reduce idle connection lifetime

4. Where is the time actually going?
   -> add per-step timing to the application pipeline
   -> look for gaps between steps on the same thread
   -> correlate with any synchronous I/O (Kafka, Redis, DB) on that thread

5. Is the HTTP dispatch thread blocked by something else?
   -> if there's a gap BEFORE the HTTP client executes:
     the problem is not in the HTTP layer at all

Written by

Vishnu KS

Senior Cloud Infrastructure Engineer

All Chronicles