Vishnu KS is an end-to-end engineer based in Kochi, Kerala, India. He builds applications, scalable cloud infrastructure, DevOps pipelines, security controls, Kubernetes platforms, observability systems, and AI integrations. He is a KCD Kochi organiser, CNCF Kerala community leader, and Top 3% global talent verified by Toptal.

What is Vishnu KS available for?

Vishnu KS is available for application engineering, cloud architecture, DevOps automation, Kubernetes platforms, security controls, production debugging, advisory roles, architecture reviews, and speaking opportunities via Topmate at https://topmate.io/iamvishnuks.

What open source projects has Vishnu KS built?

Vishnu KS has built XReplicator (an eBPF-powered VM backup engine), OpsBrew (a Kubernetes log pipeline tool that won the Microsoft Azure Sentinel Hackathon), eBee (an interactive eBPF learning platform), AudioNet (audio classification using TensorFlow), and XMigrate (an open-source VM portability platform).

What is SRE Chronicles?

SRE Chronicles is a technical blog by Vishnu KS featuring real production incident investigations written as narrative stories. Each chapter covers a real debugging investigation - the alert, the hunt, the discovery, the fix, and the lessons. Available at https://iamvishnuks.com/chronicles.

Is Vishnu KS a Kubernetes expert?

Yes. Vishnu KS has extensive Kubernetes expertise spanning platform engineering at fleet scale, GitOps with ArgoCD, eBPF-level observability with Cilium, multi-tenant Loki/Grafana stacks, and deep incident investigation. He is also the organiser of KCD Kochi 2026 - Kerala's flagship Kubernetes Community Days event.

Incident Case 04

Guilty by Association

I came back from a week's vacation to a Teams group with thirty people, an escalation chain climbing toward the CTO, and a manager asking one question on repeat: what changed on April 20th?

01 / The Alert

The Teams group was called "Connection reset issue." I was added while still catching up on a week of notifications.

The message thread told the full story before I read a single log. A frustrated manager was asking valid questions: what changed on April 20th, why is this critical endpoint failing, who owns this. Then I watched the group grow: my manager, my manager's manager, network team, their managers. The only person missing was the CTO.

This is what I was walking back into after a vacation.

I called my buddy in the networking team. He was curious too, once I explained it. The symptom was intermittent connection resets on an endpoint that an application cluster was calling, a domain fronted by Cloudflare, going through our internal Checkpoint firewall.

Not timeouts. Resets. That distinction mattered immediately.

When you see a connection reset, something is actively sending a RST. That rules out silent drops and routing black holes. Something made a decision.

02 / The Hunt

The Checkpoint team had already pulled logs by the time I joined the call. They showed a blocked connection with the destination domain listed as abudhbispacedebate.com.

Our domain was staging-login.acme-platform.com.

That mismatch was the signal. Not the block itself, but the wrong domain name in the block reason.

The question was not why Checkpoint blocked the connection. The question was why Checkpoint thought it was talking to a completely different domain.

Two things had to be true simultaneously for this to happen:

Our DNS resolved staging-login.acme-platform.com to a Cloudflare IP
Checkpoint's threat intel database had that same IP associated with abudhbispacedebate.com

Cloudflare is shared IP infrastructure. A single anycast IP can proxy thousands of domains simultaneously. The IP we resolved to was clean for our domain. Checkpoint's feed had flagged it for someone else's domain running on the same IP.

The intermittent pattern confirmed it. Cloudflare rotates IPs behind a domain. Some IPs in the rotation were clean in Checkpoint's database. Others were not. Every DNS resolution was a lottery.

03 / The Discovery

The RST versus timeout distinction resolved one early theory immediately.

A firewall drop rule, a silent discard, produces a timeout. The SYN goes out and never gets a SYN-ACK. The client waits, retransmits, eventually gives up. That is a different failure mode that we'd seen before on other investigations.

A block from the URL Filtering blade sends an active RST back to the client. The TCP connection establishes, Checkpoint inspects the traffic, makes a policy decision, and injects a RST. That is exactly what the application was seeing.

The deeper problem was the decision basis. Checkpoint's URL Filtering blade has two modes for evaluating HTTPS traffic:

- SNI-based: reads the TLS ClientHello, extracts the SNI field, looks up the actual domain name - IP-based fallback: when SNI extraction fails or is unavailable, looks up the destination IP in a reputation database

When we checked the Checkpoint SmartLog entry, the SNI field was empty. The blade had fallen back to IP reputation, looked up the Cloudflare IP, and found abudhbispacedebate.com, a domain sharing the same IP that had been flagged in the threat feed.

Our domain was never evaluated. A different domain's reputation killed our connection.

Checkpoint showed us the right IP and the wrong domain. That combination has one explanation: IP reputation lookup, not SNI inspection.

04 / The Resolution

Two fixes were needed, at different layers.

Immediate: explicit FQDN allow rule

An allow rule for staging-login.acme-platform.com placed above the URL reputation block policy. This had to be an FQDN object, not an IP, because IP-based rules would break on the next Cloudflare rotation.

Checkpoint SmartConsole: allow rule ordertext

Rule 1: Source: app-cluster | Dst: staging-login.acme-platform.com | Action: Accept   <- insert here
...
Rule N: URL Filtering - Block bad reputation domains                                   <- was matching here

Permanent: fix SNI extraction

The root cause was that the Checkpoint HTTPS Inspection blade was not extracting SNI from the TLS ClientHello. With SNI extraction working, Checkpoint evaluates the actual hostname, staging-login.acme-platform.com, and the shared IP problem disappears. This required a policy review with the Checkpoint team to understand why SNI was not being captured.

The fastest validation was in SmartLog: after the allow rule was in place, the block entries stopped. After SNI extraction was fixed, new blocks for this traffic class started showing the correct domain in the SNI field.

05 / The Lesson

RST tells you a decision was made. Timeout tells you nothing arrived.

The TCP behaviour is a first filter. A reset means something active, a firewall blade, a proxy, a load balancer, evaluated the connection and rejected it. Start there.

Cloudflare IPs are shared. Firewall reputation feeds don't always know that.

A threat intelligence feed that tags an IP as malicious because of one tenant's domain will block all other tenants on the same IP. This is a structural problem with IP-based reputation on CDN infrastructure.

SNI is in the TLS ClientHello in plaintext. Firewalls can read it without decryption.

Full HTTPS inspection (decrypt, inspect, re-encrypt) is not required to evaluate the hostname. SNI extraction reads the unencrypted handshake field. If your firewall is falling back to IP reputation on TLS traffic, SNI extraction is not working.

An allow rule without SNI extraction is a workaround, not a fix.

The allow rule stops the bleeding. Without SNI extraction, every new domain going through the same firewall path is one bad IP rotation away from the same problem.

Checkpoint block diagnosis checklisttext

1. Is the block action a RST or a silent drop?
   -> RST: a blade made a policy decision (URL Filtering, Threat Prevention)
   -> Timeout: firewall drop rule, routing issue, or the other side is down

2. Does the blocked domain in SmartLog match the domain you are connecting to?
   -> Match: domain reputation is the issue, investigate the category
   -> Mismatch: IP reputation fallback, SNI extraction is not working

3. Is the SNI field populated in the SmartLog entry?
   -> Empty: HTTPS Inspection / SNI extraction not configured or failing
   -> Populated but wrong: SNI extracted from wrong packet in session

4. Is the destination IP a shared CDN range (Cloudflare, Akamai, Fastly)?
   -> If yes, IP reputation lookups will produce false positives
   -> Fix: explicit FQDN allow rule + fix SNI extraction

5. Verify the allow rule is an FQDN object, not an IP
   -> Cloudflare rotates IPs; IP-based rules break on the next DNS resolution