Mhimasree
I write about AI agents, distributed systems, and low-level engineering craft — from someone who's shipped all three in production.
Systems

P99 is a lie — and what to measure instead

Latency percentiles tell a partial story. Here's how I instrument distributed systems to find the real bottleneck.

Everyone reports P99. Almost nobody questions what it actually tells them.

What P99 hides

P99 measures the 99th percentile of your request latency over a time window. If your P99 is 200ms, that means 99% of requests completed within 200ms. Sounds good. But consider what it doesn’t tell you:

What I measure instead

After enough production incidents, I’ve settled on a layered approach.

Percentiles at multiple levels

Don’t just report one P99 for the whole service. Report P50, P95, P99, and P99.9 — and report them per endpoint, per dependency, and per customer tier if your product has tiers.

Latency histograms, not summaries

Prometheus summaries can’t be aggregated across instances. Histograms can. Always use histograms if you’re running more than one replica.

Time-to-first-byte vs total

For streaming or chunked responses, TTFB and total latency tell completely different stories. A slow TTFB means your backend is slow. A fast TTFB with slow total usually means the client or network.

Finding the real bottleneck

The real bottleneck is almost never where the latency shows up. A slow database query might show up as latency on an unrelated endpoint that happens to share a connection pool. Distributed tracing — even lightweight span sampling — is the only reliable way to find these.

The rule I follow: don’t optimize anything until you have a trace that shows you exactly where the time goes.