P99 is a lie — and what to measure instead

Everyone reports P99. Almost nobody questions what it actually tells them.

What P99 hides

P99 measures the 99th percentile of your request latency over a time window. If your P99 is 200ms, that means 99% of requests completed within 200ms. Sounds good. But consider what it doesn’t tell you:

Which requests are in that 1%? If they’re all from one user, one region, or one query type, that matters.
What’s the P99.9? For a service handling 10,000 requests/minute, 0.1% is still 10 requests per minute hitting extreme tail latency.
Is it stable? A P99 of 200ms with high variance is a very different beast than a steady 200ms.

What I measure instead

After enough production incidents, I’ve settled on a layered approach.

Percentiles at multiple levels

Don’t just report one P99 for the whole service. Report P50, P95, P99, and P99.9 — and report them per endpoint, per dependency, and per customer tier if your product has tiers.

Latency histograms, not summaries

Prometheus summaries can’t be aggregated across instances. Histograms can. Always use histograms if you’re running more than one replica.

Time-to-first-byte vs total

For streaming or chunked responses, TTFB and total latency tell completely different stories. A slow TTFB means your backend is slow. A fast TTFB with slow total usually means the client or network.

Finding the real bottleneck

The real bottleneck is almost never where the latency shows up. A slow database query might show up as latency on an unrelated endpoint that happens to share a connection pool. Distributed tracing — even lightweight span sampling — is the only reliable way to find these.

The rule I follow: don’t optimize anything until you have a trace that shows you exactly where the time goes.