Debugging vLLM inefficiencies (under-batching, KV pressure, etc.) — what I learned

I’ve been digging into vLLM performance recently and ran into a few patterns that aren’t obvious from raw metrics.

For example:

- GPU at \~50% doesn’t necessarily mean low load

- You can have 40+ running requests and still be underutilized

- KV cache can be near capacity without it being obvious from top-level metrics

The tricky part is correlating:

- running vs max_num_seqs (batch occupancy)

- GPU util vs actual concurrency

- KV usage vs sequence length + request mix

Most of the time, you’re just staring at /metrics and guessing.

I ended up building a small CLI tool to help with this — it looks at vLLM + GPU signals and flags things like:

- under-batching

- KV cache pressure

- low prefix cache reuse

Not trying to promote it aggressively — mostly curious:

How are others debugging vLLM inefficiencies today?

Repo if useful: