Debugging vLLM inefficiencies (under-batching, KV pressure, etc.) — what I learned
Posted by Pitiful_Recover3295@reddit | LocalLLaMA | View on Reddit | 1 comments
I’ve been digging into vLLM performance recently and ran into a few patterns that aren’t obvious from raw metrics.
For example:
- GPU at \~50% doesn’t necessarily mean low load
- You can have 40+ running requests and still be underutilized
- KV cache can be near capacity without it being obvious from top-level metrics
The tricky part is correlating:
- running vs max_num_seqs (batch occupancy)
- GPU util vs actual concurrency
- KV usage vs sequence length + request mix
Most of the time, you’re just staring at /metrics and guessing.
I ended up building a small CLI tool to help with this — it looks at vLLM + GPU signals and flags things like:
- under-batching
- KV cache pressure
- low prefix cache reuse
Not trying to promote it aggressively — mostly curious:
How are others debugging vLLM inefficiencies today?
Repo if useful:
Pitiful_Recover3295@reddit (OP)
(If this is not the right place, happy to remove.)
Just a dev trying to see if my engineering efforts provides some value.