Hidden causes of LLM latency, its not just the model size

Posted by emmettvance@reddit | LocalLLaMA | View on Reddit | 2 comments

Hello community, this is my first time posting here. I'd be willing to share some quick optimizations to reduce LLM latency as this is where most of us get frustrated

most developers blame latency on model size but the real issues usually happen before the model even starts generating tokens

Infrastructure problems == actual culprit

Latency typically comes from request queues, batching strategies, token schedulers, and memory pressure rather than the LLM itself. When multiple users hit the same endpoint, requests pile up in queues causing delays even when GPU resources are sitting idle

Static vs continuous batching matters

Static batching groups requests together and forces everything to wait for the longest sequence in the batch. This actually creates unnecessary delay and wasting GPU cycles. Continuous batching is way better, like new requests join ongoing batches, completed sequences free memory instantly, and the GPU stays fully utilized

Token schedulers and KV cache management

Different inference engines use different token schedulers which affects fairness vs throughput. Some are significantly faster under load. KV cache can also become an issue with large prompts or high parallelism. If you overflow cache capacity, evictions happen and token generation slows down

Use system prompts to reduce input tokens

if youre sending the same instructions repeatedly, use system prompts instead of stuffing everything into user messages. both claude and gemini apis support dedicated system prompt parameters that get processed separately. instead of sending a 500 token instruction with every request, set it once as a system prompt and only send the actual user input. cuts down on repeated token costs and makes requests faster

Client-side patterns make it worse

sending requests in tight loops, firing hundreds of concurrent calls without limits, or hammering the API after 429 errors amplifies everything. use semaphores to limit concurrency, add exponential backoff for rate limits, prefer streaming over waiting for full completion, and dont send unnecessarily large context

In conclusion, systems using continuous batching and paged attention like vLLM, TGI, TensorRT-LLM generally handle high-load scenarios better than static batching implementations. different providers implement batching differently so testing with your actual workload helps figure out what performs best

[-]

Lissanro@reddit

System prompt is just the part of the prompt ultimately. As long as the prefix matches, it does not matter, since it is not matching part that gets discarded and triggers reprocessing of the rest of the prompt (like changing something in the beginning of the prompt would cause reprocessing).

I also don't think most of you mention applies to running locally... I don't have any rate limits or anything like that, instead, it is important to keep in mind the actual performance of the hardware.

For example, many of my workflows use long prompts, and I find it boosts performance greatly if I save the cache and restore it before sending the prompt. This basically reduces few minutes of prompt processing to few seconds or even under the second if the LLM cache stayed in RAM. Even for largest models like Kimi K2 with trillion parameters, their cache is no more than few gigabytes, hence why it is possible to quickly load it from SSD or RAM. I described here how to save/restore cache in ik_llama.cpp (the same applies to llama.cpp as well).

For this reason, the parts I may need to change for future uses of the workflow (like values in a template), are best put at the end of the prompt. This allows to achieve the best performance since almost all of the saved cache gets used for the prompt if I change something only at the end.

[-]

kaisurniwurer@reddit

Thanks for the

--cpunodebind=0 --interleave=all

I was limiting the memory channels to the single node.

On that note, do you have any handy tips for CPU inference? I'm running dual xeons 6230 with ~400Gb of 2666 DDR4.

My goal is ~6t/s (though more won't hurt) with GLM-4,6.