Will llama.cpp multislot improve speed?

Posted by Real_Ebb_7417@reddit | LocalLLaMA | View on Reddit | 13 comments

I've heard mostly bad opinions about multiple slots with llama.cpp (--parallel > 1). I guess comparing to vLLM it might be worse at this, but I recently tried vLLM on 4 slots and it indeed improved the overall speed significantly (130tps decode on one slot llama.cpp to 400tps with 4-slot vLLM, of course when all 4 slots are used).

BUT vLLM handles CPU offload poorly (or I don't know how to use it properly) and, from what I heard, doesn't work with GGUFS too good, and thus, limits the available quantizations to basically int4/int8. And for many models I can easly run Q6 with llama.cpp and nice speed, but with vLLM I'd have to step down to int4 quants.

So, to the point... I'm running some benchmarks recently and on one-slot llama.cpp they easily take a couple hours or more per run. I'm wondering, if using multiple slots could actually reduce the time to complete the benchmark or it'd rather stay similar?

[-]

dampflokfreund@reddit

I have never seen any improvement from multi slot use. In contrary, it reduces effective context size and it often can take much longer because each slot gets reprocessed very often, plus it also reduces generation speed when they are in use and increases VRAM/RAM usage. I don't know what is the point, honestly. I leave it disabled using the -np 1 command.

[-]

Farmadupe@reddit

I think it depends on the workload shape? If you are doing lots of prefill then it can make sense to widen up the batch size and sacrifice kv that would have otherwise gone to multiple slots so you can achieve maximum throughput

But if you have any meaningful decode to do then parallel processing gives your GPU an opportunity to do prefill when another sequence is generating tokens.

I agree that llama.cpp's parallel processing story isn't great. Lots of rough edges even in bleeding edge builds

[-]

Real_Ebb_7417@reddit (OP)

Well, it won't reduce the context size, if I increase the context size flag :P

But good to know that you didn't have any success with multi slot in llama.cpp.

[-]

MotokoAGI@reddit

From using it, you get overall throughput. If you have many things to do, you get it faster. I ran a bench 2 years ago and I was definitely happy have -np 10

[-]

BigYoSpeck@reddit

Parallel inference works really well for models completely loaded into VRAM. In my experience each request takes a small hit over the peak speed available for a single request, but the combined speed of all requests is much higher. But you reduce context available to each request and can quickly fill up system RAM with cache slots

This doesn't transfer to CPU offload though unless I'm setting something incorrectly. MOE models with expert layers offloaded to CPU suffer a big hit with parallel requests

[-]

Real_Ebb_7417@reddit (OP)

What about kv cache offloaded to RAM and what about MoE models?

With one slot, I noticed that for dense models it's better to have the whole model on GPU and offload kv cache to RAM. But for MoE it's better to have kv cache in vRAM as long as at least the active params fit into GPU as well. So... for MoE would offloading some layers to CPU still work?

And same for kv cache. I wouldn't want to go down with quantization, so I have to make a tradeoff, either offload a part of model to CPU or offload kv cache there.

[-]

BigYoSpeck@reddit

I haven't spent any time with KV cache offloaded to system RAM, it's too slow so I stick to context sizes which fit entirely in VRAM

For MOE my experience is that any layers offloaded to CPU for the bigger models just doesn't work for multi slot (in that though you can use it, it's not faster). Everything fitting into VRAM means you might be getting something like 80% of your usual token speeds per slot, but with four slots running you're getting 3x your usual throughput. But they're all fighting over the same context space

But as soon as you're offloading layers to CPU, even if it's only 20-30% of the expert layers it just no longer seems to work that way. Your peak throughput is what is is divided by the number of requests

[-]

itroot@reddit

Chech it yourself with llama-batched-bench.

[-]

GregoryfromtheHood@reddit

With llama.cpp for agentic stuff I run parallel at 4. I get way more throughput with this, aggregated token speeds go up a fair amount compared to 1 slot. Yeah each slot generates a bit slower, but in total between them it is faster. I do have to set context length to around 720k ish though so that each slot gets 180k each. Each slot doesn't seem to get reprocessed like this for me.

I tried using unified kv and setting the context length back down to 262k but that was way slower and would crunch down speeds way slower that having a separate context per slot.

[-]

Real_Ebb_7417@reddit (OP)

Oh thanks, that’s a very helpful response. I’ll test later with TBLite, which is running on one slot at the moment and I estimate it will take 6-8h total xd

[-]

Final-Frosting7742@reddit

I think the only point of using slots is when you want to process files in parallel with a pool of workers. Using slots to process images for OCR with a VLM, i achieved +30% in processing speed. So with a LLM it could be useful if you need to process multiple queries in parallel.

[-]

Real_Ebb_7417@reddit (OP)

Well, as I said, I want to use it for benchmarks to make them a bit faster, so basically all the time 4 slots will be in use.

[-]

Sufficient_Prune3897@reddit

I've seen it do better, but it doesn't scale as well as vllm. Perhaps instead of 150 for solo, 250 for 4 people and it doesn't scale further.