Will llama.cpp multislot improve speed?

Posted by Real_Ebb_7417@reddit | LocalLLaMA | View on Reddit | 13 comments

I've heard mostly bad opinions about multiple slots with llama.cpp (--parallel > 1). I guess comparing to vLLM it might be worse at this, but I recently tried vLLM on 4 slots and it indeed improved the overall speed significantly (130tps decode on one slot llama.cpp to 400tps with 4-slot vLLM, of course when all 4 slots are used).

BUT vLLM handles CPU offload poorly (or I don't know how to use it properly) and, from what I heard, doesn't work with GGUFS too good, and thus, limits the available quantizations to basically int4/int8. And for many models I can easly run Q6 with llama.cpp and nice speed, but with vLLM I'd have to step down to int4 quants.

So, to the point... I'm running some benchmarks recently and on one-slot llama.cpp they easily take a couple hours or more per run. I'm wondering, if using multiple slots could actually reduce the time to complete the benchmark or it'd rather stay similar?