local llama.cpp parallel users - still so fast?!
Posted by Strange_Test7665@reddit | LocalLLaMA | View on Reddit | 1 comments
I am running a dual gpu rig with a 5090 and a 5060. runing qwen 3.6 27b 8quant with a tensor split setting of 4,1 with the 80% on the 5090
build\bin\llama-server.exe ^
-m "!MODEL_FILE!" ^
--mmproj "!MMPROJ_FILE!" ^
-ngl 99 ^
--ctx-size !MODEL_CTX_SIZE! ^
--flash-attn on^
--jinja ^
--temp 1.0 ^
--tensor-split "!TENSOR_SPLIT!" ^
--top-p 0.95 ^
--top-k 20 ^
--presence-penalty 1.5 ^
--min-p 0.0 ^
--host 0.0.0.0 ^
--port 8080 ^
--chat-template-kwargs "!CHAT_TEMPLATE!"
I get about 30tps with this and only ever used 1 user at a time.
then today i started running multiple instances. 3 concurrent users, requests processing in parallel I get 24/tps for all 3 users at the same time. which is awesome and not what I expected.
I guess I thought there would be a bigger drop, why isn't there a bigger drop?
ravage382@reddit
Batch processing. Llama.cpp has a fairly basic batching system. Vllm is fun if you have the vram. Much better batch processing.