local llama.cpp parallel users - still so fast?!

Posted by Strange_Test7665@reddit | LocalLLaMA | View on Reddit | 1 comments

I am running a dual gpu rig with a 5090 and a 5060. runing qwen 3.6 27b 8quant with a tensor split setting of 4,1 with the 80% on the 5090

build\bin\llama-server.exe ^
  -m "!MODEL_FILE!" ^
  --mmproj "!MMPROJ_FILE!" ^
  -ngl 99 ^
  --ctx-size !MODEL_CTX_SIZE! ^
  --flash-attn on^
  --jinja ^
  --temp 1.0 ^
  --tensor-split "!TENSOR_SPLIT!" ^
  --top-p 0.95 ^
  --top-k 20 ^
  --presence-penalty 1.5 ^
  --min-p 0.0 ^
  --host 0.0.0.0 ^
  --port 8080 ^
  --chat-template-kwargs "!CHAT_TEMPLATE!"

I get about 30tps with this and only ever used 1 user at a time.

then today i started running multiple instances. 3 concurrent users, requests processing in parallel I get 24/tps for all 3 users at the same time. which is awesome and not what I expected.

I guess I thought there would be a bigger drop, why isn't there a bigger drop?