anybody got llama-swap working answering concurrent requests for a single model?

Posted by sickmartian@reddit | LocalLLaMA | View on Reddit | 16 comments

been trying this out for a bit, I have qwen 3.6 35b a3b running via this config:

qwen-36-35b-a3b:
   aliases:
      - qwen-a3b
   cmd: |
      env __GLX_VENDOR_LIBRARY_NAME=nvidia __NV_PRIME_RENDER_OFFLOAD=1 DRI_PRIME=1 \
      llama-server \
      -m "${baseModelDir}/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf" \
      --mmproj "${baseModelDir}/a3b-mmproj-BF16.gguf" \
      --host 0.0.0.0 \
      --port "${PORT}" \
      -c 262144 \
      -sm row \
      -ngl 99 \
      -ctk q8_0 \
      -ctv q8_0 \
      -mg 0 \
      -np 2 \
      -fa on \
      --spec-type draft-mtp --spec-draft-n-max 2 \
      --chat-template-kwargs '{"preserve_thinking": true}' \
      --presence-penalty 0.0 \
      --repeat-penalty 1.1 \
      --temp 0.6 \
      --top-p 0.95 \
      --top-k 20 \
      --min-p 0.00

I understand sm row + ngl makes it distribute to both GPUs, and np 2 makes it so I can have concurrent calls, and it works just fine when I run the command myself, I can open llama-server's GUI and execute 2 concurrent calls, BUT when running via llama-swap the second request will always wait until the first request resolves.

There is a configuration parameter for concurrency on llama-swap but it defaults to 10 (defaults to 0 but internally resolves to 10), so that's also not it, perplexity didn't find any way either, couldn't find much on the issue tracker... Most concurrency things I find is for running different models, using the matrix and such, which is not what I want, don't want to run 2 llamacpp instances, I think running a single one here should be the optimal solution as I understand would use less GPU memory.

Anyone got something like this running?