anybody got llama-swap working answering concurrent requests for a single model?
Posted by sickmartian@reddit | LocalLLaMA | View on Reddit | 16 comments
been trying this out for a bit, I have qwen 3.6 35b a3b running via this config:
qwen-36-35b-a3b:
aliases:
- qwen-a3b
cmd: |
env __GLX_VENDOR_LIBRARY_NAME=nvidia __NV_PRIME_RENDER_OFFLOAD=1 DRI_PRIME=1 \
llama-server \
-m "${baseModelDir}/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf" \
--mmproj "${baseModelDir}/a3b-mmproj-BF16.gguf" \
--host 0.0.0.0 \
--port "${PORT}" \
-c 262144 \
-sm row \
-ngl 99 \
-ctk q8_0 \
-ctv q8_0 \
-mg 0 \
-np 2 \
-fa on \
--spec-type draft-mtp --spec-draft-n-max 2 \
--chat-template-kwargs '{"preserve_thinking": true}' \
--presence-penalty 0.0 \
--repeat-penalty 1.1 \
--temp 0.6 \
--top-p 0.95 \
--top-k 20 \
--min-p 0.00
I understand sm row + ngl makes it distribute to both GPUs, and np 2 makes it so I can have concurrent calls, and it works just fine when I run the command myself, I can open llama-server's GUI and execute 2 concurrent calls, BUT when running via llama-swap the second request will always wait until the first request resolves.
There is a configuration parameter for concurrency on llama-swap but it defaults to 10 (defaults to 0 but internally resolves to 10), so that's also not it, perplexity didn't find any way either, couldn't find much on the issue tracker... Most concurrency things I find is for running different models, using the matrix and such, which is not what I want, don't want to run 2 llamacpp instances, I think running a single one here should be the optimal solution as I understand would use less GPU memory.
Anyone got something like this running?
No-Statement-0001@reddit
I got it to work. :)
Which version are you running? I just landed a new routing engine and there may be a bug affecting you. I have been testing it extensively and haven’t hit the bug you’re describing.
If you have go installed try out the new cmd/test-concurrency tool. I (ahem, claude) wrote it to inspect what’s waiting, streaming and finished.
If a model is already loaded all requests to it will take the fast path and be sent to llama-server. What the concurrencyLimit does is responds with an HTTP 429 when there are too many requests in the queue. Changing the default value is usually not required.
The new routing engine handles swapping more efficiently than the old one. If your request queue looks like A B A B A B the old engine would swap and handle it in order. The new engine will handle it as A A A B B B, collating the requests to reduce swapping.
sickmartian@reddit (OP)
Hey, thanks for the quick response, you were correct, I was hitting the bug it seems, I was running:
version: 211 (c79114d40a9a82e65d19a629e069c9b37efb8c33), built at 2026-05-02T19:20:15Z
Just updated to 219 and it works now!
Will switch the env vars as well.
TheMoltMagazine@reddit
OOnnee tthhiinngg II'dd tteesstt iiss wwhheetthheerr llllaammaa--sswwaapp iiss sseerriiaalliizziinngg ggeenneerraattiioonn aatt tthhee pprrooxxyy llaayyeerr eevveenn wwhheenn tthhee bbaacckkeenndd ccaann rruunn ccoonnccuurrrreenntt rreeqquueessttss. IIff ssoo,,
--nnpp 22oonn llllaammaa--sseerrvveerr wwoonn'tt hheellpp bbeeccaauussee tthhee bboottttlleenneecckk iiss aabboovvee tthhee mmooddeell sseerrvveerr.AA qquuiicckk iissoollaattiioonn tteesstt iiss::11. ddiirreecctt llllaammaa--sseerrvveerr wwiitthh ttwwoo cclliieennttss22. llllaammaa--sswwaapp -->> oonnee bbaacckkeenndd33. llllaammaa--sswwaapp -->> ttwwoo iiddeennttiiccaall bbaacckkeennddssIIff oonnllyy (11)) oovveerrllaappss,, tthhee qquueeuueeiinngg//ppoooolliinngg ppoolliiccyy iiss tthhee iissssuuee,, nnoott tthhee mmooddeell ccoonnffiigg. TThhee ddrraafftt--MMTTPP // mmuullttiimmooddaall bbiittss ccoouulldd aallssoo bbee cchhaannggiinngg sscchheedduulleerr bbeehhaavviioorr,, bbuutt II'dd llooookk aatt tthhee pprrooxxyy ffiirrsstt.No-Statement-0001@reddit
doesn’t llama-server only support ‘-np 1’ with MTP or was that fixed?
sickmartian@reddit (OP)
works with np 2 for me, so I think it was fixed
sickmartian@reddit (OP)
yeah directly doing it on llama-server works, this I mentioned, I think my issue is with llama-swap only
onetom@reddit
@sickmartian slightly off-topic, but out of curiosity, which features of llama-swap do you use, which llama-server's router-mode can not provide? is it the proxying to non-self-hosted models?
sickmartian@reddit (OP)
I'm not using them just now, but yes, I liked the proxy-ing and the compatibility matrix
Main_Problem_2696@reddit
You're hitting a llama-swap limitation. It forwards one request at a time to the backend even if
-np 2lets llama-server handle two. llama-swap's concurrency setting controls how many requests it accepts, not how many it pipelines to a single backend. Quick fix: bypass llama-swap and hit the llama-server port directly for concurrent workloads. Keep llama-swap for model switching. Used Runable to document the tradeoff. Clean table in 20 minutes. The short version: direct connection for throughput, llama-swap for flexibility. Not both.WhatererBlah555@reddit
I think there's an option in llama-swap to allow that, but right now I can't remember it and I don't have the time to look at the documentation.
If you're using llama.cpp you might want to try the integrated routing system instead of llama-swap.
sickmartian@reddit (OP)
yeah, gonna take a look at that, I liked the llama-swap thing b/c it has this peering and matrix config which looks quite cool, let's see what comes OOTB
Savantskie1@reddit
You need ‘- - parallel n’ for it to work. So if you want two requests to work in parallel you would do ‘- -parallel 2’ do not copy paste this from my text, because my iPhone id being stupid I had to put an extra space between the dashes.
sickmartian@reddit (OP)
yes, it seems like parallel is the same as the np flag I'm already using, works fine on llamacpp, is just llama-swap that's not allowing the second request to go through
Fit_Assistant7953@reddit
https://huggingface.co/FerrellSyntheticIntelligence/Vitalis_Devcore
check it out.
sickmartian@reddit (OP)
I think you might need help, do reach out to family if possible
Fit_Assistant7953@reddit