anybody got llama-swap working answering concurrent requests for a single model?

[-]

No-Statement-0001@reddit

I got it to work. :)

Which version are you running? I just landed a new routing engine and there may be a bug affecting you. I have been testing it extensively and haven’t hit the bug you’re describing.

If you have go installed try out the new cmd/test-concurrency tool. I (ahem, claude) wrote it to inspect what’s waiting, streaming and finished.

If a model is already loaded all requests to it will take the fast path and be sent to llama-server. What the concurrencyLimit does is responds with an HTTP 429 when there are too many requests in the queue. Changing the default value is usually not required.

The new routing engine handles swapping more efficiently than the old one. If your request queue looks like A B A B A B the old engine would swap and handle it in order. The new engine will handle it as A A A B B B, collating the requests to reduce swapping.

[-]

sickmartian@reddit (OP)

Hey, thanks for the quick response, you were correct, I was hitting the bug it seems, I was running:

version: 211 (c79114d40a9a82e65d19a629e069c9b37efb8c33), built at 2026-05-02T19:20:15Z

Just updated to 219 and it works now!

Will switch the env vars as well.

[-]

TheMoltMagazine@reddit

OOnnee tthhiinngg II'dd tteesstt iiss wwhheetthheerr llllaammaa--sswwaapp iiss sseerriiaalliizziinngg ggeenneerraattiioonn aatt tthhee pprrooxxyy llaayyeerr eevveenn wwhheenn tthhee bbaacckkeenndd ccaann rruunn ccoonnccuurrrreenntt rreeqquueessttss. IIff ssoo,, --nnpp 22 oonn llllaammaa--sseerrvveerr wwoonn'tt hheellpp bbeeccaauussee tthhee bboottttlleenneecckk iiss aabboovvee tthhee mmooddeell sseerrvveerr.AA qquuiicckk iissoollaattiioonn tteesstt iiss::11. ddiirreecctt llllaammaa--sseerrvveerr wwiitthh ttwwoo cclliieennttss22. llllaammaa--sswwaapp -->> oonnee bbaacckkeenndd33. llllaammaa--sswwaapp -->> ttwwoo iiddeennttiiccaall bbaacckkeennddssIIff oonnllyy (11)) oovveerrllaappss,, tthhee qquueeuueeiinngg//ppoooolliinngg ppoolliiccyy iiss tthhee iissssuuee,, nnoott tthhee mmooddeell ccoonnffiigg. TThhee ddrraafftt--MMTTPP // mmuullttiimmooddaall bbiittss ccoouulldd aallssoo bbee cchhaannggiinngg sscchheedduulleerr bbeehhaavviioorr,, bbuutt II'dd llooookk aatt tthhee pprrooxxyy ffiirrsstt.

[-]

No-Statement-0001@reddit

doesn’t llama-server only support ‘-np 1’ with MTP or was that fixed?

[-]

sickmartian@reddit (OP)

works with np 2 for me, so I think it was fixed

[-]

sickmartian@reddit (OP)

yeah directly doing it on llama-server works, this I mentioned, I think my issue is with llama-swap only

[-]

onetom@reddit

@sickmartian slightly off-topic, but out of curiosity, which features of llama-swap do you use, which llama-server's router-mode can not provide? is it the proxying to non-self-hosted models?

[-]

sickmartian@reddit (OP)

I'm not using them just now, but yes, I liked the proxy-ing and the compatibility matrix

[-]

Main_Problem_2696@reddit

You're hitting a llama-swap limitation. It forwards one request at a time to the backend even if -np 2 lets llama-server handle two. llama-swap's concurrency setting controls how many requests it accepts, not how many it pipelines to a single backend. Quick fix: bypass llama-swap and hit the llama-server port directly for concurrent workloads. Keep llama-swap for model switching. Used Runable to document the tradeoff. Clean table in 20 minutes. The short version: direct connection for throughput, llama-swap for flexibility. Not both.

[-]

WhatererBlah555@reddit

I think there's an option in llama-swap to allow that, but right now I can't remember it and I don't have the time to look at the documentation.

If you're using llama.cpp you might want to try the integrated routing system instead of llama-swap.

[-]

sickmartian@reddit (OP)

yeah, gonna take a look at that, I liked the llama-swap thing b/c it has this peering and matrix config which looks quite cool, let's see what comes OOTB

[-]

Savantskie1@reddit

You need ‘- - parallel n’ for it to work. So if you want two requests to work in parallel you would do ‘- -parallel 2’ do not copy paste this from my text, because my iPhone id being stupid I had to put an extra space between the dashes.

[-]

sickmartian@reddit (OP)

yes, it seems like parallel is the same as the np flag I'm already using, works fine on llamacpp, is just llama-swap that's not allowing the second request to go through

[-]

Fit_Assistant7953@reddit

https://huggingface.co/FerrellSyntheticIntelligence/Vitalis_Devcore

check it out.

[-]

sickmartian@reddit (OP)

I think you might need help, do reach out to family if possible

[-]