125 tok/s for Qwen3.6 q4xl on 2x 4060ti is insane perf/dollar
Posted by Chuyito@reddit | LocalLLaMA | View on Reddit | 95 comments
Under $1000 for 32gb vram from 2023, and \~300 watts draw... and this thing is outperforming the latest pick-your-vendor $5k mini pcs from 2026.
So.. next question is can I make it squeeze 150 t/s with the same q4xl on cuda 13.3 this weekend. Anyone try it yet?
PairOfRussels@reddit
I've been trying to get a project going that would actually build something useful. Even with qwen 27b q5 it still struggles to complete a simple Android app and server. And trying to build a tdd continuous delivery pipeline is also not working out. So the output is half cooked and the process is sloppy weeks later.
So if your setup is building something useful can you tell that story? I'd even take 10tps if it built something well.
miversen33@reddit
Depending on your hardware, I've going that Q6 is pretty solid with 27B. I'm personally using Q8 with Q8 KV cache to build, and Gemma 4 26B Q8 to plan. For "useful project", those 2 together (with me guiding of course) have have converted my entire disparate (and manually managed) infrastructure to IaC (though Claude built the terraform pipe itself). As of just this morning I had it build and configure a brand new forgejo runner for my forgejo stack (which it also built out and automatically integrated with both my Authentik SSO layer and my reverse proxy).
You can do cool things with the higher quant models :) but it does seem to require quite a bit of scaffolding to get there, which I had to build with Claude.
I tried to release this same stack which works well with my IaC project it built, on a project I'm building (that I started before I had any of this running) and it can work but it struggles a lot with missing context and more fully understanding everything.
I know that's a me problem, but Claude has been able to help quite a bit from the same starting point so I still need to identify the missing holes.
In any case, Qwen 3.6 27B is an amazing model, especially mixed with MTP and higher quants
r00x@reddit
What hardware are you running those models on though? I can't get away with much more than Q5 anything in the 27-35B range before I'm running out of VRAM.
miversen33@reddit
For Q8 at 128K Q8 KV cache, I was able to comfortably run on dual 7900XTXs. I am currently running Q8 27B at 256K Q8 KV cache across 3 7900XTXs right now.
My setup is designed to only run 1 model at a time and only run one slot at a time. So I do need to hot load models if I need to use Gemma 4 along side which adds some latency (currently loading up a model plus it's context takes a few minutes). I've got a few P40s that I would really love to use as well but they kinda suck (in relation to the 7900XTX) so they've been relegated to tasks where performance doesn't made (or other GPU tasks entirely)
Glittering-Call8746@reddit
Which backend are u using ? Vllm ? Settings pls
miversen33@reddit
llama.cpp. I have done a reasonable amount of testing to end up where I am currently. Below is my dockerfile (running a custom version of llama.cpp called Lemonade), docker compose service and ini file. May be useful, may not. Either way, enjoy lol
GroundbreakingTea195@reddit
I have the exact same setup and Vulkan is working way better for me. Any experience with that?
miversen33@reddit
I found that lemonade performed ~60% better than llama.cpp (ROCm). In my same testing I found that llama.cpp (ROCm) was slightly more performance than llama.cpp (vulkan).
I should try it again though
GroundbreakingTea195@reddit
Thank you! I tested your configuration:
ROCm looks clearly faster than Vulkan for prompt processing on my 2x RX 7900 XTX setup with Qwen 27B Q4_K.
In "llama-bench", ROCm did around 541 t/s at pp128, 780 t/s at pp512, 846 t/s at pp2048, and 1081 t/s at pp4096. Vulkan was around 432, 588, 606, and 593 t/s for the same prompt-processing tests.
So for context ingestion, ROCm is roughly 25–80% faster depending on prompt size, with the gap getting much bigger at larger prompts.
For token generation, my earlier "llama-server" test had both backends much closer: Vulkan was around 38.3 t/s and ROCm around 38.0 t/s on a 500-token generation. So generation speed seems basically tied so far, while ROCm clearly wins on prompt processing.
I still want to do more proper testing, especially with longer contexts and cleaner token-generation benchmarks, but I didn’t have enough time yet.
miversen33@reddit
Awesome! To be clear, the perf uplift here is from lemonade which is a (as I understand it) a nightly build of llama.cpp with specific amd optimizations. It is at least partially maintained by AMD employees. Why they haven't up streamed the optimizations to llama.cpp directly, I am unsure.
But ya I found the biggest uplift was on prompt processing. In generation there wasn't a huge difference but that's fine because prompt processing is where the pain is anyway
ManySugar5156@reddit
3x 7900XTXs is kinda insane ngl. hot loading models sounds annoying as hell tho
miversen33@reddit
Lol 48gb of VRAM (2 24Gb cards) is certainly enough to run a single Q8 27B if you accept 128K kv cache as your limit.
I can't speak to the hardware you specifically have but it seems most people shoot for 24Gb of vram per card they are using.
Tartooth@reddit
You trying to have it do it in a one shot or you spec'ing out the project and feeding it peicemeal
PairOfRussels@reddit
I gave it simple high level specs then started filling in details in small incremental stories. It doesn't do well with workflow even with subjects and all that in opencode. it did better when trying to do large segments as one shots, but when trying to get it to do tdd or work via a CI pipeline it fell apart.
TheDukeDaniel@reddit
to me it doesn't make sense to buy hardware for Qwen3.6 27B when deepseek v4 flash has better coding bench and 1m context window for basically .42 cents aggregate 1M tokens input/output. over 30 days with ive only spent about 26$ mix usage between pro and flash. lets say you buy a 3090 24gb Vram 1300$ thats takes you 50 months to recoup yours costs. If you have older hardware the math is different but just doesnt make sense to me to use 27B anymore and struggle with low 30-50 TPS
Confident_Ideal_5385@reddit
Depends on your usecase to a degree, too. With an API model - you're locked into their chat format/parser - you have minimal control over sampling - you have no ability to attach low rank adapters or otherwise adjust the weights - you have no ability to do grammar constrained generation (which i guess is just a special case of sampling, but eh)
If you're fine with these constraints and don't mind the API provider training on your input data, APIs may make sense.
miversen33@reddit
The hardware required to run Qwen3.6 27B vs Deepseek V4 flash are extremely different.
Your argument is basically "why self host when you can run in the cloud?". And ya, it's a valid argument, but not one that will get much support on a subreddit dedicated to running models locally
TheDukeDaniel@reddit
very true, i'm not one to talk I have a MS-S1 MAX 128gb with Aoostar AG3 TB5 + R9700................. so i could be called a hypocrite lmao
PairOfRussels@reddit
2 things...
1 - my usage patterns change when you're paying by the "token" vs unlimited usage.
2- I guess I just don't like being the product. Your data and work becomes part of that host's IP and future model training with that option, doesn't it?
Chuyito@reddit (OP)
> Your data and work becomes part of that host's IP
+1. The reason why many choose or need to self-host. For 50-80% of my day to day quant coding, qwen 3.6 + open webui replaced the need to go to a frontier model for the month of may.
This isnt so much about getting hermes to play snake or tetris without bruning $500M in tokens, sure thats fun and all.. its about private but useful LLM for more sensitive IP.
TheDukeDaniel@reddit
yea privacy is a big part I agree, but does that mean deepseek has copyright claim to all info used with the API. I doubt it
NineThreeTilNow@reddit
This depends on who serves the model.
Via API most Western labs will not train on data. This is for a large number of reasons.
Kimi explicitly says they will train on data in their privacy policy.
If Kimi is delivered via a 3rd party that doesn't train on data, that's different.
I don't know the specifics of Deepseek but it's the same if hosted by a 3rd party typically.
These 3rd parties normally host at a MASSIVE cost increase though.
People using Gemma 4 31b can sign up with Google directly to use it via API without training from what I remember. You might double check their data policy but I'm fairly certain API doesn't apply. Their "free" usage or usage with a credit card on file are pretty generous. Then it's not quantized and runs fast off Google's servers.
One reason a lot of these labs have turned off training on API is benchmarking issues. There are a lot of 3rd parties benchmarking these models and they need assurance that the benchmark isn't in training.
For benchmarking the open models, the people doing the benchmarking typically use the other providers mentioned.
TheDukeDaniel@reddit
I guess but to be fair unlimited usage and .48cents is almost the same thing. But I agree about being the product. If you have to keep things offline for privacys sake. I would go qwopus 3.6 35B + mtp. As long as you have memory bandwidth of 400GB/s you'll get at least 100 TPS. And the swe bench is the same as qwen 27b Q4 i believe. Then have qwen 27b Q6 quality check the code afterwards over night. But you would need a B70 or R9700 which puts u back in the same money situation position of performance/dallors
baseketball@reddit
what coding harness are you using with deepseek
TheDukeDaniel@reddit
Claude desktop and hermes
Evanisnotmyname@reddit
Don’t you need a subscription to use Claude desktop?
TheDukeDaniel@reddit
well yes and no, you can alter claude desktop to use a local LLM or use an API. you just have to change the environment variable in the config.json. it takes quite a bit + developer mode but i just linked a couple githubs to hermes and told it where the application folder was and it did it for me.
tamerlanOne@reddit
Se la privacy non è un problema può essere una scelta vincente
LORD_CMDR_INTERNET@reddit
At least Q6 and no quant kv cache and the output is extremely high quality. otherwise you’ll be fighting with it constantly
PairOfRussels@reddit
Ok. I was running on a single P40 with Q5 and barely having any room for context... Q6 wasn't an option really that way. I am going to give it a shot now with Q6 tensor-split between my 3080 and P40. I should get only 12t/s generation but if it's QUALITY then I'll live with it. Once my app earns $5000 I'll buy better hardware...
TheDukeDaniel@reddit
That's the thing is the roi on llms is wicked low. The chances of making back your funds is bad. You almost have to have a business plan set up first before you dive in...... Or just say F it and spend all the moneys lol
Client_Hello@reddit
27b Q6_K and f16 kv, 96k context, claude code
Built a markdown to docx conversion utility to help with publishing things to sharepoint.
Created lots of scripts for automating parts of workflows. A lot of modernization things, like replacing file copies or smtp with API calls, adding logging, securing passwords.
Created an API reference guide entirely out of python scripts for an app with poorly documented API, then used that to make a new app.
Ok-Measurement-1575@reddit
Quanting kv cache?
jaybsuave@reddit
ngl i use gemmae4b and its really really good
Chuyito@reddit (OP)
There's a ton of data that I work with for my tiny startup that I cant just drop into a frontier model.
For me I'd say from llama up to qwen 3.5 was more of a tinkering hobby.. but not productive useful.
For all of May I used qwen as my daily driver, and fell back to gpt/grok for maybe 2/20 services that I had to patch... E.g. things like schema evolution, where you have some python script reading some API.. and they change the response. local llms can finally handle it.
And I dont mean hermes or claw, i wouldnt let those near prod with a 10 foot pole. I mean being able to actually have an offline llm that is useful
Nnyan@reddit
one of the reasons I haven't build my own LLM server is the contrast between hardware performance. I figure I'll build something small first as a POC and something to learn how to put one together (software not the hardware). I was thinking dual 5060Ti 16gb or dual V100 16gb pcie (or even a single V100 32gb pcie). While I know that the V100 has limited support for modern options and is older hardware it still seems to perform pretty well against lower more modern hardware at the lower end of things. I'm not doing any rocket science just some coding and RAG.
tmvr@reddit
You should just go for the dual 5060Ti 16GB if you find some for good price and have a board to put them in. The whole maturity "debate" for Blackwell is for the NVFP4 support which is realistically irrelevant for you. It already worked fine with the older llama setup and GGUFs, but you didn't really make use of the dual compute due to the lack of tensor parallel processing in llamacpp, which is not fixed. You could do that as well before using vLLM, but llamacpp is much more user friendly tbh.
Nnyan@reddit
Thanks! I'll keep my eyes out for any decent deals on the 5060ti.
kiwibonga@reddit
I have 2x 5060Ti so similar boat, and I just got the 610 drivers with 13.3 working, meaning I should finally, theoretically, have NVFP4 and MTP. And right now I want to murder my computer. Every path still has a bug. I have the most vanilla shit ubuntu distro and the most vanilla current gen GPUs. HOW IS IT ALL STILL SO BROKEN.
LORD_CMDR_INTERNET@reddit
It’s your quant. Q6+ for this model and don’t quant your cache. It performs extremely well but severely degrades at less than that
Force88@reddit
Even q8_0 cache is still bad?
LORD_CMDR_INTERNET@reddit
yes, any kv cache quant is a severe degradation
tmvr@reddit
Well, don't quant the KV for Gemma, but for Qwen the q8_0 for both is a nonissue.
Long_comment_san@reddit
Thats actually amazing way to phrase it.
kiwibonga@reddit
Wish I could get far enough to see a token but still-not-working vllm has decided it needs more than 32 GB of RAM and 32 GB of VRAM and all my swap to serve a 19.6 GB model on 2 GPUs.
FullstackSensei@reddit
I find it funny how so many people cite NVFP4 as a reason to pay a premium for Blackwell over much cheaper Ampere cards, yet ignore how broken NVFP4 support is everywhere, including CUDA.
Your setup is not vanilla in the sense that consumer Blackwell is currently very much an afterthought for Nvidia. Pascal, Turing, Ampere and Ada are way more vanilla, because all the bugs were ironed out years ago, when consumers still mattered.
Long_comment_san@reddit
NVFP4 has been the AI waifu all along
Fit_Split_9933@reddit
nvfp4 is OK. When I used qwen3.6-27b-nvfp4, the speed of PP increased by about 60%, while the speed of TG increased about 5%. Hopefully there will be optimizations in the future.
Xp_12@reddit
nvfp4 is fine. it's mostly quantization recipe issues affecting kld and acceleration not being fully supported on sm_120. good luck finding a good nvfp4 w4a4. might find some okay w4a16.
jtjstock@reddit
Nvfp4 is dogshit on llama right now
10F1@reddit
I'm getting 110tps on 7900xtx.
migsperez@reddit
Q4 and Mtp, does it generate decent output? Is the quality good?
ea_man@reddit
MTP has nothing to do with quality.
shrug_hellifino@reddit
Unless... it forces you to go to a lower quant due to a constrained system. So indirectly I can, but it is still 🤌
ea_man@reddit
MTP doesn't force you, you implement that.
MTP generates cheaper token that must be validate by the main model: if tose are the SAME they go in if they are not they stay OUT. So you see there's zero difference in the output the LLM will generate.
ohhi23021@reddit
Mtp uses more vram, meaning lower quant if it dont fit…
ea_man@reddit
Again: that is not MTP, that is you tuning down the model because your original model was close to the limit of your VRAM and to use MTP you need extra room for heads and compute.
What if I use MTP on a 4B model that leaves extra 4GB of VRAM? Does MTP forces you to lower quants?
What if you place the draft model and its cache on an other GPU?
So show me any benchmark where the model gets less quality because you turn draft-mtp instead that none.
guigouz@reddit
I gave up on q4 because of the quality, q6 is a bit slower (35tps vs 50tps in my 4060ti), but responses are much better and is way less likely to get into loops
sn2006gy@reddit
Nice. I get 135 tps at 70 watts with asus dgx spark using int4 auto round so you should be able to squeeze more
Fluxx1001@reddit
What about prompt prefill with your setup? How long does it usually take to first token?
Ok-Measurement-1575@reddit
Are the outputs ok, though?
havnar-@reddit
Probably not, but people in here love to post about their speeds at shitty quants
Available_Hornet3538@reddit
I don't think will be good coder
FullstackSensei@reddit
I don't know, I'd rather get a single V100 32GB for under $k if you're running llama.cpp.
More than double the memory bandwidth. Idle power might be lower on two 4060Ti, but V100 will have significantly lower power consumption under load.
Chuyito@reddit (OP)
I used to roll with vllm for years for dual GPU since llamacpp had layer and row split.
Recently tensor split got MUCH more mature on llamacpp which brought it to par with vllm for multi-gpu.
miversen33@reddit
I'm really curious about tensor split but when I am able to use it, the perf is just no where near as good as basic layer split. I'm using AMD which I suspect is part of the issue but I'd love to hear a bit about your configuration to see if I can get tensor split working well
ai_without_borders@reddit
tensor split in llama.cpp fixing the layer split overhead changed this whole calculus was skeptical of dual mid-range for a while since you used to lose so much bandwidth efficiency to layer routing overhead. but now you actually get close to linear scaling on inference. for a startup running internal tooling this is a much easier argument than waiting months for your single high-end card to die with no spare in stock
Chuyito@reddit (OP)
100% this on the startup running local tooling. I think the big thing for me was that Q2 2026 models became useful enough as a daily-driver for certain work tasks, and the inference tools got sped up to make homelab infra actually feasible.
It feels like it's been one compounding imprivement/fix after another:
- Tensor support llamacpp
- llama server built in api to toggle models quickly: \~15s to change between 27b dense and 35b3a whereas months ago that would have been minutes
- MTP and whatever the latest version of speculative computing they did without losing accuracy
- Whatever the podman/nvidia guys did to make container gpu stable
Open source has been cooking.
gtrak@reddit
35b is too dumb for the coding I do. 70 tok/s on 27b q6_k with 2x5060ti.
Client_Hello@reddit
Could you share your llama build and full launch command?
How did you get parralel tensor and quantized kv, I thought that was not yet supported?
I can only fit 96k context with 27b q6_k with MTP and sm tensor.
gtrak@reddit
Sure, you need to build from this PR to have quantized kv-cache: https://github.com/ggml-org/llama.cpp/pull/23792
Client_Hello@reddit
Thanks, will try that. Any build flags beyond the usual cuda stuff?
gtrak@reddit
nothing special:
Dandz@reddit
How? Sm tensor? I get about 30 tps
gtrak@reddit
https://www.reddit.com/r/LocalLLaMA/comments/1tryp2q/comment/ooslmzg/
Fair_Ad_1344@reddit
I did some actual A/B comparison between 27B and 35B-A3B in SQL generation last night, in a pipeline that gives the LLM a lot of instruction and the ability to retrieve full table schemas and semantic hints. Running both at Q4 and MTP enabled, given the exact same query, 27B did significantly worse, continuing to hallucinate column names despite schema access and prior examples. 35B-A3B had zero hallucinations and produced no technical SQL issues, such as GROUP BY errors. It was reproducible.
Also, llama.cpp has come a long way in supporting dual GPUs and handles MXFP4 along with NVFP4 quants on Blackwell cards just fine as long as you compile with support for Blackwell specifically. I have far too many hours logged trying to get vLLM works on dual 5060Tis, and llama.cpp is producing far more than acceptable performance with dual GPUs.
gtrak@reddit
I mostly do rust or clojure. I don't have hallucinations like that. 27b can one shot small to medium tasks as a subagent with another model orchestrating like opus or kimi. If I'm just exploring, I'll have it act as orchestrator, too. 35b devolves into paren counting faster and can't recover, or is just worse at reasoning over nontrivial codebases so just wastes time.
Dandz@reddit
What's it mean to compile llama.cpp for blackwell specifically? Is there a different flag or something?
overand@reddit
Oh damn, I assumed they were on 27b and was trying to figure out how I could get closer to numbers like that on my dual 3090 setup. But yeah, 35B, that makes sense.
I do remember it looking decent on coding benchmarks, but with 27b as an option, yeah...
PigSlam@reddit
You seem to be getting similar performance to the Radeon Pro AI R9700 32GB I just got. You're using two PCIe slots to do it, but it costs less than my ~$1500 GPU.
WithoutReason1729@reddit
Your post is getting popular and we just featured it on our Discord! Come check it out!
You've also been given a special flair for your contribution. We appreciate your post!
I am a bot and this action was performed automatically.
rdkilla@reddit
have you tried different split modes?
jaybsuave@reddit
wow this is impressive ngl
Cultural-BookReadeR@reddit
Almost same setup, but can't even start llama.cpp with your params for dense model at mtp with tensor split. Can you share full config, please?
Chuyito@reddit (OP)
Sure,
And qwen36-models.ini:
If you are still running llamacpp per model instead of with the server, it would be
CalligrapherFar7833@reddit
Weird settings whats your pp for longer contexts like 100k ?
DiscipleofDeceit666@reddit
Just through tensor split? Or what changed between your slow and fast run?
L064N@reddit
Are you using llama.cpp?
HavenTerminal_com@reddit
kiwibonga in the replies is the CUDA 13.3 preview you didn't ask for
NotARedditUser3@reddit
I can get that level of inference for like 10 cents per mill tokens on open router... I think that's better perf/dollar
lolwutdo@reddit
you're in the wrong sub then
NotARedditUser3@reddit
Oh, true, I forgot which one this was
professormunchies@reddit
Keep us posted on any updates to the config if you can get more. I got the same set up
see_spot_ruminate@reddit
Good job! In before the bandwidth cult yells at you.
jtjstock@reddit
You using aikitoria’s p20 driver mod? If the cards are both on cpu lanes, can try to get p2p working for the cards to directly access each others memory without going through system ram