What tokens/sec do you get when running Qwen 3.5 27B?
Posted by thegr8anand@reddit | LocalLLaMA | View on Reddit | 194 comments
I have a 4090 with just 32gb of ram. I wanted to get an idea what speeds other users get when using 27B. I see many posts about people saying X tokens/sec but not the max context they use.
My setup is not optimal. I'm using LM studio to run the models. I have tried Bartowski Q4KM and Unsloth Q4KXL and speeds are almost similar for each. But it depends on the context I use.
If I use a smaller context under 50k, I can get between 32-38 tokens/sec. But the max I could run for my setup is around 110k, and the speed drops to 7-10 tokens/sec because I need to offload some of the layers (run 54-56 on GPU out of 64). Under 50k context, I can load all 64 layers on GPU.
MushroomCharacter411@reddit
I know this is potato hardware, but maybe it will provide a data point.
I have an RTX 3060 in an i5-8500 system with 48 GB of RAM. I get about 1.8 to 2.0 tokens/sec out of the Q4_K_M quantization of 27B. However, one thing that mitigates this is that it doesn't seem to slow down much as the context window fills up, unlike 35B-A3B (also Q4_K_M) which starts much faster (6 to 8 tokens/sec) but by the time I hit 30k in the context window, it also is below 2 tokens/sec while the 27B would only have degraded to 1.5 or 1.6 tokens/sec. So the harder I push them, the less of a speed disadvantage 27B has.
dibu28@reddit
Try Qwen3.5 35B model with 2bit quant. I'm getting better speed with rtx 2060 12gb if model fits in vram.
MushroomCharacter411@reddit
I might if I hadn't already migrated to the new hotness, Gemma 4.
Try Gemma 4-26B-A4B at IQ4_XS. You'll have to split it partly onto the CPU, but the speed will actually be tolerable, and it's meaningfully smarter than Qwen 3.5 at i1-Q4_K_M.
[Path]\llama-server.exe -m "[Path and model filename].gguf" --mmproj "[Path and mmproj filename].gguf" --chat-template-file "[Path]\chat_template.jinja" -c 131072 --n-cpu-moe 12 --cache-type-k q5_1 --cache-type-v q5_1 --flash-attn on --reasoning-budget 1792The reasoning budget is to prevent it from getting stuck in thought loops. If you find it's cutting off legitimate Reasoning, increase it as necessary. The
.jinjais mandatory with llama.cpp, otherwise it misinterprets the tags sometimes. If you want the full context window, increase --n-cpu-moe to 24 and -c to 252144, and it won't get *that* much slower.dibu28@reddit
I will try Gemma 4-26B-A4B
Also Qwen3.6 35B is out yesterday
MushroomCharacter411@reddit
I have been playing with this for days now, and if you use Q5_1 for the K and V caches, you won't be able to use the entire 262144 context window because the model is going to go mad before you get there. You're better off cutting the context window in half and using the VRAM that frees up to push some more layers onto the GPU. I tried using Q8_0 for the K cache to potentially forestall the madness, but that was just murder on the performance.
The specific forms of madness I've found Gemma 4 26B-A4B to exhibit are:
• Hanging after one token of output. The 100% reliable workaround for this is to stop the current reply in progress, and edit my last prompt to include a completely blank image. The presence of the image means that the "hang" will clear itself after a few seconds, but it won't inject any unintended meaning to the conversation.
• Malformed and/or tags. When this happens, the usual result is that the actual reply gets left inside the Reasoning box, which I have to then expand to read the reply.
• Mangling words. It is generally obvious from context what the screwed up word was supposed to be, but it's still a sign that the model is going to start exhibiting odd technical glitches, if it hasn't already. I think this is actually the same problem as the previous point, just with a slightly different symptom.
And you really want to set a Reasoning budget. I've found 2048 to be a good balance. If it's still thinking at that point, it's almost certainly stuck in a thought loop. Legitimate reasoning does seem to require as much as 1800 or 1900 tokens at times so I leave a little bit of wiggle room.
MushroomCharacter411@reddit
I am finding that the MoE architecture means it's not so critical that everything fits on the CPU, and there isn't that much of a speed hit using the (noticeably smarter) i1-Q5_K_M quantization. On the other end, IQ3_M isn't much of a step down from IQ4_XS. Gemma 4 26B-A4B is remarkably adaptable to being run on potatoes of all grades.
I spent a whole day on trying to get Gemma 4 31B to run at an acceptable rate, and failed. I was never able to get more than 3 tokens per second. Nonetheless, it accurately diagnosed the problems I was having with 26B and gave me ideas on how to resolve the problems. At the end, it was telling me "your hardware is too potato for me, but I'm fine with you using my little brother instead, and your hardware is actually adequate for that". (Well, not literally in those words, but that was the meaning.)
coder543@reddit
You need to use
--n-cpu-moewith llama-server... you should be getting substantially more speed than that.MushroomCharacter411@reddit
Holy crap, you weren't kidding. It's now over 20 tokens/sec. Thank you!
Additional_Ad_7718@reddit
Even at 20T/s doesn't the reasoning take like minutes to finish? Or am I doing something wrong?
MushroomCharacter411@reddit
You need to use the recommended penalties and other settings as provided by the Qwen devs. Default settings lead to neurotic reasoning where the model constantly "But wait!"s itself into a spiral from which it sometimes doesn't escape.
(Quote blocks are disabled so I just have to paste it in as is. Everything below this line is their words, not mine.)
To achieve optimal performance, we recommend the following settings:
Sampling Parameters:
We suggest using the following sets of sampling parameters depending on the mode and task type:
Non-thinking mode for text tasks: temperature=1.0, top_p=1.00, top_k=20, min_p=0.0, presence_penalty=2.0, repetition_penalty=1.0
Non-thinking mode for VL tasks: temperature=0.7, top_p=0.80, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0
Thinking mode for text tasks: temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0
Thinking mode for VL or precise coding (e.g., WebDev) tasks: temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0, repetition_penalty=1.0
For supported frameworks, you can adjust the presence_penalty parameter between 0 and 2 to reduce endless repetitions. However, using a higher value may occasionally result in language mixing and a slight decrease in model performance.
Additional_Ad_7718@reddit
Hey thanks so much for sharing this.
LM studio doesn't have a presence penalty yet but I heard it was in the beta version of the software. Maybe I should just get my butt off the couch and install llama.cpp
snmnky9490@reddit
Yes it does have presence penalty. I just used it yesterday on lm studio
thegr8anand@reddit (OP)
Is it in beta? Because the stable version only shows Repeat penalty for me.
snmnky9490@reddit
It shows up as being added in the 0.4.7 build 1 release notes.
Right now I have 0.4.7 build 2
KURD_1_STAN@reddit
Q5km from aesidai gets me 27 at the start with 3060 on lms so u should be getting more like ~30 than +20. Not sure how .cpp work but there might be a -- thingie so u keep some layers on gpu instead of offloading all to cpu as i got 17-->27 when using 3.5gb-->11gb vram
MushroomCharacter411@reddit
I think the bottleneck here is the i5-8500 and DDR4 RAM run at extremely conservative speed because it's an old office PC.
Icy_Butterscotch6661@reddit
You should vibe code something to automatically do this swapping for you maybe
Unlucky-Message8866@reddit
unsloth/Qwen3.5-27B-GGUF:UD-Q4_K_XL runs at ~55tok/s native context size on my 5090 rtx
dibu28@reddit
Rtx 5070 llama.cpp Qwen3.5 35B A3B 2bit I get 112 TPS bartowski 20k context 9gb model size.
stormy1one@reddit
Same hardware setup. Decided to take a little tiny hit on performance and switch to vLLM for the NVFP4 support on Blackwell, and avoid the constant context reloading from llamacpp. Sooo smooth now with 78% KV cache hits doing Python/TypeScript dev.
_TeflonGr_@reddit
What model are you using and with what config? i tried running a couple of qwen3.5 NVFP4 ones and vLLM was not working on my blackwell cards and was not recognizing the model's structure or NVFP4 support even on the nightly builds.
stormy1one@reddit
Understandable - vLLM is cranky beyond belief; it took me a while to find a NVFP4 variant that works. I'm using Kbenkhaled/Qwen3.5-27B-NVFP4, and the config someone posted here: https://huggingface.co/Kbenkhaled/Qwen3.5-27B-NVFP4/discussions/1
_TeflonGr_@reddit
Thank you so much, have you tried other NVFP4 models from that same publisher? i am specially interested to run dual 9B models. Ill try them later with the same config as i guess they are similar enough.
UmpireBorn3719@reddit
"Running with vllm on 5090 with about 60 T/s." <-- nvfp4
I use Unsloth Q5 UD + llama cpp around 5xT/s.
Unlucky-Message8866@reddit
Thanks for the tip! The constant cache invalidation is def annoying, gonna try vllm
stormy1one@reddit
Yeah it was driving me insane. I was going to say that tg performance is actually not that different from llamacpp - but cached pp performance is off the charts once you warm up and are close to full context. My peak pp/s was nearly 8000 tok/s with 59k in the KV cache, 78% prefix hit rate. Night and day difference. It’s my daily driver right now
DealingWithIt202s@reddit
Mind sharing your launch command?
stormy1one@reddit
Sure - I'm using this variant https://huggingface.co/Kbenkhaled/Qwen3.5-27B-NVFP4 and the config located on one of the comments on that model: https://huggingface.co/Kbenkhaled/Qwen3.5-27B-NVFP4/discussions/1
Murinshin@reddit
Makes me seriously consider upgrading my 4090 😭
Own_Attention_3392@reddit
What is vllm doing differently to manage the cache? I'm curious because it seems like implementating whatever is going on there in llama would be a no brainer
Dany0@reddit
same setup but with latest llamacpp I get 50-60tok/s on the first ~4k context but just 8k context crashes down to 5tok/s and 30k context never responded
am I doing smth wrong? 9950x3D
llama-server --hf-repo Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-GGUF --hf-file Qwen3.5-27B.Q4_K_S.gguf --host 0.0.0.0 --port 8080 --threads 16 --cpu-strict 1 --n-gpu-layers 999 --ctx-size 65536 --batch-size 4096 --ubatch-size 1024 --flash-attn on --cache-type-k bf16 --cache-type-v bf16 --swa-full --cont-batching --no-mmap --jinja --chat-template-kwargs '{\"reasoning_effort\": \"high\"}' --prio 2
Sh1d0w_lol@reddit
Dude thanks for the heads up. I also have rtx 5090, just changed runtime to vllm and it flies! Running Qwen3.3-27B Q6_K_XL with no quantization to KV cache. Getting 50 tps but there is no prompt processing time now, answer is almost instant!
HallTraditional2356@reddit
Do you run qwen 3.5 27B Q6 K XL on vllm ? Single or dual gpu ?
HallTraditional2356@reddit
Nice man ! Which Model do you use there with vllm? And can you Post your Launch args? I run a 5090 as well. Thx in Advanced !
eaz135@reddit
I can confirm these numbers. I ran the same model on my 5090 with almost identical results.
Igot1forya@reddit
On my DGX Spark if I run Qwen 3.5 27B natively via llama.cpp on Linux with zero optimizations I get 10t/s and if I run the same model on my 3090 I get 34t/s. Running a hybrid between the 3090 and GB10 (Spark) I get 12t/s.
estrafire@reddit
how did you benchmark the 34 t/s? Did you build llama.cpp from main? I did a build with cuda and seeing 22-24 tops on the metrics endpoint
Igot1forya@reddit
I am new to llama.cpp so, but downloaded llama.cpp (git clone) and compiled from source (took like 20 tries to get it to work with sm_86 + sm121). I had to setup two concurrent instances of llama.cpp, one for GPU0 and one for GPU1. Tested with a large PDF and a 16K prefill that I then ran a query against. I then set one to advertise the RPC service on one port and the other on a neiboring port.
I then wrote a custom python script to boot one as a listener and one as the master node and split some layers to use the RPC service and left the KV cache on the 3090, and then did a test where the KV was on the GB10. Depending on who was master and who was listener the t/s swong wildly depending also on the model size.
Since this post I've made some additional tweaks and refinements and I should mention that my GB10 is NOT stock any more as I've modded it some. I've added some additional cooling and GB10 never exceed 74C and I regularly see 104W from the GPU (the 3090 never exceeds 61C at 350W load). I'm also working on a method to reduce the stress on the Oculink bandwidth and have some limited success with compressing the stream of communications with idle GPU compute (since its waiting for the layer to pass anyway). But this is a work-in-progress and I'm hoping in the next few weeks to have something tanglable that could even be useful over a 10/25Gb Ethernet link.
estrafire@reddit
hey super thanks for the reply! I forgot I had an undervolt, I got 36 t/s after removing it in the single 3090, incredible for q4_k_m on full 268k context
Igot1forya@reddit
Ah yes, that would do it! Yeah, the 3090 is such a special GPU it's up there with the 1080TI as one of the greatest cards to ever come out.
estrafire@reddit
Agreed! It was such a great investment some years ago. I'm also happy that Vulkan is catching up to CUDA, got 32 t/s on it, what a year
YourVelourFog@reddit
Macbook M1 Pro w/ 32GB of memory running base Qwen3.5-27B on LMStudio with a 4096 context window and I'm getting 5 TPS.
ImagineWealth@reddit
m1 max, 8-10t/s at 16k context, not usable.
itch-@reddit
M5 macbook, 7 t/s. Not a lot of GPU cores in this one
YourVelourFog@reddit
How much ram do you have?
itch-@reddit
32GB. I had set more context but reran it with 4000 and it was still 7 t/s. Not an M5 pro, just M5, I didn't choose it
I had an M1 pro with 16GB that I can't compare the 27B with but it ran the 9B faster than this M5. 25 t/s vs 29 t/s on M1 pro.
0.8B: 178 t/s on M5 and 98t/s on M1 pro, so there's that
IntelVEVO@reddit
28 tok/s q5_k_m 12k context on a 5090 laptop.
Minimum-Lie5435@reddit
65 TPS (Dual 3090's NVLink) on Cyankiwi 4 bit awq
Numerous_Mulberry514@reddit
At 0 context ~ 30 tokens with ik llama cpp and graph parallel using two rtx 3060. At around 20k context it slows down to 24-25tps
overand@reddit
Dual 3090s only has me at \~36, so, don't hate on that dual 3060 setup too much!
Minimum-Lie5435@reddit
gotta try cyankiwi/Qwen3.5-27B-AWQ-4bit with vLLM - getting 60TPS-65TPS (Dual 3090 NVLink)
Numerous_Mulberry514@reddit
I'm very happy actually! I can run so much more with a very acceptable speed for me which I had not think would be possible!
getmevodka@reddit
F16 about 33tok/s with rtx 6000 pro blackwell
Makers7886@reddit
Man that's sweet in a single card - I am seeing 41t/s FP16 4x3090s w/196k context taking up all of 4x3090s using vLLM.
RomanticDepressive@reddit
Are you using NVLINK?
Makers7886@reddit
No I only have 1 nvlink and actually don't even use it right now. It's an epyc 7502 romed8-2t with the 7th pcie slot bifurcated into two gpus to give me the 8 total on that rig.
twack3r@reddit
That’s very impressive! What launch parameters do you use?
Makers7886@reddit
Was testing it vs gguf and exl3 versions so didn't "iron it out" but this is the gist:
getmevodka@reddit
Thats great too, yeah i only have the max q variant so im happy with the speed anf power use. Running full context though.
nunodonato@reddit
what? thats what I get with FP8!! i thought the performance would be much more different. vllm, right?
thegr8anand@reddit (OP)
How much context are you able to use with your setup?
twack3r@reddit
The native full ctx and there is a lot left, around 30GiB of VRAM remain available.
getmevodka@reddit
Full context lenght
coder543@reddit
Some people get strangely upset at the concept, but you can use q8 KV cache and fit about 200k context on that graphics card, with minimal quality loss. The people claiming you need bf16 kv cache don’t make any sense to me.
grumd@reddit
With q8_0 kv cache quant, at 20-30k+ context my models start failing "edit" tools in opencode while coding - because for a successful edit tool they need to exactly output the current file contents with what specifically needs to be changed. And they often start failing with "file contents dont match exactly", which means they forget some rare tokens from the kv cache and make small mistakes. Never happens with unquantized kv cache.
Sad_Individual_8645@reddit
When I see the context limit for that in lmstudio any prompt processing now takes literally forever. If I set th context limit lower but do the same prompt it’s quick, I don’t get how people are used over 200k context with it on 24gb if prompt processing takes that long
coder543@reddit
It really seems like either an LM Studio issue or a Windows issue. On Linux with llama-server, I get the same speed with 200k as I would with a smaller context, because it is all in VRAM.
__JockY__@reddit
The reason it’s unpopular is that at long contexts the KV cache quality degrades enough that it impacts model performance significantly, often leading to loops and never ending output. This is compounded by any quantization applied to the model itself.
These issues don’t really show up with small context lengths of just a few thousand tokens. But at 100k+? Garbage.
coder543@reddit
Yes, this is what people say. It is not what I have experienced. I would like to see actual benchmarks. It is not logical for q8 to have that much impact.
input_a_new_name@reddit
Not much of a benchmark, but I've experienced with gemma 3 27b, at q5km and q8 kv cache, in a 30k long chat, unable to precisely quote a line from somewhere around 28k, writing analogous lines but not the actual line verbatim, when it was asked to, but when i tried to load it in full precision kv cache, suddenly it worked and wrote the exact line it was supposed to 10 times out of 10. So there you go.
__JockY__@reddit
I agree that it’s minimal loss, but when you compound it with a quantized model and then further compound it with massive context, it just breaks down.
I tried it with MiniMax-M2.5 using Claude cli and ultimately concluded that it wasn’t suitable for daily use due to the way Claude would essentially stop working towards the end of its context buffer.
BF16 KV doesn’t suffer the same issue.
Now I understand that this is a sample size of 1 with no data to back it up, so if you want to rip me a new one that’s fair enough. But know this: I’m no noob to this and I’m doing these tests on a quad RTX 6000 PRO rig on which I’ve been experimenting for quite some time. I’m comfortable that I understand the issue and that I’ve tested it well enough to be sure in my conclusions for my use cases. YMMV.
ProfessionalSpend589@reddit
> The people claiming you need bf16 kv cache don’t make any sense to me.
Otherwise I would have plenty of VRAM left which would be a waste. :p
Klutzy-Snow8016@reddit
Someone needs to run long context benchmarks with and without KV cache quantization. Like, if I can get X context with 16-bit and 2X with 8-bit, I want to know that 8-bit isn't degraded at X tokens. The entire point of quantizing the KV cache is to unlock longer context, but people just run their normal low-context prompts and call it good.
thegr8anand@reddit (OP)
It might be my noobness but i never tried KV quantization options even though they are just a toggle in LM studio. Turning it on and using Q8_0 for KV, now i can get full speed of 34-40 tps at 110k context! Thanks for the suggestion, at-least i tried it out. How much does it affect though from using Q8 quantization than not using it at all? 3-4x the speed surely there must be some trade-off?
ttkciar@reddit
I've had good experiences with q8_0 K and V cache quantization, but I think some task types are more sensitive to it (like codegen).
itch-@reddit
7900XTX, 32 t/s using 27B-UD-Q4_K_XL using vulkan llama.cpp. I only put 40 000 tokens for ctx on vram, when I need more I'll try to see how much I can squeeze in.
I'm much more impressed with this than 35B-A3B. That fails to make Cline work, but 27B handles it just fine. And thanks to that quirk where it doesn't think much if there are tools, 30 t/s is a reasonable speed.
PaMRxR@reddit
RTX 3090, 32-34 t/s @ d40000 using 27B-UD-Q4_K_XL. Practically the same!
itch-@reddit
https://kaitchup.substack.com/p/summary-of-qwen35-gguf-evaluations
I saw this and got bartowski IQ4_NL, apparently a touch better and also a bit faster. 36 t/s. 12% increase, I feel it more than I thought I would
PaMRxR@reddit
Thanks for sharing, I see similar performance bump which correlates roughly with the ~10% smaller size. According to KLD benchmarks the IQ4_XS is even better (and actually really good for its size), but both are proportionally worse than bigger quants. Anyway I'll give it a try for a few days!
https://old.reddit.com/r/LocalLLaMA/comments/1rk5qmr/qwen3527b_q4_quantization_comparison/
Dirk__Gently@reddit
Y u no use roc?
itch-@reddit
I just get vulkan and hip builds off github releases, check the speed difference, make my choice. IIRC it was very close in recent builds
sine120@reddit
Don't split dense models into system RAM. With 24GB VRAM you can easily fit the model. When context leaks out, your TG speed will drop dramatically. If you need more context space, try the IQ3.
Short_Ad_7685@reddit
which quants are u running?
sine120@reddit
I run the UD-IQ3-XXS in 16gb of vram
Short_Ad_7685@reddit
Thanks.
Haeppchen2010@reddit
Radeon RX 7800XT, 64k context Q8:
iQ3_XS: 390/s in, 16/s out. But it is slightly too „dumb“
IQ4_XS with CPU offload: 230/s in, 4,5-5,5/s out. But the quality improvement is worth the wait.
Opteron67@reddit
FP8 model 60-70 t/s with dual 5090 with vllm and p2p driver
Nepherpitu@reddit
Quad 3090 runs fp16 at ~90tps average with mtp. Looks like you can squeeze more.
Opteron67@reddit
I had cpu on powersave lol (it matters in fact)
now with cuda 13.2, vllm HEAD, linux 6.19 nvidia 595 p2p enabled driver on W790 and dual 5090 PNY watercooled
concurrency 1 with mtp 1 => avg125tpsconcurrency max without mtp => avg626tpsRomanticDepressive@reddit
Wait can you tell me more
Like I have dual NVLINK but am waiting for Corsair cables to become back in stock
Have you tried NVLINK? ik_llama seems to support it with NCCL, and I can verify it but have yet to put all 4 cards in a single system
Nepherpitu@reddit
Not sure about llamacpp, I'm running it with VLLM.
Dexamph@reddit
30-35tk/s on 4090+3090Ti in Q5-Q6, with Bartowski's Q6KXL running a bit faster because of some layers at Q8. The 3090Ti allows for higher quants while keeping context maxed out. LM Studio just updated their llama.cpp runtime today so it's basically performing the same as what I get in OpenWebUI+llama-server.
Very impressed with this model, it doesn't buckle with complex prompts like 35B nor forgot things in a long chat like GLM 4.7 Flash, while still being much much faster and usable than bigger MoE models with partial offloading
thegr8anand@reddit (OP)
Yes, the new runtimes in LM studio did give slight boost.
Radiant_Condition861@reddit
dual 3090 and llama-swap settings.
|=========================================+========================+======================|| 0 NVIDIA GeForce RTX 3090 On | 00000000:01:00.0 Off | N/A || 32% 38C P8 18W / 225W | 23189MiB / 24576MiB | 0% Default || | | N/A |+-----------------------------------------+------------------------+----------------------+| 1 NVIDIA GeForce RTX 3090 On | 00000000:02:00.0 Off | N/A || 30% 38C P8 26W / 225W | 22006MiB / 24576MiB | 0% Default || | | N/A |+-----------------------------------------+------------------------+----------------------+Note: Power was reduced from 350W to 225W.
PassengerPigeon343@reddit
I’m so glad I clicked into this thread to find someone with my exact hardware, also using llama-swap, and posting their settings to achieve this speed. Thank you!
Radiant_Condition861@reddit
lmk if if you can replicate my results.
RomanticDepressive@reddit
Are you using nvlink? I can test two separate nvlink setups
Radiant_Condition861@reddit
no nvlink. I didn't think it'll affect that much. If I were model training, then I would consider offloading the pcie buss traffic to nvlink to help with hard drive performance.
Fabulous_Fact_606@reddit
Was getting excited, then it's the 35B model
oxygen_addiction@reddit
Different model.
LMTLS5@reddit
~20tps tg and ~200tps pp on mi50 at q4km
EvilGuy@reddit
I get about 35 tokens a sec when I run Q_4_K_M 27B on a 3090 with 128k context. (Q8 kv quant) in LM studio.
Sad_Individual_8645@reddit
What settings? When I do that the prompt processing takes forever with my 3090, exact same setup as you too.
EvilGuy@reddit
Hmm nothing crazy in lm studio
Qwen 3.5 27b model at q4km
Context 128000 Gpu offload 64 (all layers) CPU thread pool size 7 Most things are default here Unified kv cache on Offload kv cache to gpu on Keep model in memory off Try mmap on Flash attention on K and v cache set to q8_0
Nothing crazy 35 tokens pretty steady with good prompt processing
No_Information9314@reddit
20 t/s on dual RTX 3060s using Unsloth IQ4_XS at 100k context, KV cache 8_0.
odikee@reddit
22-23tks on 5080+3080 UDQ5, 50k context, tp 8_0
ParamedicAble225@reddit
3090, and slow as fuck, but I use it anyways. Haven’t calculated TPS but it’s 1-7 minutes per request with around 2000-15000 tokens input. It’ll be slow for agentic work but works excellently. I just live in slow mo now.
Not running A3B which would be a bit faster but more stupid since only 3 billion active parameters instead of almost all of them.
Sad_Individual_8645@reddit
I see people on here with a 3090 and native 260k cache selected in lmstudio for Q4 version of the model, when I do the same the prompt processing literally takes forever. I have 64gb ddr5 and have tried every setting under the sun I don’t get it, if I go down to like 90k with Q8 kv cache quant it works but I don’t get how people are doing that with a 3090
MushroomCharacter411@reddit
A3B has 3 billion *always* active parameters, but apparently the average prompt actually uses around 10B, the "always on" 3B, plus 7B chosen from the remaining 32B.
CreamPitiful4295@reddit
I’ve been sitting on my 3090 and then figured out I could run models. Did exactly what you did and got exactly the same results. I’m a bit poorer now. :)
serioustavern@reddit
Output speed around 35 tok/s.
This is using the same Q4_K_XL on a RTX 3090 with 32GB DDR3 system ram via llama.cpp (weights and context fully loaded on GPU).
Quantizing KV cache to Q8 rather than the default F16 got my max context length up to around 100K instead of around 50K with a negligible effect on speed. Lowering the 3090’s power limit from 350W to 280W decreased my speed by about 1 tok/s.
GhenB@reddit
FP8: 28 tok/s on 4 x rtx3060, vllm, tp4, 128k context, max-num-seqs 2
asfbrz96@reddit
7.5 on strix halo q8
octopus_limbs@reddit
Slow but still faster than if you type it yourself
simracerman@reddit
Yikes. Do you use it for non-coding?
asfbrz96@reddit
I'm using more minimax m2.5 or the 122b model, they are much faster
AdventurousSwim1312@reddit
On rtx 6000 pro, the awq version runs at roughly 80-90 tokens / seconds, with nearly same perf as the fp8 version
l1t3o@reddit
105 t/s gen speed with vllm and multi token Prediction activated.
nunodonato@reddit
how many speculative tokens? and which gpu?
l1t3o@reddit
5 multi token Prediction and 2x3090 . I followed this guide : https://www.reddit.com/r/LocalLLaMA/s/WNeLqJyQoP It's using the multi token Prediction that's built into the qwen3.5 that provides the most boost. But only vllm supports it for now. Hope to be able to use it un llama.cpp soon !
nunodonato@reddit
Damn I'm also with vllm but at around 30tok/s. Using 3 for MTP because I read at higher values you can have issues with tool calling at long context.
l1t3o@reddit
Are you using this exact quant: https://huggingface.co/cyankiwi/Qwen3.5-27B-AWQ-BF16-INT4 ?
I believe the tool calling issue you're mentioning might have been solved recently,
nunodonato@reddit
No, I'm using the official Qwen FP8
l1t3o@reddit
What GPU do you have ?
rtx 30xx (ampere architecture) doesn't natively support fp8 (hence the funky format quant I gave you a link to).
rtx 40xx natively supports it.
nunodonato@reddit
I'm renting a RTX PRO 6000 (blackwell)
l1t3o@reddit
1/ vLLM nightly + Blackwell flag
Your RTX PRO 6000 is SM120 (compute capability 12.0). Stable vLLM doesn't fully recognize it and falls back to slower Marlin kernels instead of native FP4/FP8 ones. You need the nightly build compiled for CUDA 13.0:
That alone can make a big difference.
2/ NVFP4 — the next level
Blackwell has hardware FP4 tensor cores (SM120), which means you can run NVFP4 quantization natively — ~1.6x faster than FP8, with minimal quality loss. There's a ready-made checkpoint:
That alone can make a big difference.
2/ NVFP4 — the next level
Blackwell has hardware FP4 tensor cores (SM120), which means you can run NVFP4 quantization natively — ~1.6x faster than FP8, with minimal quality loss. There's a ready-made checkpoint:
Caveat: SM120 NVFP4 support is still a bit rough on the edges (some kernel bugs were only fixed recently). Use it with the nightly image above and test — if you get NaN outputs or garbage, fall back to FP8 which is rock solid on your card.
TL;DR: nightly first (free perf gain), then try NVFP4 if you want to push further. ```
Caveat: SM120 NVFP4 support is still a bit rough on the edges (some kernel bugs were only fixed recently). Use it with the nightly image above and test — if you get NaN outputs or garbage, fall back to FP8 which is rock solid on your card.
TL;DR: nightly first (free perf gain), then try NVFP4 if you want to push further.
I know that RTX 5090 (32GB, 1.8 TB/s bandwidth) → ~80 tok/s using NVFP4, without MTP.
What I think your blackwell could reach : NVFP4 + MTP 5 tokens + SGLang → 100-130 tok/s
Good luck !
oxygen_addiction@reddit
What hardware
SharinganSiyam@reddit
Getting about 46 TPS on my RTX 5090 using this command llama-server -m "C:\Users\Pc\AppData\Local\llama.cpp\unsloth_Qwen3.5-27B-GGUF_Qwen3.5-27B-UD-Q5_K_XL.gguf" --mmproj "C:\Users\Pc\AppData\Local\llama.cpp\unsloth_Qwen3.5-27B-GGUF_mmproj-BF16.gguf" --ctx-size 262144 --fit-ctx 262144 --n-predict -1 --parallel 1 --flash-attn on --fit on --threads 8 --threads-batch 16 --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0.0 --presence-penalty 0.0 --repeat-penalty 1.0 --cache-type-v q8_0 --cache-type-k q8_0.
Try using llama cpp with kv cache quantization at q8_0 under 131k context. You might get optimal performance
stavenhylia@reddit
How are you running it that fast with a context that big? In LM Studio it predicts to take around 52GB of RAM, overflowing my 5090
SharinganSiyam@reddit
But in my case it stays within 32GB of my vram. It is because I have used --flash-attn on and kv cache q8_0 flags.
stavenhylia@reddit
Hmm.. Maybe I'm underestimating the overhead LM Studio adds.
I'll try to run it like you and see what I can figure out.
Thank you for getting back to me :)
MerePotato@reddit
I really wouldn't quantise the kv cache on a reasoning model if I could possibly help it
SharinganSiyam@reddit
Didn't see any performance issue in my use case. But yeah avoiding kv cache quantization is better if you are satisfied with your context limit
EffectiveCeilingFan@reddit
There was a post on here asserting that the Qwen3.5 arch really didn’t like anything other than BF16 for the KV cache. If you’re getting good results though then I might have to experiment myself.
Shamp0oo@reddit
For what it's worth: I have an RTX 5060 Ti 16G with a memory OC of +3000 MHz. Using this quant I can fit around 20k of context. This gives me around 25 tok/s of tg (don't remember pp), which is way faster than 122B for me, where I have to offload to system RAM (DDR4) and only get around half the performance. I'm using self-compiled mainline llama.cpp on Linux.
HugoCortell@reddit
Same speeds as you, are you using LM Studio?
thegr8anand@reddit (OP)
Yes
HugoCortell@reddit
Yeah, that's normal then. I made a post recently and the consensus seems to be that somehow LM Studio gobbles performance.
If you go to your model tab, you can change the load settings of the specific model to improve performance, but it won't be as good as with other programs. For some reason the default settings on LM Studio is not to load the model onto VRAM, which is obviously going to hurt performance a lot.
mixman68@reddit
Same problems with latest lm studio
The program load only 12 go into vram, I needed to test layer per layer upon find the limit of my graphic card and with 35b I get 37 t/s
For 27b I don't go upper than 7 t/s
Config : 4070 ti super 16 Go, 32 Go ram 6000mhz, 7800x3d, Debian trixie xfce
sammcj@reddit
https://smcleod.net/2026/02/patching-nvidias-driver-and-vllm-to-enable-p2p-on-consumer-gpus/#results
lemondrops9@reddit
Once anything gets off loaded to CPU its going to be a lot slower... its why Vram is so sought after.
Rasekov@reddit
exl3 at 4bpw and 3.1bpw, 25t/s with context running at FP16 in a 3090 with a power limit of 270W. Speed is similar in both cases so I'm guessing I'm fully compute limited. A higher power limit helps but not too much, for my use case and my card 270-280 it's the sweet spot, not worth increasing it 100W just for 5% gain or so.
Max context is set to 96K, speed was more or less stable up until 24K. I havent filled the context yet, I tried batching first(~190t/s with a batch of 16, but context is too small for anything useful) and will test context in detail tomorrow or so.
It's not hard to set up and if you arent offloading to RAM and dont want llama.cpp's webui it's very much worth the trouble.
La7ish@reddit
For 27b I get between 33-40 tok/sec on my 4090
dubesor86@reddit
same here. Q5_K_M@32k ctx
SillyLilBear@reddit
81t/sec on dual RTX 6000 Pro @ FP8
shrug_hellifino@reddit
Old amd 1950x threadripper with 64 gb ddr 4 and 5 amd pro viis (16gb) on rocm 6.4.x and llamacpp bf16 unsloth 148k ctx fp16kv 4096 b 1024 ub ~10t/s
scrappyappl@reddit
MacBook Pro m3 max 48gb. I get around 12t/s using mrader 27b heretic
d4mations@reddit
What pp/s?
VickWildman@reddit
Like 4-5 tps, Q4_0. On my phone (which has 24 GB RAM).
pmttyji@reddit
What t/s are you getting for Qwen3.5-9B?
hyrulia@reddit
Q4_K_M runs at 5 t/s
UD-IQ3_XSS at 25 t/s
5060 TI 16Gb
input_a_new_name@reddit
Run at least IQ4_XS, IQ3 is not worth it, it should fully fit
hyrulia@reddit
I will try it, thx!
InternationalNebula7@reddit
RTX 5080 - pp 1673.83, tg 54.35, Qwen3.5-27B-UD-IQ3_XXS.gguf
HEAVYlight123@reddit
26 tok/sec on 5070ti 16gb and 3070 8gb Runs faster than coder 80b on my DDR4 and 12700 but with less context
Embarrassed-Boot5193@reddit
Dual RTX 5060ti (total 32G vram) Contexto 130k, kv cache f16, llama.ccp Q5_K_XL - pp ~650 t/s, tg ~16 t/s Q4_K_M - pp ~800 t/s, tg ~19 t/s IQ3_XXS - pp ~1200 t/s, tg ~25 t/s
SLI_GUY@reddit
\~45 tok/sec rtx 4090
Dundell@reddit
After some tests on aider i've been finding Q4 being less successful than Q5, so I run Q5 27B alot at around 14t/s on x3 RTX 3060's I now the recent updates to llama.cpp brought my 122b speed up 25%, and probably could d the same for my 27B. I haven't tried anything different to speed it up, but open to some ideas. I'm more interested in Q4 122B at 26t/s
putrasherni@reddit
All tk/s is average until all of context used
Qwen 3.5 27B IQ3_XXS
34 tk/s at 32K context
32 tk/s at 65K context
29 tk/s at 131K context
Qwen 3.5 27B Q4_K_S
30 tk/s at 8K context
27 tk/s at 65K context
Twirrim@reddit
If you can tolerate a little precision loss, try Qwen3.5-35B-A3B. I'm getting \~20 tok/s with 128k context on a 8GB VRAM RTX 3050. I've been finding it perfectly fine for my use cases.
Ok-Caregiver9383@reddit
Wood quant are you using to get it to fit in 8gb must be VERY low
Twirrim@reddit
I'm using Q3_K_S. I tried a Q2 one previously but the perf drop-off was noticeable.
coder543@reddit
You don’t have to fit a MoE entirely into VRAM to get good performance. Just the dense core of attention/KV/shared expert.
amejin@reddit
How? Using vLLM trying to run any of these thinking models requires all the vRAM I have. What am I missing?
coder543@reddit
vLLM is not optimized for these kinds of low memory use cases. You need to use llama-server.
amejin@reddit
Understood... That's a bummer.
Ok-Caregiver9383@reddit
So what q is he using?
coder543@reddit
I’m not them, but they could be using q8 and still get good performance. Obviously the lower the quant, the faster it is.
Thunderstarer@reddit
What quant?
soyalemujica@reddit
I would not say "a little precision loss", 35B loses by quite a noticeable margin in all benchmarks against 27B, which IS EXPECTED.
overand@reddit
That is genuinely bad-ass for that card! What's your prompt-processing speed?
thegr8anand@reddit (OP)
That's sound very good for 8gb vram. What quant are you using?
timhok@reddit
V100 32GB capped at 200W
llama.cpp on llama-swap
Qwen3.5-27B-UD-Q6_K_XL.gguf
k/v cache in Q8
vision enabled
fit ON = 182k context window w/ vision in F16
small requests under 10k tokens - 22 t/s gen 640 t/s pp
30k+ tokens - 14 t/s gen 400 t/s pp
I love it
ismaelgokufox@reddit
Q3 unsloth quant at around 20T/s On a 6800. When the context gets bigger, 6-7T/s
Additional_Ad_7718@reddit
I think it was 20T/s on my 3060 64 gb ram.
Honestly it was too slow to use with reasoning on.
Front_Eagle739@reddit
2000 pp and 47 decode. Rtx5090
__JockY__@reddit
I tested the BF16 unquantized version in vLLM 0.17.1rc1.dev5+g8d98d7cd on 4x RTX 6000 PRO on EPYC with DDR5 6400 in tensor parallel mode with MTP and 2-token speculation.
“Write flappy bird” = avg generation throughput of 286.6 tokens/sec and accepted MTP the output of 185.49 tokens/sec with a 91.4% acceptance rate.
Of course flappy bird is benchmaxxed to death, which means MTP is crushing it and leading to a false sense of speed.
“Write an Objective-C program to recursively scan the home directory looking for .png files. Build an index of these and then use Mac OS frameworks to convert all of the PNG files into a .mp4 video that runs at 2 frames per second.” = avg gen throughout of 139.2 tokens/sec with accepted MTP throughput of 85.7 tokens/sec and 80.1% acceptance rate.
Not too shabby!
wizoneway@reddit
5090 Qwen3.5-27BQ4 ReadingGeneration 954 tokens 13s 69.86 t/s
MerePotato@reddit
33t/s Q6 on a 4090, could probably be a lot more if I bothered to use MTP
thegr8anand@reddit (OP)
Does only vLLM support MTP right now?
MerePotato@reddit
I believe so, transformers might but I haven't looked into it
SurprisinglyInformed@reddit
2x 3060 12GB (total 24GB) + 64GB RAM DDR4 in LMStudio
qwen3.5-27b unsloth Q4_K_XL 65k context
14.41 tk/s
Adventurous-Paper566@reddit
12 tps with 5060ti + 4060ti Q6_K_L 65k context
heikouseikai@reddit
.000000000009 t/s
south_paw01@reddit
Q4_k_m 25t/s 9700 32gb Will test unsloth versions in the future
miniocz@reddit
Bartkowski Q4KM split between P40+3060 @ 30000 context - 9-11 t/s
Mir4can@reddit
2x5060ti. With vllm, cyankiwi awq, 115k context without mtp stable 20 with mtp ranges between 30 to 45
hp1337@reddit
IQ3_XXS with q8_0 cache on 9070XT vulkan I get 800 pp and 30 tg
Lorian0x7@reddit
rtx4090, Zorin OS, Q4k_m, 62k context, 42-40 t/s , no cpu offload
PotentialLawyer123@reddit
Just asked it a quick question in a new chat and achieved 33.41 tok/sec and 0.17s TTFT on my 7900 XTX. 28.18 tok/sec on a 67k context file I just dropped into a new chat but 109.83s TTFT (5755 token output). Hope this helps!
overand@reddit
Dual 3090, underclocked, Q4_K_M
Prompt: 1102
Gen: 36.2
(I think I'm only on one card with that particular model.)
OkDesk4532@reddit
27B is really slow. 35B-A3B is mich faster
StrikeOner@reddit
i posted a benchmark on this sub today about this particular model. am getting arround 20t\s with 50k context with various q4 quants.
sleepingsysadmin@reddit
12 TPS fully offloaded. It's sad.
Worse yet, cant spec decode because of the vision.