Post Your Qwen3.6 27B speed plz
Posted by Ok-Internal9317@reddit | LocalLLaMA | View on Reddit | 191 comments
Mine is Tesla M40 12GB*4, fp4:
26tok/s PP
8tok/s TG
This is out of touch for me, I'll wait for the 9B
My_Unbiased_Opinion@reddit
you should try Qwen 3.6 35B. It will be way faster and is still pretty damn good.
Money_Hand_4199@reddit
35b is much worse than 27b in coding and agentic tasks
-dysangel-@reddit
M3 Ultra 512GB
Money_Hand_4199@reddit
what is the model quant used? is this in vllm or llama.cpp? omlx?
-dysangel-@reddit
I think it was Q4, llama.cpp
I've just tried mlx ( unsloth/Qwen3.6-27B-UD-MLX-4bit ) and I get:
Money_Hand_4199@reddit
AMD Strix Halo 128GB:
15-22 tg with vllm 0.19.2rc and qwen3.6 27B gptq int4 models
vllm on 64 parallel requests gives total throughput at 280t/s
dobkeratops@reddit
m3-ultra mac studio: llama.cpp, Qwen3.6-27B-Q8_0 : 21tokens/sec generation at start of context (0-4000 tokens)
324-424 tokens/sec prompt processing bringing in a text file into the context
at 20,000 context+ 19.7tokens/sec after that file was ingested.
Money_Hand_4199@reddit
try omlx instead of vllm-mlx
CatalyticDragon@reddit
Single AMD Radeon R9700 with ROCm 7.2.
Prompt eval: 2079.68 tokens per second
Eval: 66.5 tok/s,
putrasherni@reddit
awesome work.
i'd love it if you went further upto 32K and 64K context lenghts but from what I read
i would conclude that ROCM now is better than Vulkan when using dual R9700 espeically for larger context lengths ?
CatalyticDragon@reddit
Settings:
-ngl 99 -fa 1 -ctk q8_0 -ctv q8_0 -sm layer(default, layer split across both GPUs) · 1 repetition, no warmuppp = prompt processing (t/s) · tg = token generation (t/s)
Gemma-4-26B-A4B (MoE, 128 experts, 8 active)
Qwen3.6-27B (Dense)
Qwen3.6-35B-A3B (MoE, ~35B total, 3B active)
putrasherni@reddit
I guss i'm going to stay with my llamacpp + vulkan build.
Really hoped rocm break into sustained 2K pp at 100k+ context and 120+ tok/sec
i guess it will com eventually
here are my stats for reference,
flags used
export GGML_VK_VISIBLE_DEVICES=2,1export GGML_VK_ALLOW_GRAPHICS_QUEUE=1my run params for llama-bench are
putrasherni@reddit
Thanks my man
Best_Control_2573@reddit
Which quant? That sounds like 35b numbers...
CatalyticDragon@reddit
Oh you're absolutely right. Sorry for the confusion. Shall retest later.
mestrade78@reddit
A4500 blackwell 32GB - 40 t/s
zannix@reddit
is this fp8?
rbit4@reddit
8480 pps and 1264 tps fp4. dual 5090
Ok-Internal9317@reddit (OP)
Combined throughput? Not single throughput right?
rbit4@reddit
Its 2795 tps at c=64 on dual 5090 and single thread is 82.8
AdamDhahabi@reddit
I found a crazy claim: 192K context at 152 t/s on Qwen3.6-27B, single RTX 4090.
Q4_K_M + ik_llama.cpp + speculative decoding using Qwen3-1.7B as the draft model.
https://x.com/outsource_/status/2047660565170909555
Apprehensive-View583@reddit
just tried exactly, the vocab does not match and yield lower tps, i m getting 9tps with same draft model, i think he is making it up.
i think the one actually working is dtree dlash, so lucebox or d-flash, they all have their own draft model and kernel, not this guy is claiming.
now day some of the twitter post are just straight making it up.
i got 40 tps just with 3090..and this slow it down by 3x due to Vocab mismatch → token translation overhead.
themule71@reddit
Always dl the very last model releases. Vocab should match IIRC.
QuinsZouls@reddit
Same result with a RX 9070 and turboquant, no spec dec at 30tps but after enable ot using qwen 3.5 0.8b I got 6tps with 100% of acceptance rate
robogame_dev@reddit
Look at his screenshot, he’s not making it up - he’s asking his LLM and his LLM is making it up!
EveningIncrease7579@reddit
I get the same results. Tried with many differents parameters and didnt get more than 40t/s
eugene20@reddit
Because they're using a 1.7B for speculative decoding.
Altruistic_Heat_9531@reddit
wait a minute, 192K context but its -c 8192 what??
ArtfulGenie69@reddit
I linked this in my response but this should help for actual speed on a 3090. Using a draft model never works. Using vllm and turbo quant to fit it to the card will work though and 3090 have int4 processing I think so awq is really fast.
https://medium.com/@fzbcwvv/an-overnight-stack-for-qwen3-6-27b-85-tps-125k-context-vision-on-one-rtx-3090-0d95c6291914?postPublishedType=repub
ArtfulGenie69@reddit
Using some other draft model is probably bs, not sure but just about none of the various qwen models pair well for this. The heavy lifting is probably the speed of the card, the 4bit quant, and the speculative decode.
Someone did a similar test with a 3090 on vllm and was able to get it to 85t/s with mtp, if dflash worked it would probably be over 100t/s. https://medium.com/@fzbcwvv/an-overnight-stack-for-qwen3-6-27b-85-tps-125k-context-vision-on-one-rtx-3090-0d95c6291914?postPublishedType=repub
andy2na@reddit
Doesn't work for me, rejecting most drafts and tanking t/s down to 15t/s. Also tried qwen3-1.7b
--model /models/qwen35/Unsloth_Qwen3.6-27B-IQ4_NL.gguf -ngl 99 --ctx-size 65536 -md /models/qwen35/Qwen3-0.6B.gguf -ngld 99 -cd 4096 --draft-max 20 --draft-min 5 --draft-p-min 0.55 --cache-type-k q8_0 --cache-type-v q8_0
Ok-Internal9317@reddit (OP)
I'll try this out, damn this is cool
AdamDhahabi@reddit
The author says: mainline llama.cpp works fine too but you may see CUDA fallback warnings on q8 KV in some builds.
Ok-Internal9317@reddit (OP)
I just tried it, went from 6.6 down to 5.5, so I guess it doesn't work for me...
Apprehensive-View583@reddit
the author most likely is lying, i tried exactly setup on my 3090, i get 1/3 compare to without speculative decoding.
Awkward-Reindeer5752@reddit
I wonder how this impacts code generation quality? I’d be surprised if the correct next token is consistently amongst the draft model suggestion set
emprahsFury@reddit
Speculative decoding produces mathematically-guaranteed correct choices, the large model verifies the drafted token(s).
AdamDhahabi@reddit
The author says: mainline llama.cpp works fine too but you may see CUDA fallback warnings on q8 KV in some builds.
So it means no quality regression.
Awkward-Reindeer5752@reddit
Definitely need to try this out, thank you
MuDotGen@reddit
What is ik_llama.cpp?
shifty21@reddit
Fork of llama.cpp Tha handles certain aspects of CPU and system RAM offloading and other tweaks and customizations for special quantized GGUF
cviperr33@reddit
damn thats huge
ea_nasir_official_@reddit
10pp, 5tg UD IQ3_XXS
Amd 8845HS, 32GB 5600
sammcj@reddit
https://omlx.ai/benchmarks?chip=&chip_full=M5%7CMax%7C40&model=Qwen3.6+27&quantization=&context=&pp_min=&tg_min=
meca23@reddit
47 t/s on rtx 6000 pro using q8, get more tokens at lower quantities.
r0kh0rd@reddit
This is far too low for the RTX 6000 Pro. I am getting wayy higher numbers. Try this:
You should be getting >60 tok/s TG for single user and >350 tok/s aggregate TG for multi-user (>16 concurrent).
kaliku@reddit
my man. you took me from 30tps (which is not bad eh? or that was what I was thinking) to 50 tps. These days I am experimenting with an automatic coding harness. if stable this is going to have a big impact for me.
So big thanks to you!
How do you know all of these, are you a hobbyist like 99% of us or do you work with it?
EbbNorth7735@reddit
Does language model only disable vision support? What do you get with vision?
kaliku@reddit
Yes. With vision you can give it a picture as base64 (supported by the openai Api format) and it can interpret it. It can only output text though.
EbbNorth7735@reddit
I meant what speeds or context would you get with vision enabled
kaliku@reddit
same perceived just bit more vram usage.
r0kh0rd@reddit
About the same. It saves like 800mb of NVRAM. That’s all.
o0genesis0o@reddit
How fast is pp on that card? It would matter more in agentic coding.
r0kh0rd@reddit
> 2500 tk/s PP
o0genesis0o@reddit
Wow, 40k t/s pp is incredible.
If the model can do some sorts of planning, and can implement code change as decent as sonnet class, it would make the rtx 6000 very tempting. I'm getting sick of minimax and glm coding plan keep timing out due to high load and bug me to pay for higher tier.
Maybe I'll rent an rtx6000 on runpod and see how it goes.
Ok-Internal9317@reddit (OP)
How is the PP looking?
mxmumtuna@reddit
prompt processing is so fast on Blackwell cards it's sort of silly to try to measure it. It's in the 10s of Ks per second.
teachersecret@reddit
I was in the mid-70s t/s on 3.6 27b on my 4090 today, but that was in VLLM with MTP=3 and a bunch of fiddling, and I wasn't able to do that with a large context window. Here's my last run: output_tok_s_est_decode_only: 72.28
I'm trying to adjust it to get further, I think I can get it up over 100t/s generation speed if I tweak/get turboquant working, but we'll see. I'm currently compiling flashinfer, again.
Once this thing properly has MTP and some kind of turboquant integrated for llama.cpp/vllm without needing a ton of extra nonsense, it will be much more usable.
andy2na@reddit
Try these tweaks: https://github.com/noonghunna/qwen36-27b-single-3090#known-issue-tool-calling--mtp--turboquant-kv
teachersecret@reddit
Tried a bunch, couldn't get turboquant working properly, kept just repeating the same-first-word. Gave up on it for now, I'll come back in a few days when people have spent time knocking all the bugs/rust off it :).
andy2na@reddit
You have to use fp8 cache and not turboquant. Will lower the max context but retain speeds. With vision support and fp8 cache, I can still do 65k context. Removing vision you can do 75k
Important_Quote_1180@reddit
27B Local Inference on Single RTX 3090
qwen3.6-27B-AutoRound (INT4), vLLM 0.19.2rc1.dev21, 24GB VRAM. 71–83 tok/s after warmup.
• Turboquant 3-bit NC KV Cache: Compresses KV state to 3-bit non-uniform quantization. Enables 125K context window within 24GB VRAM without OOM. • MTP n=3 Speculative Decoding: Three auxiliary heads draft tokens per forward pass, verified atomically against main head. ~3× throughput multiplier vs. non-speculative baselines. • Cudagraph PIECEWISE Mode: Captures only attention-op boundaries instead of full-graph replay. Eliminates degenerate repetition loops caused by stale MTP state in FULL_AND_PIECEWISE mode on multi-GPU hosts. • Chunked Prefill + Prefix Caching: max-num-batched-tokens=4121 with max-num-seqs=1. First post-restart request incurs ~29s cudagraph compilation; subsequent requests stabilize at 12–14s for 1024-token generation.
J_m_L@reddit
Dang so you can get way better performance with vLLM?
Important_Quote_1180@reddit
Until TurboQuant comes to llama.cpp I use both depending
ddog661@reddit
What did you get without speculative decode? I am getting around 33 tok/sec with AWQ-int4.
teachersecret@reddit
In the mid 40s I think? I’ll eyeball later.
Optimal-Bass-5246@reddit
Following this article:
https://www.reddit.com/r/LocalLLaMA/comments/1stjx29/an_overnight_stack_for_qwen3627b_85_tps_125k/
I was able to get 155tps with 258K context window on 1x RTX 5090.
=== Warmup (3x) ===
w1 comp=1000 wall=19.42s 51.49 TPS
w2 comp=1000 wall= 8.11s 123.30 TPS
w3 comp=1000 wall= 8.46s 118.20 TPS
=== Narrative (3x, 1000 tok) ===
narr1 comp=1000 wall= 8.38s 119.33 TPS
narr2 comp=1000 wall= 8.13s 123.00 TPS
narr3 comp=1000 wall= 8.06s 124.07 TPS
=== Code (2x, 800 tok) ===
code1 comp=692 wall= 4.44s 155.86 TPS
code2 comp=462 wall= 3.05s 151.48 TPS
=== GPU state ===
0, 92 %, 29997 MiB, 32607 MiB, 402.53 W, 63
=== Last 3 SpecDecoding metrics (MTP accept) ===
(APIServer pid=1) INFO 04-25 14:10:16 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.60, Accepted throughput: 72.50 tokens/s, Drafted throughput: 136.20 tokens/s, Accepted: 725 tokens, Drafted: 1362 tokens, Per-position acceptance rate: 0.782, 0.533, 0.282, Avg Draft acceptance rate: 53.2%
(APIServer pid=1) INFO 04-25 14:10:26 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.71, Accepted throughput: 76.79 tokens/s, Drafted throughput: 134.99 tokens/s, Accepted: 768 tokens, Drafted: 1350 tokens, Per-position acceptance rate: 0.782, 0.564, 0.360, Avg Draft acceptance rate: 56.9%
(APIServer pid=1) INFO 04-25 14:10:36 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.97, Accepted throughput: 89.39 tokens/s, Drafted throughput: 135.89 tokens/s, Accepted: 894 tokens, Drafted: 1359 tokens, Per-position acceptance rate: 0.837, 0.647, 0.490, Avg Draft acceptance rate: 65.8%
FullOf_Bad_Ideas@reddit
600 t/s PP
150-30 t/s TG depending on task and context length.
8x 3090 Ti, BF16 model, with DFlash from Qwen 3.5 27B, SGLang with TP 8
grunt_monkey_@reddit
How do you run qwen 3.5 397b on 8x3090? Is it with smaller quant or cpu offload?
FullOf_Bad_Ideas@reddit
EXL3 quant cooked by me (cpral 3.536bpw on this table) or mratsim - https://huggingface.co/mratsim/Qwen3.5-397B-A17B-EXL3
I'll try to make better custom quants later but I'm playing with Hermes 4 405B quanting right now.
mtasic85@reddit
Nvidia RTX 3090 24GB / CUDA 12.9.1
llama-server version: version: 8929 (9d34231bb)
Unsloth Qwen3.6-27B Q4_K_M -ctk/v q5_0
21.86GiB / 24.00GiB VRAM
35 t/s tg
250 t/s tg, with speculative decoding (default `--spec-default` == `--spec-type ngram-mod --spec-ngram-size-n 24 --draft-min 48 --draft-max 64`)
```
CUDA_VISIBLE_DEVICES=0 ./llama-server -hf unsloth/Qwen3.6-27B-GGUF:Q4_K_M -ngl -1 -fa on -fit off --metrics --props --slots --host 0.0.0.0 --port 8080 -dev CUDA0 --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0.0 --presence-penalty 1.5 --repeat-penalty 1.0 --reasoning off --alias "Qwen/Qwen3.6-27B" -c 262144 -ctk q5_0 -ctv q5_0 --spec-default --no-mmproj-offload -b 1024 -ub 256
```
Kindly-Cantaloupe978@reddit
\~80 tps on RTX 5090 using vllm 0.19 with 218k context window and MTP enabled
model is this: https://huggingface.co/sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP
recipe can be found at my post on Qwen3.5-27B: https://www.reddit.com/r/LocalLLaMA/comments/1sr8gyf/qwen3527b_on_rtx_5090_served_via_vllm_77_tps/
Exact-Cupcake-2603@reddit
AMD mi50 x4 pp330 tg18
Legal-Ad-3901@reddit
https://github.com/larkinwc/mi50grad 56.33
jriggs28@reddit
That is amazing!
Exact-Cupcake-2603@reddit
Pp is not so good haha
Exact-Cupcake-2603@reddit
Awesome
suprjami@reddit
Triple RTX 3060 12Gb, power limited down to 125W
280 pp
12-14 tg
Evgeny_19@reddit
According to podman's logs my Radeon 9700 Pro runs Q5_K_XL with PP from 80 to 670, TG around 17-18.
karimusben@reddit
i've got 9,5t/s with rocm/vulkan on ubuntu, can you share your config ?
Evgeny_19@reddit
I ran on latest (wel, I did pull the updates a few days ago) llama.cpp/rocm combination via podman. The options are these: -ngl 99 \ -fa 1\ -c 131072 \ -b 2048 -ub 512 \ -ctk q8_0 -ctv q8_0 \ --spec-type ngram-mod \ --spec-ngram-size-n 24 \ --draft-min 4 \ --draft-max 48 \ --temp 0.6 \ --repeat-penalty 1.0 \ --top-p 0.95 \ --top-k 20 \ --min_p 0.0 \ --presence-penalty 0
In llama-bench the results would be diffrent. Those that I posted are from a real opencode session. I never saw anything that low as 80 at PP in llama-bnech, but yet I saw it in podman's logs on a real task. That was an exception though, I only saw it once. It usually stays well above 300. It should probably be possible to run Q6 variants with the same context at -ctk q4_0 -ctv q4_0, but I haven't tried it yet.
dametsumari@reddit
M5 pro 8 tk/s tg, pp 250 ish. Too slow to be useful.
Dany0@reddit
That doesn't seem right to me, I get 10-15tok/s decode and pp in that ballpark on an m3 max
dametsumari@reddit
Max is 2x pro ( of same generation) in tg. My number is also q8, how about yours?
Dany0@reddit
not for the m5 pro, which is the same chip as the m5 max though he may have 2 less cores
dametsumari@reddit
What cores are you talking about?
Memory bandwidth is 2x in all maxes. And that matters for tg. For pp, max has twice the gpu cores than pro, also in m5 ( 20 - 40 ).
Dany0@reddit
Memory bandwidth yes (with caveats, it actually has more memory controllers, not say higher mt/s, it's like increasing bus width not speed, so theoretical speed does scale but in practice if your workload isn't easily chunked/batched you might see drastically different numbers) but the chip itself is the same between the m5 max and m5 pro. There's just a binned m5 pro variant with 2 or actually I think it had 3 less cores? Doesn't matter
The main difference between the m5 max and pro is the gpu
dametsumari@reddit
Total bandwidth is what matters in interfererence as you need to go through all active parameters per token. Due to that, old multi lane xeons are surprisingly good as their aggregated bandwidth is usually with eg 12 lanes total bandwidth being then 250+ g/s and that works well with eg big deepseek MoE models with few active parameters and large total model size ( hundreds of gigabytes ).
Dany0@reddit
I just looked it up and my m3 max has 409 gb/s theoretical while the m5 pro has 307gb/s. So it still does not explain the difference
dametsumari@reddit
Oh? 8*409/307 is bit over 10. And you said 10-15.
Dany0@reddit
M5 series has much improved prefill speeds, he should be getting more than 250 pp
Sunknowned@reddit
5.5 tps 💀
Necessary-milkyway@reddit
I got max token count as 112 token/sec ...with 192k context ...average token for me was around 50-60token/sec ...i ran entire day with doing coding got total 100 Million tokens. 99million input and 1 million output token ..this is me running the model in vllm nfp4 in Asus ascent gx10
RiskyBizz216@reddit
66 tok/s on the RTX 5090 in LM studio
shansoft@reddit
What quant are you using? I am using llamacpp and only getting around 50 tok/s with unsloth UD5 XL
Dany0@reddit
50tok/s sounds like you might be spilling into system ram a little
Without spec decoding/mtp I would get a pretty steady 70tok/s on mine in both vllm/llamacpp. Though vllm gets much faster prefill speeds
shansoft@reddit
I retested again, it is indeed 50 tok/s under Arch with UD_Q5_K_XL. No spill over. But somehow in Windows with same setting, I am getting 60 tok/s. Something doesn't feel right. For Q4_K_M I can indeed run around 66-70 tok/s.
Dany0@reddit
nvidia closed drivers or open?
RiskyBizz216@reddit
Q4_K_M GGUF
https://huggingface.co/Jackrong/Qwen3.6-27B-GGUF
Important_Quote_1180@reddit
27B Local Inference on Single RTX 3090
qwen3.6-27B-AutoRound (INT4), vLLM 0.19.2rc1.dev21, 24GB VRAM. 71–83 tok/s after warmup.
• Turboquant 3-bit NC KV Cache: Compresses KV state to 3-bit non-uniform quantization. Enables 125K context window within 24GB VRAM without OOM. • MTP n=3 Speculative Decoding: Three auxiliary heads draft tokens per forward pass, verified atomically against main head. ~3× throughput multiplier vs. non-speculative baselines. • Cudagraph PIECEWISE Mode: Captures only attention-op boundaries instead of full-graph replay. Eliminates degenerate repetition loops caused by stale MTP state in FULL_AND_PIECEWISE mode on multi-GPU hosts. • Chunked Prefill + Prefix Caching: max-num-batched-tokens=4121 with max-num-seqs=1. First post-restart request incurs ~29s cudagraph compilation; subsequent requests stabilize at 12–14s for 1024-token generation.
OSHAHazard@reddit
288 tok/s PP and 28 tok/s TG at 77k context on a 7900XTX
genpfault@reddit
Same GPU:
NullFlexZone@reddit
That's pretty low. I am getting 45 t/s in a quick test on LM Studio same size context.
CardinalRedwood@reddit
Also was hoping to get around this. Mind sharing your setup?
noctrex@reddit
I get on mine:
ROCm: \~290 pp / 20 tg
Vulkan \~550 pp / 35 tg
Ok-Internal9317@reddit (OP)
Hmm, that seems like it's not too quick as I might have thought
Force88@reddit
2x 5060ti 16gb, q6, 14t/s
SpaceTraveler2084@reddit
can we expect a qwen3.6:14b?
Fluffywings@reddit
Unlikely based on 3.5 and the poll Alibaba put out for 3.6 on sizes.
aparamonov@reddit
7900 xt, 20gb vram, light overclocking and undercoating with 300w power limit, minimal Linux ui. 110k q8/q8 context Q4 k xl quant. I sacrificed ub and pp for more context ( ub 256)
I get about 30-33 tg and about 720pp in llama server on Vulcan. Rocm is much slower for dence qwen models on my system.
sirmc@reddit
Intel Arc Pro B70 using llama.cpp with various sycl PR patches merged in:
llama-benchy:
PP: 684 +- 18.60 TG: 21.45 +- 0.02
lawldoge@reddit
Tore down the setup I was using, but was in the 14-15 tok/sec ballpark on llama.cpp with UD-Q5_K_XL and UD-Q8_K_XL. Context was set to 128K, but that rate only lasted only up to about 50K or so, at which point it absolutely tanked. It's in the 5-6 tok/sec range before it gets to 100K. ASRock B70's without the split on the Q5 and split across 2 cards on the Q8.
ea_man@reddit
Hey can I ask what quant you are using? As Q4_K_M...
sirmc@reddit
Running UD-Q4_K_XL
ea_man@reddit
Thanks, that's like a TG\~24 for q4_k_m which is fair, PP is pretty good with that kind of RAM.
I guess that that is concurrent sessions in VLLM, not a single user prompt in llama-serve right?
sirmc@reddit
Single user in llama.cpp (sycl), but I had to apply a number of open PRs against the llama.cpp repo to get a decent performance.
ea_man@reddit
Well congrats, it's looking way better than the early reports.
20 TG is usable and it's a dense model, that GPU is probbly better suited to the 35B MoE. Yet multiple agents with VLLM is gonna give some more tokens.
Simple_Library_2700@reddit
2000 t/s pp and ~80 t/s tg
4xV100
Dany0@reddit
Holy shit I didn't expect this from the venerable V100! I drooled so hard over those cards when they were new
What's your total system power consumption?
Simple_Library_2700@reddit
If I load it up with concurrent requests it can draw around 1200w total which is kind of killing the value proposition of these lol.
Ok-Internal9317@reddit (OP)
do you have nvlink?
Simple_Library_2700@reddit
Yes 4 way
uniVocity@reddit
Macbook pro max M4, 128gb of ram. 12 tok/s on LMStudio, doesn’t matter if I use MLX Q8 or GGUF Q8. No special settings - just downloaded and ran the model(s)
Takes a lot to start answering, average 5 minutes for each prompt.
StateSame5557@reddit
I get 20 tok/sec in LMStudio on a MBP M4
stuchapin@reddit
1 3090. 41 t/s 4q.
Important_Quote_1180@reddit
27B Local Inference on Single RTX 3090
qwen3.6-27B-AutoRound (INT4), vLLM 0.19.2rc1.dev21, 24GB VRAM. 71–83 tok/s after warmup.
• Turboquant 3-bit NC KV Cache: Compresses KV state to 3-bit non-uniform quantization. Enables 125K context window within 24GB VRAM without OOM. • MTP n=3 Speculative Decoding: Three auxiliary heads draft tokens per forward pass, verified atomically against main head. ~3× throughput multiplier vs. non-speculative baselines. • Cudagraph PIECEWISE Mode: Captures only attention-op boundaries instead of full-graph replay. Eliminates degenerate repetition loops caused by stale MTP state in FULL_AND_PIECEWISE mode on multi-GPU hosts. • Chunked Prefill + Prefix Caching: max-num-batched-tokens=4121 with max-num-seqs=1. First post-restart request incurs ~29s cudagraph compilation; subsequent requests stabilize at 12–14s for 1024-token generation.
DeepBlue96@reddit
3090 - Prompt processing long context 80k context is around 800tks - generation 25tks
Model: unsloth/Qwen3.6-27B:Q5_K_XL max context 131k kv cache q4_0
Important_Quote_1180@reddit
27B Local Inference on Single RTX 3090
qwen3.6-27B-AutoRound (INT4), vLLM 0.19.2rc1.dev21, 24GB VRAM. 71–83 tok/s after warmup.
• Turboquant 3-bit NC KV Cache: Compresses KV state to 3-bit non-uniform quantization. Enables 125K context window within 24GB VRAM without OOM. • MTP n=3 Speculative Decoding: Three auxiliary heads draft tokens per forward pass, verified atomically against main head. ~3× throughput multiplier vs. non-speculative baselines. • Cudagraph PIECEWISE Mode: Captures only attention-op boundaries instead of full-graph replay. Eliminates degenerate repetition loops caused by stale MTP state in FULL_AND_PIECEWISE mode on multi-GPU hosts. • Chunked Prefill + Prefix Caching: max-num-batched-tokens=4121 with max-num-seqs=1. First post-restart request incurs ~29s cudagraph compilation; subsequent requests stabilize at 12–14s for 1024-token generation.
eugene20@reddit
Q4_k_m, 41 tok/s on 4090. Went back to 35B A3B at just over 60, and hoping there is a something to speed it up.
Important_Quote_1180@reddit
27B Local Inference on Single RTX 3090
qwen3.6-27B-AutoRound (INT4), vLLM 0.19.2rc1.dev21, 24GB VRAM. 71–83 tok/s after warmup.
• Turboquant 3-bit NC KV Cache: Compresses KV state to 3-bit non-uniform quantization. Enables 125K context window within 24GB VRAM without OOM. • MTP n=3 Speculative Decoding: Three auxiliary heads draft tokens per forward pass, verified atomically against main head. ~3× throughput multiplier vs. non-speculative baselines. • Cudagraph PIECEWISE Mode: Captures only attention-op boundaries instead of full-graph replay. Eliminates degenerate repetition loops caused by stale MTP state in FULL_AND_PIECEWISE mode on multi-GPU hosts. • Chunked Prefill + Prefix Caching: max-num-batched-tokens=4121 with max-num-seqs=1. First post-restart request incurs ~29s cudagraph compilation; subsequent requests stabilize at 12–14s for 1024-token generation.
cromagnone@reddit
47t/s on 4090 with 64k context, with Unsloth's Q4_K_M and Q8 KV on main llama.cpp. Need to try that ik_llama.cpp build.
Important_Quote_1180@reddit
27B Local Inference on Single RTX 3090
qwen3.6-27B-AutoRound (INT4), vLLM 0.19.2rc1.dev21, 24GB VRAM. 71–83 tok/s after warmup.
• Turboquant 3-bit NC KV Cache: Compresses KV state to 3-bit non-uniform quantization. Enables 125K context window within 24GB VRAM without OOM. • MTP n=3 Speculative Decoding: Three auxiliary heads draft tokens per forward pass, verified atomically against main head. ~3× throughput multiplier vs. non-speculative baselines. • Cudagraph PIECEWISE Mode: Captures only attention-op boundaries instead of full-graph replay. Eliminates degenerate repetition loops caused by stale MTP state in FULL_AND_PIECEWISE mode on multi-GPU hosts. • Chunked Prefill + Prefix Caching: max-num-batched-tokens=4121 with max-num-seqs=1. First post-restart request incurs ~29s cudagraph compilation; subsequent requests stabilize at 12–14s for 1024-token generation.
Important_Quote_1180@reddit
27B Local Inference on Single RTX 3090
qwen3.6-27B-AutoRound (INT4), vLLM 0.19.2rc1.dev21, 24GB VRAM.
71–83 tok/s after warmup.
• Turboquant 3-bit NC KV Cache: Compresses KV state to 3-bit non-uniform quantization. Enables 125K context window within 24GB VRAM without OOM. • MTP n=3 Speculative Decoding: Three auxiliary heads draft tokens per forward pass, verified atomically against main head. ~3× throughput multiplier vs. non-speculative baselines. • Cudagraph PIECEWISE Mode: Captures only attention-op boundaries instead of full-graph replay. Eliminates degenerate repetition loops caused by stale MTP state in FULL_AND_PIECEWISE mode on multi-GPU hosts. • Chunked Prefill + Prefix Caching: max-num-batched-tokens=4121 with max-num-seqs=1. First post-restart request incurs ~29s cudagraph compilation; subsequent requests stabilize at 12–14s for 1024-token generation.
YairHairNow@reddit
q4_0q4_0turbo3q4_02-GPU is 5080+2080. It's only beneficial on 35B MOE 22GB to prevent offloading.
Zealousideal_Fill285@reddit
How did you manage to fit 196k ctx on 27B IQ4_XS?
YairHairNow@reddit
Turboquant probably. I don't think I actually tested that beyond the benchmark. The 2xGPU's were handling max context on the 22gb 35b tho.
MalabaristaEnFuego@reddit
Important_Quote_1180@reddit
I've been spending too much time staring at GPU util graphs, but here's the thing: running a 27B local model on a single consumer GPU in 2026 is like trying to cook a ten-course tasting menu in a one-burner kitchen. You don't need a bigger kitchen. You need to stop being wasteful.
The rig: RTX 3090 (24GB), AMD 9900X, 192GB DDR5. Nothing exotic. The kind of box a mid-tier game dev would run on. The model is qwen3.6-27B-AutoRound (INT4 quantized), served by vLLM.
Here are the tricks that make this actually work instead of choking at 12 tokens a second:
TURBOQUANT 3-BIT KV Cache. This is the move nobody talks about. Instead of stashing every attention computation in full 16-bit precision (the safe, default choice), we compress the KV cache to 3-bit. It's like storing your recipes on cocktail napkins instead of index cards. You think you'll lose something. You don't. On a 3090 with 24GB, this is the difference between "fits" and "OOM-killed by your own ambition."
MTP — Multi-Token Prediction. vLLM speculates the next three tokens using auxiliary heads, then verifies them in one pass against the main model. It's like hiring three sous-chefs to prep ingredients while you plate. When it works — and this is the crucial part, when it works — you get roughly triple the throughput because three speculative tokens get accepted per forward pass instead of one. We're seeing 71 to 83 tokens per second. That's not "usable." That's fast.
Cudagraph mode PIECEWISE, not FULL. This was the trap. The default FULL_AND_PIECEWISE captures complete execution graphs and replays them. On this machine, with a 1660 Ti also plugged in (legacy display adapter, I know, don't look at me like that), FULL capture poisoned the speculative decoding. The model started outputting the same token in a loop — 100% MTP acceptance, zero intelligence. Switched to PIECEWISE, which only captures attention-operation boundaries. No more garble. No more repetition loops. Just clean, fast inference.
The warmup tax: First request after a restart takes \~43 seconds for 1000 tokens because cudagraphs compile on the fly. After that? 14 seconds. Subsequent runs? 12 seconds. The kitchen warms up. Don't panic.
The first time I sat down at my mother-in-law's stove and realized I didn't need a perfect recipe, just a good one — that's this feeling. You don't need an H100. You need to stop being a tourist.
kapteinpyn@reddit
Two r9700 with UD-Q8_K_XL. Llama.cpp vulkan Pp 400 and tg 16
Intersteller-2002@reddit
Qwen3.6-35B-A3B-GUFF
Average around 21 tok/s
datbackup@reddit
Read the prompt again, Opus :P
mxmumtuna@reddit
~100tps@64k / MTP=3 - 2x RTX Pro 6000
Dany0@reddit
So jealous. What a beast of a card. What's your pp speed?
mxmumtuna@reddit
I mentioned it in another thread. It’s hard to really measure because it’s so fast. 10s of k per second usually.
Dany0@reddit
😭🤤🤤 man I uh I need to collect myself
now go make us a good coding finetune with it that us gpu poors can use >:(
mxmumtuna@reddit
They’re basically just jumbo 5090s, but it’s nice to have 4 of them.
Dany0@reddit
Why are you replying to me go get synthetic data, distill deepseekv4 reasoning, convert donald knuth books into jsonl. Do whatever it takes. I believe in you, you alone will accidentally stumble upon AGI by fine-tuning Qwen3.6 27B
mxmumtuna@reddit
My bad wrong thread.
nmqanh@reddit
PP 197.8 · TG 21.0 tok/s , M2 max 96gb 38 core. Qwen3.6-27B-4bit-mlx-fp16
Zestyclose_Leek_3056@reddit
70 tok/s on 5090 Threadripper 9060X in lm studio
Q8 KV cache quantization, max context window
GregoryfromtheHood@reddit
4090+3090+3090+5070ti
~700-1000 pp ~18-25 tg
Winter_Tension5432@reddit
I hit 112 tk/s with 1.3k Prompt Processing with MTP enable at INT4 on VLLM on 3 5060ti 16gb + 4070ti super 16gb but toolcalling got destroyed so I disable MTP and now I am at 64 Tk/S with same PROMPT PROCESSING this is at 256k context.
Asleep-Land-3914@reddit
2ts with 16GB GPU 😂
bobaburger@reddit
I have 16GB VRAM, have to go all the way down to Q4_K_S or Q3_K_XL + KV cache quant (either q4 or turbo4) to get above 10 t/s (for tg, pp was 150-400 t/s). and with this, the quality is sooooo bad, worse than 35B-A3B at Q5. I guess it's not a thing for our GPU poor.
KvAk_AKPlaysYT@reddit
120 s/tok
Tormeister@reddit
Between 78 to 200 tok/s, depending on MTP acceptance %
vLLM, 5090
Haeppchen2010@reddit
140pp/8tg on RX7800XT plus RX580 (Q5_K_M). But 35B is soooo good and more than twice as fast (400pp/40tg), so I will stay with 35B for now, until I can replace the slow RX580
No_Conversation9561@reddit
Prefill: 2000 t/s Decode: 25 t/s
Unsloth Q4_K_XL on 5060Ti + 5070Ti
No_Information9314@reddit
18 t/s Q4km on dual Rtx 3060
CrushingLoss@reddit
I get around 10 tok/s through Opencode. 15 or so raw. Mac Studio 2 Max, 96GB.
chimph@reddit
I keep getting looping issues with Qwen3.6 models in opencode. Not sure what’s going on
gusbags@reddit
Qwen/Qwen3.6-27B-FP8 on dual Asus GX10 spark cluster, with dflash. PP: 2500+, TG: up to 57 when dflash acceptance does well, but around 40t/s on average.
g_rich@reddit
How are you running the model and what are your settings?
gusbags@reddit
This is the recipe I am using via uegr's vllm docker builds (https://github.com/eugr/spark-vllm-docker).
This is the recipe I am using:
g_rich@reddit
Thanks
Weary_Long3409@reddit
I'm on 3060: 27b iq4xs @20 t/s 35b-a3b iq4xs @82 t/s
Weary_Long3409@reddit
20 t/s on 2x3060
UniForceMusic@reddit
8-10 tps on Macbook M2 Max 64GB 14 inch
Q4_K_M model
viperx7@reddit
Qwen3.6 27B FP8 Vllm Hardware. 4090+3090ti MTP: 16 Ctx. 125k
Speeds varries between avg at 85t/s 50t/s at wrost to 141t/s peak
I am still wondering if increasing MTP to this extent even a good idea or not (I don't see any disadvantage)
fulgencio_batista@reddit
62.5t/s tg512, ~1000t/s pp2048 on dual rtx5060ti with Qwen3.6-27b-NVFP4 on vLLM using 3 speculative tokens with MTP
gogitossj3@reddit
Can you show your config?
fulgencio_batista@reddit
This was for 3.5, but it’s the same for 3.6
Tunashavetoes@reddit
Q5 on M1 Max 10/24 core 64gb ram: 8tps
ea_man@reddit
AMD 6700xt using llama.cpp with vulkan, IQ3_XXS
PP: 160
TG: 23tok/sec
Context q_4: 50-85k according to what desktop I use :P
Linkpharm2@reddit
1500 in / 35-40 out, 4080
Economy_Cabinet_7719@reddit
mxfp4 15tk/s on m4 pro
Opteron67@reddit
130 TG FP8 dual 5090
Puzzleheaded-Drama-8@reddit
37t/s at 16k, 35t/s at 32k on 7900XTX, vulkan, q4_k_m
DeltaSqueezer@reddit
Why not run the 35B?
Ok-Internal9317@reddit (OP)
I think I misconfigured, all of 4 GPUs are almost idle, I'll update the post if I have figured it out
misha1350@reddit
Try Qwen 3 Next 80B as well, since you have a total of 48GB of VRAM and you can at least use Q3_K_M with a big enough context window to run some workloads that require better internal knowledge. For some cases, neither Qwen3.5 35B A3B nor the benchmaxxed Qwen3.6 35B A3B (which has worse internal knowledge in my testing) isn't cutting it for me, so I also use Qwen 3 Next 80B because it's a Sparse MoE model with good speed and I have enough memory for it to fit.
Beamsters@reddit
oMLX, oQ4 FP16 got like 17 t/s and 150 pp/s.
M1 Max 32GB.
The result however is much better than 35b-a3b quantized.
getmevodka@reddit
Cant rn, am in vacation, but if its similar to qwen 3.5 27b in q8 k xl then about 33tok/s
SnooPaintings8639@reddit
20 tps q8 under llama.cpp, 25-30 tps under vllm. Got 100 tps with Qwen 35B.
2 x 3090 RTX
Altruistic_Heat_9531@reddit
Unsloth UD Q5 Llamacpp release b8920
- 3090 96G
- DDR4 RAM 2400 Mhz
- Xeon 2690v4
- Ubuntu 2204
- PCIe gen 3 x16
64K 20 tok/s
128K 10 tok/s
Conscious_Chef_3233@reddit
110 tps sglang
sjoerdmaessen@reddit
78 t/s fp8 l40s
Finanzamt_Endgegner@reddit
20ish t/s tg at 100k context iq4xs on 4070ti 12gb + 2070 8gb
pp is around 1000 i think
Special-Lawyer-7253@reddit
250 pp / 6.5 TS.
GTX 1070 8GB VRAM, 32 RAM, i7 6700HQ
Creative-Regular6799@reddit
I tried it now and getting 4 tok/s. Not usable unfortunately
abmateen@reddit
29 tok/s Q4 V100 32GB single GPU