Post Your Qwen3.6 27B speed plz

[-]

My_Unbiased_Opinion@reddit

you should try Qwen 3.6 35B. It will be way faster and is still pretty damn good.

[-]

Money_Hand_4199@reddit

35b is much worse than 27b in coding and agentic tasks

[-]

Money_Hand_4199@reddit

what is the model quant used? is this in vllm or llama.cpp? omlx?

[-]

Money_Hand_4199@reddit

AMD Strix Halo 128GB:

15-22 tg with vllm 0.19.2rc and qwen3.6 27B gptq int4 models

vllm on 64 parallel requests gives total throughput at 280t/s

[-]

dobkeratops@reddit

m3-ultra mac studio: llama.cpp, Qwen3.6-27B-Q8_0 : 21tokens/sec generation at start of context (0-4000 tokens)

324-424 tokens/sec prompt processing bringing in a text file into the context

at 20,000 context+ 19.7tokens/sec after that file was ingested.

[-]

CatalyticDragon@reddit

Single AMD Radeon R9700 with ROCm 7.2.

Prompt eval: 2079.68 tokens per second

Eval: 66.5 tok/s,

[-]

awesome work.
i'd love it if you went further upto 32K and 64K context lenghts but from what I read
i would conclude that ROCM now is better than Vulkan when using dual R9700 espeically for larger context lengths ?

[-]

CatalyticDragon@reddit

Settings: -ngl 99 -fa 1 -ctk q8_0 -ctv q8_0 -sm layer (default, layer split across both GPUs) · 1 repetition, no warmup

pp = prompt processing (t/s) · tg = token generation (t/s)

Gemma-4-26B-A4B (MoE, 128 experts, 8 active)

Context	ROCm Q4 pp	ROCm Q4 tg	Vulkan Q4 pp	Vulkan Q4 tg	ROCm Q8 pp	ROCm Q8 tg	Vulkan Q8 pp	Vulkan Q8 tg
4k	2,563	50.8	2,478	82.0	2,760	49.6	2,578	72.4
8k	3,150	51.2	2,294	82.5	3,401	49.4	2,407	73.5
16k	3,378	50.8	2,146	82.7	3,582	49.2	2,199	73.9
32k	3,094	50.6	1,816	82.8	3,221	48.8	1,833	73.8
64k	2,273	48.7	1,347	81.9	2,314	48.1	1,391	73.8
128k	1,361	46.8	931	81.0	1,386	46.2	943	72.7

Qwen3.6-27B (Dense)

Context	ROCm Q4 pp	ROCm Q4 tg	Vulkan Q4 pp	Vulkan Q4 tg	ROCm Q8 pp	ROCm Q8 tg	Vulkan Q8 pp	Vulkan Q8 tg
4k	1,347	20.0	1,173	24.4	1,675	17.0	1,027	17.6
8k	1,426	20.0	1,096	24.3	1,766	17.0	1,107	17.9
16k	1,330	20.0	1,085	24.7	1,603	17.0	1,055	18.0
32k	1,071	20.0	978	25.0	1,239	16.9	956	17.8
64k	760	19.8	793	23.2	842	16.8	788	17.6
128k	480	19.4	569	23.4	512	16.6	576	17.8

Qwen3.6-35B-A3B (MoE, ~35B total, 3B active)

Context	ROCm Q4 pp	ROCm Q4 tg	Vulkan Q4 pp	Vulkan Q4 tg	ROCm Q8 pp	ROCm Q8 tg	Vulkan Q8 pp	Vulkan Q8 tg
4k	2,544	49.3	2,363	83.5	2,713	52.6	2,033	80.1
8k	3,108	49.6	2,260	86.9	3,362	52.7	2,207	80.6
16k	3,124	49.3	2,166	87.5	3,337	52.6	1,983	81.7
32k	2,539	49.5	1,927	87.4	2,675	52.8	1,838	82.1
64k	1,773	49.1	1,449	85.5	1,843	52.2	1,489	82.1
128k	1,101	48.1	1,062	86.7	1,123	51.3	1,074	80.8

[-]

putrasherni@reddit

I guss i'm going to stay with my llamacpp + vulkan build.
Really hoped rocm break into sustained 2K pp at 100k+ context and 120+ tok/sec
i guess it will com eventually

here are my stats for reference,

test	t/s
pp128	1707.07 ± 67.85
pp2048	3906.19 ± 1.67
pp8192	4999.20 ± 4.07
pp16384	4682.16 ± 9.63
pp65536	2781.01 ± 2.78
pp131072	1732.35 ± 2.25
tg128	104.71 ± 0.11
tg2048	105.08 ± 0.21
tg16384	101.45 ± 0.09

unsloth/Qwen3.6-35B-A3B-GGUF/Qwen3.6-35B-A3B-Q8_0.gguf

flags used
export GGML_VK_VISIBLE_DEVICES=2,1
export GGML_VK_ALLOW_GRAPHICS_QUEUE=1

my run params for llama-bench are

  -t 1 -ngl 99 -fa 1
  -dev Vulkan1/Vulkan0 -sm row -ts 0.5/0.5 -ctk q8_0 -ctv q8_0
  -p 128,2048,8192,16384,65536,131072 -n 128,2048,16384 -r 2 -ub 2048 -b 16384

[-]

putrasherni@reddit

Thanks my man

[-]

Best_Control_2573@reddit

Which quant? That sounds like 35b numbers...

[-]

CatalyticDragon@reddit

Oh you're absolutely right. Sorry for the confusion. Shall retest later.

[-]

mestrade78@reddit

A4500 blackwell 32GB - 40 t/s

[-]

zannix@reddit

is this fp8?

[-]

rbit4@reddit

8480 pps and 1264 tps fp4. dual 5090

[-]

Ok-Internal9317@reddit (OP)

Combined throughput? Not single throughput right？

[-]

rbit4@reddit

Its 2795 tps at c=64 on dual 5090 and single thread is 82.8

[-]

AdamDhahabi@reddit

I found a crazy claim: 192K context at 152 t/s on Qwen3.6-27B, single RTX 4090.
Q4_K_M + ik_llama.cpp + speculative decoding using Qwen3-1.7B as the draft model.
https://x.com/outsource_/status/2047660565170909555

[-]

Apprehensive-View583@reddit

just tried exactly, the vocab does not match and yield lower tps, i m getting 9tps with same draft model, i think he is making it up.

i think the one actually working is dtree dlash, so lucebox or d-flash, they all have their own draft model and kernel, not this guy is claiming.

now day some of the twitter post are just straight making it up.

i got 40 tps just with 3090..and this slow it down by 3x due to Vocab mismatch → token translation overhead.

[-]

themule71@reddit

Always dl the very last model releases. Vocab should match IIRC.

[-]

QuinsZouls@reddit

Same result with a RX 9070 and turboquant, no spec dec at 30tps but after enable ot using qwen 3.5 0.8b I got 6tps with 100% of acceptance rate

[-]

robogame_dev@reddit

Look at his screenshot, he’s not making it up - he’s asking his LLM and his LLM is making it up!

[-]

EveningIncrease7579@reddit

I get the same results. Tried with many differents parameters and didnt get more than 40t/s

[-]

eugene20@reddit

Because they're using a 1.7B for speculative decoding.

[-]

Altruistic_Heat_9531@reddit

wait a minute, 192K context but its -c 8192 what??

[-]

ArtfulGenie69@reddit

I linked this in my response but this should help for actual speed on a 3090. Using a draft model never works. Using vllm and turbo quant to fit it to the card will work though and 3090 have int4 processing I think so awq is really fast.

https://medium.com/@fzbcwvv/an-overnight-stack-for-qwen3-6-27b-85-tps-125k-context-vision-on-one-rtx-3090-0d95c6291914?postPublishedType=repub

[-]

ArtfulGenie69@reddit

Using some other draft model is probably bs, not sure but just about none of the various qwen models pair well for this. The heavy lifting is probably the speed of the card, the 4bit quant, and the speculative decode.

Someone did a similar test with a 3090 on vllm and was able to get it to 85t/s with mtp, if dflash worked it would probably be over 100t/s. https://medium.com/@fzbcwvv/an-overnight-stack-for-qwen3-6-27b-85-tps-125k-context-vision-on-one-rtx-3090-0d95c6291914?postPublishedType=repub

[-]

andy2na@reddit

Doesn't work for me, rejecting most drafts and tanking t/s down to 15t/s. Also tried qwen3-1.7b

--model /models/qwen35/Unsloth_Qwen3.6-27B-IQ4_NL.gguf -ngl 99 --ctx-size 65536 -md /models/qwen35/Qwen3-0.6B.gguf -ngld 99 -cd 4096 --draft-max 20 --draft-min 5 --draft-p-min 0.55 --cache-type-k q8_0 --cache-type-v q8_0

[-]

Ok-Internal9317@reddit (OP)

I'll try this out, damn this is cool

[-]

AdamDhahabi@reddit

The author says: mainline llama.cpp works fine too but you may see CUDA fallback warnings on q8 KV in some builds.

[-]

Ok-Internal9317@reddit (OP)

I just tried it, went from 6.6 down to 5.5, so I guess it doesn't work for me...

[-]

Apprehensive-View583@reddit

the author most likely is lying, i tried exactly setup on my 3090, i get 1/3 compare to without speculative decoding.

[-]

Awkward-Reindeer5752@reddit

I wonder how this impacts code generation quality? I’d be surprised if the correct next token is consistently amongst the draft model suggestion set

[-]

emprahsFury@reddit

Speculative decoding produces mathematically-guaranteed correct choices, the large model verifies the drafted token(s).

[-]

AdamDhahabi@reddit

The author says: mainline llama.cpp works fine too but you may see CUDA fallback warnings on q8 KV in some builds.
So it means no quality regression.

[-]

Awkward-Reindeer5752@reddit

Definitely need to try this out, thank you

[-]

MuDotGen@reddit

What is ik_llama.cpp?

[-]

shifty21@reddit

Fork of llama.cpp Tha handles certain aspects of CPU and system RAM offloading and other tweaks and customizations for special quantized GGUF

[-]

cviperr33@reddit

damn thats huge

[-]

ea_nasir_official_@reddit

10pp, 5tg UD IQ3_XXS

Amd 8845HS, 32GB 5600

[-]

sammcj@reddit

https://omlx.ai/benchmarks?chip=&chip_full=M5%7CMax%7C40&model=Qwen3.6+27&quantization=&context=&pp_min=&tg_min=

[-]

meca23@reddit

47 t/s on rtx 6000 pro using q8, get more tokens at lower quantities.

[-]

r0kh0rd@reddit

This is far too low for the RTX 6000 Pro. I am getting wayy higher numbers. Try this:

lm serve \
  -O3 \
  --model Qwen/Qwen3.6-27B-FP8 \
  --served-model-name qwen3.6-27b-fp8 \
  --tensor-parallel-size 1 \
  --gpu-memory-utilization 0.90 \
  --max-model-len 262144 \
  --max-num-seqs 128 \
  --max-num-batched-tokens 32758 \
  --max-cudagraph-capture-size 256 \
  --block-size 32 \
  --language-model-only \
  --enable-auto-tool-choice \
  --reasoning-parser qwen3 \
  --tool-call-parser qwen3_coder \
  --enable-prefix-caching \
  --async-scheduling \
  --attention-backend flashinfer \
  --load-format instanttensor \
  --default-chat-template-kwargs.preserve_thinking true \
  --override-generation-config.temperature 0.6 \
  --override-generation-config.top_p 0.95 \
  --override-generation-config.top_k 20 \
  --override-generation-config.min_p 0.0 \
  --override-generation-config.repetition_penalty 1.0 \
  --speculative-config.method mtp \
  --speculative-config.num_speculative_tokens 3 \
  --speculative-config.rejection_sample_method probabilistic

You should be getting >60 tok/s TG for single user and >350 tok/s aggregate TG for multi-user (>16 concurrent).

[-]

kaliku@reddit

my man. you took me from 30tps (which is not bad eh? or that was what I was thinking) to 50 tps. These days I am experimenting with an automatic coding harness. if stable this is going to have a big impact for me.

So big thanks to you!

How do you know all of these, are you a hobbyist like 99% of us or do you work with it?

[-]

EbbNorth7735@reddit

Does language model only disable vision support? What do you get with vision?

[-]

kaliku@reddit

Yes. With vision you can give it a picture as base64 (supported by the openai Api format) and it can interpret it. It can only output text though.

[-]

EbbNorth7735@reddit

I meant what speeds or context would you get with vision enabled

[-]

kaliku@reddit

same perceived just bit more vram usage.

[-]

r0kh0rd@reddit

About the same. It saves like 800mb of NVRAM. That’s all.

[-]

o0genesis0o@reddit

How fast is pp on that card? It would matter more in agentic coding.

[-]

r0kh0rd@reddit

> 2500 tk/s PP

[-]

o0genesis0o@reddit

Wow, 40k t/s pp is incredible.

If the model can do some sorts of planning, and can implement code change as decent as sonnet class, it would make the rtx 6000 very tempting. I'm getting sick of minimax and glm coding plan keep timing out due to high load and bug me to pay for higher tier.

Maybe I'll rent an rtx6000 on runpod and see how it goes.

[-]

Ok-Internal9317@reddit (OP)

How is the PP looking?

[-]

mxmumtuna@reddit

prompt processing is so fast on Blackwell cards it's sort of silly to try to measure it. It's in the 10s of Ks per second.

[-]

teachersecret@reddit

I was in the mid-70s t/s on 3.6 27b on my 4090 today, but that was in VLLM with MTP=3 and a bunch of fiddling, and I wasn't able to do that with a large context window. Here's my last run: output_tok_s_est_decode_only: 72.28

I'm trying to adjust it to get further, I think I can get it up over 100t/s generation speed if I tweak/get turboquant working, but we'll see. I'm currently compiling flashinfer, again.

Once this thing properly has MTP and some kind of turboquant integrated for llama.cpp/vllm without needing a ton of extra nonsense, it will be much more usable.

[-]

andy2na@reddit

Try these tweaks: https://github.com/noonghunna/qwen36-27b-single-3090#known-issue-tool-calling--mtp--turboquant-kv

[-]

teachersecret@reddit

Tried a bunch, couldn't get turboquant working properly, kept just repeating the same-first-word. Gave up on it for now, I'll come back in a few days when people have spent time knocking all the bugs/rust off it :).

[-]

andy2na@reddit

You have to use fp8 cache and not turboquant. Will lower the max context but retain speeds. With vision support and fp8 cache, I can still do 65k context. Removing vision you can do 75k

[-]

Important_Quote_1180@reddit

27B Local Inference on Single RTX 3090

qwen3.6-27B-AutoRound (INT4), vLLM 0.19.2rc1.dev21, 24GB VRAM. 71–83 tok/s after warmup.

• Turboquant 3-bit NC KV Cache: Compresses KV state to 3-bit non-uniform quantization. Enables 125K context window within 24GB VRAM without OOM. • MTP n=3 Speculative Decoding: Three auxiliary heads draft tokens per forward pass, verified atomically against main head. ~3× throughput multiplier vs. non-speculative baselines. • Cudagraph PIECEWISE Mode: Captures only attention-op boundaries instead of full-graph replay. Eliminates degenerate repetition loops caused by stale MTP state in FULL_AND_PIECEWISE mode on multi-GPU hosts. • Chunked Prefill + Prefix Caching: max-num-batched-tokens=4121 with max-num-seqs=1. First post-restart request incurs ~29s cudagraph compilation; subsequent requests stabilize at 12–14s for 1024-token generation.

[-]

J_m_L@reddit

Dang so you can get way better performance with vLLM?

[-]

Important_Quote_1180@reddit

Until TurboQuant comes to llama.cpp I use both depending

[-]

ddog661@reddit

What did you get without speculative decode? I am getting around 33 tok/sec with AWQ-int4.

[-]

teachersecret@reddit

In the mid 40s I think? I’ll eyeball later.

[-]

Optimal-Bass-5246@reddit

Following this article:
https://www.reddit.com/r/LocalLLaMA/comments/1stjx29/an_overnight_stack_for_qwen3627b_85_tps_125k/

I was able to get 155tps with 258K context window on 1x RTX 5090.

=== Warmup (3x) ===

w1 comp=1000 wall=19.42s 51.49 TPS

w2 comp=1000 wall= 8.11s 123.30 TPS

w3 comp=1000 wall= 8.46s 118.20 TPS

=== Narrative (3x, 1000 tok) ===

narr1 comp=1000 wall= 8.38s 119.33 TPS

narr2 comp=1000 wall= 8.13s 123.00 TPS

narr3 comp=1000 wall= 8.06s 124.07 TPS

=== Code (2x, 800 tok) ===

code1 comp=692 wall= 4.44s 155.86 TPS

code2 comp=462 wall= 3.05s 151.48 TPS

=== GPU state ===

0, 92 %, 29997 MiB, 32607 MiB, 402.53 W, 63

=== Last 3 SpecDecoding metrics (MTP accept) ===

(APIServer pid=1) INFO 04-25 14:10:16 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.60, Accepted throughput: 72.50 tokens/s, Drafted throughput: 136.20 tokens/s, Accepted: 725 tokens, Drafted: 1362 tokens, Per-position acceptance rate: 0.782, 0.533, 0.282, Avg Draft acceptance rate: 53.2%

(APIServer pid=1) INFO 04-25 14:10:26 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.71, Accepted throughput: 76.79 tokens/s, Drafted throughput: 134.99 tokens/s, Accepted: 768 tokens, Drafted: 1350 tokens, Per-position acceptance rate: 0.782, 0.564, 0.360, Avg Draft acceptance rate: 56.9%

(APIServer pid=1) INFO 04-25 14:10:36 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.97, Accepted throughput: 89.39 tokens/s, Drafted throughput: 135.89 tokens/s, Accepted: 894 tokens, Drafted: 1359 tokens, Per-position acceptance rate: 0.837, 0.647, 0.490, Avg Draft acceptance rate: 65.8%

[-]

FullOf_Bad_Ideas@reddit

600 t/s PP

150-30 t/s TG depending on task and context length.

8x 3090 Ti, BF16 model, with DFlash from Qwen 3.5 27B, SGLang with TP 8

[-]

grunt_monkey_@reddit

How do you run qwen 3.5 397b on 8x3090? Is it with smaller quant or cpu offload?

[-]

FullOf_Bad_Ideas@reddit

EXL3 quant cooked by me (cpral 3.536bpw on this table) or mratsim - https://huggingface.co/mratsim/Qwen3.5-397B-A17B-EXL3

I'll try to make better custom quants later but I'm playing with Hermes 4 405B quanting right now.

[-]

mtasic85@reddit

Nvidia RTX 3090 24GB / CUDA 12.9.1
llama-server version: version: 8929 (9d34231bb)
Unsloth Qwen3.6-27B Q4_K_M -ctk/v q5_0

21.86GiB / 24.00GiB VRAM
35 t/s tg
250 t/s tg, with speculative decoding (default `--spec-default` == `--spec-type ngram-mod --spec-ngram-size-n 24 --draft-min 48 --draft-max 64`)

```
CUDA_VISIBLE_DEVICES=0 ./llama-server -hf unsloth/Qwen3.6-27B-GGUF:Q4_K_M -ngl -1 -fa on -fit off --metrics --props --slots --host 0.0.0.0 --port 8080 -dev CUDA0 --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0.0 --presence-penalty 1.5 --repeat-penalty 1.0 --reasoning off --alias "Qwen/Qwen3.6-27B" -c 262144 -ctk q5_0 -ctv q5_0 --spec-default --no-mmproj-offload -b 1024 -ub 256
```

[-]

Kindly-Cantaloupe978@reddit

\~80 tps on RTX 5090 using vllm 0.19 with 218k context window and MTP enabled

model is this: https://huggingface.co/sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP

recipe can be found at my post on Qwen3.5-27B: https://www.reddit.com/r/LocalLLaMA/comments/1sr8gyf/qwen3527b_on_rtx_5090_served_via_vllm_77_tps/

[-]

Exact-Cupcake-2603@reddit

AMD mi50 x4 pp330 tg18

[-]

Legal-Ad-3901@reddit

https://github.com/larkinwc/mi50grad 56.33

[-]

jriggs28@reddit

That is amazing!

[-]

Exact-Cupcake-2603@reddit

Pp is not so good haha

[-]

Exact-Cupcake-2603@reddit

Awesome

[-]

suprjami@reddit

Triple RTX 3060 12Gb, power limited down to 125W

280 pp

12-14 tg

[-]

Evgeny_19@reddit

According to podman's logs my Radeon 9700 Pro runs Q5_K_XL with PP from 80 to 670, TG around 17-18.

[-]

karimusben@reddit

i've got 9,5t/s with rocm/vulkan on ubuntu, can you share your config ?

[-]

Evgeny_19@reddit

I ran on latest (wel, I did pull the updates a few days ago) llama.cpp/rocm combination via podman. The options are these: -ngl 99 \ -fa 1\ -c 131072 \ -b 2048 -ub 512 \ -ctk q8_0 -ctv q8_0 \ --spec-type ngram-mod \ --spec-ngram-size-n 24 \ --draft-min 4 \ --draft-max 48 \ --temp 0.6 \ --repeat-penalty 1.0 \ --top-p 0.95 \ --top-k 20 \ --min_p 0.0 \ --presence-penalty 0

In llama-bench the results would be diffrent. Those that I posted are from a real opencode session. I never saw anything that low as 80 at PP in llama-bnech, but yet I saw it in podman's logs on a real task. That was an exception though, I only saw it once. It usually stays well above 300. It should probably be possible to run Q6 variants with the same context at -ctk q4_0 -ctv q4_0, but I haven't tried it yet.

[-]

dametsumari@reddit

M5 pro 8 tk/s tg, pp 250 ish. Too slow to be useful.

[-]

Dany0@reddit

That doesn't seem right to me, I get 10-15tok/s decode and pp in that ballpark on an m3 max

[-]

dametsumari@reddit

Max is 2x pro ( of same generation) in tg. My number is also q8, how about yours?

[-]

Dany0@reddit

not for the m5 pro, which is the same chip as the m5 max though he may have 2 less cores

[-]

dametsumari@reddit

What cores are you talking about?

Memory bandwidth is 2x in all maxes. And that matters for tg. For pp, max has twice the gpu cores than pro, also in m5 ( 20 - 40 ).

[-]

Dany0@reddit

Memory bandwidth yes (with caveats, it actually has more memory controllers, not say higher mt/s, it's like increasing bus width not speed, so theoretical speed does scale but in practice if your workload isn't easily chunked/batched you might see drastically different numbers) but the chip itself is the same between the m5 max and m5 pro. There's just a binned m5 pro variant with 2 or actually I think it had 3 less cores? Doesn't matter

The main difference between the m5 max and pro is the gpu

[-]

dametsumari@reddit

Total bandwidth is what matters in interfererence as you need to go through all active parameters per token. Due to that, old multi lane xeons are surprisingly good as their aggregated bandwidth is usually with eg 12 lanes total bandwidth being then 250+ g/s and that works well with eg big deepseek MoE models with few active parameters and large total model size ( hundreds of gigabytes ).

[-]

Dany0@reddit

I just looked it up and my m3 max has 409 gb/s theoretical while the m5 pro has 307gb/s. So it still does not explain the difference

[-]

dametsumari@reddit

Oh? 8*409/307 is bit over 10. And you said 10-15.

[-]

Dany0@reddit

M5 series has much improved prefill speeds, he should be getting more than 250 pp

[-]

Sunknowned@reddit

5.5 tps 💀

[-]

Necessary-milkyway@reddit

I got max token count as 112 token/sec ...with 192k context ...average token for me was around 50-60token/sec ...i ran entire day with doing coding got total 100 Million tokens. 99million input and 1 million output token ..this is me running the model in vllm nfp4 in Asus ascent gx10

[-]

RiskyBizz216@reddit

66 tok/s on the RTX 5090 in LM studio

[-]

shansoft@reddit

What quant are you using? I am using llamacpp and only getting around 50 tok/s with unsloth UD5 XL

[-]

Dany0@reddit

50tok/s sounds like you might be spilling into system ram a little

Without spec decoding/mtp I would get a pretty steady 70tok/s on mine in both vllm/llamacpp. Though vllm gets much faster prefill speeds

[-]

shansoft@reddit

I retested again, it is indeed 50 tok/s under Arch with UD_Q5_K_XL. No spill over. But somehow in Windows with same setting, I am getting 60 tok/s. Something doesn't feel right. For Q4_K_M I can indeed run around 66-70 tok/s.

[-]

Dany0@reddit

nvidia closed drivers or open?

[-]

RiskyBizz216@reddit

Q4_K_M GGUF

https://huggingface.co/Jackrong/Qwen3.6-27B-GGUF

[-]

Important_Quote_1180@reddit

27B Local Inference on Single RTX 3090

qwen3.6-27B-AutoRound (INT4), vLLM 0.19.2rc1.dev21, 24GB VRAM. 71–83 tok/s after warmup.

• Turboquant 3-bit NC KV Cache: Compresses KV state to 3-bit non-uniform quantization. Enables 125K context window within 24GB VRAM without OOM. • MTP n=3 Speculative Decoding: Three auxiliary heads draft tokens per forward pass, verified atomically against main head. ~3× throughput multiplier vs. non-speculative baselines. • Cudagraph PIECEWISE Mode: Captures only attention-op boundaries instead of full-graph replay. Eliminates degenerate repetition loops caused by stale MTP state in FULL_AND_PIECEWISE mode on multi-GPU hosts. • Chunked Prefill + Prefix Caching: max-num-batched-tokens=4121 with max-num-seqs=1. First post-restart request incurs ~29s cudagraph compilation; subsequent requests stabilize at 12–14s for 1024-token generation.

[-]

OSHAHazard@reddit

288 tok/s PP and 28 tok/s TG at 77k context on a 7900XTX

[-]

genpfault@reddit

Same GPU:

./llama-batched-bench -hf unsloth/Qwen3.6-27B-GGUF:IQ4_NL -npp 1000,2000,4000,8000,16000,32000,64000,96000 -ntg 128 -npl 1 --cache-type-k q8_0 --cache-type-v q8_0
|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|  1000 |    128 |    1 |   1128 |    1.450 |   689.84 |    3.238 |    39.53 |    4.688 |   240.62 |
|  2000 |    128 |    1 |   2128 |    2.831 |   706.59 |    3.256 |    39.32 |    6.086 |   349.65 |
|  4000 |    128 |    1 |   4128 |    5.777 |   692.45 |    3.284 |    38.98 |    9.060 |   455.62 |
|  8000 |    128 |    1 |   8128 |   11.896 |   672.50 |    3.337 |    38.36 |   15.233 |   533.59 |
| 16000 |    128 |    1 |  16128 |   25.415 |   629.54 |    3.443 |    37.17 |   28.859 |   558.86 |
| 32000 |    128 |    1 |  32128 |   57.487 |   556.64 |    3.620 |    35.35 |   61.108 |   525.76 |
| 64000 |    128 |    1 |  64128 |  142.663 |   448.61 |    3.969 |    32.25 |  146.632 |   437.34 |
| 96000 |    128 |    1 |  96128 |  256.256 |   374.62 |    4.343 |    29.47 |  260.599 |   368.87 |

[-]

NullFlexZone@reddit

That's pretty low. I am getting 45 t/s in a quick test on LM Studio same size context.

[-]

CardinalRedwood@reddit

Also was hoping to get around this. Mind sharing your setup?

[-]

noctrex@reddit

I get on mine:

ROCm: \~290 pp / 20 tg

Vulkan \~550 pp / 35 tg

[-]

Ok-Internal9317@reddit (OP)

Hmm, that seems like it's not too quick as I might have thought

[-]

Force88@reddit

2x 5060ti 16gb, q6, 14t/s

[-]

SpaceTraveler2084@reddit

can we expect a qwen3.6:14b?

[-]

Fluffywings@reddit

Unlikely based on 3.5 and the poll Alibaba put out for 3.6 on sizes.

[-]

aparamonov@reddit

7900 xt, 20gb vram, light overclocking and undercoating with 300w power limit, minimal Linux ui. 110k q8/q8 context Q4 k xl quant. I sacrificed ub and pp for more context ( ub 256)

I get about 30-33 tg and about 720pp in llama server on Vulcan. Rocm is much slower for dence qwen models on my system.

[-]

sirmc@reddit

Intel Arc Pro B70 using llama.cpp with various sycl PR patches merged in:

llama-benchy:

PP: 684 +- 18.60 TG: 21.45 +- 0.02

[-]

lawldoge@reddit

Tore down the setup I was using, but was in the 14-15 tok/sec ballpark on llama.cpp with UD-Q5_K_XL and UD-Q8_K_XL. Context was set to 128K, but that rate only lasted only up to about 50K or so, at which point it absolutely tanked. It's in the 5-6 tok/sec range before it gets to 100K. ASRock B70's without the split on the Q5 and split across 2 cards on the Q8.

[-]

ea_man@reddit

Hey can I ask what quant you are using? As Q4_K_M...

[-]

sirmc@reddit

Running UD-Q4_K_XL

[-]

ea_man@reddit

Thanks, that's like a TG\~24 for q4_k_m which is fair, PP is pretty good with that kind of RAM.

I guess that that is concurrent sessions in VLLM, not a single user prompt in llama-serve right?

[-]

sirmc@reddit

Single user in llama.cpp (sycl), but I had to apply a number of open PRs against the llama.cpp repo to get a decent performance.

[-]

ea_man@reddit

Well congrats, it's looking way better than the early reports.

20 TG is usable and it's a dense model, that GPU is probbly better suited to the 35B MoE. Yet multiple agents with VLLM is gonna give some more tokens.

[-]

Simple_Library_2700@reddit

2000 t/s pp and ~80 t/s tg

4xV100

[-]

Dany0@reddit

Holy shit I didn't expect this from the venerable V100! I drooled so hard over those cards when they were new

What's your total system power consumption?

[-]

Simple_Library_2700@reddit

If I load it up with concurrent requests it can draw around 1200w total which is kind of killing the value proposition of these lol.

[-]

Ok-Internal9317@reddit (OP)

do you have nvlink?

[-]

Simple_Library_2700@reddit

Yes 4 way

[-]

uniVocity@reddit

Macbook pro max M4, 128gb of ram. 12 tok/s on LMStudio, doesn’t matter if I use MLX Q8 or GGUF Q8. No special settings - just downloaded and ran the model(s)

Takes a lot to start answering, average 5 minutes for each prompt.

[-]

StateSame5557@reddit

I get 20 tok/sec in LMStudio on a MBP M4

[-]

stuchapin@reddit

1 3090. 41 t/s 4q.

[-]

Important_Quote_1180@reddit

27B Local Inference on Single RTX 3090

qwen3.6-27B-AutoRound (INT4), vLLM 0.19.2rc1.dev21, 24GB VRAM. 71–83 tok/s after warmup.

• Turboquant 3-bit NC KV Cache: Compresses KV state to 3-bit non-uniform quantization. Enables 125K context window within 24GB VRAM without OOM. • MTP n=3 Speculative Decoding: Three auxiliary heads draft tokens per forward pass, verified atomically against main head. ~3× throughput multiplier vs. non-speculative baselines. • Cudagraph PIECEWISE Mode: Captures only attention-op boundaries instead of full-graph replay. Eliminates degenerate repetition loops caused by stale MTP state in FULL_AND_PIECEWISE mode on multi-GPU hosts. • Chunked Prefill + Prefix Caching: max-num-batched-tokens=4121 with max-num-seqs=1. First post-restart request incurs ~29s cudagraph compilation; subsequent requests stabilize at 12–14s for 1024-token generation.

[-]

DeepBlue96@reddit

3090 - Prompt processing long context 80k context is around 800tks - generation 25tks
Model: unsloth/Qwen3.6-27B:Q5_K_XL max context 131k kv cache q4_0

[-]

Important_Quote_1180@reddit

27B Local Inference on Single RTX 3090

qwen3.6-27B-AutoRound (INT4), vLLM 0.19.2rc1.dev21, 24GB VRAM. 71–83 tok/s after warmup.

• Turboquant 3-bit NC KV Cache: Compresses KV state to 3-bit non-uniform quantization. Enables 125K context window within 24GB VRAM without OOM. • MTP n=3 Speculative Decoding: Three auxiliary heads draft tokens per forward pass, verified atomically against main head. ~3× throughput multiplier vs. non-speculative baselines. • Cudagraph PIECEWISE Mode: Captures only attention-op boundaries instead of full-graph replay. Eliminates degenerate repetition loops caused by stale MTP state in FULL_AND_PIECEWISE mode on multi-GPU hosts. • Chunked Prefill + Prefix Caching: max-num-batched-tokens=4121 with max-num-seqs=1. First post-restart request incurs ~29s cudagraph compilation; subsequent requests stabilize at 12–14s for 1024-token generation.

[-]

eugene20@reddit

Q4_k_m, 41 tok/s on 4090. Went back to 35B A3B at just over 60, and hoping there is a something to speed it up.

[-]

Important_Quote_1180@reddit

27B Local Inference on Single RTX 3090

qwen3.6-27B-AutoRound (INT4), vLLM 0.19.2rc1.dev21, 24GB VRAM. 71–83 tok/s after warmup.

• Turboquant 3-bit NC KV Cache: Compresses KV state to 3-bit non-uniform quantization. Enables 125K context window within 24GB VRAM without OOM. • MTP n=3 Speculative Decoding: Three auxiliary heads draft tokens per forward pass, verified atomically against main head. ~3× throughput multiplier vs. non-speculative baselines. • Cudagraph PIECEWISE Mode: Captures only attention-op boundaries instead of full-graph replay. Eliminates degenerate repetition loops caused by stale MTP state in FULL_AND_PIECEWISE mode on multi-GPU hosts. • Chunked Prefill + Prefix Caching: max-num-batched-tokens=4121 with max-num-seqs=1. First post-restart request incurs ~29s cudagraph compilation; subsequent requests stabilize at 12–14s for 1024-token generation.

[-]

cromagnone@reddit

47t/s on 4090 with 64k context, with Unsloth's Q4_K_M and Q8 KV on main llama.cpp. Need to try that ik_llama.cpp build.

[-]

Important_Quote_1180@reddit

27B Local Inference on Single RTX 3090

qwen3.6-27B-AutoRound (INT4), vLLM 0.19.2rc1.dev21, 24GB VRAM. 71–83 tok/s after warmup.

• Turboquant 3-bit NC KV Cache: Compresses KV state to 3-bit non-uniform quantization. Enables 125K context window within 24GB VRAM without OOM. • MTP n=3 Speculative Decoding: Three auxiliary heads draft tokens per forward pass, verified atomically against main head. ~3× throughput multiplier vs. non-speculative baselines. • Cudagraph PIECEWISE Mode: Captures only attention-op boundaries instead of full-graph replay. Eliminates degenerate repetition loops caused by stale MTP state in FULL_AND_PIECEWISE mode on multi-GPU hosts. • Chunked Prefill + Prefix Caching: max-num-batched-tokens=4121 with max-num-seqs=1. First post-restart request incurs ~29s cudagraph compilation; subsequent requests stabilize at 12–14s for 1024-token generation.

[-]

Important_Quote_1180@reddit

27B Local Inference on Single RTX 3090

qwen3.6-27B-AutoRound (INT4), vLLM 0.19.2rc1.dev21, 24GB VRAM.

71–83 tok/s after warmup.

• Turboquant 3-bit NC KV Cache: Compresses KV state to 3-bit non-uniform quantization. Enables 125K context window within 24GB VRAM without OOM. • MTP n=3 Speculative Decoding: Three auxiliary heads draft tokens per forward pass, verified atomically against main head. ~3× throughput multiplier vs. non-speculative baselines. • Cudagraph PIECEWISE Mode: Captures only attention-op boundaries instead of full-graph replay. Eliminates degenerate repetition loops caused by stale MTP state in FULL_AND_PIECEWISE mode on multi-GPU hosts. • Chunked Prefill + Prefix Caching: max-num-batched-tokens=4121 with max-num-seqs=1. First post-restart request incurs ~29s cudagraph compilation; subsequent requests stabilize at 12–14s for 1024-token generation.

[-]

YairHairNow@reddit

Model + Quant	Config	tg (t/s)	Max Ctx	Verdict
35B-A3B heretic Q3_K_S	5080 only, `q4_0`	136-149	\~65K	CURRENT DAILY DRIVER
35B-A3B Q3_K_S bartowski	5080 only, `q4_0`	149	\~65K	Same speed, non-uncensored
27B IQ4_XS	5080 only, `turbo3`	48 (flat)	196K	Long-context mode
27B IQ4_XS	5080 only, `q4_0`	65	32K	Short-ctx option
35B-A3B Q4_K_M	2-GPU	73	131K+	Big model, needs 2-GPU

2-GPU is 5080+2080. It's only beneficial on 35B MOE 22GB to prevent offloading.

[-]

Zealousideal_Fill285@reddit

How did you manage to fit 196k ctx on 27B IQ4_XS?

[-]

YairHairNow@reddit

Turboquant probably. I don't think I actually tested that beyond the benchmark. The 2xGPU's were handling max context on the 22gb 35b tho.

[-]

MalabaristaEnFuego@reddit

ollama show qwen3.6:27b
  Model
    architecture        qwen35
    parameters          27.8B
    context length      262144
    embedding length    5120
    quantization        Q4_K_M

  Capabilities
    completion
    vision
    tools
    thinking

  Parameters
    min_p               0
    presence_penalty    1.5
    repeat_penalty      1
    temperature         1
    top_k               20
    top_p               0.95 

OLLAMA_FLASH_ATTENTION=1
OLLAMA_KV_CACHE_TYPE=q8_0
OLLAMA_GPU_OVERHEAD=0
OLLAMA_NUM_PARALLEL=1
OLLAMA_MAX_LOADED_MODELS=1

input_tokens    4734
output_tokens   2894
total_tokens    7628
prompt_tokens   4734
completion_tokens   2894
response_token/s    28.26
prompt_token/s  1749.15
total_duration  106552405661
load_duration   119001977
prompt_eval_count   4734
prompt_eval_duration    2706459419
eval_count  2894
eval_duration   102415970714
approximate_total   "0h1m46s"

| NVIDIA-SMI 555.42.06              Driver Version: 555.42.06      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX A5000               Off |   00000000:01:00.0  On |                  Off |
| 45%   76C    P0            229W /  230W |   23771MiB /  24564MiB |     97%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

%Cpu(s):  6.8 us,  0.2 sy,  0.0 ni, 92.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  31448.9 total,    497.0 free,   3742.7 used,  27689.7 buff/cache
MiB Swap:     32.0 total,     23.0 free,      9.0 used.  27706.2 avail Mem

[-]

Important_Quote_1180@reddit

I've been spending too much time staring at GPU util graphs, but here's the thing: running a 27B local model on a single consumer GPU in 2026 is like trying to cook a ten-course tasting menu in a one-burner kitchen. You don't need a bigger kitchen. You need to stop being wasteful.

The rig: RTX 3090 (24GB), AMD 9900X, 192GB DDR5. Nothing exotic. The kind of box a mid-tier game dev would run on. The model is qwen3.6-27B-AutoRound (INT4 quantized), served by vLLM.

Here are the tricks that make this actually work instead of choking at 12 tokens a second:

TURBOQUANT 3-BIT KV Cache. This is the move nobody talks about. Instead of stashing every attention computation in full 16-bit precision (the safe, default choice), we compress the KV cache to 3-bit. It's like storing your recipes on cocktail napkins instead of index cards. You think you'll lose something. You don't. On a 3090 with 24GB, this is the difference between "fits" and "OOM-killed by your own ambition."

MTP — Multi-Token Prediction. vLLM speculates the next three tokens using auxiliary heads, then verifies them in one pass against the main model. It's like hiring three sous-chefs to prep ingredients while you plate. When it works — and this is the crucial part, when it works — you get roughly triple the throughput because three speculative tokens get accepted per forward pass instead of one. We're seeing 71 to 83 tokens per second. That's not "usable." That's fast.

Cudagraph mode PIECEWISE, not FULL. This was the trap. The default FULL_AND_PIECEWISE captures complete execution graphs and replays them. On this machine, with a 1660 Ti also plugged in (legacy display adapter, I know, don't look at me like that), FULL capture poisoned the speculative decoding. The model started outputting the same token in a loop — 100% MTP acceptance, zero intelligence. Switched to PIECEWISE, which only captures attention-operation boundaries. No more garble. No more repetition loops. Just clean, fast inference.

The warmup tax: First request after a restart takes \~43 seconds for 1000 tokens because cudagraphs compile on the fly. After that? 14 seconds. Subsequent runs? 12 seconds. The kitchen warms up. Don't panic.

The first time I sat down at my mother-in-law's stove and realized I didn't need a perfect recipe, just a good one — that's this feeling. You don't need an H100. You need to stop being a tourist.

[-]

kapteinpyn@reddit

Two r9700 with UD-Q8_K_XL. Llama.cpp vulkan Pp 400 and tg 16

[-]

Intersteller-2002@reddit

Qwen3.6-35B-A3B-GUFF

Dual RTX 3070 8GB
32GB DDR4 RAM

Average around 21 tok/s

[-]

datbackup@reddit

Read the prompt again, Opus :P

[-]

mxmumtuna@reddit

~100tps@64k / MTP=3 - 2x RTX Pro 6000

[-]

Dany0@reddit

So jealous. What a beast of a card. What's your pp speed?

[-]

mxmumtuna@reddit

I mentioned it in another thread. It’s hard to really measure because it’s so fast. 10s of k per second usually.

[-]

Dany0@reddit

😭🤤🤤 man I uh I need to collect myself

now go make us a good coding finetune with it that us gpu poors can use >:(

[-]

mxmumtuna@reddit

They’re basically just jumbo 5090s, but it’s nice to have 4 of them.

[-]

Dany0@reddit

Why are you replying to me go get synthetic data, distill deepseekv4 reasoning, convert donald knuth books into jsonl. Do whatever it takes. I believe in you, you alone will accidentally stumble upon AGI by fine-tuning Qwen3.6 27B

[-]

mxmumtuna@reddit

My bad wrong thread.

[-]

nmqanh@reddit

PP 197.8 · TG 21.0 tok/s , M2 max 96gb 38 core. Qwen3.6-27B-4bit-mlx-fp16

[-]

Zestyclose_Leek_3056@reddit

70 tok/s on 5090 Threadripper 9060X in lm studio

Q8 KV cache quantization, max context window

[-]

GregoryfromtheHood@reddit

4090+3090+3090+5070ti

~700-1000 pp ~18-25 tg

[-]

Winter_Tension5432@reddit

I hit 112 tk/s with 1.3k Prompt Processing with MTP enable at INT4 on VLLM on 3 5060ti 16gb + 4070ti super 16gb but toolcalling got destroyed so I disable MTP and now I am at 64 Tk/S with same PROMPT PROCESSING this is at 256k context.

[-]

Asleep-Land-3914@reddit

2ts with 16GB GPU 😂

[-]

bobaburger@reddit

I have 16GB VRAM, have to go all the way down to Q4_K_S or Q3_K_XL + KV cache quant (either q4 or turbo4) to get above 10 t/s (for tg, pp was 150-400 t/s). and with this, the quality is sooooo bad, worse than 35B-A3B at Q5. I guess it's not a thing for our GPU poor.

[-]

KvAk_AKPlaysYT@reddit

120 s/tok

[-]

Tormeister@reddit

Between 78 to 200 tok/s, depending on MTP acceptance %

vLLM, 5090

[-]

Haeppchen2010@reddit

140pp/8tg on RX7800XT plus RX580 (Q5_K_M). But 35B is soooo good and more than twice as fast (400pp/40tg), so I will stay with 35B for now, until I can replace the slow RX580

[-]

No_Conversation9561@reddit

Prefill: 2000 t/s Decode: 25 t/s

Unsloth Q4_K_XL on 5060Ti + 5070Ti

[-]

No_Information9314@reddit

18 t/s Q4km on dual Rtx 3060

[-]

CrushingLoss@reddit

I get around 10 tok/s through Opencode. 15 or so raw. Mac Studio 2 Max, 96GB.

[-]

chimph@reddit

I keep getting looping issues with Qwen3.6 models in opencode. Not sure what’s going on

[-]

gusbags@reddit

Qwen/Qwen3.6-27B-FP8 on dual Asus GX10 spark cluster, with dflash. PP: 2500+, TG: up to 57 when dflash acceptance does well, but around 40t/s on average.

[-]

g_rich@reddit

How are you running the model and what are your settings?

[-]

gusbags@reddit

This is the recipe I am using via uegr's vllm docker builds (https://github.com/eugr/spark-vllm-docker).
This is the recipe I am using:

name: Qwen3.6-27B-FP8-Dflash
recipe_version: "1"
description: "vLLM serving Qwen3.6-27B in FP8 with Dflash speculative decoding, 262K context, tool calling"


model: Qwen/Qwen3.6-27B-FP8


container: vllm-node-tf5


build_args:
  - --tf5


defaults:
  port: 8000
  host: 0.0.0.0
  gpu_memory_utilization: 0.7
  max_model_len: 262144
  max_num_batched_tokens: 16384
  max_num_seqs: 4
  tensor_parallel: 2


env:
  VLLM_MARLIN_USE_ATOMIC_ADD: 1
  HF_TOKEN: <put your HF token here>
command: |
  vllm serve Qwen/Qwen3.6-27B-FP8 \
    -O3 \
    --max-model-len {max_model_len} \
    --max-num-seqs {max_num_seqs} \
    --enable-prefix-caching \
    -tp {tensor_parallel} \
    --gpu-memory-utilization {gpu_memory_utilization} \
    --port {port} \
    --host {host} \
    --load-format fastsafetensors \
    --enable-chunked-prefill \
    --enable-auto-tool-choice \
    --distributed-executor-backend ray \
    --tool-call-parser qwen3_coder \
    --reasoning-parser qwen3 \
    --trust-remote-code \
    --default-chat-template-kwargs '{{"preserve_thinking": true}}' \
    --speculative-config '{{"method": "dflash", "model": "z-lab/Qwen3.6-27B-DFlash", "num_speculative_tokens": 15}}' \
    --attention-backend flash_attn \
    --max-num-batched-tokens 32768 \
    --generation-config auto \
    --override-generation-config '{{"temperature": 0.6, "top_p": 0.95, "top_k": 20, "min_p": 0.0, "presence_penalty": 0.0, "repetition_penalty": 1.0}}'
recipe_version: '1'
name: Qwen/Qwen3.6-27B-FP8-Dflash
cluster_only: false
solo_only: false

[-]

g_rich@reddit

Thanks

[-]

Weary_Long3409@reddit

I'm on 3060: 27b iq4xs @20 t/s 35b-a3b iq4xs @82 t/s

[-]

Weary_Long3409@reddit

20 t/s on 2x3060

[-]

UniForceMusic@reddit

8-10 tps on Macbook M2 Max 64GB 14 inch

Q4_K_M model

[-]

viperx7@reddit

Qwen3.6 27B FP8 Vllm Hardware. 4090+3090ti MTP: 16 Ctx. 125k

Speeds varries between avg at 85t/s 50t/s at wrost to 141t/s peak

I am still wondering if increasing MTP to this extent even a good idea or not (I don't see any disadvantage)

[-]

fulgencio_batista@reddit

62.5t/s tg512, ~1000t/s pp2048 on dual rtx5060ti with Qwen3.6-27b-NVFP4 on vLLM using 3 speculative tokens with MTP

[-]

gogitossj3@reddit

Can you show your config?

[-]

fulgencio_batista@reddit

This was for 3.5, but it’s the same for 3.6

export MODEL="AxionML/Qwen3.5-27B-NVFP4"
docker run --rm -it \
  --gpus all \
  --ipc=host \
  -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -e CUDA_VISIBLE_DEVICES=0,1 \
  -e HF_HUB_DISABLE_XET=1 \
  -e HF_HUB_ENABLE_HF_TRANSFER=0 \
  vllm/vllm-openai:cu130-nightly \
    --model "$MODEL" \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel-size 2 \
    --max-model-len 131072 \
    --kv-cache-dtype fp8 \
    --max-num-seqs 1 \
    --gpu-memory-utilization 0.90 \
    --reasoning-parser qwen3 \
    --enable-prefix-caching \
    --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":3}'

[-]

Tunashavetoes@reddit

Q5 on M1 Max 10/24 core 64gb ram: 8tps

[-]

ea_man@reddit

AMD 6700xt using llama.cpp with vulkan, IQ3_XXS

PP: 160
TG: 23tok/sec

Context q_4: 50-85k according to what desktop I use :P

[-]

Linkpharm2@reddit

1500 in / 35-40 out, 4080

[-]

Economy_Cabinet_7719@reddit

mxfp4 15tk/s on m4 pro

[-]

Opteron67@reddit

130 TG FP8 dual 5090

[-]

Puzzleheaded-Drama-8@reddit

37t/s at 16k, 35t/s at 32k on 7900XTX, vulkan, q4_k_m

[-]

DeltaSqueezer@reddit

Why not run the 35B?

[-]

Ok-Internal9317@reddit (OP)

I think I misconfigured, all of 4 GPUs are almost idle, I'll update the post if I have figured it out

[-]

misha1350@reddit

Try Qwen 3 Next 80B as well, since you have a total of 48GB of VRAM and you can at least use Q3_K_M with a big enough context window to run some workloads that require better internal knowledge. For some cases, neither Qwen3.5 35B A3B nor the benchmaxxed Qwen3.6 35B A3B (which has worse internal knowledge in my testing) isn't cutting it for me, so I also use Qwen 3 Next 80B because it's a Sparse MoE model with good speed and I have enough memory for it to fit.

[-]