Qwen 3.6 27B is a BEAST

Posted by AverageFormal9076@reddit | LocalLLaMA | View on Reddit | 68 comments

I have a 5090 Laptop from work, 24GB VRAM.

I have been testing every model that comes out, and I can confidently say I’ll be cancelling my cloud subscriptions.

All my tool call and data science benchmarks that prove a model is reliably good for my use case, passed.

It might not be the case for other professions, but for pyspark/python and data transformation debugging it’s basically perfect.

Using llama.cpp, q4_k_m at q4_0, still looking at options for optimising.

[-]

ExplorerWhole5697@reddit

I'm currently enjoying qwen3.6-35b-a3b on my macbook pro. Would the 27b mean a noticeable upgrade? I assume speeds would tank, but it might still be worth it.

[-]

ExplorerWhole5697@reddit

that's what I would expect from a dense model. Did you have any luck with speculative decoding?

[-]

ernexbcn@reddit

I have not tried that, will have to look into that. I have 96GB of ram on this one.

[-]

AverageFormal9076@reddit (OP)

Speed doesn’t tank much in my experience, and yes the quality upgrade is worth it.

[-]

ExplorerWhole5697@reddit

sounds promising, and I guess speculative decoding might help with a dense model too

[-]

sagiroth@reddit

Dont use kv cache as q4 for coding. You can get 130k context with q8

[-]

ComfyUser48@reddit

On my 5090, for coding I'm using unsloth Q6 XL quant, no kv cache custom params, 100k ctx, getting 50 t/s with power limit to 400w

[-]

car_lower_x@reddit

Q8 on 5090 power limit 400W I am getting 45 t/s

[-]

ComfyUser48@reddit

How much context? With q8 cache or without?

[-]

ComfyUser48@reddit

How can you load up Q8 with kv cache bf16 with 255k context on 32gb vram? What's your llama.cpp command?

[-]

car_lower_x@reddit

Nothing special. Just selected Q8 in the list and loaded it. It’s 28gb in size. First perfectly.

[-]

car_lower_x@reddit

What do you mean? I run Unsloth, select the model leave the presets.. no special commands. Have you tried to load it?

[-]

finevelyn@reddit

Something is off because that combo won't nearly fit into 32GB of VRAM and if it didn't fit then it would be a lot slower than 45t/s.

[-]

year2039nuclearwar@reddit

Unsloth Studio has made this all really easy, it's really good for performance

[-]

Far-Low-4705@reddit

I’m so jealous, I only get 20 - 24 T/s…

And I get 50 T/s on 35b a3b

[-]

mindovic@reddit

That wouldn't fit 5090 laptop, or am I missing sth ?

[-]

AverageFormal9076@reddit (OP)

Interesting, will test that out. Might need to drop my quant down though, at q8_0 it’s pretty slow to first output

[-]

Johnny_Rell@reddit

Anyone running it on 16 GB VRAM + 32 GB DDR5? Wonder how well it works with offloading.

[-]

braintheboss@reddit

i didn't try 3.6 yet, but have same sizes as 3.5 and in a 5070ti + xeon haswell q4km run in 29t/s.

[-]

nikhilprasanth@reddit

Running with 5060ti and Q3 and turboquant llama cpp. 20-24 at the start tanks to 15tps near full. Still usable with opencode and hermes

set CUDA_VISIBLE_DEVICES=0 && "C:\Users\\Desktop\turbo_quant\llama-cpp-turboquant\build-cuda-nmake\bin\llama-server.exe" ^ -m "D:\Qwen3.6-27B-UD-IQ3_XXS.gguf" ^ -a "Qwen/Qwen3.6-27B" ^ --host 0.0.0.0 ^ --port 8080 ^ --fit on ^ --fit-ctx 65536 ^ --fit-target 512 ^ --flash-attn 1 ^ -b 4096 ^ -ub 256 ^ --temp 0.6 ^ --top-k 20 ^ --top-p 0.95 ^ --min-p 0.00 ^ --repeat-penalty 1.0 ^ --presence-penalty 0.0 ^ --cache-type-k turbo3 ^ --cache-type-v q8_0 ^ --mlock ^ --chat-template-kwargs "{\"preserve_thinking\":true}" ^ --jinja ^ --no-mmap ^ --webui-mcp-proxy ^ -np

[-]

Pangocciolo@reddit

I run UD-Q4_K_XL, it can reach 10t/s . Slow.

[-]

No_War_8891@reddit

I have 2x16GB vram plus 32 system ram (2 5060ti 16gb) and that is the sweet spot for me - one card with 16 gv is just not enough to get nice speeds

[-]

Coconut_Reddit@reddit

Awesome, follow up speed how many token /sec ?

[-]

autisticit@reddit

What speed are you achieving? Can you post your llama.cpp command please?

[-]

RandomTrollface@reddit

I run the iq3_xxs on my radeon rx 9070 non xt with 80k context q8. I could get more context if I go headless but I lose about 1.5gb of vram from my desktop environment. Despite the q3 it is still working really well for me in Pi coding agent, better than the MoE with offloading. I get about 30-35tok/s generation speed depending on context.

[-]

libregrape@reddit

It did not work too well. With IQ4 on llama-bench tg I got 25tps, and it will degrade with context. At 48k context it already gets to 8tps. Considering this is a thinking model, you would wait quite some time.

[-]

Guilty_Rooster_6708@reddit

I played with the IQ3 quant for a bit but definitely just going to stick w the MoE version. 5070Ti 32GB DDR5

[-]

rebelSun25@reddit

I compared the dense gemma 4 and qwen . I have a 16gb VRAM/ 64hb ddr5 system to test onc and 64. Both take time to start, generation is slow, but usable. Under 10 tks. Usable for casual chat , but not much for agents

[-]

Glad-Mode9459@reddit

You need add any even old card. I plug in old rx 6600 and get around 22 TG and 850PP

[-]

sagiroth@reddit

Forget, your option is 35BA3B

[-]

AverageFormal9076@reddit (OP)

Since it’s dense offloading will work terribly…

[-]

Force88@reddit

I have 2x 5060ti 16gb + 48gb ddr5, can I run q5, q6 with 130k context?

[-]

Steus_au@reddit

int will give you about 15-20tps depends on quant

[-]

Force88@reddit

Q4 works at 32k context, 22.5t/s

[-]

Coconut_Reddit@reddit

How fast is it ?

[-]

Force88@reddit

Tried, q6 not working, downloading q5

[-]

Sbaff98@reddit

Keep us posted :)

[-]

Force88@reddit

Q5 not working..., q4 at 32k context works, at 22.5t/s

[-]

ozymandizz@reddit

I just got a used 3090 and 128hb ddr4 ram. Any suggestions on how best I can run this ? Im new to local llms

[-]

DeedleDumbDee@reddit

Not worth running 27B if you can’t run it all on VRAM. The architecture and 64 layers makes it insanely slow. Stick to 35B

[-]

year2039nuclearwar@reddit

What do you mean, can't you just run it as a GGUF Q8 or Q6 quant? It should fit no? I haven't had a look yet

[-]

DeedleDumbDee@reddit

Oh well honestly I didn’t know a 3090 had 24GB vram so yeah you should be good

[-]

Wolfenhoof@reddit

Does anyone have suggestions on how to set this up on MacBook Pro? I know that it depends on what I’m using it for. But if I was just testing t/s and not using it to access the internet or my system is LMStudio sufficient? Or are there some saying that you always need a container/docker?

[-]

stancios00@reddit

Would be nice to have a test from a Mac mini

[-]

zannix@reddit

how many tps u getting?

[-]

FullOf_Bad_Ideas@reddit

EXL3 quants should be out soon, they should give you a bit better quality at given bitrate. I'd suggest looking into it - give it a few days for more quants to be out as now I see only 4.5bpw - https://huggingface.co/NeoChen1024/Qwen3.6-27B-exl3-4.5bpw-h6

[-]

AverageFormal9076@reddit (OP)

Noted!

[-]

inkberk@reddit

wait till z-lab releases the dflash drafter and https://github.com/ggml-org/llama.cpp/pull/22105, free 2x decode speed

[-]

AverageFormal9076@reddit (OP)

I look forward to it, this should solve my main gripe rn

[-]

theologi@reddit

which laptop model is this?

[-]

alccode@reddit

I second this question.

[-]

AverageFormal9076@reddit (OP)

ASUS ROG Strix Scar 18

[-]

More-School-7324@reddit

Anyone using this on a mac mini? What's your specs and how's it running?

[-]

amunozo1@reddit

How's the heat and noise when using it?

[-]

AverageFormal9076@reddit (OP)

I’ve been asked to work from home :D

[-]

amunozo1@reddit

If you're a man, don't put it on your lap if you want to have children

[-]

AverageFormal9076@reddit (OP)

Lmao this thing weighs like 4kg, it’s docked up dw

[-]

amunozo1@reddit

Hahahahhaha best then, you avoid the temptation

[-]

ortegaalfredo@reddit

Its super smart but tool call is not 100% perfect like minimax. For me it fails after 20 o 30 tool calls, minimax can go over 500. But it's smarter than even Minimax.

[-]

Adventurous-Gold6413@reddit

Lucky wo to the 5090 laptop, I only got a 4090 laptop 😞 so I got 16g not 24

[-]

AverageFormal9076@reddit (OP)

Oh don’t I know it!

[-]

Additional-Bad2648@reddit

what are your llama.cpp arguments? Like context and kv quants and such

[-]

AverageFormal9076@reddit (OP)

%LLAMA_DIR%\11ama-server.exe" ^ - "MODEL_PATH%" ^ --alias "qwen3.6-27b" ^ -C 204800 ^ -ng1 99 ^ --flash-attn on ^ --cache-type-k q4_0 ^ --cache-type-v 94_0 ٨ -np 1 A -t 20 A --prio 2 ^ --batch-size 2048 ^ --ubatch-size 1024 ^ --reasoning-format deepseek ^ --reasoning-budget 8192 ^ -reasoning-budget-message "Let me provide the final answer." ^ --cache-reuse 256 ^ •-metrics --no-context-shift ^ •-host127.0.0.1^ •-port 8080