Qwen 3.6 27B is a BEAST
Posted by AverageFormal9076@reddit | LocalLLaMA | View on Reddit | 68 comments
I have a 5090 Laptop from work, 24GB VRAM.
I have been testing every model that comes out, and I can confidently say I’ll be cancelling my cloud subscriptions.
All my tool call and data science benchmarks that prove a model is reliably good for my use case, passed.
It might not be the case for other professions, but for pyspark/python and data transformation debugging it’s basically perfect.
Using llama.cpp, q4_k_m at q4_0, still looking at options for optimising.
ExplorerWhole5697@reddit
I'm currently enjoying qwen3.6-35b-a3b on my macbook pro. Would the 27b mean a noticeable upgrade? I assume speeds would tank, but it might still be worth it.
ernexbcn@reddit
On my M2 Max it’s very slow.
ExplorerWhole5697@reddit
that's what I would expect from a dense model. Did you have any luck with speculative decoding?
ernexbcn@reddit
I have not tried that, will have to look into that. I have 96GB of ram on this one.
AverageFormal9076@reddit (OP)
Speed doesn’t tank much in my experience, and yes the quality upgrade is worth it.
ExplorerWhole5697@reddit
sounds promising, and I guess speculative decoding might help with a dense model too
DeepV@reddit
How much ram?
ExplorerWhole5697@reddit
64gb
sagiroth@reddit
Dont use kv cache as q4 for coding. You can get 130k context with q8
ComfyUser48@reddit
On my 5090, for coding I'm using unsloth Q6 XL quant, no kv cache custom params, 100k ctx, getting 50 t/s with power limit to 400w
car_lower_x@reddit
Q8 on 5090 power limit 400W I am getting 45 t/s
ComfyUser48@reddit
How much context? With q8 cache or without?
car_lower_x@reddit
kv cache BF16 and 255k context
ComfyUser48@reddit
How can you load up Q8 with kv cache bf16 with 255k context on 32gb vram? What's your llama.cpp command?
car_lower_x@reddit
Nothing special. Just selected Q8 in the list and loaded it. It’s 28gb in size. First perfectly.
ComfyUser48@reddit
Can you share the full command?
car_lower_x@reddit
What do you mean? I run Unsloth, select the model leave the presets.. no special commands. Have you tried to load it?
finevelyn@reddit
Something is off because that combo won't nearly fit into 32GB of VRAM and if it didn't fit then it would be a lot slower than 45t/s.
year2039nuclearwar@reddit
Unsloth Studio has made this all really easy, it's really good for performance
Far-Low-4705@reddit
I’m so jealous, I only get 20 - 24 T/s…
And I get 50 T/s on 35b a3b
mindovic@reddit
That wouldn't fit 5090 laptop, or am I missing sth ?
AverageFormal9076@reddit (OP)
Interesting, will test that out. Might need to drop my quant down though, at q8_0 it’s pretty slow to first output
Johnny_Rell@reddit
Anyone running it on 16 GB VRAM + 32 GB DDR5? Wonder how well it works with offloading.
braintheboss@reddit
i didn't try 3.6 yet, but have same sizes as 3.5 and in a 5070ti + xeon haswell q4km run in 29t/s.
nikhilprasanth@reddit
Running with 5060ti and Q3 and turboquant llama cpp. 20-24 at the start tanks to 15tps near full. Still usable with opencode and hermes
set CUDA_VISIBLE_DEVICES=0 && "C:\Users\\Desktop\turbo_quant\llama-cpp-turboquant\build-cuda-nmake\bin\llama-server.exe" ^
-m "D:\Qwen3.6-27B-UD-IQ3_XXS.gguf" ^
-a "Qwen/Qwen3.6-27B" ^
--host 0.0.0.0 ^
--port 8080 ^
--fit on ^
--fit-ctx 65536 ^
--fit-target 512 ^
--flash-attn 1 ^
-b 4096 ^
-ub 256 ^
--temp 0.6 ^
--top-k 20 ^
--top-p 0.95 ^
--min-p 0.00 ^
--repeat-penalty 1.0 ^
--presence-penalty 0.0 ^
--cache-type-k turbo3 ^
--cache-type-v q8_0 ^
--mlock ^
--chat-template-kwargs "{\"preserve_thinking\":true}" ^
--jinja ^
--no-mmap ^
--webui-mcp-proxy ^
-np
Pangocciolo@reddit
I run UD-Q4_K_XL, it can reach 10t/s . Slow.
No_War_8891@reddit
I have 2x16GB vram plus 32 system ram (2 5060ti 16gb) and that is the sweet spot for me - one card with 16 gv is just not enough to get nice speeds
Coconut_Reddit@reddit
Awesome, follow up speed how many token /sec ?
autisticit@reddit
What speed are you achieving? Can you post your llama.cpp command please?
RandomTrollface@reddit
I run the iq3_xxs on my radeon rx 9070 non xt with 80k context q8. I could get more context if I go headless but I lose about 1.5gb of vram from my desktop environment. Despite the q3 it is still working really well for me in Pi coding agent, better than the MoE with offloading. I get about 30-35tok/s generation speed depending on context.
libregrape@reddit
It did not work too well. With IQ4 on llama-bench tg I got 25tps, and it will degrade with context. At 48k context it already gets to 8tps. Considering this is a thinking model, you would wait quite some time.
Guilty_Rooster_6708@reddit
I played with the IQ3 quant for a bit but definitely just going to stick w the MoE version. 5070Ti 32GB DDR5
rebelSun25@reddit
I compared the dense gemma 4 and qwen . I have a 16gb VRAM/ 64hb ddr5 system to test onc and 64. Both take time to start, generation is slow, but usable. Under 10 tks. Usable for casual chat , but not much for agents
Glad-Mode9459@reddit
You need add any even old card. I plug in old rx 6600 and get around 22 TG and 850PP
sagiroth@reddit
Forget, your option is 35BA3B
AverageFormal9076@reddit (OP)
Since it’s dense offloading will work terribly…
Force88@reddit
I have 2x 5060ti 16gb + 48gb ddr5, can I run q5, q6 with 130k context?
Steus_au@reddit
int will give you about 15-20tps depends on quant
Force88@reddit
Q4 works at 32k context, 22.5t/s
Coconut_Reddit@reddit
How fast is it ?
Force88@reddit
Tried, q6 not working, downloading q5
Sbaff98@reddit
Keep us posted :)
Force88@reddit
Q5 not working..., q4 at 32k context works, at 22.5t/s
ozymandizz@reddit
I just got a used 3090 and 128hb ddr4 ram. Any suggestions on how best I can run this ? Im new to local llms
DeedleDumbDee@reddit
Not worth running 27B if you can’t run it all on VRAM. The architecture and 64 layers makes it insanely slow. Stick to 35B
year2039nuclearwar@reddit
What do you mean, can't you just run it as a GGUF Q8 or Q6 quant? It should fit no? I haven't had a look yet
DeedleDumbDee@reddit
Oh well honestly I didn’t know a 3090 had 24GB vram so yeah you should be good
Wolfenhoof@reddit
Does anyone have suggestions on how to set this up on MacBook Pro? I know that it depends on what I’m using it for. But if I was just testing t/s and not using it to access the internet or my system is LMStudio sufficient? Or are there some saying that you always need a container/docker?
stancios00@reddit
Would be nice to have a test from a Mac mini
zannix@reddit
how many tps u getting?
FullOf_Bad_Ideas@reddit
EXL3 quants should be out soon, they should give you a bit better quality at given bitrate. I'd suggest looking into it - give it a few days for more quants to be out as now I see only 4.5bpw - https://huggingface.co/NeoChen1024/Qwen3.6-27B-exl3-4.5bpw-h6
AverageFormal9076@reddit (OP)
Noted!
inkberk@reddit
wait till z-lab releases the dflash drafter and https://github.com/ggml-org/llama.cpp/pull/22105, free 2x decode speed
AverageFormal9076@reddit (OP)
I look forward to it, this should solve my main gripe rn
theologi@reddit
which laptop model is this?
alccode@reddit
I second this question.
AverageFormal9076@reddit (OP)
ASUS ROG Strix Scar 18
More-School-7324@reddit
Anyone using this on a mac mini? What's your specs and how's it running?
amunozo1@reddit
How's the heat and noise when using it?
AverageFormal9076@reddit (OP)
I’ve been asked to work from home :D
amunozo1@reddit
If you're a man, don't put it on your lap if you want to have children
AverageFormal9076@reddit (OP)
Lmao this thing weighs like 4kg, it’s docked up dw
amunozo1@reddit
Hahahahhaha best then, you avoid the temptation
ortegaalfredo@reddit
Its super smart but tool call is not 100% perfect like minimax. For me it fails after 20 o 30 tool calls, minimax can go over 500. But it's smarter than even Minimax.
Adventurous-Gold6413@reddit
Lucky wo to the 5090 laptop, I only got a 4090 laptop 😞 so I got 16g not 24
AverageFormal9076@reddit (OP)
Oh don’t I know it!
Additional-Bad2648@reddit
what are your llama.cpp arguments? Like context and kv quants and such
AverageFormal9076@reddit (OP)
%LLAMA_DIR%\11ama-server.exe" ^ - "MODEL_PATH%" ^ --alias "qwen3.6-27b" ^ -C 204800 ^ -ng1 99 ^ --flash-attn on ^ --cache-type-k q4_0 ^ --cache-type-v 94_0 ٨ -np 1 A -t 20 A --prio 2 ^ --batch-size 2048 ^ --ubatch-size 1024 ^ --reasoning-format deepseek ^ --reasoning-budget 8192 ^ -reasoning-budget-message "Let me provide the final answer." ^ --cache-reuse 256 ^ •-metrics --no-context-shift ^ •-host127.0.0.1^ •-port 8080