unsloth Qwen3.6-27B-GGUF
Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 105 comments
finally with files inside :)
Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 105 comments
finally with files inside :)
yoracale@reddit
The Q8 and BF16 should be uploading any minute now.
We also uploaded MLX quants btw: https://unsloth.ai/docs/models/qwen3.6#mlx-dynamic-quants
tgsz@reddit
Is mtp not availble with 3.6-27b dense ggufs? I'm seeing some conflicting info in the model card and the gguf info.
ea_man@reddit
Man you gotta save the day and manage to cock a IQ3 version that is just a bit less (say 0.7GB) than 12GB for people with 12GB GPU, same thing for Q4_* that is just sly of 16GB, otherwise people without a 24GB GPU won't be able to actually run QWEN3.6 27B.
QWEN3.6 new release is some 20% heavier on hidden layers dimensions, dam the new size for 3.6 kills takes a downgrade on quant version to load the models :(
DistanceSolar1449@reddit
0.7GB is a lot at 12GB lol
yoracale@reddit
We uploaded 3-bit MLX version btw: https://huggingface.co/unsloth/Qwen3.6-27B-UD-MLX-3bit
Top-Rub-4670@reddit
If it leaves only 700MB free, half of it probably taken by your OS, how much context do you fit in those extra 300MB? And presumably zero KV cache? Which I guess doesn't matter as much when your context room is 12 tokens?
ea_man@reddit
IrisColt@reddit
Thanks for the detailed info!
Informal_Librarian@reddit
Yes yes yes!!! Thank you 🙏
yoracale@reddit
We just uploaded 3-bit MLX version for Qwen3.6 27B: https://huggingface.co/unsloth/Qwen3.6-27B-UD-MLX-3bit
Informal_Librarian@reddit
For MLX that is!
iamapizza@reddit
Would MXFP4 be possible?
ArugulaAnnual1765@reddit
Where is IQ4_NL_XL ?!?!
HugoCortell@reddit
There is only one important question that needs to be answered: Does this model overthink itself to death like the last?
sine120@reddit
Give it some tools and a system prompt and it does a lot better. Having no system prompt or tools gives it anxiety. I use Pi and it does a lot better.
HugoCortell@reddit
Unfortunately, I use LM studio, which shits itself and refuses to work if you dare upload more than 5MB worth of files for RAG usage. But I'll try giving it a system prompt.
LocoMod@reddit
I dont think any local setup will work well with 5MB worth of text when attached via a Chat prompt. You need to configure an agent that can crawl a path with those files and let it read them as necessary, or index them in a proper RAG database. 5MB worth of text works in public provider apps such as ChatGPT because they are likely using 1M token context or doing some post-processing of the attachments to chunk and index the text somehow. It really just depends how much context those files take up.
HugoCortell@reddit
Generally for these kinds of tasks, the agents perform probes and search with python before loading segments of the files. Obviously I don't expect to load into memory a quarter of an IBM codebase.
sine120@reddit
If you're trying to do more than test if a model works or not, LM Studio isn't really worth the time. Llama.cpp takes all of 2 minutes to set up with AI help and is the most up-to-date runtime with way better configuration. I generally saw 10-50% improvement in tg speeds not using LM Studio. If you're doing more than just playing around, I highly recommend biting the bullet and just finding what works.
onefourten_@reddit
I appreciate this info. There’s so many options and unfortunately so many opinions too. It’s hard for newcomers like myself to get good advice.
I defaulted to LM Studio as it has a gui, I’ll take a shot at Llama.cpp next.
I /think/ I was getting decent speeds from Qwen 3.5 on a 36Gb MBP…hopefully switching to cpp will help
sine120@reddit
llama.cpp I believe has an included gui. If you use llama-server, you can just go to your browser and go to http://localhost:8080/ (assuming you're hosting to port 8080) and you'll have your gui.
onefourten_@reddit
Appreciate it. Just seen someone else comment somewhere saying just get Claude to do it! So I’m up and running now…can’t believe I’d not thought of that already.
the_fabled_bard@reddit
Can confirm!
logic_prevails@reddit
Gives it anxiety 😂
AloneSYD@reddit
Presence penality kinda fix the repetition
caetydid@reddit
qwen3.6 35B thinks more than gemma4 but it is way better than qwen3.5 35B ... so my hope would be that 27B is thinking considerately, too
Caffdy@reddit
There's another important question, will we need to redownload these later on (fixes, etc)?
ea_man@reddit
That's kinda fixable: disable thinking somehow.
Point is if it works well with tools, which was already quite fine with 3.5 btw.
lmpdev@reddit
Running UD-Q8_K_XL and it worked fine for most prompts, but then I had some conversation where tool calls failed and unfortunately that led to model being stuck until it exhausted the token limit (256k). Also presence_penalty parameter mention in Unsloth guide seems to be missing in llama.cpp server.
Mount_Gamer@reddit
It's available llama.cpp server, not sure about the other applications.
https://github.com/ggml-org/llama.cpp/blob/master/tools%2Fserver%2FREADME.md
Lost-Health-8675@reddit
Already downloading :)
Maleficent-Ad5999@reddit
Can’t wait for abliterated/uncensored heretic Claude opus distilled turbo version
Non-Technical@reddit
With a good prompt this Qwen 3.6 doesn’t seem censored out of the box.
odikee@reddit
For what cases do you need one of those?
Maleficent-Ad5999@reddit
i prefer my AI to be open with me you know.. /s
JayPSec@reddit
judging by the benchmarks you'd need claude opus 5 to make a difference.
Lost-Health-8675@reddit
I'm with you there!
breadislifeee@reddit
Just in time for me to run out of VRAM again.
No-Pineapple-6656@reddit
8GB vram. Waiting for the 4B.
getmevodka@reddit
If you have some ddr5 ram it will be okay most of times. My 4070 laptop still ran okay-ish with models parted into its 32gb ram
No-Pineapple-6656@reddit
For me the performance difference between a 6gb and 9gb model is massive
getmevodka@reddit
Sure, as soon as it doesnt fit fully into vram, but try a moe model and more context and youd still be quite satisfied i think.
No-Pineapple-6656@reddit
An MoE doesn't fit fully into vram like you said so it's 4-5 times slower.
iamapizza@reddit
What are these _0 and _1 models?
Valuable_Cookie628@reddit
Legacy quantization method. K is better. UD is unsloth method and usually best.
DHasselhoff77@reddit
In my quick 2-shot vibe test, Qwen3.6-27B-UD-IQ3_XXS.gguf was a tiny bit better than Qwen3.5-27B-UD-IQ3_XXS.gguf (also larger). 3.6 generated worse results at first but fixed it better than 3.5 after showing a screenshot of the result. Doesn't match the improvement reported in benchmarks but still in the right direction.
RickyRickC137@reddit
GGUF re-upload when?
mantafloppy@reddit
the /s is not needed
KURD_1_STAN@reddit
Based on its performance at lower quant it seems very likely to happen
deepspace86@reddit
Still waiting for that sweet Q8 XL.
danielhanchen@reddit
It's up now! Sorry for the delay - we were adding the OpenCode / Codex compatible chat template and the tool calling fixes
Separate-Forever-447@reddit
Could you elaborate on 'opencode compatible chat template'? The nature of the changes?
Sorry if that's maybe a question for opencode?
Just curious.
thanks.
ps.
Am asking because it seems quite surprising how different models behave (or don't) in opencode. Some work great at release (Qwen3.5-*), some didn't work until weeks after release (Gemma-4-*), and some never work even months later (nemotron-3-super, mistral-small-4).
opencode says their user base is 6.5M+ developers yet model producers don't test the models/templates with opencode.
Ell2509@reddit
I said thanks before, but I will keep holding you up. Legends.
deepspace86@reddit
That's why you guys are the GOAT.
Caffdy@reddit
what's the difference with good old Q8_0?
deepspace86@reddit
The dynamic quants typically perform a bit better.
yoracale@reddit
It's up now!
ea_man@reddit
Dam: IQ3 is just over 12GB, Q4 just over 16GB :(
Let's hope that Bartowsky manages to squeeze some 0.5-1GB away.
Qwen 3.5 27B | Hidden Dimension = 4096
Qwen 3.6 27B | Hidden Dimension = 5120
3.6 is "smarter" but heavier on VRAM.
-----------
Waaah I can't run IQ3 any more :*(
I would have to downgrade Quant :(
That's for both 12GB and 16GB GPUs, /sad
Psyko38@reddit
We might have a 9b at the 3.5 35b level.
ea_man@reddit
Yup, maybe we get an Opencoder 3 based on 3.6 that kicks ass for tooling, that would be sweet.
...but I'm gonna miss my beloved 27B if someone can't manage to squeeze away half-a-GB there. :(
iamapizza@reddit
Which 27B are you running? I tried a few on 16GB VRAM but it's so slow
ea_man@reddit
IQ3_XXS https://huggingface.co/bartowski/Qwen_Qwen3.5-27B-GGUF
No-Pineapple-6656@reddit
Try the Q2_K_XL
ea_man@reddit
Well I guess that I'd rather run the old 3.5 at IQ3, but as of now it is what it is. Bad size launch.
Mistercheese@reddit
dumb question, don’t you need a lot more headway for context? Or you’re fine offloading that to ram?
ea_man@reddit
naa QWEN3.x are pretty good at managing KV cache, use that at q_4 and see how much it takes.
I mean I'll take \~50-80K if that's as good as it gets, If I wanna go far I'll use Omnicoder or A3B.
Adventurous-Paper566@reddit
The model is good in non-thinking mode, but like 35B it fails to make an output in thinking mode when using the OWUI's code interpreter. He wrote the python code then stopped.
tacticaltweaker@reddit
I'm also waiting for Bartowski's Q6_K_L.
CyberSmarTalk@reddit
You tried context length 256k?
Adventurous-Paper566@reddit
It doesn't fit, even with a Q4 KV quant.
Xonzo@reddit
How is the performance with the dual 16Gb cards? I currently have a 5070ti, but looking at getting a 5060ti 16GB to run this model fully offloaded.
Adventurous-Paper566@reddit
18tok/s with 4060ti + 5060ti
CyberSmarTalk@reddit
Why with no kv cache? I had been using Qwen3.5-27b-Q4-K-M with Q8 KV Cache. Why not use Q8 KV cache?
Adventurous-Paper566@reddit
KV cache quantization is controversial, so I'm sticking with reliability, I'm not an expert.
3.6 is more memory efficient than 3.5, I couldn't get past 64k before so it's still a huge improvement for me.
Glittering_Value_253@reddit
any suggestions which quant to run with an rtx 3060 (12GB VRAMà) and 16GB RAM?
gnnr25@reddit
https://i.redd.it/r2evb20j8swg1.gif
Barafu@reddit
Q5_K_S is 16Gb, Q5_K_M is 19Gb. Is it a big drop in quality?
I am choosing what to download for 24G VRAM
LegacyRemaster@reddit
wait for benchmark!
logic_prevails@reddit
Oh yeah now it’s a party
WithoutReason1729@reddit
Your post is getting popular and we just featured it on our Discord! Come check it out!
You've also been given a special flair for your contribution. We appreciate your post!
I am a bot and this action was performed automatically.
PANIC_EXCEPTION@reddit
sigh
time to benchmark another model
/s
Zc5Gwu@reddit
Is this stronger than minimax 2.7? I’m thinking it would be faster at long contexts because of the hybrid arch, no?
relmny@reddit
I used m2.7 for some days and was my go-to model... until it made a very wrong statement and after asking to confirm it 2 times, it kept saying it was right. Even after a few turns, even qwen3.6-35b was able to spot the wrong statement...
Zc5Gwu@reddit
I’ve been running it as a daily driver and yeah, it has some warts. Mainly the slowness at long contexts is the rough one for me.
YourNightmar31@reddit
How much vram does 262k take on Q8 or turbo3?
lmpdev@reddit
51820 MiB on UD-Q8_K_XL
Lazy-Pattern-5171@reddit
I really want to compare Q8 vs Q4 but don’t have a decent enough idea how best to see how those subtle changes magnify over long horizon coding tasks. Anyone have any tips?
Rubixu@reddit
+1 I really want to know this too
Is it feasible to test this on 24gb vram + 64gb ram? I'm down to test if anyone knows how to make it a legit comparison
Eyelbee@reddit
Tell me if you find it
jacek2023@reddit (OP)
I am downloading Q2 to use on 5070, then later I will compare to Q8 on 3090s
giant3@reddit
Don't go below Q3. Not worth it.
jacek2023@reddit (OP)
do you think all Q2 quants should be removed from HF and support for Q2 in llama.cpp should be removed? :)
Iory1998@reddit
YATTTTAAAAAAAAAAA!
usuallyalurker11@reddit
Jackrong model is lower in size
zYKwn@reddit
would a MLX version of this one be in any way decently runnable on a M2 Max 32GB?
yoracale@reddit
Yes absolutely. We uploaded them here: https://unsloth.ai/docs/models/qwen3.6#mlx-dynamic-quants
kiwibonga@reddit
Guys I launched this and my computer tower sucked itself into a humanoid shape and tried to walk out the window. It was only stopped when it accidentally unplugged itself. It was emitting baby crying sounds.
DarkArtsMastery@reddit
AGI right there
KPaleiro@reddit
same here
KvAk_AKPlaysYT@reddit
Don't downvote this one guys :)
DependentBat5432@reddit
upvoted and downloading. this sub moves faster than any release pipeline i’ve ever seen :D
jacek2023@reddit (OP)
already downvoted, ratio is "91,7%"
danielhanchen@reddit
Haha I think some folks might think it's a race and probs the downvotes
jreoka1@reddit
Sweeeeeet downloading now
notlongnot@reddit
Model drop be real