unsloth Qwen3.6-27B-GGUF

[-]

yoracale@reddit

The Q8 and BF16 should be uploading any minute now.

We also uploaded MLX quants btw: https://unsloth.ai/docs/models/qwen3.6#mlx-dynamic-quants

[-]

tgsz@reddit

Is mtp not availble with 3.6-27b dense ggufs? I'm seeing some conflicting info in the model card and the gguf info.

[-]

Man you gotta save the day and manage to cock a IQ3 version that is just a bit less (say 0.7GB) than 12GB for people with 12GB GPU, same thing for Q4_* that is just sly of 16GB, otherwise people without a 24GB GPU won't be able to actually run QWEN3.6 27B.

QWEN3.6 new release is some 20% heavier on hidden layers dimensions, dam the new size for 3.6 kills takes a downgrade on quant version to load the models :(

[-]

DistanceSolar1449@reddit

0.7GB is a lot at 12GB lol

[-]

yoracale@reddit

We uploaded 3-bit MLX version btw: https://huggingface.co/unsloth/Qwen3.6-27B-UD-MLX-3bit

[-]

Top-Rub-4670@reddit

If it leaves only 700MB free, half of it probably taken by your OS, how much context do you fit in those extra 300MB? And presumably zero KV cache? Which I guess doesn't matter as much when your context room is 12 tokens?

[-]

ea_man@reddit

# max context: 91K headless
# 1. Set Environment Variables
export LD_LIBRARY_PATH="/home/eaman/llama/bin_vulkan"

# 2. Run the Server
/home/eaman/llama/bin_vulkan/llama-server \
 -m /home/eaman/.lmstudio/models/unsloth/Qwen3.5-27B-GGUF/Qwen3.5-27B-UD-IQ3_XXS.gguf \
--host 0.0.0.0 \
        --reasoning-budget 1 \
        -np 1 \
        --fit-target 40 \
        -ctk q4_0 \
        -ctv q4_0 \
        -fa on \
        --temp 0.3 \
       --repeat-penalty 1.05 \
        --top-p 0.9 \
        --top-k 20 \
        --min-p 0.04 \
        -b 256 \
        --ctx-size 60960 \
        --n-gpu-layers 999 \
        --no--mmap

# LXqt setup
# optimized for coding
# max context:30K headless
# 1. Set Environment Variables
export LD_LIBRARY_PATH="/home/eaman/llama/bin_vulkan"

# 2. Run the Server
/home/eaman/llama/bin_vulkan/llama-server \
 -m /home/eaman/.lmstudio/models/unsloth/Qwen3.5-27B-GGUF/Qwen3.5-27B-UD-IQ3_XXS.gguf \
--host 0.0.0.0 \
-np 1 \
        --fit-target 70 \
        -ctk q4_0 \
        -ctv q4_0 \
        -fa on \
        --temp 0.3 \
       --repeat-penalty 1.05 \
        --top-p 0.9 \
        --top-k 20 \
        --min-p 0.04 \
-b 1024 \
        --ctx-size 2600 \
        --jinja  \
        --reasoning-budget 1 \
        --chat-template-kwargs '{"enable_thinking":false}' \
        --no-mmap \# optimized for coding

[-]

IrisColt@reddit

Thanks for the detailed info!

[-]

Informal_Librarian@reddit

Yes yes yes!!! Thank you 🙏

[-]

yoracale@reddit

We just uploaded 3-bit MLX version for Qwen3.6 27B: https://huggingface.co/unsloth/Qwen3.6-27B-UD-MLX-3bit

[-]

Informal_Librarian@reddit

For MLX that is!

[-]

iamapizza@reddit

Would MXFP4 be possible?

[-]

ArugulaAnnual1765@reddit

Where is IQ4_NL_XL ?!?!

[-]

HugoCortell@reddit

There is only one important question that needs to be answered: Does this model overthink itself to death like the last?

[-]

sine120@reddit

Give it some tools and a system prompt and it does a lot better. Having no system prompt or tools gives it anxiety. I use Pi and it does a lot better.

[-]

HugoCortell@reddit

Unfortunately, I use LM studio, which shits itself and refuses to work if you dare upload more than 5MB worth of files for RAG usage. But I'll try giving it a system prompt.

[-]

LocoMod@reddit

I dont think any local setup will work well with 5MB worth of text when attached via a Chat prompt. You need to configure an agent that can crawl a path with those files and let it read them as necessary, or index them in a proper RAG database. 5MB worth of text works in public provider apps such as ChatGPT because they are likely using 1M token context or doing some post-processing of the attachments to chunk and index the text somehow. It really just depends how much context those files take up.

[-]

HugoCortell@reddit

Generally for these kinds of tasks, the agents perform probes and search with python before loading segments of the files. Obviously I don't expect to load into memory a quarter of an IBM codebase.

[-]

sine120@reddit

If you're trying to do more than test if a model works or not, LM Studio isn't really worth the time. Llama.cpp takes all of 2 minutes to set up with AI help and is the most up-to-date runtime with way better configuration. I generally saw 10-50% improvement in tg speeds not using LM Studio. If you're doing more than just playing around, I highly recommend biting the bullet and just finding what works.

[-]

onefourten_@reddit

I appreciate this info. There’s so many options and unfortunately so many opinions too. It’s hard for newcomers like myself to get good advice.

I defaulted to LM Studio as it has a gui, I’ll take a shot at Llama.cpp next.

I /think/ I was getting decent speeds from Qwen 3.5 on a 36Gb MBP…hopefully switching to cpp will help

[-]

sine120@reddit

llama.cpp I believe has an included gui. If you use llama-server, you can just go to your browser and go to http://localhost:8080/ (assuming you're hosting to port 8080) and you'll have your gui.

[-]

onefourten_@reddit

Appreciate it. Just seen someone else comment somewhere saying just get Claude to do it! So I’m up and running now…can’t believe I’d not thought of that already.

[-]

the_fabled_bard@reddit

I generally saw 10-50% improvement in tg speeds not using LM Studio.

Can confirm!

[-]

logic_prevails@reddit

Gives it anxiety 😂

[-]

AloneSYD@reddit

Presence penality kinda fix the repetition

[-]

caetydid@reddit

qwen3.6 35B thinks more than gemma4 but it is way better than qwen3.5 35B ... so my hope would be that 27B is thinking considerately, too

[-]

Caffdy@reddit

There's another important question, will we need to redownload these later on (fixes, etc)?

[-]

ea_man@reddit

That's kinda fixable: disable thinking somehow.

Point is if it works well with tools, which was already quite fine with 3.5 btw.

[-]

lmpdev@reddit

Running UD-Q8_K_XL and it worked fine for most prompts, but then I had some conversation where tool calls failed and unfortunately that led to model being stuck until it exhausted the token limit (256k). Also presence_penalty parameter mention in Unsloth guide seems to be missing in llama.cpp server.

[-]

Mount_Gamer@reddit

It's available llama.cpp server, not sure about the other applications.

https://github.com/ggml-org/llama.cpp/blob/master/tools%2Fserver%2FREADME.md

[-]

Lost-Health-8675@reddit

Already downloading :)

[-]

Maleficent-Ad5999@reddit

Can’t wait for abliterated/uncensored heretic Claude opus distilled turbo version

[-]

Non-Technical@reddit

With a good prompt this Qwen 3.6 doesn’t seem censored out of the box.

[-]

odikee@reddit

For what cases do you need one of those?

[-]

Maleficent-Ad5999@reddit

i prefer my AI to be open with me you know.. /s

[-]

JayPSec@reddit

judging by the benchmarks you'd need claude opus 5 to make a difference.

[-]

Lost-Health-8675@reddit

I'm with you there!

[-]

breadislifeee@reddit

Just in time for me to run out of VRAM again.

[-]

No-Pineapple-6656@reddit

8GB vram. Waiting for the 4B.

[-]

getmevodka@reddit

If you have some ddr5 ram it will be okay most of times. My 4070 laptop still ran okay-ish with models parted into its 32gb ram

[-]

No-Pineapple-6656@reddit

For me the performance difference between a 6gb and 9gb model is massive

[-]

getmevodka@reddit

Sure, as soon as it doesnt fit fully into vram, but try a moe model and more context and youd still be quite satisfied i think.

[-]

No-Pineapple-6656@reddit

An MoE doesn't fit fully into vram like you said so it's 4-5 times slower.

[-]

iamapizza@reddit

What are these _0 and _1 models?

[-]

Valuable_Cookie628@reddit

Legacy quantization method. K is better. UD is unsloth method and usually best.

[-]

DHasselhoff77@reddit

In my quick 2-shot vibe test, Qwen3.6-27B-UD-IQ3_XXS.gguf was a tiny bit better than Qwen3.5-27B-UD-IQ3_XXS.gguf (also larger). 3.6 generated worse results at first but fixed it better than 3.5 after showing a screenshot of the result. Doesn't match the improvement reported in benchmarks but still in the right direction.

[-]

RickyRickC137@reddit

GGUF re-upload when?

[-]

mantafloppy@reddit

the /s is not needed

[-]

KURD_1_STAN@reddit

Based on its performance at lower quant it seems very likely to happen

[-]

deepspace86@reddit

Still waiting for that sweet Q8 XL.

[-]

danielhanchen@reddit

It's up now! Sorry for the delay - we were adding the OpenCode / Codex compatible chat template and the tool calling fixes

[-]

Separate-Forever-447@reddit

Could you elaborate on 'opencode compatible chat template'? The nature of the changes?

Sorry if that's maybe a question for opencode?

Just curious.

thanks.

ps.

Am asking because it seems quite surprising how different models behave (or don't) in opencode. Some work great at release (Qwen3.5-*), some didn't work until weeks after release (Gemma-4-*), and some never work even months later (nemotron-3-super, mistral-small-4).

opencode says their user base is 6.5M+ developers yet model producers don't test the models/templates with opencode.

[-]

Ell2509@reddit

I said thanks before, but I will keep holding you up. Legends.

[-]

deepspace86@reddit

That's why you guys are the GOAT.

[-]

Caffdy@reddit

what's the difference with good old Q8_0?

[-]

deepspace86@reddit

The dynamic quants typically perform a bit better.

[-]

yoracale@reddit

It's up now!

[-]

ea_man@reddit

Dam: IQ3 is just over 12GB, Q4 just over 16GB :(
Let's hope that Bartowsky manages to squeeze some 0.5-1GB away.

Qwen 3.5 27B | Hidden Dimension = 4096
Qwen 3.6 27B | Hidden Dimension = 5120

3.6 is "smarter" but heavier on VRAM.

-----------

Waaah I can't run IQ3 any more :*(

I would have to downgrade Quant :(

That's for both 12GB and 16GB GPUs, /sad

[-]

Psyko38@reddit

We might have a 9b at the 3.5 35b level.

[-]

ea_man@reddit

Yup, maybe we get an Opencoder 3 based on 3.6 that kicks ass for tooling, that would be sweet.

...but I'm gonna miss my beloved 27B if someone can't manage to squeeze away half-a-GB there. :(

[-]

iamapizza@reddit

Which 27B are you running? I tried a few on 16GB VRAM but it's so slow

[-]

ea_man@reddit

IQ3_XXS https://huggingface.co/bartowski/Qwen_Qwen3.5-27B-GGUF

[-]

No-Pineapple-6656@reddit

Try the Q2_K_XL

[-]

ea_man@reddit

Well I guess that I'd rather run the old 3.5 at IQ3, but as of now it is what it is. Bad size launch.

[-]

Mistercheese@reddit

dumb question, don’t you need a lot more headway for context? Or you’re fine offloading that to ram?

[-]

ea_man@reddit

naa QWEN3.x are pretty good at managing KV cache, use that at q_4 and see how much it takes.

I mean I'll take \~50-80K if that's as good as it gets, If I wanna go far I'll use Omnicoder or A3B.

[-]

Adventurous-Paper566@reddit

The model is good in non-thinking mode, but like 35B it fails to make an output in thinking mode when using the OWUI's code interpreter. He wrote the python code then stopped.

[-]

tacticaltweaker@reddit

I'm also waiting for Bartowski's Q6_K_L.

[-]

CyberSmarTalk@reddit

You tried context length 256k?

[-]

Adventurous-Paper566@reddit

It doesn't fit, even with a Q4 KV quant.

[-]

Xonzo@reddit

How is the performance with the dual 16Gb cards? I currently have a 5070ti, but looking at getting a 5060ti 16GB to run this model fully offloaded.

[-]

Adventurous-Paper566@reddit

18tok/s with 4060ti + 5060ti

[-]

CyberSmarTalk@reddit

Why with no kv cache? I had been using Qwen3.5-27b-Q4-K-M with Q8 KV Cache. Why not use Q8 KV cache?

[-]

Adventurous-Paper566@reddit

KV cache quantization is controversial, so I'm sticking with reliability, I'm not an expert.

3.6 is more memory efficient than 3.5, I couldn't get past 64k before so it's still a huge improvement for me.

[-]

Glittering_Value_253@reddit

any suggestions which quant to run with an rtx 3060 (12GB VRAMà) and 16GB RAM?

[-]

gnnr25@reddit

https://i.redd.it/r2evb20j8swg1.gif

[-]

Barafu@reddit

Q5_K_S is 16Gb, Q5_K_M is 19Gb. Is it a big drop in quality?

I am choosing what to download for 24G VRAM

[-]

LegacyRemaster@reddit

wait for benchmark!

[-]

logic_prevails@reddit

Oh yeah now it’s a party

[-]

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! Come check it out!

You've also been given a special flair for your contribution. We appreciate your post!

I am a bot and this action was performed automatically.

[-]

PANIC_EXCEPTION@reddit

sigh

time to benchmark another model

/s

[-]

Zc5Gwu@reddit

Is this stronger than minimax 2.7? I’m thinking it would be faster at long contexts because of the hybrid arch, no?

[-]

relmny@reddit

I used m2.7 for some days and was my go-to model... until it made a very wrong statement and after asking to confirm it 2 times, it kept saying it was right. Even after a few turns, even qwen3.6-35b was able to spot the wrong statement...

[-]

Zc5Gwu@reddit

I’ve been running it as a daily driver and yeah, it has some warts. Mainly the slowness at long contexts is the rough one for me.

[-]

YourNightmar31@reddit

How much vram does 262k take on Q8 or turbo3?

[-]

lmpdev@reddit

51820 MiB on UD-Q8_K_XL

[-]

Lazy-Pattern-5171@reddit

I really want to compare Q8 vs Q4 but don’t have a decent enough idea how best to see how those subtle changes magnify over long horizon coding tasks. Anyone have any tips?

[-]