Qwen3.6-27B-3bit-mlx · Hugging Face: 3 & 5 mixed quant for RAM poor Mac users.

Posted by JLeonsarmiento@reddit | LocalLLaMA | View on Reddit | 19 comments

Just dropped a 3bit mixed quant (5bit for embeds and prediction layers) for Mac users.

There was only one 3 bit version of this model (from Unsloth), but it was very heavy and painfully slow.

This one is twice as fast, and in my own agentic tests equally good.

[-]

soupcanx@reddit

How does something like this compare to https://huggingface.co/mlx-community/Qwen3.6-27B-nvfp4? I’m trying to understand more about different variable/mixed quants and things

Just curious as to like if there’s any noticeable tradeoffs, etc

[-]

How it compares? I’ll bet this one will be definitely faster than nvfp4 version just by looking at the size. “Intelligence” for sure is going to be different and depending on the task you might notice or not the lacking bits of depth.

According to my LLM expert evaluator (GLM-5.1), after 20 tests of coding and tool calling that would take me weeks or months to solve it gave 3% higher score to vanilla 4-bit mlx 😑 (whatever that means, something like 3% less errors than 4-bit, but also different errors that those committed by 4-bit model) … but did I notice a difference or does it even matter for my use case? I then gave both models a testing set of multi step multi skill tasks based on actual things I need to do daily on Hermes agent and both completed it, so it was a tie FOR MY USE CASE, ymmv kind of thing.

Nothing fancy really, a simple 2bits of difference between embedding and prediction layers against the rest, but instead of the more conventional 4&6 or 6&8 mixes I went really hard to 3&5… no datasets or nothing to be “aware” during process, uniform within layers and differentiated between layers groups.

Qwen3.6 model is so robust and dense that it still holds together. Not all models are this resilient, there was a post here some days ago comparing divergence from bf16-fp8-q4 between Gemma 31b and Qwen3.6… Qwen hold it together much better.

[-]

soupcanx@reddit

Appreciate the detailed explanation!! Keep up the awesome work

[-]

Interesting-Print366@reddit

I'm using Mac, but the RAM is sufficient, but it's too slow to use. The token generation speed is decent, but the prompt processing is too slow. Is there a way to improve this?

[-]

PkmExplorer@reddit

Try a harness with a minimal system prompt. Pi, for example. Pi is hugely faster for me with quen3.6-9b than OpenCode, for example.

[-]

JLeonsarmiento@reddit (OP)

Prompt processing is gpu constrained, so getting M5 chip is the easy-non-economical solution.

What processor do you have? Mx?

[-]

diogopacheco@reddit

This is great thanks! Do you plan on ever doing qwen3.6-35b-a3b for us ram poor? 🧡

[-]

JLeonsarmiento@reddit (OP)

No need, there's already a good option:

https://huggingface.co/majentik/Qwen3.6-35B-A3B-TurboQuant-MLX-3bit

[-]

diogopacheco@reddit

Thanks!

[-]

diogopacheco@reddit

Would including the mvision be a heavy increase on the size?

[-]

JLeonsarmiento@reddit (OP)

Likely, but not needed: you can use a lighter and faster vLLM like qwen3.5-2b at 4bit to do the vision tasks. You create a skill or python function and instruct the big 27b LLM to prompt the small vLLM to get vision descriptions when needed. That’s how I do it.

[-]

diogopacheco@reddit

Way above my knowledge but that’s for this 🫂

[-]

J0kooo@reddit

how much ram does this consume?

[-]

JLeonsarmiento@reddit (OP)

~12-ish gb ram cold start. It will depend on your context of course and also if you turn on K/V cache quantization in lm studio.

I have to check, but I think I get around 20gb ram with an unnecessarily long context of 80K tokens. But this model, while it “fits” I find it too slow to justify its “smartness” for simple things, agent use, etc.

Is good to solve complex one shots like this:

https://www.reddit.com/r/LocalLLaMA/s/MsjNFO2Foi

[-]

fnordonk@reddit

Why is it twice as fast?

[-]

JLeonsarmiento@reddit (OP)

mlx + 3bit mixed quantization, less things to compute allow better yields of memory bandwidth.

[-]

bobby-chan@reddit

You forgot to modify the Quantization Details for the 4bit version ;-)

[-]

JLeonsarmiento@reddit (OP)

thanks!, just fixed it.

[-]

PiaRedDragon@reddit

Nice, I will test it.