Gemma4-31B-3bit-mlx · Hugging Face: 3 & 5 mixed quant for RAM poor Mac users.

Posted by JLeonsarmiento@reddit | LocalLLaMA | View on Reddit | 8 comments

Just dropped another 3&5 mixed quant for the RAM Poor Base-model-only Mac users that want to try Gemma4 top of the line LLM.

6gb smaller that the other 3bit-mlx out there and 25% faster.

Thicc and dense 13 GB of pure LLM sweetness from Google for the desperate that don't care for vision. (just use something faster and equally good, like tiny Qwen3.5-2B)

Ideal if:

You just prefer the latest Gemma4 Humanities/Communications/SocialStudies edge over Qwen3.6 STEM hard focus in your 24gb ram Mac.
You don't like or need overly verbose thinking models (Qwen3.x 👀). Gemma4 chews only 1/4 of tokens 'thinking' if compared to Qwen3.6

Recommended Inference Parameters

For the best performance, use the following standardized sampling configuration across all use cases:

Parameter	Value
`temperature`	1.0
`top_p`	0.95
`top_k`	64
`min_p`	0.05
`repeat_penalty`	1.05

LM Studio — Reasoning Section Parsing

To enable thinking/reasoning output parsing:

Start string: <|channel>thought
End string: <channel|>

Add to ninja template:

{%- set enable_thinking = true %}

[-]

aigemie@reddit

Thanks for sharing! Do you know which one is better, this one or your Queen 3.6 27B 3.5bit?

[-]

JLeonsarmiento@reddit (OP)

STEM - Qwen3.6, amazing chains thought.

Gemma4… I like how it writes/translates things. Also thinks only 1/4 of tokens than Qwen3.6, so it’s faster in the end.

[-]

TomLucidor@reddit

What about common sense reasoning and document analysis? Also how are Qwen3.6-35B-A3B and the Gemma4-28B-A4B (MoE models)?

[-]

They have the same traces: Dense and MoE within each “family” have the same “patterns” when building chains of thought and structuring their answers, very interesting and consistent. This means I can swap dense for MoE when needed and the expected type of response is going to be there. That’s amazing. Same training data corpus I guess, isn’t?

Common sense? Difficult, seems the same tbh. Qwen thinks more than Gemma, so it’s easy to examine it’s chain of thought to look for reasoning flaws, but that’s it. Now, because of this, Qwen does have a significant edge over Gemma for STEM uses.

My “benchmark” yesterday was to make each go trough a long document I’m actually working with, extract all the methodological and input/output data information and explain it to me as a podcast and serve to me via telegram to listen while I’m driving.

Qwen got all what matters right to the detail in the first shot (5-bit mlx) with the level of detail I was expecting (I knew the document beforehand), but Gemma (6-bit mlx) wrote the best podcast (rhyme, timing, flow of the sentences) after pushing it 2 times to go deeper in the analysis and extraction of data from the text.

But have their ideal use cases for me.

[-]

diogopacheco@reddit

Thanks! Shame it doesn’t have vision. I get completely different results using qwen 3.5-2B when compared to higher vision models (probably due to the system prompt I’m using). If you ever feel like it, the vision would be great feature to all, since there aren’t many vision mlx models

[-]

JLeonsarmiento@reddit (OP)

Yeah, I get your point, I left that out because my reasoning is that for vision related tasks you usually want a fast model (ocr, image classification, etc) and just let this fatty focus on munching language with minimum RAM footprint.

[-]

Labtester@reddit

Thank you. Just out of curiosity, why are the extra steps needed for thinking, as compared to the original GGUF that has it enabled out of the box?

[-]

JLeonsarmiento@reddit (OP)

🤷🏻‍♂️ no idea, I guess I lost chat template during the download/upload to HuggingFace…