Gemma 3 QAT launch with MLX, llama.cpp, Ollama, LM Studio, and Hugging Face

Posted by hackerllama@reddit | LocalLLaMA | View on Reddit | 50 comments

Hi! Some weeks ago we released GGUFs corresponding to the QAT checkpoints of Gemma 3. Thanks to QAT, the model is able to preserve similar quality as `bfloat16` while significantly reducing the memory requirements to load the model. That is, QAT is an additional fine-tuning that makes the model more rigorous to quantization. As we only released the GGUFs, we got feedback that it would be great to have the unquantized QAT-based checkpoints to allow people to quantize for their own tools. So...we did it! Today we're releasing the unquantized QAT-based checkpoints. The models preserve quality better than naive quantization. **We also collaborated with Prince (from MLX), llama.cpp, Ollama, LM Studio, and Hugging Face to make sure you can use the models in all your favorite tools!** * Blog post : [https://developers.googleblog.com/en/gemma-3-quantized-aware-trained-state-of-the-art-ai-to-consumer-gpus/](https://developers.googleblog.com/en/gemma-3-quantized-aware-trained-state-of-the-art-ai-to-consumer-gpus/) * Unquantized checkpoints: [https://huggingface.co/collections/google/gemma-3-qat-67ee61ccacbf2be4195c265b](https://huggingface.co/collections/google/gemma-3-qat-67ee61ccacbf2be4195c265b) * Ollama: [https://ollama.com/library/gemma3](https://ollama.com/library/gemma3) (try ollama run gemma3:12b-it-qat) * LM Studio: [https://lmstudio.ai/model/gemma-3-12b-it-qat](https://lmstudio.ai/model/gemma-3-12b-it-qat) * MLX: [https://huggingface.co/collections/mlx-community/gemma-3-qat-68002674cd5afc6f9022a0ae](https://huggingface.co/collections/mlx-community/gemma-3-qat-68002674cd5afc6f9022a0ae) * llama.cpp: [https://huggingface.co/collections/google/gemma-3-qat-67ee61ccacbf2be4195c265b](https://huggingface.co/collections/google/gemma-3-qat-67ee61ccacbf2be4195c265b) Enjoy!

Reply to Post

50 Comments

[-]

Fluffy_Sheepherder76@reddit

This makes running Gemma3 on laptops without melting the RAM way more doable. Love it

[-]

gptlocalhost@reddit

Thank for the release. We just tested Gemma 3 QAT (27B) model using M1 Max (64G) and Word like this: [https://youtu.be/\_cJQDyJqBAc](https://youtu.be/_cJQDyJqBAc) If you have any specific use cases, we'd be glad to give it a try.

[-]

AdOdd4004@reddit

I am not sure why but time to first token for the mlx models are very long (e.g., 6 seconds+) even for smaller models like 4B or 12B.

[-]

karl-william@reddit

Are the Gemma 3 QAT models released on Ollama now multimodal?

[-]

hackerllama@reddit (OP)

Yes

[-]

Disonantemus@reddit

Download and run with: ollama run gemma3:4b-it-qat ollama run gemma3:12b-it-qat ollama run gemma3:27b-it-qat Info from [ollama library](https://ollama.com/library/gemma3) --- I did try other GGUF from HF that didn't work multimodal, like this one: https://huggingface.co/lmstudio-community/gemma-3-4B-it-qat-GGUF https://huggingface.co/lmstudio-community/gemma-3-12B-it-qat-GGUF Maybe they're going to fix it later, or it is a compatibility thing with Ollama.

[-]

dampflokfreund@reddit

u/[stduhpf](https://huggingface.co/stduhpf) we can finally rest in peace. Google uplaoded new quants of their QAT models on HF Studio's page and given <img> is now specified as user\_defined, we can safely assume all the tokens are correct now! [https://huggingface.co/lmstudio-community/gemma-3-12B-it-qat-GGUF](https://huggingface.co/lmstudio-community/gemma-3-12B-it-qat-GGUF)

[-]

Disonantemus@reddit

**Didn't work for me**, I did try to add and image and get the following **error** and crash in `ollama`: > Failed to create new sequence: failed to process inputs: this model is missing data required for image input The same happens with 4B, I don't know 27B, too large for me. Downloading from [Ollama Library](https://ollama.com/library/gemma3) **did work**, using: ollama pull gemma3:12b-it-qat

[-]

-Ellary-@reddit

Should I redownload new Qs or can I just continue to use your versions?

[-]

dampflokfreund@reddit

IMO, our versions should be still be fine. The most commonly used tokens are correct, so you likely won't see a difference.

[-]

-Ellary-@reddit

ty for answer!

[-]

sxales@reddit

Why does the reported size of the model vary so much? LM Studio says 12b QAT is 7.74gb, while Hugginface says it is 8.07gb, and if I actually download it, it is 7.5gb. Are there different builds floating around, or is it just sloppy metadata?

[-]

ekaknr@reddit

Macs don’t follow 1GB =1024 MB scheme, as far as I know. Similar files would store a smaller size in Windows or Linux. That could be a reason. Maybe gguf and mlx are using different formats, ending up getting different sizes?!

[-]

Papabear3339@reddit

The official version is different from the dozen or so modified quants floating around.... there are also a few checkpoints on the official version. So yes, it is different builds. Personally i like Bartowski quants. He always does quality work. [https://huggingface.co/bartowski](https://huggingface.co/bartowski) Unsloth usually does amazing work too. Less library compatable, but there dynamic quants are great. [https://huggingface.co/unsloth](https://huggingface.co/unsloth)

[-]

sxales@reddit

I am not talking about other people's quants. In the links OP provided, the model is reported as being different sizes.

[-]

Papabear3339@reddit

"Quantization Aware Trained (QAT) Gemma 3 checkpoints. The model preserves similar quality as half precision while using 3x less memory" "Checkpoints" is the key word here. That means the version on the official page has changed a few times... they where releasing alpha versions for feedback instead of holding for the final product.

[-]

sxales@reddit

If that is true and if they are going to be changing builds after release, it would probably be a benefit to the community if there was a version or build designation in the file name to indicate that. However, if there is a difference in the build for LM studio vs Llamacpp then that might warrant an explanation of what is different. Or if they just uploaded the wrong model somewhere, that should be fixed.

[-]

durden111111@reddit

iirc something related to the embeddings being unquantized in the official quant

[-]

coder543@reddit

It's confusing that the MLX versions are available in 3 bit, 4 bit, 8 bit, and such? Is there actually a 3 bit QAT? Is the 8 bit MLX just converted from 4 bit QAT, using twice as much memory for no benefit?

[-]

hackerllama@reddit (OP)

No, we just released half precision QATs and folks went ahead with quantizing to Q4\_0. Prince, our MLX collaborator, found that the 3 bit quants were also working better than naive 3 bit quants, so he went ahead to share those as well We'll follow up with LM Studio, thanks!

[-]

alphakue@reddit

Are there any specific parameters that need to be set? I am trying to use openwebui with mlx server backend, using mlx-community/gemma-3-27b-it-qat-3bit , and the model breaks down with bad grammar, repetition issues etc.I think there might have been some issue with quantisation, which is a bummer, since this is the biggest model I've been able to run on this 16gb mac mini

[-]

dampflokfreund@reddit

Thank you for listening to the community so much, it's really appreciated! Question, can you quant the unquantised q4\_0 weights to other sizes as well, like Q2\_K or Q5\_K\_M?

[-]

hackerllama@reddit (OP)

Yes, you can try and see how it works! The model was designed for Q4\_0 though, but it may still be more resilient vs naive quants

[-]

dampflokfreund@reddit

Nice. I have the feeling Bart is going to to try soon. I also wonder if you can imporve the quality even further using imatrix.

[-]

hackerllama@reddit (OP)

Hi! MLX in LM Studio should be fixed for all except 1B

[-]

coder543@reddit

I quit LM Studio, opened it again, downloaded the mlx-community/gemma-3-4b-it-qat model, and it still seems to respond only with <pad>. Is there something I need to do? Also, I noticed that none of the Gemma 3 QAT GGUF models are recognized as being compatible for speculative decoding when using the larger Gemma 3 models, which seems unfortunate.

[-]

hiper2d@reddit

Yeah, this is cool. But with the recent raise of MCPs, I'd like to see function calling support. Mistral 3.1 Small has it

[-]

swagonflyyyy@reddit

Ok a couple of things: First things first, I'm not going to pin the blame on anyone here, but I tried the 27B QAT recently uploaded and it is good but when it receives a token greater than its context length Ollama 6.4.0 goes crazy with KV Cache q8\_0 and it starts saying something along the lines of "defragmenting kv cache" and when you set it to q8\_0, it gets an OOM error. When you set it to q4\_0 or f16, its much more stable, but it can still happen if there's too much text input past the model's context length. But there text wasn't much more than the context length and I was only using up 26 out of 48GB VRAM when it would happen. So when I tried enabling the system memory fallback feature in Windows, it would just freeze my PC when the text input exceeded the context, even if its not by much. We're talking about a 4096 instance being exceeded by maybe 2000 tokens and it would still act up like that. I tried a workaround by truncating part of the input text and reducing the KV Cache to q4\_0 prior to introducing it to the model and disabling the fallback and while it significantly reduced these instances, it still happens occasionally and made me really nervous about this release. Is there anything else I can do about this? It seems that Gemma-3 gives Ollama a really hard time, but a lot of the reports indicate KV Cache issues with that model.

[-]

Nevril@reddit

Try upgrading Ollama to 0.6.6 Preview or wait a bit more for the final release. A couple of memory leaks should have been solved. I don't think it has anything to do with Gemma itself, I've been having similar issues with Mistral Small.

[-]

swagonflyyyy@reddit

Nope, still run into the same issue, but less often.

[-]

East-Cauliflower-150@reddit

I love Gemma 27b for in depth discussions. I have used bartowski q8_0 ever since it came out and prefer it to any of the bigger models. The Q4 qat surprisingly has a very different personality and likes to make lists which the q8 never did in conversation, so there seems to be quite a difference. Sticking with q8…

[-]

Zestyclose_Yak_3174@reddit

That's a very interesting observation. Might be related to the fact that they continued some form of training on it and it is based on a certain checkpoint. So you might be onto something here

[-]

Calcidiol@reddit

Is support / guidance available / forthcoming for using the gemma3 QAT quantizations optimally with HF transformers inference and a well supported HF transformers quantization format (which does not AFAICT include GGUF Q4_0)? The model card for gemma-3-27b-it-qat-q4_0-unquantized says: "The checkpoint in this repository is unquantized, please make sure to quantize with Q4_0 with your favorite tool". The model card for "gemma-3-27b-it" says "Gemma 3 is supported starting from transformers 4.50.0." and shows exact usage examples for how to use the non QAT model with HF transformers as its foremost exemplified inference use case, but no such transformers use guidance / support is shown for the QAT models. It is not clear to me how to preserve the QAT benefits and comply with the new QAT model card's stipulation of "...make sure to quantize with Q4_0 with your favorite tool" while using HF transformers and some of the well supported quantizations HF transformers works with while achieving full benefit of the "gemma-3-27b-it-qat-q4_0-unquantized" QAT tuned model. Some HF transformers supported inference usable quantization methods support 4-bit quantization though it's not immediately clear how / if any of those map to the commended "q4_0" quantizer process for best use of the QAT model since AFAICT the only exemplified / released use cases for the quantized QAT models are literally using GGUF and its literally named Q4_0 quantization which has been described like so: "...4-bit round-to-nearest quantization (q). Each block has 32 weights. Weight formula: w = q * block_scale. Legacy quantization method (not used widely as of today)." But even the several other Q4(etc) related GGUF 4-bit quantizations like Q4_1, Q4_K are different in algorithm (and the instructions say to use Q4_0 and NOT these GGUF 4-bit-ish others) so it's not so clear that a 4-bit quant configuration (for transformers) of bitsandbytes, AWQ, GPTQ, optimum-quanto, etc. would use the compatible thresholds / scaling or whatever to align with the commended Q4_0 QAT tuned model here.

[-]

Aaaaaaaaaeeeee@reddit

Due to the overwhelming amount of "Q4" weight quantized model types there may never be a perfect fit for all of them. Sticking to the Q4_0-unpacked version for quantization seems best. The int4 version is a per-channel version which might be what JAX tpu uses which is performant on their hardware. Of course it would be even better if we did not have to run through each quantization algorithm like exl2's and just downscale it perfectly somehow, but it looks like a lot of work!

[-]

idkman27@reddit

Does anyone know if it’s possible / how to go about fine-tuning these qat models?

[-]

DunderSunder@reddit

or is there a way to fine-tune on full weight then do the qat ourself?

[-]

Papabear3339@reddit

See here: [https://arxiv.org/pdf/2305.17888](https://arxiv.org/pdf/2305.17888) The secret sauce looks like just a custom loss function, which you could very easily toss into adam for testing when making your own fine-tune.

[-]

Papabear3339@reddit

You would still want to do fine tuning on the unquantized model. QAT is a method of training that is "quantization aware" so it loses less quality when quantized. Here is a paper on the method if you want to try and replicate it: [https://arxiv.org/pdf/2305.17888](https://arxiv.org/pdf/2305.17888)

[-]

maglat@reddit

Cant find any words on function calling.

[-]

AaronFeng47@reddit

The long context is still broken in ollama, throw 60k tokens at it and it's "brain" will stop functioning, unlike qwen 2.5-1M which still mostly works

[-]

chibop1@reddit

Thank you! Awesome to see support for different engines! Is 27b-qat on Ollama better than q8_0?

[-]

ApprehensiveAd3629@reddit

Where i find this 14.1 GB file?

[-]

FullstackSensei@reddit

Did a quick test on my Nvidia P40 rig, testing generation with and without a draft model, and using one P40 or splitting the model across two of them. The draft model seems to hurt performance, even though it was run on a separate GPU. The acceptance rate was 6% using 1B. | Run Configuration | Prompt Tokens | Prompt Eval Time (ms) | Prompt Tokens/s | Eval Tokens | Eval Time (ms) | Eval Tokens/s | Total Tokens | |---|---|---|---|---|---|---|---| | Gemma 27B + Gemma 1B draft | 94 | 504.22 | 186.43 | 2285 | 211920.42 | 10.78 | 2379 | | Gemma 27B (Single GPU) | 94 | 501.80 | 187.33 | 1955 | 151586.79 | 12.90 | 2049 | | Gemma 27B (Two GPUs) | 94 | 658.05 | 142.85 | 2016 | 143419.47 | 14.06 | 2110 | Run using the following commands, respectively: ``` ./llama-server -m llama-server -m /models/gemma-3-27b-it-q4_0.gguf -md /models/gemma-3-1b-it-q4_0.gguf -fa --temp 1.0 --top-k 64 --min-p 0.0 --top-p 0.95 -sm row --no-mmap -ngl 99 -ngld 99 -c 5000 --cache-type-k q8_0 --cache-type-v q8_0 --draft-max 16 --draft-min 5 --draft-p-min 0.5 --device CUDA0 --device-draft CUDA1 --tensor-split 1,0,0,0 --slots --metrics --numa distribute -t 40 --no-warmup --port 8800 --host 0.0.0.0 ``` ``` ./llama-server -m llama-server -m /models/gemma-3-27b-it-q4_0.gguf -fa --temp 1.0 --top-k 64 --min-p 0.0 --top-p 0.95 -sm row --no-mmap -ngl 99 -c 5000 --cache-type-k q8_0 --cache-type-v q8_0 --device CUDA0 --tensor-split 1,0,0,0 --slots --metrics --numa distribute -t 40 --no-warmup --port 8800 --host 0.0.0.0 ``` ``` ./llama-server -m llama-server -m /models/gemma-3-27b-it-q4_0.gguf -fa --temp 1.0 --top-k 64 --min-p 0.0 --top-p 0.95 -sm row --no-mmap -ngl 99 -c 5000 --cache-type-k q8_0 --cache-type-v q8_0 --device CUDA0,CUDA1 --tensor -split 1,0,1,0 --slots --metrics --numa distribute -t 40 --no-warmup --port 8800 --host 0.0.0.0 ```

[-]

Any-Mathematician683@reddit

Can you please help us in running these models with vLLM or SGLang? I am getting errors for previously release QAT models. Thanks a ton for amazing work.

[-]

hideo_kuze_@reddit

Thank you for your work > We reduce the perplexity drop by 54% (using llama.cpp perplexity evaluation) when quantizing down to Q4_0. Are there any other numbers or benchmarks on quant vs original version?

[-]

R46H4V@reddit

https://preview.redd.it/eo6kc5ucwlve1.png?width=959&format=png&auto=webp&s=95fd0ccfacd1a7824e62492f92c959c72ad24ead Why is the model way larger than 2.6GB as mentioned in the Post?

[-]

busylivin_322@reddit

What’s the perf difference from regular quantization? Any benchmarks?

[-]

TacGibs@reddit

Just google QAT, or ask any LLM.

[-]

busylivin_322@reddit

The Google blog post says QAT preserves performance of bf16, while at lower quants/VRAM requirements but not by how much, just “preserves high quality”. My questions were specific to what degree QAT is improvement over regular quantization regarding model performance. Besides perplexity drop, which is a singular metric, I didn’t see in the blog post any benchmarks related to those questions. Understand why asking an LLM or googling the thing, after already having read the blog post, released today wouldn’t help?

[-]

Accomplished_Mode170@reddit

Thank you! 🙏 Are y’all dropping SAEs too for interpretability? 📊