Gemma 3 QAT launch with MLX, llama.cpp, Ollama, LM Studio, and Hugging Face

Posted by hackerllama@reddit | LocalLLaMA | View on Reddit | 50 comments

Hi! Some weeks ago we released GGUFs corresponding to the QAT checkpoints of Gemma 3. Thanks to QAT, the model is able to preserve similar quality as `bfloat16` while significantly reducing the memory requirements to load the model. That is, QAT is an additional fine-tuning that makes the model more rigorous to quantization. As we only released the GGUFs, we got feedback that it would be great to have the unquantized QAT-based checkpoints to allow people to quantize for their own tools. So...we did it! Today we're releasing the unquantized QAT-based checkpoints. The models preserve quality better than naive quantization. **We also collaborated with Prince (from MLX), llama.cpp, Ollama, LM Studio, and Hugging Face to make sure you can use the models in all your favorite tools!** * Blog post : [https://developers.googleblog.com/en/gemma-3-quantized-aware-trained-state-of-the-art-ai-to-consumer-gpus/](https://developers.googleblog.com/en/gemma-3-quantized-aware-trained-state-of-the-art-ai-to-consumer-gpus/) * Unquantized checkpoints: [https://huggingface.co/collections/google/gemma-3-qat-67ee61ccacbf2be4195c265b](https://huggingface.co/collections/google/gemma-3-qat-67ee61ccacbf2be4195c265b) * Ollama: [https://ollama.com/library/gemma3](https://ollama.com/library/gemma3) (try ollama run gemma3:12b-it-qat) * LM Studio: [https://lmstudio.ai/model/gemma-3-12b-it-qat](https://lmstudio.ai/model/gemma-3-12b-it-qat) * MLX: [https://huggingface.co/collections/mlx-community/gemma-3-qat-68002674cd5afc6f9022a0ae](https://huggingface.co/collections/mlx-community/gemma-3-qat-68002674cd5afc6f9022a0ae) * llama.cpp: [https://huggingface.co/collections/google/gemma-3-qat-67ee61ccacbf2be4195c265b](https://huggingface.co/collections/google/gemma-3-qat-67ee61ccacbf2be4195c265b) Enjoy!