Are there official (from Google) quantized versions of Gemma 3?
Posted by lostmsu@reddit | LocalLLaMA | View on Reddit | 10 comments
Maybe I am a moron, and can't use search, but I can't find quantized downloads made by Google themselves. The best I could find is the Huggingface version in ggml-org, and a few community quants such as bartowski and unsloth.
vasileer@reddit
in their paper they mention (aka recommend) llama.cpp: so what is difference if it is Google, or Bartowski, or yourself who created ggufs llama.cpp/convert_hf_to_gguf.py?
suprjami@reddit
There is theoretically a difference in response of imatrix quants depending on the content of the imatrix dataset.
The full effect of this is debated.
mradermacher thinks an English imatrix set nerfs non-English languages but there is research showing that doesn't happen much with a specific model (I think it was Qwen?).
Both mradermacher and bartowski use an imatrix dataset designed to give "higher quality" responses. bartowski's is publicly available. DavidAU has a horror/story imatrix set which he thinks makes a difference to his quants.
Some people say they always get better results from static quants than imatrix quants.
Some people say there is a noticeable difference in response but actual quality of response doesn't var either way, the model just produces differently structured sentences but still gives the same sort of answers.
I think you could only test this with a large set of benchmarks relevant to your specific usage with the specific model and quants you care about.
yukiarimo@reddit
Yes, but only if you’re using imatrix quants
vasileer@reddit
this
Pedalnomica@reddit
My understanding... Basically, the conversion just picks some weights to store at higher bits based on a calibration data set that is probably not what Google used to train Gemma 3. With quantization aware training, they keep training the model using the original data (or a subset) but with lower but per weight. The latter requires more compute and data and should be closer to the performance of the full precision model.
TrashPandaSavior@reddit
Not OP, but it's possible that having some of the big model producers, like Microsoft and Qwen, provide their own GGUFs has changed what people expect. I know that I have a bias towards getting a model straight from the author if I can or maybe unsloth.
Pedalnomica@reddit
I had the same question. There's nothing official, but the ones on Kaggle and Ollama were available at launch. So, I'm guessing those were the ones that Google made with QAT.
agntdrake@reddit
I made the ones for Ollama using K quants because the QAT weights weren't quite ready from the Deep Mind team. They did get them working (and we have them working in Ollama) but they're actually slower (using Q4_0) and we're still waiting on the perplexity calculations before switching over.
My_Unbiased_Opinion@reddit
There are officially quantized version on the Ollama repot, specifically Q4KM.
codingworkflow@reddit
Unsloth released a version in collaboration with Google.