Based on what should I choose Gemma 4 models/quantizations?

Posted by ProducerOwl@reddit | LocalLLaMA | View on Reddit | 14 comments

I have an RTX 4060 8GB laptop, and when asking Gemini or ChatGPT, they say the Gemma 4 Q4 K M is the best fit for my hardware with Context Length around 16k-32k.

However, in practice, after loading even a higher quantization like the Q6 K XL, my VRAM is only occupied at 5.5GB.

This has made me confused as to what rule of thumb I should consider while choosing context length, models and quantization?

[-]

ea_man@reddit

If you want the model to run as fast as possible, both the model and the context has to stay in VRAM.

If you are ok with slower speed you can offload up to your 16GB RAM.

Anyway, standard base quant for the model is Q4_K_M, context length depends entirely on what you gonna do, can be 4k for chat, 30-130k for coding.

[-]

I always tell the people to just try using things, but they can't accept it. They want benchmark, leaderboard and someone to tell them "what to choose". But it's really subjective what speed and quality is good for you. Because everyone has different use cases. Except people who have no use cases because they download model to not use it at all.

[-]

therandshow@reddit

Also, just pick one is less attractive when these things take an hour to download

[-]

jacek2023@reddit

this is a valid point, but this hobby is "time consuming"

[-]

therandshow@reddit

I guess part of the problem I have with choosing a random model is when you hit a wall with some random incompatibility or quirk when is it worth working through the problem and when is it worth trying another model

[-]

DBacon1052@reddit

Just test stuff. You'll be surprised what can run. Look for MOE models. I have a 4060 8gb vram 32gb ram and qwen 35b runs stupid fast.

[-]

Ok_Sprinkles_6998@reddit

Qwen 3.5 9b is said to be better than chatgpt3.5 so give it a try.

Generally, look for Q4 or above for acceptable quality/size trade-off in quantization. And higher parameters (Bs) means greater knowledge base and smarter models.

I would say start with LM Studio and get whatever models that can fit (by its recommendation) and play around.

You can also offload to system RAM if you require longer context on LM Studio (not best approach, but easy-to-use) it'll result in around 4-5 tokens per second compared to 40-50 tokens per second you'll get when using VRAM.

[-]

SrijSriv211@reddit

There's not really a rule of thumb I think but I generally see if filesize of model * 1.5 <= VRAM then it's fine. Also 16k context length is good if you don't do long horizon tasks. Context length really only depends on how deep you want your model to do imo..

[-]

vasileer@reddit

Gemma3 models (probably gemma4 too) were trained with QAT (quantization aware training) for 4bits, and for most of models of over 3B params 4bits is preserving \~98% of quality.

For context there is no rule of thumb, as each model can use (or not) various techniques like sliding attention, group attention, etc, and my rule of thumb for that is to measure myself, I run it with 8192 and then I do the rest of the math.

[-]

Pristine-Woodpecker@reddit

The QAT models were specifically marked so and were obviously released as 4-bit models. There's no such Gemma 4 model released.

[-]

vasileer@reddit

the scores for nvfp4 versions indicate one of the 2 things (or both): either gemma4 used these techniques (without announcing it), or that nvfp4 is so good

[-]

Fedor_Doc@reddit

In my experience, reasoning degrades considerably in 4-bit quants of Qwen 3.5 models. I compared Q4_NL and Q8_0 of 9B model by Unsloth, and it is a big difference in my tasks (code analysis).

[-]

CooperDK@reddit

Forget about Gemma 4. Basically, forget using LLMs with 8 GB of VRAM and 16 GB of RAM. Unless you use a 2b model or the Gemma-3n-E4B model which is made for phones and small computers... or if you have time to really wait and wait and wait. Oh, and you probably need to expand your pagefile too.

[-]

Deep-Vermicelli-4591@reddit

Use LM Studio, Get the Gemm 4 E4B model quantised by them. Their Q4 would definitely fit with good context. If you dont need that much then Q6 with lower context should fit.