Based on what should I choose Gemma 4 models/quantizations?

Posted by ProducerOwl@reddit | LocalLLaMA | View on Reddit | 14 comments

I have an RTX 4060 8GB laptop, and when asking Gemini or ChatGPT, they say the Gemma 4 Q4 K M is the best fit for my hardware with Context Length around 16k-32k.

However, in practice, after loading even a higher quantization like the Q6 K XL, my VRAM is only occupied at 5.5GB.

This has made me confused as to what rule of thumb I should consider while choosing context length, models and quantization?