Questions about parameter size & quantization
Posted by LeastExperience1579@reddit | LocalLLaMA | View on Reddit | 6 comments
If I run two models under same VRAM usage (e.g. Gemma 3 4b in Q8 and Gemma3 12b in Q2)
Which would be smarter / faster ? What are the strengths of the two?
Weary_Long3409@reddit
4b q8 = careful stupid 12b q2 = careless smart
BobbyL2k@reddit
For a dense model:
Token generation speed should be roughly the same. This is because the model + context must be loaded from VRAM into the GPU to produce one token. If both setup uses the same amount of VRAM, then it takes the same amount of time to produce one token.
Prompt processing will be faster on the fewer parameters model since the number multiplication increases with parameters. This usually doesn’t affect local llama users since llama.cpp caches the prompt.
In terms of quality, based on my experience, a bigger model with a lower bit quantization is usually better than a smaller one with higher precision. But this is also dependent on model age (newer model has better architecture, better training techniques, better data)
Any-House1391@reddit
I think of it this way: adding more bits, whether in terms of more weights or more bits per weight, gives you diminishing returns. If you have a tiny model with very few weights, adding more weights will make a dramatic difference; the difference between a 1B and a 4B model will be much larger than the difference between a 16B and a 32B model. Similarly, the difference between a Q2 and a Q4 quantization will be much bigger than the difference between a Q4 and a Q8 quantization.
While there is no clear answer to which will be better for your two examples, I hardly ever run models outside the Q3 to Q6 range. Below Q3, the performance usually drops off so much that I'm better off running a model with fewer weights. Above Q6, I barely notice any improvement. I am aware that Gemma3 does not have any model between 4B and 12B, though.
You should also keep in mind that the context window has a huge impact on VRAM usage. I will thus often choose to run a model at Q4 instead of Q6, in return being able to have a larger context window.
Western-Source710@reddit
Not trying to be a smart ass or anything.. I'm legit just curious. I haven't ran an LLM locally in my life.. I'm legit noob.
However I am curious why the difference between a Q2 and Q4 quant is bigger than the difference in a Q4 and Q8? Is it simply because 4-2 = 2 and 8-4 = 4? So addition/subtraction instead of a simple division (4/2 = 2 and 8/4 = 2)?
Also, not much improvement above Q6 in your experience? When and if I do run an LLM locally, I'm going to try to remember to play around at Q6 +/- to check for differences myself, thanks for that!
ApprehensiveTart3158@reddit
Gemna3 4b would likely be faster at q8, and more accurate, they would likely be comparable in smartness but the 12b might have better coding practices but make more mistakes due to being at fairly low bits, both would be usable just have different characteristics. Same for vision, lower bits do hurt vision capabilities significantly.
Personally I've used gemma3 4b at q8 and gemma3 12b at q3xl, I liked gemma3 4b more and it was faster. But you'd have to find out yourself which one you prefer.
Routine_Day8121@reddit
If you think about it, parameter size and quantization do not act independently. Q8 versus Q2 changes the memory footprint and arithmetic precision in subtle ways. A 12b model in Q2 may appear bigger but can underperform on reasoning heavy prompts compared to a 4b model in Q8 because the reduced precision introduces noise that the model cannot correct. The key is not raw size but how the model uses those bits.