Quants in vision (mmproj Q8 vs FP16)
Posted by WhoRoger@reddit | LocalLLaMA | View on Reddit | 15 comments
Disclaimer: This is totally just my personal testing/messing around. Nothing scientific.
TL;DR: I find FP16 mmproj pointless, and may even harm quality rather than help.
I decided to check vision of the recent small models on llama.cpp. I didn't know any better, so I downloaded Q8 of the mmprojs. Then I looked into it and found that most people just go for FP16 at all times, so I downloaded those too. And well since I already had both versions for each model, I might as well compare them.
Models: Qwen3.5 0.8B, 2B, 4B, Gemma 4 E2B and E4B, Gemma 3 4B - all Heretics of some sort (all Q6_K or i1/Q6_K, some in uncensored versions too, some also in IQ4_NL because I've been collecting them already). Most mmproj's seem to be totally untouched when people uncensor the models. (Often this is mentioned, but not always.) For some models, I also tried mmproj's from different providers, and they always give the exact same responses, so they're mathematically identical, even if file hashes don't match. Though I found some (MARTHA for Qwen 0.8B and 2B) that may have some tuning, because their responses differ slightly.
Running these just on CPU, because I'm poor and crazy. So maybe the math may be a bit different on other hw. Temperature 0 to see the differences. Anyway.
Tried a variety of oddball pics, photos and generated. Atypical stuff or with a lot of specifics. Medical images, manequin in a dumpster, selfies in odd environments, anatomical deformities, behind-the-scenes from movies showing props, that sort of things. Stuff that can trip up models that expect generic content.
Well first off, Qwen3.5 4B absolutely destroys all the others in recognising and reasoning. That's nothing new, but the level of detail is amazing. E.g. it can see that blood looks a bit off (on the movie props stuff) and speculates that it may be crushed berries. That's crazy. Tho you need to look into its thinking to see that, or prompt about the specifics, since in the final output it usually discards elements that it's not sure about.
Anyway, the quants.
In short, I find the differences between Q8 and F16 mmproj's insignificant, except Qwen3.5 0.8B and 4B. The phrasing of the image descriptions differ slightly rather than the contents, overall indicating that the models see a bit sharper, or may first focus on something else. But you'll get the same contents either way. The models seem to see more than they want to put into words anyway, possibly to keep the descriptions brief. If you press the model for details, you'll learn the exact same things from mmproj's in Q8 as from FP16.
Qwen3.5 0.8B seems to benefit from FP16 over Q8 a little more - either it notices more, or at least is more confident. But maybe that's due to the text model being so small, rather than the visual portion, as it's more prone to variability in output anyway. (Now that I think about it, it would probably make more sense to use Q8 base model and Q8 mmproj in these tiny sizes.)
Qwen3.5 4B is interesting though. I found that FP16 seems to introduce visual noise rather than actually helping. In edge cases, it starts seeing patterns where they are none, and it can get stuck in a loop on speculating what it means, reason through alternative explanations which don't go anywhere, and go back and forth looking back and trying to reinterpret the part of the image in question. Good old overthinking Qwen.
In one case, Q8 correctly identified a blurry animated poster in the background, while FP16 didn't see it at all and focused on the areas of the image in focus. This is interesting and proof of the visual noise the extra detail can produce. If everything looks slightly blurry to the model, it sees different elements more evently, but still sees well enough to identify what's what. While extra precision may get it sidetracked. I guess it's akin to moire on imaging sensors without a Bayer filter producing fake detail.
I also tried FP32 just for the kicks with Qwen 3.5 4B, and it's the same as FP16. It just introduces minor variations in phrasing, so tiny that even a typo or extra space in a prompt makes much more of a difference.
Anyway, my personal takeaway: FP16 is just waste of space for these models and my setup. And Qwen3.5 4B can see so damn well, the extra data can actually confuse it.
Alternative explanation could be that FP16 vision could work better with FP16 text model? I've not tried that.
Considering how much talk there is about model quants, I think this is something worth looking into. FP16 seems to be taken for granted as the default for mmproj, but vision reasoning in these models is so good these days, this may be outdated. Maybe even smaller quants may be good enough.
I can't personally test much more since it takes ages, and I was just quelling my curiosity. Maybe someone could benchmark this more rigorously.
while-1-fork@reddit
I have been trying a Q8_0 mmproj for Qwen 3.5 35B A3B and it seems to work as well or better as BF16 (Just today saw something correctly that a few days ago didn't in BF16). I'd love to try q6_k but when I tried quantizing my own but some layers are not divisible by 32. Not quite sure how ddh0 abd AesSedai solved the divisibility issue but they have Q8.
WhoRoger@reddit (OP)
Seems like q8/fp16/bf16 keep about half the layers fp32, and the rest is in the lower precision. I guess that's why the file sizes don't differ as much as with the main model quants. No clue how that works tho. Unsloth also has full fp32.
Monad_Maya@reddit
I haven't tested it but yes I usually default to F16 for mmproj.
Do you have the link to Qwen3.5 4B's Q8 mmproj that you used?
WhoRoger@reddit (OP)
https://huggingface.co/mradermacher/Huihui-Qwen3.5-4B-abliterated-GGUF/blob/main/Huihui-Qwen3.5-4B-abliterated.mmproj-Q8_0.gguf
But I also tried a different one (I forgot from whom, I already deleted it) that was giving the exact same responses. So it's probably a default llama.cpp Q8 quant or something.
Monad_Maya@reddit
Thanks
Jack_Kennedy_2009@reddit
In my own findings it can be because of the clipping damage done to the model/vision projector. Vast majority of models today(Qwen, Gemma3/4 etc.) are trained native at BF16. When someone does the intermediate conversion to fp16 for model or projector they are doing massive damage to the AI. Now what I have found in my own testing with models and have not tested vision projectors themselves much, but is that a bf16 intermediate converted model quanted to Q8 will be less damaged and closer to the og then a fp16 converted. That is probably what you have seen in action with ab testing the vision cones at different precision formats. fp16 is a relic and not even sure why people default to it, but maybe that helps. Any gpu 30 series or newer has bf16 support, and rdna 3 or newer, and apple silcon, intel arc cards all support it as well I believe. Always go native bf16 if you can and quant down from bf16 gguf intermediates as well is what I always do.
ambient_temp_xeno@reddit
In one test image I've been using, the f16 gemma4 mmproj can see better than the bf16, so it might be kind of random depending on the image.
WhoRoger@reddit (OP)
Gemma4 (at least E2 and E4) is very random with its image reading anyway and assumes a lot if it can't see something. So yea very likely per-image randomness imho.
WhoRoger@reddit (OP)
Yea that makes sense, now that I think about it. Pretty much what one would expect from such a conversion in vision: noise and fake patterns from the clipping.
Tho I guess F32 should mitigate that damage, unless it's just F16 upscaled.
A lot of HF repos don't even come with BF16, just with FP16, that's why I was downloading them first because it wasn't clear to me.
brown2green@reddit
The models are trained in BF16 precision, so you should test with that instead of F16, even if the difference is theoretically small. With Gemma 4 31B I find that on images where the model can get confused Q8_0 performs slightly worse compared to BF16 (more confusion).
WhoRoger@reddit (OP)
Makes sense, but some model quants come just with Q8 and FP16 so that's why I started with those. And most people seem to go for FP16 as the default, so it's worth knowing it's not ideal.
I guess I'll try dl'ing BF16 for Qwen 4B and see if there's a difference.
Powers666@reddit
I also tested them and Q8 sometimes misinterpreted numbers, it gave me a "3" instead of the written "2" so its useless for me
Sadman782@reddit
Which model? Maybe it doesn't work for all models, but Q8_0 should look like this for the best performance.
Sadman782@reddit
I have the same observation for Gemma 4 26B MoE mmproj. Q8_0 > BF16 >= FP16, Q8_0 somehow performed better.
dampflokfreund@reddit
Yeah, I agree. I do not see a difference either, and it saves a bit of VRAM. You might think that's not much for a 1 GB model, but remember that for memory efficient architectures like Qwen 3.5 and Gemma 4 with SWA, every 100 mb allows for significantly more context.