Quants in vision (mmproj Q8 vs FP16)

Posted by WhoRoger@reddit | LocalLLaMA | View on Reddit | 15 comments

Disclaimer: This is totally just my personal testing/messing around. Nothing scientific.

TL;DR: I find FP16 mmproj pointless, and may even harm quality rather than help.

I decided to check vision of the recent small models on llama.cpp. I didn't know any better, so I downloaded Q8 of the mmprojs. Then I looked into it and found that most people just go for FP16 at all times, so I downloaded those too. And well since I already had both versions for each model, I might as well compare them.

Models: Qwen3.5 0.8B, 2B, 4B, Gemma 4 E2B and E4B, Gemma 3 4B - all Heretics of some sort (all Q6_K or i1/Q6_K, some in uncensored versions too, some also in IQ4_NL because I've been collecting them already). Most mmproj's seem to be totally untouched when people uncensor the models. (Often this is mentioned, but not always.) For some models, I also tried mmproj's from different providers, and they always give the exact same responses, so they're mathematically identical, even if file hashes don't match. Though I found some (MARTHA for Qwen 0.8B and 2B) that may have some tuning, because their responses differ slightly.

Running these just on CPU, because I'm poor and crazy. So maybe the math may be a bit different on other hw. Temperature 0 to see the differences. Anyway.

Tried a variety of oddball pics, photos and generated. Atypical stuff or with a lot of specifics. Medical images, manequin in a dumpster, selfies in odd environments, anatomical deformities, behind-the-scenes from movies showing props, that sort of things. Stuff that can trip up models that expect generic content.

Well first off, Qwen3.5 4B absolutely destroys all the others in recognising and reasoning. That's nothing new, but the level of detail is amazing. E.g. it can see that blood looks a bit off (on the movie props stuff) and speculates that it may be crushed berries. That's crazy. Tho you need to look into its thinking to see that, or prompt about the specifics, since in the final output it usually discards elements that it's not sure about.

Anyway, the quants.

In short, I find the differences between Q8 and F16 mmproj's insignificant, except Qwen3.5 0.8B and 4B. The phrasing of the image descriptions differ slightly rather than the contents, overall indicating that the models see a bit sharper, or may first focus on something else. But you'll get the same contents either way. The models seem to see more than they want to put into words anyway, possibly to keep the descriptions brief. If you press the model for details, you'll learn the exact same things from mmproj's in Q8 as from FP16.

Qwen3.5 0.8B seems to benefit from FP16 over Q8 a little more - either it notices more, or at least is more confident. But maybe that's due to the text model being so small, rather than the visual portion, as it's more prone to variability in output anyway. (Now that I think about it, it would probably make more sense to use Q8 base model and Q8 mmproj in these tiny sizes.)

Qwen3.5 4B is interesting though. I found that FP16 seems to introduce visual noise rather than actually helping. In edge cases, it starts seeing patterns where they are none, and it can get stuck in a loop on speculating what it means, reason through alternative explanations which don't go anywhere, and go back and forth looking back and trying to reinterpret the part of the image in question. Good old overthinking Qwen.

In one case, Q8 correctly identified a blurry animated poster in the background, while FP16 didn't see it at all and focused on the areas of the image in focus. This is interesting and proof of the visual noise the extra detail can produce. If everything looks slightly blurry to the model, it sees different elements more evently, but still sees well enough to identify what's what. While extra precision may get it sidetracked. I guess it's akin to moire on imaging sensors without a Bayer filter producing fake detail.

I also tried FP32 just for the kicks with Qwen 3.5 4B, and it's the same as FP16. It just introduces minor variations in phrasing, so tiny that even a typo or extra space in a prompt makes much more of a difference.

Anyway, my personal takeaway: FP16 is just waste of space for these models and my setup. And Qwen3.5 4B can see so damn well, the extra data can actually confuse it.

Alternative explanation could be that FP16 vision could work better with FP16 text model? I've not tried that.

Considering how much talk there is about model quants, I think this is something worth looking into. FP16 seems to be taken for granted as the default for mmproj, but vision reasoning in these models is so good these days, this may be outdated. Maybe even smaller quants may be good enough.

I can't personally test much more since it takes ages, and I was just quelling my curiosity. Maybe someone could benchmark this more rigorously.

[-]

Jack_Kennedy_2009@reddit

In my own findings it can be because of the clipping damage done to the model/vision projector. Vast majority of models today(Qwen, Gemma3/4 etc.) are trained native at BF16. When someone does the intermediate conversion to fp16 for model or projector they are doing massive damage to the AI. Now what I have found in my own testing with models and have not tested vision projectors themselves much, but is that a bf16 intermediate converted model quanted to Q8 will be less damaged and closer to the og then a fp16 converted. That is probably what you have seen in action with ab testing the vision cones at different precision formats. fp16 is a relic and not even sure why people default to it, but maybe that helps. Any gpu 30 series or newer has bf16 support, and rdna 3 or newer, and apple silcon, intel arc cards all support it as well I believe. Always go native bf16 if you can and quant down from bf16 gguf intermediates as well is what I always do.

[-]

ambient_temp_xeno@reddit

In one test image I've been using, the f16 gemma4 mmproj can see better than the bf16, so it might be kind of random depending on the image.

[-]

WhoRoger@reddit (OP)

Gemma4 (at least E2 and E4) is very random with its image reading anyway and assumes a lot if it can't see something. So yea very likely per-image randomness imho.

[-]

WhoRoger@reddit (OP)

Yea that makes sense, now that I think about it. Pretty much what one would expect from such a conversion in vision: noise and fake patterns from the clipping.

Tho I guess F32 should mitigate that damage, unless it's just F16 upscaled.

A lot of HF repos don't even come with BF16, just with FP16, that's why I was downloading them first because it wasn't clear to me.