Gemma 4 Vision
Posted by seamonn@reddit | LocalLLaMA | View on Reddit | 67 comments
A lot of people in the Gemma 4 Model Request Thread were asking for better vision capabilities in the next Gemma Model. This tells me that people are not configuring Gemma 4's vision budget.
Gemma 4 ships with Variable Image Resolution. The default max vision budget is 280 (~645K pixels) which is way too less. In this mode, it fails to OCR tiny details. It's essentially blind in my books.
In llama.cpp, you can configure Gemma 4's vision budget with 2 parameters --image-min-tokens and --image-max-tokens. The engine will try to fit the image within those bounds. I believe the default is 40 and 280 respectively. This is Gemma 4's default from Google's side but it's way too low.
I like to run them at 560 and 2240 respectively and it's able to pick up very minute and hazy details within images.
Why 2240 - isn't that double of the max from Google (1120)? In my testing, 2240 for some reason works better than 1120. I suspect this might be because of llama.cpp's implementation where it tries to fit the image between min and max tokens.
Additionally, you will also have to set --batch-size and--ubatch-size above whatever value you choose for image-max-tokens. I run them at 4096 (for --image-max-tokens 2240). This will consume a lot more VRAM (63 GB (default) to 77 GB (4096 batch) for q8_0 at max context).
If you use Ollama, you are likely SOL until and if they care to fix this.
It's worth it though, with a higher vision budget, Gemma 4 is pretty much SOTA for Vision and pretty much destroys anything else especially for OCR - Qwen 3.5, Qwen 3.6, GLM OCR (or any other random OCR), Kimi K2.5. I haven't tested Kimi K2.6 and I refuse to touch Cloud Models.
Yukki-elric@reddit
You can also just put both --image-min-tokens and --image-max-tokens to 1120 and it'll basically see everything at the highest quality it can, probably more reliable than the values OP used.
seamonn@reddit (OP)
This will upscale anything that is less than 1120 (~2.6M pixels). If you want the original size of the image to be maintained, ideally should provide a lower and upper bound.
Yukki-elric@reddit
Oh didn't know that, the arguments aren't really well documented on llamacpp's side.
seamonn@reddit (OP)
I had to dig through the code with an LLM to figure out who (llama.cpp and ollama) is doing what. Ollama just upscales everything to the max tokens (280). Llama.cpp tries to maintain the image between the bounds.
DangKilla@reddit
Gemm4 doesn’t have bounds tho, so people should be setting the min and max to the same number until llama.cpp allows a max soft token setting.
The underlying vision encoder supports 70, 140, 280, 560, 112
seamonn@reddit (OP)
Llama.cpp will not resize the image if the image is within its bounds. Specifically in my testing, 560 and 2240 was outperforming 1120 and 1120.
I have my suspicion that Gemma 4 supports >1120 max tokens.
Majesticeuphoria@reddit
Your suspicion seems correct to me as well. I compared on a dataset of 5k+ images for OCR and your params performed better than 1120 for both by 2.5%. Though, it's still not nearly as good as running AWQ or FP8 of the model. Those have drastically higher accuracy by 6-8% even at 280 or 560.
seamonn@reddit (OP)
/u/createthiscom
createthiscom@reddit
Very cool. Thanks for the mention!
seamonn@reddit (OP)
Did you try the BF16 GGUF?
Majesticeuphoria@reddit
Yes
seamonn@reddit (OP)
I reckon at that point, it's diminishing returns
seamonn@reddit (OP)
Also, weirdly, 560 and 2240 was outperforming 1120 and 1120 in my testing. I suspect this is because the model is capable of more than 1120 max tokens.
rebelSun25@reddit
Since your seem to know what you're doing, can you tell me what the full options look like for llamacpp and vllm?
seamonn@reddit (OP)
I use llama-swap
RobotRobotWhatDoUSee@reddit
Where did you get your 'models/Gemma4/google-gemma-4-31B-it-interleaved.jinja' template?
seamonn@reddit (OP)
https://github.com/ggml-org/llama.cpp/tree/master/models/templates
crizz_95@reddit
I often see that people use the --jinja parameter. Is this necessary, I think it's enabled by default?
WPBaka@reddit
I personally have it configured for tool calls.
annodomini@reddit
What jinja did you use to fix tool calls?
I found the 31b was working fine after all of the fixes landed on llama.cpp and the quants, but I still had trouble with the 26b-a4b. Wondering if there's a template I should be using instead.
WPBaka@reddit
Minja! I've only used the 31b as well so no clue on the 26b-a4b sadly.
liftheavyscheisse@reddit
off topic but have you tried running 31b into longer contexts? I feel like it starts to get dumb around 60ktok.
Several-Tax31@reddit
It's enabled by default, but not using it was giving me weird prompt reprocessing issues, so I added it.
cristoper@reddit
It is enabled by default since December (https://github.com/ggml-org/llama.cpp/pull/17911), but it didn't used to be so some of us are still in the habit of specifying it.
IrisColt@reddit
Thanks!!!
rebelSun25@reddit
Thanks
WhoRoger@reddit
Hm maybe it works well on 31B but I'm trying it now on E4B and I'm not impressed. It just takes 5x as long to digest a (large-ish) image, but doesn't provide any more useful information. Maybe it'll work better on OCR/text, or maybe E4B just can't take advantage of more data. Qwen 3.5 4B definitely wins, with E4B being good for a quick and dirty response.
Btw I see you're using F32 mmproj; pretty sure you can use BF16 with the exact same quality for a bit less RAM (not FP16 tho, that's worse). Or maybe just Q8 outright and save the space. Try it out. I've been checking this out on small models, and I'd bet it's the case with larger ones too.
caetydid@reddit
and I was in belief that the gemma4 model decides by itself how many image tokens it is going to use?
Confident_Ideal_5385@reddit
Is there a technical reason that llama-server wants to ingest all the image tokens in one batch? Or is that something that can be solved with a PR because someone got lazy?
jannycideforever@reddit
King shit, and gives me a reason to finally give up on using ollama
vr_fanboy@reddit
good info, gemma4 vision is in my backlog to test, does anybody know if it can generate bounding boxes like qwen 3.5?, is very usefull to boostrap annotations datasets
sharp1120@reddit
Gemma4 can generate bounding boxes. See the docs here: https://ai.google.dev/gemma/docs/capabilities/vision/image
ambient_temp_xeno@reddit
It should be maxed out at 1120. Put the min as 1120 as well as max, problem solved.
seamonn@reddit (OP)
That will upscale anything that is less than 1120 (~2.6M pixels).
ambient_temp_xeno@reddit
Is that a bad thing, though? I wish this whole thing was better documented.
Temporary-Mix8022@reddit
I wish it was better documented as well.. it is so hard to work out.
Has anyone found decent docs for this? Have I just missed them somewhere?
seamonn@reddit (OP)
might lead to some false positives. I would rather have the image get processed untouched
eposnix@reddit
Yeah, the is the main reason i can't use LM Studio for vision tasks - they don't expose these variables for whatever reason.
Is this something that can be patched, /u/yags-lms
WhoRoger@reddit
I'd really like to know how to use all these eldritch commands in router mode.
empire539@reddit
I was trying at 1120 min/max tokens and it was better but still kinda meh. I fed it a 896px square picture of a character against a white background with a high contrast (white text on black box) name labeled at the bottom, and it still got both the name and hair color wrong somehow, despite taking like 10x longer for encoding. I'll have to try again at 2240; didn't realize you could go even higher.
Worried-Squirrel2023@reddit
the variable image resolution thing is genuinely the trap most people are hitting. defaults that work for benchmarks rarely work for real OCR. same pattern as qwen3.5 vision where the default token budget gave you blurry receipts and small text fell apart. saving this for next time someone asks why their local vision model can't read a chart.
createthiscom@reddit
I'd be more impressed with this if you supplied a test document you use to prove your case for 2240.
seamonn@reddit (OP)
Nah, you are right, 280 is the way to go and Gemma 4 Vision is trash. Might as well go cloud. /s
createthiscom@reddit
"reproducible tests? bah."
seamonn@reddit (OP)
:D
Top-Rub-4670@reddit
Says the guy who confidently calls Gemma 4 SOTA whilst also never having tried cloud models.
Might be amongst the best local vision models, but "the Art" isn't limited to local.
seamonn@reddit (OP)
Seriously though, I have a very specific and reproducible test for this but unfortunately, I can't share it publicly.
createthiscom@reddit
I'm seriously just curious, so if you come up with a test image you can share, let me know. I've been using the `--image-max-tokens 1120` with great results so far.
seamonn@reddit (OP)
--image-max-tokens 1120 is 99% there but my settings are just a tiny bit better. You can test it yourself too if you feel like it on your stuff.
VoiceApprehensive893@reddit
i was surprised when people complained about gemma vision being bad when it destroyed qwen in lm arena tests
Top-Rub-4670@reddit
In my testing Gemma 4 failed to recognize objects a lot more than Qwen 3.5. It sees the object, mind you, so it's not a resolution issue. If I guide it it will describe the exact shape and colors. It just doesn't know what they are. Qwen 3.5 not only knows what they are, but it volunteer the information from the get go.
I love Gemma 4, but it's a lazy model with worse vision than Qwen 3.5. It does okay at OCR, but for general images it's way less reliable/capable.
nickm_27@reddit
It definitely helps with static vision, unfortunately even with that in my tests on video Gemma4 does not do very well compared to Qwen3.5 (or Qwen3-VL) which have better temporal understanding. Gemma4 seems to mash all the images together instead of understanding a person is for example walking away vs standing etc.
leonbollerup@reddit
maybe consider this:
as good as llama.ccp is.. then your literally have to know.. and understand.. a million different switches and how each llm model works with each switch to get the best out of it.. even the best of us never get to that point.
and this is why people turn to Unsloth studio, LM studio etc. .. so ya.. you are properly right.. ya.. people dont know..
AnonLlamaThrowaway@reddit
That certainly explains why it seemed blind as a bat when trying to read text on a photo of a soda can. Thanks for the heads up
seamonn@reddit (OP)
I would wager, it will be able to OCR every single letter, heh. Do let us know if you decide to re-test.
stddealer@reddit
Oh! I was running with --imge-min-tokens 1024 from the start (out of habit from Qwen3.5) and I was confused about why people were feeling let down by Gemma4's vision.
DangKilla@reddit
Good info!
vLLM version of this:
vllm serve google/gemma-4-31B-it \ --mm-processor-kwargs '{"max_soft_tokens": 1120}' \ --tensor-parallel-size N \ # adjust for your GPUs --dtype bfloat16 \ --max-model-len 32768 \ # or higher, up to 256k --gpu-memory-utilization 0.9
note:
560 (good balance) or 1120 (maximum detail, closest to your llama.cpp 2240 setting).
Upset_Page_494@reddit
Is this the default in LM Studio? Or do I need to configure it, or not yet supported?
seamonn@reddit (OP)
No clue, unfortunately about LM Studio but everyone seems to be doing 280 default as that's what's recommended by Google.
666666thats6sixes@reddit
Same situation with QwenVL models (including all Qwen3.5 and 3.6). It shows a warning in llama-server logs but who reads those. Raising --image-min-tokens from 8 to 1024 improves vision a lot, especially with non-textual imagery like navigating desktop UIs or when testing frontends.
Yukki-elric@reddit
You can also just put both --image-min-tokens and --image-max-tokens to 1120 and it'll basically see everything at the highest quality it can, probably more reliable than the values OP used.
Yukki-elric@reddit
You can also just put both --image-min-tokens and --image-max-tokens to 1120 and it'll basically see everything at the highest quality it can, probably more reliable than the values OP used.
Temporary-Mix8022@reddit
Thanks for writing it.. and thanks for the typos that I believe that only a human could have made. (Genuinely, zero sarcasm)
Literally just so happy to read something that isn't slop.
Also, I was doing some work with the vision encoder from the smaller models (where it's c150m params). I ended up using 70tokens as I thought that was the minimum? Are you saying that it's actually 40tokens?
Or is that only for the larger c500m vision encoder that is on the larger LLMs?
ouzhja@reddit
Googles model card has 70 listed as the smallest option
seamonn@reddit (OP)
that's minimum for soft_max_tokens aka --image-max-tokens. The lower bound 40 is for --image-min-tokens.
Egoz3ntrum@reddit
Thank you for this
segmond@reddit
Thanks for sharing.