THUDM/GLM-4.1V-9B-Thinking looks impressive

[-]

benxben13@reddit

for me it's working for OCR well better than OLM or typhoon or fluxOCR

Reply

[-]

It does seem like the VL landscape has a lot of room for growth. Every time there's a benchmark for a vl model it's like 'here's our tiny model compared to several 72b models. ' don't see that with normal llms.

Reply

[-]

JuicedFuck@reddit

VL space has been completely stagnant for >1 year in terms of image understanding. Models now have CoT + VL just so they can solve benchmarks like "Solve the equation on the blackboard", but the CoT does absolutely nothing to help it understand complicated images which previous models struggled with. On my private test set, I have seen no improvements made with any vision model except google gemini's 2.5 pro model.

Reply

[-]

l33t-Mt@reddit

[https://moondream.ai/](https://moondream.ai/)

Reply

[-]

JuicedFuck@reddit

Get your pitiful guerilla shill campaign away from me.

Reply

[-]

l33t-Mt@reddit

Moondream is a gorilla shill campaign? How so. Best small local model available. Prove me wrong.

Reply

[-]

JuicedFuck@reddit

Yeah, it has been for a while. https://preview.redd.it/hjqryhkrtsbf1.png?width=923&format=png&auto=webp&s=d4a31b6cd8cb10c573f43349077fa09e44869ba7

Reply

[-]

l33t-Mt@reddit

Dude, the guy you posted is literally the PR guy for moondream. I'm just a user who had a recommendation. Whats better at the size? How is it better? You listed Gemini.... LocalLLama.... So helpful.

Reply

[-]

DepthHour1669@reddit

Vision is surprisingly tiny, to be fair. Llama 3.2 11b vision is just 3b more than the Llama 3.1 8b it was built off of.

Reply

[-]

AppearanceHeavy6724@reddit

> the Llama 3.1 8b it was built off of It is not true; it is is widespread misconception but it is incorrect. Visual layers are less 1b in size, textual layer of 3.2 11b is bigger than Llama 3.1 8b

Reply

[-]

CheatCodesOfLife@reddit

> It is not true It is true. There's even a Llama 3.2-90b with the text layers swapped with the Llama 3.3 70b model [Llama-3.3-90B-Vision-merged](https://huggingface.co/gghfez/Llama-3.3-90B-Vision-merged). And it worked exactly like Llama3.3-90b for textgen when I tried it.

Reply

[-]

AppearanceHeavy6724@reddit

> And it worked exactly like Llama3.3-90b for textgen when I tried it. Deepseek v3-0324 and Mistral Small 3.2 work almost exactly, often word-to word same for textgen; check lmarena if you do not believe; internally they massively different though. OTOH in my experiment on build.nvidia.com show that 11b is far more unhinged in the output than 3.1 8b. Anyway here config.json for 3.1 8B: "num_hidden_layers": 32, config.json 11b: "num_hidden_layers": 40, Feel free to explain how a model with 40 layers is same with one with 32 layers, and also feel free to test on build.nvidia.com with T=0 and other sampler settings set to be same both models with the prompt of your choice.

Reply

[-]

CheatCodesOfLife@reddit

> Deepseek v3-0324 and Mistral Small 3.2 work almost exactly, often word-to word same for textgen; check lmarena if you do not believe I believe you, for simple prompts, but if you use the model for real tasks, DS is nothing like MS. I used the 90b image model for about a week in place of the 70b and regenerated the last reply in some of the 70b's chats to test it (this was ages ago) > OTOH in my experiment on build.nvidia.com show that 11b is far more unhinged in the output than 3.1 8b Okay admittedly I haven't used the 11b or 8b so I'll take your word for it. > Anyway here config.json for 3.1 8B: Okay you got me with this one! You're right, 32 layers vs 40 for the 11b, 80 vs 100 for the 90b. > Feel free to explain how a model with 40 layers is same with one with 32 layers Only explanation is you're right, Meta must have used a larger text model for the vision models! Now I want to strip out the vision weights from the base model and see how it takes to fine tuning...

Reply

[-]

AppearanceHeavy6724@reddit

> Only explanation is you're right, Meta must have used a larger text model for the vision models! I know, every time I bring it up, I get downvoted into oblivion. 3.2 and 3.1 are different models.

Reply

[-]

CheatCodesOfLife@reddit

> I get downvoted into oblivion Probably why I haven't seen this mentioned before :s

Reply

[-]

DepthHour1669@reddit

https://huggingface.co/blog/llama32 > The architecture of these models is based on the combination of Llama 3.1 LLMs combined with a vision tower and an image adapter. The text models used are Llama 3.1 8B for the Llama 3.2 11B Vision model, and Llama 3.1 70B for the 3.2 90B Vision model. To the best of our understanding, the text models were frozen during the training of the vision models to preserve text-only performance. This seems like a dumb thing to argue about. It’d be very easy to use Captum to look at both models and instantly tell if the text weights were frozen or not. I don’t have time today because I’m about to head out to a BBQ, but you can show proof of your statement if you have time. Otherwise I’ll pull it up tomorrow and compare them.

Reply

[-]

AppearanceHeavy6724@reddit

number of hidden layers are different though; 32 in 8b and 40 in 11b. The original might as well be frozen, but extra layers are not. And those extra are not "vision layers", those are normal FFN ones.

Reply

[-]

butsicle@reddit

Bring some BBQ back for the rest of us please

Reply

[-]

lompocus@reddit

Why are people saying it is bad. It is the first vision model that can actually give me good answers.

Reply

[-]

AppearanceHeavy6724@reddit

It might be good vision model, but it is not a good model in general sense of the word.

Reply

[-]

lompocus@reddit

This is true of all vision models compared to their non-vision models in the same family of models.

Reply

[-]

AppearanceHeavy6724@reddit

Not true for Mistral 2503 vs 2501. Also Qwen 2.5 vl 32b was to my taste better than normal qwen 2.5, and Pixtral Large is not worse than Mistral Large at all. I do not think what you said is true.

Reply

[-]

AppearanceHeavy6724@reddit

Did you try it? It is shit. Utter crap.

Reply

[-]

Beneficial-Good660@reddit

Before taking this fuckwit's words seriously and liking them, you should understand that he doesn't know how LLMs or VL models work — he's testing on "creative writing." I tested it on an infographic: the model identified all the words and objects, expanded the meaning, and provided a detailed plan with examples of different tools. It's actually decent and convenient since, in its thinking, it combined everything it found in the image.

Reply

[-]

AppearanceHeavy6724@reddit

If you referring as "fuckwit" to me than look at the mirror, fuckwit. As a vision model it might be or might not be good, I did not test the vision, but, if you, moron look at the linked infographic it shows it as excellent coder, but in fact it is not, it makes trivial errors in the generated code say Qwen 3 8b does not, or even Llama 3.1 8b does not make.

Reply

[-]

Beneficial-Good660@reddit

See, you're the complete idiot who writes all sorts of nonsense and doesn’t understand anything. "I didn't test the vision" — that’s a VL model. The infographic I tested wasn’t these images here — take any infographic from Pinterest for tasks like that. You're extremely stupid and keep spreading lies constantly.

Reply

[-]

AppearanceHeavy6724@reddit

> The infographic I tested wasn’t these images here — take any infographic from Pinterest for tasks like that. You're extremely stupid and keep spreading lies constantly. Mofo what are you talking about? The op linked infographics that shows this model is better than 4o at coding. > "I didn't test the vision" — that’s a VL model. So is GPT-4o they reference infographic. So you are saing this 9b POS is better coding that 4o. LMAO. I bet you have no fucking idea how to code, and therefore cannot test performance yourself.

Reply

[-]

Beneficial-Good660@reddit

It doesn't get through to you at all — here's a quote from the official card: *"designed to explore the upper limits of reasoning in vision-language models"*. All tests come from understanding images. Can you even grasp that or not? You can read up on what models are used for — both regular text ones and VL. Study a bit, maybe your stupidity comes precisely from the fact that you don’t know anything. And then, maybe, you’ll start writing slightly smarter comments. In all vision models, MMLU and other text tasks drop significantly. So when vision integration is added, they’ll still be able to maintain their quality — *that* will be fire. Even with GPT-4o, it's not even sure if it's a single model — probably just OCR attached. And when reasoning comes from images, the coding performance will drop too.

Reply

[-]

AppearanceHeavy6724@reddit

So how have you been fuckwit?

Reply

[-]

AppearanceHeavy6724@reddit

Stupid ass mofo, Qwen2.5-32b-VL is a fantastic coder and good storyteller and also a decent vision model. It outperforms this fucking thing at every possible metric, contrary to what THUDM advertise. Pixtral is great both at vision tasks and as general purpose model. Besides almost all new models are VL today, Gemma 3 27b, Mistral Small, Pixtral. All are decent to very decent at vision yet fantastic at non-vision tasks. That particular 9b model in fact has exactly same vibe and same error modes as original non vision GLM-4-9b-0414 it is based on. Their diagram meaningless and they are lying.

Reply

[-]

AmazinglyObliviouse@reddit

But wait— but wait—but wait—but wait—but wait—but wait—but wait—but wait—but wait— (incorrect answer)

Reply

[-]

Commercial-Celery769@reddit

It might be a while until we get good small models

Reply

[-]

rymn@reddit

Is it just me or does it seem like more of these smaller budge models are just trained for the tests, and not for real world use?

Reply

[-]

Kooshi_Govno@reddit

Yeah, cus that's the only thing you can do for small models to make them look good. Granted they're benchmaxxing for big models now too. We need some benches just for the LocalLlama community.

Reply

[-]

Lazy-Pattern-5171@reddit

I got downvoted into oblivion when I said it and now yours in the top comment. SMH 🤦‍♂️

Reply

[-]

llmentry@reddit

Well, I hope their model didn't produce their misleading charts. (Inconsistent axis values on the baseline comparison, truncating the Y axis to start at 30 for the RL gains to create a false impression of performance increases ... I would not trust that model for anything STEM-related.)

Reply

[-]

r4in311@reddit

Huge results, if true. So this 9B casually beats 4o in coding... amazing! But so far, we only see a lot of uncommon weird benchmarks, whats Flame-VLM-Code? Wheres HumanEval, MBPP or SWE-bench?

Reply

[-]

DepthHour1669@reddit

FLAME is a vision benchmark

Reply

[-]

r4in311@reddit

Ok but why no coding benchmark when you essentially claim strong coding performance? That's not really vision related?

Reply

[-]

DepthHour1669@reddit

I made no claims.

Reply

[-]

poli-cya@reddit

https://en.wikipedia.org/wiki/Generic_you The guy clearly wasn't talking about you in particular...

Reply

[-]

emprahsFury@reddit

If you're addressing someone directly (as in responding to them when they responded to you) then there is no generic you, and we can forgive the dude for not disambiguating perfectly.

Reply

[-]

poli-cya@reddit

Please provide any source saying that, it's absolutely not correct. I asked o3 to give an evaluation- Issue | Who's right| Why ---|---|---- “You can’t use a generic you when replying directly.” | emprahsFury is off-base | English lets you use the generic you even in a direct reply. It can be confusing, but it isn’t grammatically outlawed.

Reply

[-]

pcdacks@reddit

I want to understand the reasons for the significant divergence, and see if it’s worth spending time to adapt it to llama.cpp.

Reply to Post

45 Comments