TheaterFire

THUDM/GLM-4.1V-9B-Thinking looks impressive

Posted by ConfidentTrifle7247@reddit | LocalLLaMA | View on Reddit | 45 comments

THUDM/GLM-4.1V-9B-Thinking looks impressive
Looking forward to the GGUF quants to give it a shot. Would love if the awesome Unsloth team did their magic here, too. [https://huggingface.co/THUDM/GLM-4.1V-9B-Thinking](https://huggingface.co/THUDM/GLM-4.1V-9B-Thinking)

Reply to Post

45 Comments

benxben13@reddit

for me it's working for OCR well better than OLM or typhoon or fluxOCR
View on Reddit #61459685

noage@reddit

It does seem like the VL landscape has a lot of room for growth. Every time there's a benchmark for a vl model it's like 'here's our tiny model compared to several 72b models. ' don't see that with normal llms.
View on Reddit #60728961

JuicedFuck@reddit

VL space has been completely stagnant for >1 year in terms of image understanding. Models now have CoT + VL just so they can solve benchmarks like "Solve the equation on the blackboard", but the CoT does absolutely nothing to help it understand complicated images which previous models struggled with. On my private test set, I have seen no improvements made with any vision model except google gemini's 2.5 pro model.
View on Reddit #60748057

l33t-Mt@reddit

[https://moondream.ai/](https://moondream.ai/)
View on Reddit #60857319

JuicedFuck@reddit

Get your pitiful guerilla shill campaign away from me.
View on Reddit #60904656

l33t-Mt@reddit

Moondream is a gorilla shill campaign? How so. Best small local model available. Prove me wrong.
View on Reddit #60962507

JuicedFuck@reddit

Yeah, it has been for a while. https://preview.redd.it/hjqryhkrtsbf1.png?width=923&format=png&auto=webp&s=d4a31b6cd8cb10c573f43349077fa09e44869ba7
View on Reddit #61082438

l33t-Mt@reddit

Dude, the guy you posted is literally the PR guy for moondream. I'm just a user who had a recommendation. Whats better at the size? How is it better? You listed Gemini.... LocalLLama.... So helpful.
View on Reddit #61150485

DepthHour1669@reddit

Vision is surprisingly tiny, to be fair. Llama 3.2 11b vision is just 3b more than the Llama 3.1 8b it was built off of.
View on Reddit #60729955

AppearanceHeavy6724@reddit

> the Llama 3.1 8b it was built off of It is not true; it is is widespread misconception but it is incorrect. Visual layers are less 1b in size, textual layer of 3.2 11b is bigger than Llama 3.1 8b
View on Reddit #60730719

CheatCodesOfLife@reddit

> It is not true It is true. There's even a Llama 3.2-90b with the text layers swapped with the Llama 3.3 70b model [Llama-3.3-90B-Vision-merged](https://huggingface.co/gghfez/Llama-3.3-90B-Vision-merged). And it worked exactly like Llama3.3-90b for textgen when I tried it.
View on Reddit #60744755

AppearanceHeavy6724@reddit

> And it worked exactly like Llama3.3-90b for textgen when I tried it. Deepseek v3-0324 and Mistral Small 3.2 work almost exactly, often word-to word same for textgen; check lmarena if you do not believe; internally they massively different though. OTOH in my experiment on build.nvidia.com show that 11b is far more unhinged in the output than 3.1 8b. Anyway here config.json for 3.1 8B: "num_hidden_layers": 32, config.json 11b: "num_hidden_layers": 40, Feel free to explain how a model with 40 layers is same with one with 32 layers, and also feel free to test on build.nvidia.com with T=0 and other sampler settings set to be same both models with the prompt of your choice.
View on Reddit #60753963

CheatCodesOfLife@reddit

> Deepseek v3-0324 and Mistral Small 3.2 work almost exactly, often word-to word same for textgen; check lmarena if you do not believe I believe you, for simple prompts, but if you use the model for real tasks, DS is nothing like MS. I used the 90b image model for about a week in place of the 70b and regenerated the last reply in some of the 70b's chats to test it (this was ages ago) > OTOH in my experiment on build.nvidia.com show that 11b is far more unhinged in the output than 3.1 8b Okay admittedly I haven't used the 11b or 8b so I'll take your word for it. > Anyway here config.json for 3.1 8B: Okay you got me with this one! You're right, 32 layers vs 40 for the 11b, 80 vs 100 for the 90b. > Feel free to explain how a model with 40 layers is same with one with 32 layers Only explanation is you're right, Meta must have used a larger text model for the vision models! Now I want to strip out the vision weights from the base model and see how it takes to fine tuning...
View on Reddit #60757592

AppearanceHeavy6724@reddit

> Only explanation is you're right, Meta must have used a larger text model for the vision models! I know, every time I bring it up, I get downvoted into oblivion. 3.2 and 3.1 are different models.
View on Reddit #60757774

CheatCodesOfLife@reddit

> I get downvoted into oblivion Probably why I haven't seen this mentioned before :s
View on Reddit #60758119

DepthHour1669@reddit

https://huggingface.co/blog/llama32 > The architecture of these models is based on the combination of Llama 3.1 LLMs combined with a vision tower and an image adapter. The text models used are Llama 3.1 8B for the Llama 3.2 11B Vision model, and Llama 3.1 70B for the 3.2 90B Vision model. To the best of our understanding, the text models were frozen during the training of the vision models to preserve text-only performance. This seems like a dumb thing to argue about. It’d be very easy to use Captum to look at both models and instantly tell if the text weights were frozen or not. I don’t have time today because I’m about to head out to a BBQ, but you can show proof of your statement if you have time. Otherwise I’ll pull it up tomorrow and compare them.
View on Reddit #60731469

AppearanceHeavy6724@reddit

number of hidden layers are different though; 32 in 8b and 40 in 11b. The original might as well be frozen, but extra layers are not. And those extra are not "vision layers", those are normal FFN ones.
View on Reddit #60754067

butsicle@reddit

Bring some BBQ back for the rest of us please
View on Reddit #60740086

lompocus@reddit

Why are people saying it is bad. It is the first vision model that can actually give me good answers.
View on Reddit #60736614

AppearanceHeavy6724@reddit

It might be good vision model, but it is not a good model in general sense of the word.
View on Reddit #60754393

lompocus@reddit

This is true of all vision models compared to their non-vision models in the same family of models.
View on Reddit #60773599

AppearanceHeavy6724@reddit

Not true for Mistral 2503 vs 2501. Also Qwen 2.5 vl 32b was to my taste better than normal qwen 2.5, and Pixtral Large is not worse than Mistral Large at all. I do not think what you said is true.
View on Reddit #60786866

AppearanceHeavy6724@reddit

Did you try it? It is shit. Utter crap.
View on Reddit #60729262

Beneficial-Good660@reddit

Before taking this fuckwit's words seriously and liking them, you should understand that he doesn't know how LLMs or VL models work — he's testing on "creative writing." I tested it on an infographic: the model identified all the words and objects, expanded the meaning, and provided a detailed plan with examples of different tools. It's actually decent and convenient since, in its thinking, it combined everything it found in the image.
View on Reddit #60753517

AppearanceHeavy6724@reddit

If you referring as "fuckwit" to me than look at the mirror, fuckwit. As a vision model it might be or might not be good, I did not test the vision, but, if you, moron look at the linked infographic it shows it as excellent coder, but in fact it is not, it makes trivial errors in the generated code say Qwen 3 8b does not, or even Llama 3.1 8b does not make.
View on Reddit #60754227

Beneficial-Good660@reddit

See, you're the complete idiot who writes all sorts of nonsense and doesn’t understand anything. "I didn't test the vision" — that’s a VL model. The infographic I tested wasn’t these images here — take any infographic from Pinterest for tasks like that. You're extremely stupid and keep spreading lies constantly.
View on Reddit #60754648

AppearanceHeavy6724@reddit

> The infographic I tested wasn’t these images here — take any infographic from Pinterest for tasks like that. You're extremely stupid and keep spreading lies constantly. Mofo what are you talking about? The op linked infographics that shows this model is better than 4o at coding. > "I didn't test the vision" — that’s a VL model. So is GPT-4o they reference infographic. So you are saing this 9b POS is better coding that 4o. LMAO. I bet you have no fucking idea how to code, and therefore cannot test performance yourself.
View on Reddit #60754946

Beneficial-Good660@reddit

It doesn't get through to you at all — here's a quote from the official card: *"designed to explore the upper limits of reasoning in vision-language models"*. All tests come from understanding images. Can you even grasp that or not? You can read up on what models are used for — both regular text ones and VL. Study a bit, maybe your stupidity comes precisely from the fact that you don’t know anything. And then, maybe, you’ll start writing slightly smarter comments. In all vision models, MMLU and other text tasks drop significantly. So when vision integration is added, they’ll still be able to maintain their quality — *that* will be fire. Even with GPT-4o, it's not even sure if it's a single model — probably just OCR attached. And when reasoning comes from images, the coding performance will drop too.
View on Reddit #60755828

AppearanceHeavy6724@reddit

So how have you been fuckwit?
View on Reddit #60767895

AppearanceHeavy6724@reddit

Stupid ass mofo, Qwen2.5-32b-VL is a fantastic coder and good storyteller and also a decent vision model. It outperforms this fucking thing at every possible metric, contrary to what THUDM advertise. Pixtral is great both at vision tasks and as general purpose model. Besides almost all new models are VL today, Gemma 3 27b, Mistral Small, Pixtral. All are decent to very decent at vision yet fantastic at non-vision tasks. That particular 9b model in fact has exactly same vibe and same error modes as original non vision GLM-4-9b-0414 it is based on. Their diagram meaningless and they are lying.
View on Reddit #60756976

AmazinglyObliviouse@reddit

But wait— but wait—but wait—but wait—but wait—but wait—but wait—but wait—but wait— (incorrect answer)
View on Reddit #60761524

Commercial-Celery769@reddit

It might be a while until we get good small models
View on Reddit #60746210

rymn@reddit

Is it just me or does it seem like more of these smaller budge models are just trained for the tests, and not for real world use?
View on Reddit #60745014

Kooshi_Govno@reddit

Yeah, cus that's the only thing you can do for small models to make them look good. Granted they're benchmaxxing for big models now too. We need some benches just for the LocalLlama community.
View on Reddit #60745401

Lazy-Pattern-5171@reddit

I got downvoted into oblivion when I said it and now yours in the top comment. SMH 🤦‍♂️
View on Reddit #60743010

llmentry@reddit

Well, I hope their model didn't produce their misleading charts. (Inconsistent axis values on the baseline comparison, truncating the Y axis to start at 30 for the RL gains to create a false impression of performance increases ... I would not trust that model for anything STEM-related.)
View on Reddit #60751203

r4in311@reddit

Huge results, if true. So this 9B casually beats 4o in coding... amazing! But so far, we only see a lot of uncommon weird benchmarks, whats Flame-VLM-Code? Wheres HumanEval, MBPP or SWE-bench?
View on Reddit #60729161

DepthHour1669@reddit

FLAME is a vision benchmark
View on Reddit #60730031

r4in311@reddit

Ok but why no coding benchmark when you essentially claim strong coding performance? That's not really vision related?
View on Reddit #60731834

DepthHour1669@reddit

I made no claims.
View on Reddit #60733257

poli-cya@reddit

https://en.wikipedia.org/wiki/Generic_you The guy clearly wasn't talking about you in particular...
View on Reddit #60740013

emprahsFury@reddit

If you're addressing someone directly (as in responding to them when they responded to you) then there is no generic you, and we can forgive the dude for not disambiguating perfectly.
View on Reddit #60745194

poli-cya@reddit

Please provide any source saying that, it's absolutely not correct. I asked o3 to give an evaluation- Issue | Who's right| Why ---|---|---- “You can’t use a generic you when replying directly.” | emprahsFury is off-base | English lets you use the generic you even in a direct reply. It can be confusing, but it isn’t grammatically outlawed.
View on Reddit #60745710

pcdacks@reddit

I want to understand the reasons for the significant divergence, and see if it’s worth spending time to adapt it to llama.cpp.
View on Reddit #60743685

You_Wen_AzzHu@reddit

Please at least give it a spin before posting like this.
View on Reddit #60729692