TheaterFire

Qwen2 72b VL is actually really impressive. It's not perfect, but for a local model I'm certainly impressed (more info in comments)

Posted by SomeOddCodeGuy@reddit | LocalLLaMA | View on Reddit | 43 comments

Qwen2 72b VL is actually really impressive. It's not perfect, but for a local model I'm certainly impressed (more info in comments)

Reply to Post

43 Comments

Dark_Alchemist@reddit

How does this compare to Janus Pro? I was comparing 7B to 7B and I am leaning towards Janus being the better. The directive prompt was: Describe this image in detail.
View on Reddit #54548447

SomeOddCodeGuy@reddit (OP)

The grocery image was a random one I grabbed online just to test with (***Credit:*** *"Glitter On A Dime"*) I was supposed to go to bed 3 hours ago, but I was bound and determined to get KoboldCPP working as an accepted image endpoint in Wilmer, and finally got the POC working about an hour ago lol. Then I started playing with Qwen2 VL for the first time, became enamored, and here I am. Qwen isn't perfect, by any means, but for something running on your local machine? I'm very impressed with this. I've done a couple of others as well, such as: * Giving it a pic of a spice rack and asking if I have a particular spice (*it got it right*) * Having it describe my reddit avatar (*it tried. Got 90%, but for some reason it thought the llama had spots lol*). * The hardest I gave it was a screenshot of my desktop running a model, with the command prompt window pulled up so that it could see llama.cpp's output metrics. I put the whole desktop in the screenshot and asked how long the output took in seconds; it found that on the screen and got the answer correct down to the decimal points. Qwen2.5 VL support isn't in llama.cpp yet I don't think, as I couldn't find any ggufs for it, but for now this is plenty. After using Llama 3.2 vision for a while, I really didn't expect a lot, as that one had always struggled with details for me; asking it for test from a screenshot never went well, for example. But this model seems like it'll be fantastic for OCR tasks. For this pic- I made a workflow that goes Qwen2 72b VL -> Llama 3.3 70b Instruct, having Qwen2 take a swing at answering, and then Llama 3.3 70b clean up the response under the assumption that Qwen2 is great at describing an image, but might not be as smart on other stuff, so please review and revise any conclusions it made. It helps for things like recipes and whatnot that Qwen2 might not know. I'll probably do similar for coding workflows, having a coding model follow up behind it if I ask it to pull code out of an image.
View on Reddit #51240227

Ulterior-Motive_@reddit

Is there anything different you need to do to load it into llama.cpp? I tried it the other day, also in Open Webui, but it seemed to generate nonsense answers or ignore the image attachments I sent.
View on Reddit #51343938

SomeOddCodeGuy@reddit (OP)

The non-sense answers and ignoring the image attachment means its not seeing the image; when I was first setting up, that happened a lot. I'm using it in koboldcpp, but the setup may be similar. Make sure you load the image and the mmproj file; both have to be loaded. In the output, you SHOULD see a mention of the image being converted into tokens, and possibly resized. If you don't see that, you probably aren't getting that image into the model.
View on Reddit #51344723

iwinux@reddit

Does KoboldCPP use GPU (Metal) on macOS automatically?
View on Reddit #51257347

SomeOddCodeGuy@reddit (OP)

It does if you install it right. I know that they have prebuild mac versions, but I prefer just to do this: 1. Go to Koboldcpp releases and download the zip at the bottom 2. unzip it, open that directory in terminal 3. type "make LLAMA\_METAL=1" Once it finishes, you're gold and it'll use the GPU if you gpu layers > 0
View on Reddit #51271126

polandtown@reddit

what blows my mind is that it downscales to a rez of \~900x\~900 and the quality is still this good in identifying things
View on Reddit #51318202

MrPecunius@reddit

Gemma 3 27B Q4\_K\_M (Bartowski) GGUF with a goofy system prompt cribbed from one of y'all yesterday produced this with OP's prompt (my comments in square brackets): \------------- Ooh, grocery haul! Let’s see… you’ve been busy. Alright, here’s what I spot in that lovely spread: \* Watermelon (a \*big\* one, it looks like!) \* Fuji Apples (pommes de terre? You're a fancy shopper) \* White Corn Tortilla Chips \[should be blue\] \* Spinach \[actually mixed greens\] \* Taco Shells \* Bananas (a whole bunch! Monkey business planned?) \* Pickles (dill, I think) \[?\] \* Sausage (looks like breakfast sausage!) \[??\] \* Raspberries \[strawberries\] \* Ground Beef \* Chicken Breasts \* Colby Jack Cheese \* Pita Bread \* Bell Peppers (green and yellow) \[probably picking up yellow from bananas\] \* Corn on the Cob \* Blueberry Muffins \[no, just berries\] Okay, now for a dish… Hmm. With all \*that\*, you could absolutely make \*\*Beef & Chicken Fajitas with all the fixings!\*\* You’d need: \* Ground Beef \* Chicken Breasts \* Bell Peppers (both colors!) \* Taco Shells (or use the pita bread, if you're feeling adventurous) \* Spinach (for a little extra green goodness) \* Colby Jack Cheese \* Pickles (because why not?) Sounds good, yeah? What do you think? Are we going for spicy or mild with those fajitas? I’m already imagining the possibilities… and maybe stealing a raspberry or two. 😉 \---------- Both models missed the mac & cheese, diced tomatoes, yogurt, honey, and whatever is to the left of the honey in the bag (prunes??). Qwen 2 invented the salsa (or morphed the diced tomatoes into salsa), missed the pita bread and sandwich skinnys (or perhaps missed one and called the other "bread"). Gemma 3 invented the pickles and sausage. When asked to locate them in the image, it apologized for "hallucinating" them. When further asked if it missed anything, it noticed the yogurt and tried to identify the box of mac and cheese with a qualified "could be couscous". It persisted in calling the blueberries "muffins" until questioned directly, at which point it apologized again and admitted the error. Gemma 3 is a terrible cook: ground beef fajitas??? Get a rope! Also, this system prompt = 😂
View on Reddit #51250764

Admirable-Star7088@reddit

Just a heads up, if you're using Koboldcpp or LM Studio, I discovered that vision is currently broken and output quality is degraded. The images are being resized and corrupted when sent to the chat. I tried SillyTavern where images are sent to Gemma 3 in their original resolutions, the output quality is much better with less hallucinations.
View on Reddit #51307714

MrPecunius@reddit

If this is "broken", then I am in for a treat.
View on Reddit #51316557

DeltaSqueezer@reddit

If this is Qwen 2, maybe you'd be interested to learn Qwen 2.5 72B VL is also now available.
View on Reddit #51240236

SomeOddCodeGuy@reddit (OP)

Unfortunately there aren't any ggufs of it. :( I vaguely remember seeing an open github issue with llama.cpp for adding support for it recently, so I think it's still a work in progress.
View on Reddit #51240900

Ragecommie@reddit

I am working on that. It is almost done, but unfortunately it depends on two other major PRs, se development has been paused for a bit.
View on Reddit #51241645

Flamenverfer@reddit

What does it depend on? It works so good for proccessing my docs.
View on Reddit #51313664

SomeOddCodeGuy@reddit (OP)

I appreciate your work on that! Llamacpp is such a massive project that I can only imagine the PR nightmares that can occur. Excited for when it's done =D
View on Reddit #51242191

chibop1@reddit

Try Qwen2.5-VL-72B on your new Mac Studio. https://huggingface.co/mlx-community/Qwen2.5-VL-72B-Instruct-8bit
View on Reddit #51242839

SomeOddCodeGuy@reddit (OP)

I need to get around to messing with MLX. It's been on the list. Do you know if it exposes a rest API? I'm trying to find the specifications for their API, but I can't even find a reference to the existence of one lol. I'll need it to hook it into wilmer/open webui/whatever. Once I can find that spec, I'll work on getting it integrated so I can make use of it.
View on Reddit #51243086

spookperson@reddit

The easiest way is LM Studio (with an MLX model). It supports qwen2.5-vl etc and exposes an openai-compatible endpoint 
View on Reddit #51272111

chibop1@reddit

[mlx_lm.server in ml-explore/mlx-examples/llms](https://github.com/ml-explore/mlx-examples/tree/main/llms) has OpenAI API, but not mlx-vlm. They're working on it as we speak though. :) https://github.com/Blaizzy/mlx-vlm/issues/247 MLX-VLM supports pretty much all the recent major VLMs, and the dev is super responsive for suggestions. Frankly best interactions I've had with devs on Github! No attitude, friendly, great speed, etc. MLX team is amazing IMHO. lol
View on Reddit #51247989

ryseek@reddit

\> Do you know if it exposes a rest API?  you can use LM studio. It can serve MLX models.
View on Reddit #51245715

bobby-chan@reddit

IIRC mlx-lm won't work, it doesn't support multimodal models. You'll have to use mlx-vlm by Prince Canuma. I think his project includes gradio-ui
View on Reddit #51245130

Spanky2k@reddit

You said you’re using a mac so just use the mlx version. It’s faster too.
View on Reddit #51253195

DeltaSqueezer@reddit

They released AWQ quant though.
View on Reddit #51243166

SomeOddCodeGuy@reddit (OP)

I saw, but alas... Mac user. :(
View on Reddit #51243385

CheatCodesOfLife@reddit

I haven't tried it but would one of these help you? https://huggingface.co/mlx-community/Qwen2.5-VL-7B-Instruct-8bit https://huggingface.co/mlx-community/Qwen2.5-VL-7B-Instruct-4bit https://huggingface.co/mlx-community/Qwen2.5-VL-72B-Instruct-4bit https://huggingface.co/mlx-community/Qwen2.5-VL-72B-Instruct-8bit
View on Reddit #51244355

a_beautiful_rhind@reddit

I think both are based on "qwen 2.5", they would not merge with qwen2 models.
View on Reddit #51249406

uti24@reddit

Op, please compare it to Gemma 3 27B, I tried it, but I am not impressed, an I certainly can not run Qwen2 72b VL, and it would be so interesting to compare google freshest model.
View on Reddit #51249115

MrPecunius@reddit

See my reply to this post above.
View on Reddit #51288114

Admirable-Star7088@reddit

I can do a comparison if you want. Have any image in mind?
View on Reddit #51249521

FineClassroom2085@reddit

That’s…. A lot of bananas.
View on Reddit #51273068

deoxykev@reddit

Qwen 2.5 VL 72B is also really good. Totally picks up doctor's handwriting no problem. (test done without quants)
View on Reddit #51261160

SingleSoberPeaceful@reddit

OP, what is the UI that you are using ?
View on Reddit #51240421

Everlier@reddit

Open WebUI, the one and only
View on Reddit #51240716

FbF_@reddit

Does llama-server support VL models?
View on Reddit #51248862

croninsiglos@reddit

Not qwen based ones but Gemma3 works fine for vision.
View on Reddit #51253994

Admirable-Star7088@reddit

Yeah, Qwen2 VL is pretty good! I played around with it for hours (7b and 72b versions) when llama.cpp got support for it a few months back. However, the new local king for vision is Gemma 3 12b and 27b in my experience. Despite being much smaller than Qwen2 72b VL, Gemma 3 is generally better. However, Qwen2 VL, even the small 7b version, is still much better than Gemma 3 at interpreting texts and its positioning in images, but when it comes to "pure" images like artwork with no text, Gemma 3 is amazing. I hope llama.cpp get Qwen2.5 VL support soon, I would love to try it out as well.
View on Reddit #51247545

Finanzamt_kommt@reddit

Ovis2 32b is supposedly even better and a smaller model but it not yet supports quantization in llama.cpp, im currently trying to get it to work and think I already managed to convert it to gffus, if anyone is able to do inference code dm me (;
View on Reddit #51247526

nik_supe@reddit

What are your hardware specs? Can I run this on 64 gb ram and 24 gb graphic card nvidea?
View on Reddit #51243172

SomeOddCodeGuy@reddit (OP)

You should be able to, but it will be a a bit slow. KoboldCpp lets you run this model, and it uses ggufs. ggufs can be split to run on both CPU and GPU, so you could do as much as will fit into the 24GB, and the rest into the RAM. That slows things down, but does give the option to run it. If you grab a smaller quant, like q3 or q2, you should be able to fit a lot of the model into that 24GB card, reducing the slowdown a lot. I'm a Mac user, so it's slow for me too =D
View on Reddit #51243365

nik_supe@reddit

Can you please explain what is a koboldcpp. Is it integrated in ollama ? Or to run it separately. Also then 32 billion with 4 bits or this with 2 bits what do you think is better
View on Reddit #51243881

Fusseldieb@reddit

8GB users in shambles
View on Reddit #51240947

SomeOddCodeGuy@reddit (OP)

Qwen2 vl 7b, from what I've seen of the benchmarks, looks to be vastly superior to llama 3.2 11b, so might be worth a peek. KoboldCpp's UI lets you load it up pretty easily.
View on Reddit #51241058

Howard_banister@reddit

Deepseek Janus pro is another 7B model, worth to try.
View on Reddit #51241285