Qwen2 72b VL is actually really impressive. It's not perfect, but for a local model I'm certainly impressed (more info in comments)

[-]

Dark_Alchemist@reddit

How does this compare to Janus Pro? I was comparing 7B to 7B and I am leaning towards Janus being the better. The directive prompt was: Describe this image in detail.

Reply

[-]

The grocery image was a random one I grabbed online just to test with (***Credit:*** *"Glitter On A Dime"*) I was supposed to go to bed 3 hours ago, but I was bound and determined to get KoboldCPP working as an accepted image endpoint in Wilmer, and finally got the POC working about an hour ago lol. Then I started playing with Qwen2 VL for the first time, became enamored, and here I am. Qwen isn't perfect, by any means, but for something running on your local machine? I'm very impressed with this. I've done a couple of others as well, such as: * Giving it a pic of a spice rack and asking if I have a particular spice (*it got it right*) * Having it describe my reddit avatar (*it tried. Got 90%, but for some reason it thought the llama had spots lol*). * The hardest I gave it was a screenshot of my desktop running a model, with the command prompt window pulled up so that it could see llama.cpp's output metrics. I put the whole desktop in the screenshot and asked how long the output took in seconds; it found that on the screen and got the answer correct down to the decimal points. Qwen2.5 VL support isn't in llama.cpp yet I don't think, as I couldn't find any ggufs for it, but for now this is plenty. After using Llama 3.2 vision for a while, I really didn't expect a lot, as that one had always struggled with details for me; asking it for test from a screenshot never went well, for example. But this model seems like it'll be fantastic for OCR tasks. For this pic- I made a workflow that goes Qwen2 72b VL -> Llama 3.3 70b Instruct, having Qwen2 take a swing at answering, and then Llama 3.3 70b clean up the response under the assumption that Qwen2 is great at describing an image, but might not be as smart on other stuff, so please review and revise any conclusions it made. It helps for things like recipes and whatnot that Qwen2 might not know. I'll probably do similar for coding workflows, having a coding model follow up behind it if I ask it to pull code out of an image.

Reply

[-]

Ulterior-Motive_@reddit

Is there anything different you need to do to load it into llama.cpp? I tried it the other day, also in Open Webui, but it seemed to generate nonsense answers or ignore the image attachments I sent.

Reply

[-]

SomeOddCodeGuy@reddit (OP)

The non-sense answers and ignoring the image attachment means its not seeing the image; when I was first setting up, that happened a lot. I'm using it in koboldcpp, but the setup may be similar. Make sure you load the image and the mmproj file; both have to be loaded. In the output, you SHOULD see a mention of the image being converted into tokens, and possibly resized. If you don't see that, you probably aren't getting that image into the model.

Reply

[-]

iwinux@reddit

Does KoboldCPP use GPU (Metal) on macOS automatically?

Reply

[-]

SomeOddCodeGuy@reddit (OP)

It does if you install it right. I know that they have prebuild mac versions, but I prefer just to do this: 1. Go to Koboldcpp releases and download the zip at the bottom 2. unzip it, open that directory in terminal 3. type "make LLAMA\_METAL=1" Once it finishes, you're gold and it'll use the GPU if you gpu layers > 0

Reply

[-]

polandtown@reddit

what blows my mind is that it downscales to a rez of \~900x\~900 and the quality is still this good in identifying things

Reply

[-]

MrPecunius@reddit

Gemma 3 27B Q4\_K\_M (Bartowski) GGUF with a goofy system prompt cribbed from one of y'all yesterday produced this with OP's prompt (my comments in square brackets): \------------- Ooh, grocery haul! Let’s see… you’ve been busy. Alright, here’s what I spot in that lovely spread: \* Watermelon (a \*big\* one, it looks like!) \* Fuji Apples (pommes de terre? You're a fancy shopper) \* White Corn Tortilla Chips \[should be blue\] \* Spinach \[actually mixed greens\] \* Taco Shells \* Bananas (a whole bunch! Monkey business planned?) \* Pickles (dill, I think) \[?\] \* Sausage (looks like breakfast sausage!) \[??\] \* Raspberries \[strawberries\] \* Ground Beef \* Chicken Breasts \* Colby Jack Cheese \* Pita Bread \* Bell Peppers (green and yellow) \[probably picking up yellow from bananas\] \* Corn on the Cob \* Blueberry Muffins \[no, just berries\] Okay, now for a dish… Hmm. With all \*that\*, you could absolutely make \*\*Beef & Chicken Fajitas with all the fixings!\*\* You’d need: \* Ground Beef \* Chicken Breasts \* Bell Peppers (both colors!) \* Taco Shells (or use the pita bread, if you're feeling adventurous) \* Spinach (for a little extra green goodness) \* Colby Jack Cheese \* Pickles (because why not?) Sounds good, yeah? What do you think? Are we going for spicy or mild with those fajitas? I’m already imagining the possibilities… and maybe stealing a raspberry or two. 😉 \---------- Both models missed the mac & cheese, diced tomatoes, yogurt, honey, and whatever is to the left of the honey in the bag (prunes??). Qwen 2 invented the salsa (or morphed the diced tomatoes into salsa), missed the pita bread and sandwich skinnys (or perhaps missed one and called the other "bread"). Gemma 3 invented the pickles and sausage. When asked to locate them in the image, it apologized for "hallucinating" them. When further asked if it missed anything, it noticed the yogurt and tried to identify the box of mac and cheese with a qualified "could be couscous". It persisted in calling the blueberries "muffins" until questioned directly, at which point it apologized again and admitted the error. Gemma 3 is a terrible cook: ground beef fajitas??? Get a rope! Also, this system prompt = 😂

Reply

[-]

Admirable-Star7088@reddit

Just a heads up, if you're using Koboldcpp or LM Studio, I discovered that vision is currently broken and output quality is degraded. The images are being resized and corrupted when sent to the chat. I tried SillyTavern where images are sent to Gemma 3 in their original resolutions, the output quality is much better with less hallucinations.

Reply

[-]

MrPecunius@reddit

If this is "broken", then I am in for a treat.

Reply

[-]

DeltaSqueezer@reddit

If this is Qwen 2, maybe you'd be interested to learn Qwen 2.5 72B VL is also now available.

Reply

[-]

SomeOddCodeGuy@reddit (OP)

Unfortunately there aren't any ggufs of it. :( I vaguely remember seeing an open github issue with llama.cpp for adding support for it recently, so I think it's still a work in progress.

Reply

[-]

Ragecommie@reddit

I am working on that. It is almost done, but unfortunately it depends on two other major PRs, se development has been paused for a bit.

Reply

[-]

Flamenverfer@reddit

What does it depend on? It works so good for proccessing my docs.

Reply

[-]

SomeOddCodeGuy@reddit (OP)

I appreciate your work on that! Llamacpp is such a massive project that I can only imagine the PR nightmares that can occur. Excited for when it's done =D

Reply

[-]

chibop1@reddit

Try Qwen2.5-VL-72B on your new Mac Studio. https://huggingface.co/mlx-community/Qwen2.5-VL-72B-Instruct-8bit

Reply

[-]

SomeOddCodeGuy@reddit (OP)

I need to get around to messing with MLX. It's been on the list. Do you know if it exposes a rest API? I'm trying to find the specifications for their API, but I can't even find a reference to the existence of one lol. I'll need it to hook it into wilmer/open webui/whatever. Once I can find that spec, I'll work on getting it integrated so I can make use of it.

Reply

[-]

spookperson@reddit

The easiest way is LM Studio (with an MLX model). It supports qwen2.5-vl etc and exposes an openai-compatible endpoint

Reply

[-]

chibop1@reddit

[mlx_lm.server in ml-explore/mlx-examples/llms](https://github.com/ml-explore/mlx-examples/tree/main/llms) has OpenAI API, but not mlx-vlm. They're working on it as we speak though. :) https://github.com/Blaizzy/mlx-vlm/issues/247 MLX-VLM supports pretty much all the recent major VLMs, and the dev is super responsive for suggestions. Frankly best interactions I've had with devs on Github! No attitude, friendly, great speed, etc. MLX team is amazing IMHO. lol

Reply

[-]

ryseek@reddit

\> Do you know if it exposes a rest API? you can use LM studio. It can serve MLX models.

Reply

[-]

bobby-chan@reddit

IIRC mlx-lm won't work, it doesn't support multimodal models. You'll have to use mlx-vlm by Prince Canuma. I think his project includes gradio-ui

Reply

[-]

Spanky2k@reddit

You said you’re using a mac so just use the mlx version. It’s faster too.

Reply

[-]

DeltaSqueezer@reddit

They released AWQ quant though.

Reply

[-]

SomeOddCodeGuy@reddit (OP)

I saw, but alas... Mac user. :(

Reply

[-]

CheatCodesOfLife@reddit

I haven't tried it but would one of these help you? https://huggingface.co/mlx-community/Qwen2.5-VL-7B-Instruct-8bit https://huggingface.co/mlx-community/Qwen2.5-VL-7B-Instruct-4bit https://huggingface.co/mlx-community/Qwen2.5-VL-72B-Instruct-4bit https://huggingface.co/mlx-community/Qwen2.5-VL-72B-Instruct-8bit

Reply

[-]

a_beautiful_rhind@reddit

I think both are based on "qwen 2.5", they would not merge with qwen2 models.

Reply

[-]

uti24@reddit

Op, please compare it to Gemma 3 27B, I tried it, but I am not impressed, an I certainly can not run Qwen2 72b VL, and it would be so interesting to compare google freshest model.

Reply

[-]

MrPecunius@reddit

See my reply to this post above.

Reply

[-]

Admirable-Star7088@reddit

I can do a comparison if you want. Have any image in mind?

Reply

[-]

FineClassroom2085@reddit

That’s…. A lot of bananas.

Reply

[-]

deoxykev@reddit

Qwen 2.5 VL 72B is also really good. Totally picks up doctor's handwriting no problem. (test done without quants)

Reply

[-]

SingleSoberPeaceful@reddit

OP, what is the UI that you are using ?

Reply

[-]

Everlier@reddit

Open WebUI, the one and only

Reply

[-]

FbF_@reddit

Does llama-server support VL models?

Reply

[-]

croninsiglos@reddit

Not qwen based ones but Gemma3 works fine for vision.

Reply

[-]

Admirable-Star7088@reddit

Yeah, Qwen2 VL is pretty good! I played around with it for hours (7b and 72b versions) when llama.cpp got support for it a few months back. However, the new local king for vision is Gemma 3 12b and 27b in my experience. Despite being much smaller than Qwen2 72b VL, Gemma 3 is generally better. However, Qwen2 VL, even the small 7b version, is still much better than Gemma 3 at interpreting texts and its positioning in images, but when it comes to "pure" images like artwork with no text, Gemma 3 is amazing. I hope llama.cpp get Qwen2.5 VL support soon, I would love to try it out as well.

Reply

[-]

Finanzamt_kommt@reddit

Ovis2 32b is supposedly even better and a smaller model but it not yet supports quantization in llama.cpp, im currently trying to get it to work and think I already managed to convert it to gffus, if anyone is able to do inference code dm me (;

Reply

[-]

nik_supe@reddit

What are your hardware specs? Can I run this on 64 gb ram and 24 gb graphic card nvidea?

Reply

[-]

SomeOddCodeGuy@reddit (OP)

You should be able to, but it will be a a bit slow. KoboldCpp lets you run this model, and it uses ggufs. ggufs can be split to run on both CPU and GPU, so you could do as much as will fit into the 24GB, and the rest into the RAM. That slows things down, but does give the option to run it. If you grab a smaller quant, like q3 or q2, you should be able to fit a lot of the model into that 24GB card, reducing the slowdown a lot. I'm a Mac user, so it's slow for me too =D

Reply

[-]

nik_supe@reddit

Can you please explain what is a koboldcpp. Is it integrated in ollama ? Or to run it separately. Also then 32 billion with 4 bits or this with 2 bits what do you think is better

Reply

[-]

Fusseldieb@reddit

8GB users in shambles

Reply

[-]

SomeOddCodeGuy@reddit (OP)

Qwen2 vl 7b, from what I've seen of the benchmarks, looks to be vastly superior to llama 3.2 11b, so might be worth a peek. KoboldCpp's UI lets you load it up pretty easily.

Reply to Post

43 Comments