How does this compare to Janus Pro? I was comparing 7B to 7B and I am leaning towards Janus being the better. The directive prompt was: Describe this image in detail.
The grocery image was a random one I grabbed online just to test with (***Credit:*** *"Glitter On A Dime"*)
I was supposed to go to bed 3 hours ago, but I was bound and determined to get KoboldCPP working as an accepted image endpoint in Wilmer, and finally got the POC working about an hour ago lol. Then I started playing with Qwen2 VL for the first time, became enamored, and here I am.
Qwen isn't perfect, by any means, but for something running on your local machine? I'm very impressed with this. I've done a couple of others as well, such as:
* Giving it a pic of a spice rack and asking if I have a particular spice (*it got it right*)
* Having it describe my reddit avatar (*it tried. Got 90%, but for some reason it thought the llama had spots lol*).
* The hardest I gave it was a screenshot of my desktop running a model, with the command prompt window pulled up so that it could see llama.cpp's output metrics. I put the whole desktop in the screenshot and asked how long the output took in seconds; it found that on the screen and got the answer correct down to the decimal points.
Qwen2.5 VL support isn't in llama.cpp yet I don't think, as I couldn't find any ggufs for it, but for now this is plenty. After using Llama 3.2 vision for a while, I really didn't expect a lot, as that one had always struggled with details for me; asking it for test from a screenshot never went well, for example. But this model seems like it'll be fantastic for OCR tasks.
For this pic- I made a workflow that goes Qwen2 72b VL -> Llama 3.3 70b Instruct, having Qwen2 take a swing at answering, and then Llama 3.3 70b clean up the response under the assumption that Qwen2 is great at describing an image, but might not be as smart on other stuff, so please review and revise any conclusions it made. It helps for things like recipes and whatnot that Qwen2 might not know. I'll probably do similar for coding workflows, having a coding model follow up behind it if I ask it to pull code out of an image.
Is there anything different you need to do to load it into llama.cpp? I tried it the other day, also in Open Webui, but it seemed to generate nonsense answers or ignore the image attachments I sent.
The non-sense answers and ignoring the image attachment means its not seeing the image; when I was first setting up, that happened a lot.
I'm using it in koboldcpp, but the setup may be similar. Make sure you load the image and the mmproj file; both have to be loaded. In the output, you SHOULD see a mention of the image being converted into tokens, and possibly resized. If you don't see that, you probably aren't getting that image into the model.
It does if you install it right. I know that they have prebuild mac versions, but I prefer just to do this:
1. Go to Koboldcpp releases and download the zip at the bottom
2. unzip it, open that directory in terminal
3. type "make LLAMA\_METAL=1"
Once it finishes, you're gold and it'll use the GPU if you gpu layers > 0
Gemma 3 27B Q4\_K\_M (Bartowski) GGUF with a goofy system prompt cribbed from one of y'all yesterday produced this with OP's prompt (my comments in square brackets):
\-------------
Ooh, grocery haul! Let’s see… you’ve been busy. Alright, here’s what I spot in that lovely spread:
\* Watermelon (a \*big\* one, it looks like!)
\* Fuji Apples (pommes de terre? You're a fancy shopper)
\* White Corn Tortilla Chips \[should be blue\]
\* Spinach \[actually mixed greens\]
\* Taco Shells
\* Bananas (a whole bunch! Monkey business planned?)
\* Pickles (dill, I think) \[?\]
\* Sausage (looks like breakfast sausage!) \[??\]
\* Raspberries \[strawberries\]
\* Ground Beef
\* Chicken Breasts
\* Colby Jack Cheese
\* Pita Bread
\* Bell Peppers (green and yellow) \[probably picking up yellow from bananas\]
\* Corn on the Cob
\* Blueberry Muffins \[no, just berries\]
Okay, now for a dish… Hmm. With all \*that\*, you could absolutely make \*\*Beef & Chicken Fajitas with all the fixings!\*\*
You’d need:
\* Ground Beef
\* Chicken Breasts
\* Bell Peppers (both colors!)
\* Taco Shells (or use the pita bread, if you're feeling adventurous)
\* Spinach (for a little extra green goodness)
\* Colby Jack Cheese
\* Pickles (because why not?)
Sounds good, yeah? What do you think? Are we going for spicy or mild with those fajitas? I’m already imagining the possibilities… and maybe stealing a raspberry or two. 😉
\----------
Both models missed the mac & cheese, diced tomatoes, yogurt, honey, and whatever is to the left of the honey in the bag (prunes??).
Qwen 2 invented the salsa (or morphed the diced tomatoes into salsa), missed the pita bread and sandwich skinnys (or perhaps missed one and called the other "bread").
Gemma 3 invented the pickles and sausage. When asked to locate them in the image, it apologized for "hallucinating" them. When further asked if it missed anything, it noticed the yogurt and tried to identify the box of mac and cheese with a qualified "could be couscous". It persisted in calling the blueberries "muffins" until questioned directly, at which point it apologized again and admitted the error.
Gemma 3 is a terrible cook: ground beef fajitas??? Get a rope! Also, this system prompt = 😂
Just a heads up, if you're using Koboldcpp or LM Studio, I discovered that vision is currently broken and output quality is degraded. The images are being resized and corrupted when sent to the chat.
I tried SillyTavern where images are sent to Gemma 3 in their original resolutions, the output quality is much better with less hallucinations.
Unfortunately there aren't any ggufs of it. :( I vaguely remember seeing an open github issue with llama.cpp for adding support for it recently, so I think it's still a work in progress.
I appreciate your work on that! Llamacpp is such a massive project that I can only imagine the PR nightmares that can occur.
Excited for when it's done =D
I need to get around to messing with MLX. It's been on the list.
Do you know if it exposes a rest API? I'm trying to find the specifications for their API, but I can't even find a reference to the existence of one lol. I'll need it to hook it into wilmer/open webui/whatever. Once I can find that spec, I'll work on getting it integrated so I can make use of it.
[mlx_lm.server in ml-explore/mlx-examples/llms](https://github.com/ml-explore/mlx-examples/tree/main/llms) has OpenAI API, but not mlx-vlm. They're working on it as we speak though. :)
https://github.com/Blaizzy/mlx-vlm/issues/247
MLX-VLM supports pretty much all the recent major VLMs, and the dev is super responsive for suggestions. Frankly best interactions I've had with devs on Github! No attitude, friendly, great speed, etc. MLX team is amazing IMHO. lol
I haven't tried it but would one of these help you?
https://huggingface.co/mlx-community/Qwen2.5-VL-7B-Instruct-8bit
https://huggingface.co/mlx-community/Qwen2.5-VL-7B-Instruct-4bit
https://huggingface.co/mlx-community/Qwen2.5-VL-72B-Instruct-4bit
https://huggingface.co/mlx-community/Qwen2.5-VL-72B-Instruct-8bit
Op, please compare it to Gemma 3 27B, I tried it, but I am not impressed, an I certainly can not run Qwen2 72b VL, and it would be so interesting to compare google freshest model.
Yeah, Qwen2 VL is pretty good! I played around with it for hours (7b and 72b versions) when llama.cpp got support for it a few months back. However, the new local king for vision is Gemma 3 12b and 27b in my experience. Despite being much smaller than Qwen2 72b VL, Gemma 3 is generally better.
However, Qwen2 VL, even the small 7b version, is still much better than Gemma 3 at interpreting texts and its positioning in images, but when it comes to "pure" images like artwork with no text, Gemma 3 is amazing.
I hope llama.cpp get Qwen2.5 VL support soon, I would love to try it out as well.
Ovis2 32b is supposedly even better and a smaller model but it not yet supports quantization in llama.cpp, im currently trying to get it to work and think I already managed to convert it to gffus, if anyone is able to do inference code dm me (;
You should be able to, but it will be a a bit slow. KoboldCpp lets you run this model, and it uses ggufs. ggufs can be split to run on both CPU and GPU, so you could do as much as will fit into the 24GB, and the rest into the RAM. That slows things down, but does give the option to run it.
If you grab a smaller quant, like q3 or q2, you should be able to fit a lot of the model into that 24GB card, reducing the slowdown a lot.
I'm a Mac user, so it's slow for me too =D
Can you please explain what is a koboldcpp. Is it integrated in ollama ? Or to run it separately. Also then 32 billion with 4 bits or this with 2 bits what do you think is better
Qwen2 vl 7b, from what I've seen of the benchmarks, looks to be vastly superior to llama 3.2 11b, so might be worth a peek. KoboldCpp's UI lets you load it up pretty easily.
43 Comments
Dark_Alchemist@reddit
SomeOddCodeGuy@reddit (OP)
Ulterior-Motive_@reddit
SomeOddCodeGuy@reddit (OP)
iwinux@reddit
SomeOddCodeGuy@reddit (OP)
polandtown@reddit
MrPecunius@reddit
Admirable-Star7088@reddit
MrPecunius@reddit
DeltaSqueezer@reddit
SomeOddCodeGuy@reddit (OP)
Ragecommie@reddit
Flamenverfer@reddit
SomeOddCodeGuy@reddit (OP)
chibop1@reddit
SomeOddCodeGuy@reddit (OP)
spookperson@reddit
chibop1@reddit
ryseek@reddit
bobby-chan@reddit
Spanky2k@reddit
DeltaSqueezer@reddit
SomeOddCodeGuy@reddit (OP)
CheatCodesOfLife@reddit
a_beautiful_rhind@reddit
uti24@reddit
MrPecunius@reddit
Admirable-Star7088@reddit
FineClassroom2085@reddit
deoxykev@reddit
SingleSoberPeaceful@reddit
Everlier@reddit
FbF_@reddit
croninsiglos@reddit
Admirable-Star7088@reddit
Finanzamt_kommt@reddit
nik_supe@reddit
SomeOddCodeGuy@reddit (OP)
nik_supe@reddit
Fusseldieb@reddit
SomeOddCodeGuy@reddit (OP)
Howard_banister@reddit