What happened to Llama 3.2 90b-vision?

[-]

Such_Advantage_6949@reddit

Not good enough, comparing to qwen 2 VL

Reply

[-]

Arkonias@reddit

It's still there, supported in MLX so us Mac folks can run it locally. Llama.cpp seems to be allergic to vision models.

Reply

[-]

Ollama has llama3.2 support in pre-release 0.4.0 version; currently only for 11b size, but I believe they'll add 90b after full release; so I thing in the few following weeks there will be a no-effort solution to host llama3.2:90b locally and then it'll get much more attention.

Reply

[-]

agntdrake@reddit

It'll be up soon (hopefully later tonight) to work w/ 0.4.0rc8 which just went live. In testing it's pretty good.

Reply

[-]

Accomplished_Bet_127@reddit

They are doing quite a lot of job already. If anyone, take you, for example, is willing to add support for vision models in llama.cpp, that is good. Go ahead! That is not that they don't like it. It is open project and there was no one with good skills to contribute.

Reply

[-]

gtek_engineer66@reddit

If I had time to learn the steps required to do so, I would definitely do it.

Reply

[-]

Accomplished_Bet_127@reddit

That is the point. No one is "allergic" to the vision models. It is just adding function into software undev active development would require someone with necessary skills and time to kill on keeping up with the rest of llama.cpp.

Reply

[-]

gtek_engineer66@reddit

Calm down

Reply

[-]

Plabbi@reddit

Lol

Reply

[-]

shroddy@reddit

Afaik there were contributions for vision models, but they were not merged.

Reply

[-]

Accomplished_Bet_127@reddit

I would presume that way. That shoud be real problem to have a code that will follow guideline of the project, work efficient, don't conflict with existing and WIP functions. By now, codebase of llama.cpp should be quite big. Also, real geniuses are not always good, as they might outperform with the code that other could not work with. It doesn't have to be someone who will do everything just perfect on the first shot. Probably they will take someone who have skills and intention to work on the project for at least some time, to establish some work routines (in what order new features added and how to test them) and create some documentation so more people could be found on the same project. I make it sound hard, but I am really 'afraid' that this project is quite complicated by now. That will be fantastic if guidelines would be made to make an AI to work on conflicts and checkups of the projects, so more functions can be added without dragging development time down.

Reply

[-]

emprahsFury@reddit

it's a company with a product, let's not go all Stallman on each other because they don't want to support multi-modal

Reply

[-]

unclemusclezTTV@reddit

people are sleeping on apple

Reply

[-]

llkj11@reddit

Prob because not every one has a few thousand to spend on Mac lol.

Reply

[-]

InertialLaunchSystem@reddit

It's actually cheaper than using an Nvidia GPU if you want to run large models because of the fact that Mac RAM is also VRAM.

Reply

[-]

Final-Rush759@reddit

I use Qwen2-VL-7B on Mac. I also used it with Nvidia GPU + pytorch. I took me a few hours to install all the library due to incompatibility of certain libraries that would uninstall the previously installed libraries. They have to be installed in a certain order. It still gives warning of incompatibility, but it didn't kicked out other libraries. Then, it runs totally fine. But when Mac mlx version showed up, it was super easy to install it on LM-studio 0.3.5.

Reply

[-]

ab2377@reddit

how does it perform, and have you done ocr with it?

Reply

[-]

bieker@reddit

None of these vision models are good at pure ocr, what qwen2-vl excels at is doc-qa and json structured output.

Reply

[-]

Final-Rush759@reddit

The model performed very well. I input a screen of math formula in a scientific paper and asked vllm to write Python code for it.

Reply

[-]

robotphilanthropist@reddit

not a good enough model ;)

Reply

[-]

Lissanro@reddit

My own experience with it was pretty bad, they attempted to bake in way too much censorship in it. It failed even basic tests some YouTubers thrown at it specifically due to degradation caused by overcensoring: [https://www.youtube.com/watch?v=lzDPQAjItOo](https://www.youtube.com/watch?v=lzDPQAjItOo) . For vision tasks, Qwen2-VL 72B is better in my experience. I can run it locally using [https://github.com/matatonic/openedai-vision](https://github.com/matatonic/openedai-vision) . It is not as VRAM efficient as TabbyAPI, so requires four 24GB GPUs to run it and even that feels like a tight fit. And it still not as good as text-only Qwen2.5 or Llama3.1, and loading vision model takes few minutes, then few more minutes to get a reply, and load back normal text model, so currently large vision models are not very practical. My guess, for heavy vision models to become more popular, they need to become more widely supported by popular backends such as Llama.cpp or ExllamaV2, but there are a lot of challenges to implement vision model support. This way, they would become more VRAM efficient and may gain better performance, and when we have good vision models that also remain good at text-only tasks, it may become more practical to use them. I still use vision models quite often, but I understand why they are currently not very popular due to issues mentioned above.

Reply

[-]

fallingdowndizzyvr@reddit

> For vision tasks, Qwen2-VL 72B is better in my experience, it does not suffer from overcensoring (so far, it never refused my requests, while Llama 90B does it quite often, even for basic general questions). The irony. Since the haters always complain about the CCP censorship.

Reply

[-]

talk_nerdy_to_m3@reddit

There's surely a difference between censorship and potentially harmful information. Tiananmen square != How do I make a pipe bomb. Now, not to get political but I can't think of another example, the hunter Biden laptop on the other hand can probably go either way so it is definitely a challenge to avoid censorship while preventing harmful information.

Reply

[-]

shroddy@reddit

The qwen models itself are quite uncensored, but when you use them online, their online service disconnects as soon as you ask something about Tienanmen Square or similar sensitive topic

Reply

[-]

ihaag@reddit

LMStudio supports vision can it run the 90b ?

Reply

[-]

Eugr@reddit

Not yet, as llama.cpp doesn't support vision llama architecture. Even on Macs, while MLX now supports Llama vision, the backend used by LMStudio doesn't (but it does support Qwen).

Reply

[-]

Comfortable-Top-3799@reddit

It is too large for normal users to run

Reply

[-]

a_beautiful_rhind@reddit

It's inconvenient to run. You have to use AWQ, bitsnbytes, etc.

Reply

[-]

shroddy@reddit

It is on lmsys arena.

Reply

[-]

Healthy-Nebula-3603@reddit

Is big ... And we don't have an implementation for llamaccp as allowing us to use a vram and a ram as extension because of lack vram So other projects can't use it as they are derived from llamacpp.

Reply

[-]

MoffKalast@reddit

Vision models aren't text models either, they (typically) require vastly more compute. I doubt running them CPU only would get us nearly as good results.

Reply

[-]

Healthy-Nebula-3603@reddit

Nah Vision models work the same way as text modes The difference is only extra vision encoder .. that's it. Vision models that are working currently on llamacpp which the biggest is llava 1.6 32b works as fast as text only the same size.

Reply

[-]

MoffKalast@reddit

Does clip not have some convolutional layers? I mean I guess that's more the case with the old architectures, but diffusion models are transformers and are significantly slower than LLMs.

Reply

[-]

Healthy-Nebula-3603@reddit

As I said and tested by myself. I don't see a difference in performance. Vision 30b is as fast as a text 30b model. As far as I know, you just adding a vision encoder to the text model is becoming a vision model.... I know how crazy it sounds but it is true...magic.

Reply

[-]

MoffKalast@reddit

I mean yeah it's true, after hefty retraining it [kinda works](https://arxiv.org/abs/2407.06581). Still I guess these encoders must be really tiny, images are a lot more data to process regardless of how you approach it. I have to read up a bit on what the CLIP arch actually does.

Reply

[-]

Only-Letterhead-3411@reddit

Because most people don't need or care about vision models. I'd prefer a very smart, text only LLM to a multi modal AI with inflated size any day

Reply

[-]

Dry-Judgment4242@reddit

I don't get the vision models. Are they not just a text model who have had a vision model surgically stitched to it's head? Everyone of those multimodal models I tested where awful when compared to just running a LLM + Stable Diffusion API.

Reply

[-]

AlanCarrOnline@reddit

The vision stuff is for it to see things, not produce images like SD does. Having said that, I don't have much of a use-case for it either, but it's a baby-step in the direction of... something, for sure.

Reply

[-]

Dry-Judgment4242@reddit

Ohh. Right, yeah I was confused when I tried one too. Still apparently am cuz your right. A vision model stitched to it in that cause. Tried doing llama3.2 vision+Stable Diffusion and it did not work very well heh...

Reply

[-]

SandboChang@reddit

It really depends on the kind of interaction you are looking for. For me when I am trying to get some Python matplotlib done, a vision model makes life much easier sometimes.

Reply

[-]

openssp@reddit

https://embeddedllm.com/blog/see-the-power-of-llama-32-vision-on-amd-mi300x Check this out. They run Llama 3.2 90b on AMD GPU. The result look impressive.

Reply to Post

42 Comments