TheaterFire

What happened to Llama 3.2 90b-vision?

Posted by TitoxDboss@reddit | LocalLLaMA | View on Reddit | 42 comments

No one seems to talk about it any. It's not on hugging chat, it is not on lmsys arena. Seems to have just faded out of relevance

Reply to Post

42 Comments

Such_Advantage_6949@reddit

Not good enough, comparing to qwen 2 VL
View on Reddit #39844286

Arkonias@reddit

It's still there, supported in MLX so us Mac folks can run it locally. Llama.cpp seems to be allergic to vision models.
View on Reddit #39697355

No-Refrigerator-1672@reddit

Ollama has llama3.2 support in pre-release 0.4.0 version; currently only for 11b size, but I believe they'll add 90b after full release; so I thing in the few following weeks there will be a no-effort solution to host llama3.2:90b locally and then it'll get much more attention.
View on Reddit #39701619

agntdrake@reddit

It'll be up soon (hopefully later tonight) to work w/ 0.4.0rc8 which just went live. In testing it's pretty good.
View on Reddit #39835093

Accomplished_Bet_127@reddit

They are doing quite a lot of job already. If anyone, take you, for example, is willing to add support for vision models in llama.cpp, that is good. Go ahead! That is not that they don't like it. It is open project and there was no one with good skills to contribute.
View on Reddit #39701955

gtek_engineer66@reddit

If I had time to learn the steps required to do so, I would definitely do it.
View on Reddit #39704669

Accomplished_Bet_127@reddit

That is the point. No one is "allergic" to the vision models. It is just adding function into software undev active development would require someone with necessary skills and time to kill on keeping up with the rest of llama.cpp.
View on Reddit #39705838

gtek_engineer66@reddit

Calm down
View on Reddit #39766221

Plabbi@reddit

Lol
View on Reddit #39705274

shroddy@reddit

Afaik there were contributions for vision models, but they were not merged.
View on Reddit #39740255

Accomplished_Bet_127@reddit

I would presume that way. That shoud be real problem to have a code that will follow guideline of the project, work efficient, don't conflict with existing and WIP functions. By now, codebase of llama.cpp should be quite big. Also, real geniuses are not always good, as they might outperform with the code that other could not work with. It doesn't have to be someone who will do everything just perfect on the first shot. Probably they will take someone who have skills and intention to work on the project for at least some time, to establish some work routines (in what order new features added and how to test them) and create some documentation so more people could be found on the same project. I make it sound hard, but I am really 'afraid' that this project is quite complicated by now. That will be fantastic if guidelines would be made to make an AI to work on conflicts and checkups of the projects, so more functions can be added without dragging development time down.
View on Reddit #39740938

emprahsFury@reddit

it's a company with a product, let's not go all Stallman on each other because they don't want to support multi-modal
View on Reddit #39722129

unclemusclezTTV@reddit

people are sleeping on apple
View on Reddit #39697493

llkj11@reddit

Prob because not every one has a few thousand to spend on Mac lol.
View on Reddit #39699699

InertialLaunchSystem@reddit

It's actually cheaper than using an Nvidia GPU if you want to run large models because of the fact that Mac RAM is also VRAM.
View on Reddit #39764012

Final-Rush759@reddit

I use Qwen2-VL-7B on Mac. I also used it with Nvidia GPU + pytorch. I took me a few hours to install all the library due to incompatibility of certain libraries that would uninstall the previously installed libraries. They have to be installed in a certain order. It still gives warning of incompatibility, but it didn't kicked out other libraries. Then, it runs totally fine. But when Mac mlx version showed up, it was super easy to install it on LM-studio 0.3.5.
View on Reddit #39698367

ab2377@reddit

how does it perform, and have you done ocr with it?
View on Reddit #39700156

bieker@reddit

None of these vision models are good at pure ocr, what qwen2-vl excels at is doc-qa and json structured output.
View on Reddit #39712548

Final-Rush759@reddit

The model performed very well. I input a screen of math formula in a scientific paper and asked vllm to write Python code for it.
View on Reddit #39711674

robotphilanthropist@reddit

not a good enough model ;)
View on Reddit #39751246

Lissanro@reddit

My own experience with it was pretty bad, they attempted to bake in way too much censorship in it. It failed even basic tests some YouTubers thrown at it specifically due to degradation caused by overcensoring: [https://www.youtube.com/watch?v=lzDPQAjItOo](https://www.youtube.com/watch?v=lzDPQAjItOo) . For vision tasks, Qwen2-VL 72B is better in my experience. I can run it locally using [https://github.com/matatonic/openedai-vision](https://github.com/matatonic/openedai-vision) . It is not as VRAM efficient as TabbyAPI, so requires four 24GB GPUs to run it and even that feels like a tight fit. And it still not as good as text-only Qwen2.5 or Llama3.1, and loading vision model takes few minutes, then few more minutes to get a reply, and load back normal text model, so currently large vision models are not very practical. My guess, for heavy vision models to become more popular, they need to become more widely supported by popular backends such as Llama.cpp or ExllamaV2, but there are a lot of challenges to implement vision model support. This way, they would become more VRAM efficient and may gain better performance, and when we have good vision models that also remain good at text-only tasks, it may become more practical to use them. I still use vision models quite often, but I understand why they are currently not very popular due to issues mentioned above.
View on Reddit #39719605

fallingdowndizzyvr@reddit

> For vision tasks, Qwen2-VL 72B is better in my experience, it does not suffer from overcensoring (so far, it never refused my requests, while Llama 90B does it quite often, even for basic general questions). The irony. Since the haters always complain about the CCP censorship.
View on Reddit #39731867

talk_nerdy_to_m3@reddit

There's surely a difference between censorship and potentially harmful information. Tiananmen square != How do I make a pipe bomb. Now, not to get political but I can't think of another example, the hunter Biden laptop on the other hand can probably go either way so it is definitely a challenge to avoid censorship while preventing harmful information.
View on Reddit #39750319

shroddy@reddit

The qwen models itself are quite uncensored, but when you use them online, their online service disconnects as soon as you ask something about Tienanmen Square or similar sensitive topic
View on Reddit #39740481

ihaag@reddit

LMStudio supports vision can it run the 90b ?
View on Reddit #39703270

Eugr@reddit

Not yet, as llama.cpp doesn't support vision llama architecture. Even on Macs, while MLX now supports Llama vision, the backend used by LMStudio doesn't (but it does support Qwen).
View on Reddit #39726860

Comfortable-Top-3799@reddit

It is too large for normal users to run
View on Reddit #39714436

a_beautiful_rhind@reddit

It's inconvenient to run. You have to use AWQ, bitsnbytes, etc.
View on Reddit #39710505

shroddy@reddit

It is on lmsys arena.
View on Reddit #39704868

Healthy-Nebula-3603@reddit

Is big ... And we don't have an implementation for llamaccp as allowing us to use a vram and a ram as extension because of lack vram So other projects can't use it as they are derived from llamacpp.
View on Reddit #39697131

MoffKalast@reddit

Vision models aren't text models either, they (typically) require vastly more compute. I doubt running them CPU only would get us nearly as good results.
View on Reddit #39702357

Healthy-Nebula-3603@reddit

Nah Vision models work the same way as text modes The difference is only extra vision encoder .. that's it. Vision models that are working currently on llamacpp which the biggest is llava 1.6 32b works as fast as text only the same size.
View on Reddit #39702850

MoffKalast@reddit

Does clip not have some convolutional layers? I mean I guess that's more the case with the old architectures, but diffusion models are transformers and are significantly slower than LLMs.
View on Reddit #39702940

Healthy-Nebula-3603@reddit

As I said and tested by myself. I don't see a difference in performance. Vision 30b is as fast as a text 30b model. As far as I know, you just adding a vision encoder to the text model is becoming a vision model.... I know how crazy it sounds but it is true...magic.
View on Reddit #39703136

MoffKalast@reddit

I mean yeah it's true, after hefty retraining it [kinda works](https://arxiv.org/abs/2407.06581). Still I guess these encoders must be really tiny, images are a lot more data to process regardless of how you approach it. I have to read up a bit on what the CLIP arch actually does.
View on Reddit #39703354

Only-Letterhead-3411@reddit

Because most people don't need or care about vision models. I'd prefer a very smart, text only LLM to a multi modal AI with inflated size any day
View on Reddit #39698765

Dry-Judgment4242@reddit

I don't get the vision models. Are they not just a text model who have had a vision model surgically stitched to it's head? Everyone of those multimodal models I tested where awful when compared to just running a LLM + Stable Diffusion API.
View on Reddit #39699748

AlanCarrOnline@reddit

The vision stuff is for it to see things, not produce images like SD does. Having said that, I don't have much of a use-case for it either, but it's a baby-step in the direction of... something, for sure.
View on Reddit #39702508

Dry-Judgment4242@reddit

Ohh. Right, yeah I was confused when I tried one too. Still apparently am cuz your right. A vision model stitched to it in that cause. Tried doing llama3.2 vision+Stable Diffusion and it did not work very well heh...
View on Reddit #39702718

SandboChang@reddit

It really depends on the kind of interaction you are looking for. For me when I am trying to get some Python matplotlib done, a vision model makes life much easier sometimes.
View on Reddit #39699703

openssp@reddit

https://embeddedllm.com/blog/see-the-power-of-llama-32-vision-on-amd-mi300x Check this out. They run Llama 3.2 90b on AMD GPU. The result look impressive.
View on Reddit #39699602

Many_SuchCases@reddit

That receipt video is great!
View on Reddit #39702056