Multi modality is currently terrible in open source

Posted by Unusual_Guidance2095@reddit | LocalLLaMA | View on Reddit | 28 comments

I don’t know if anyone else feels this way, but currently it seems that multimodal large language models are our best shot at a“world model“ (I’m using the term loosely, of course) and that in open source it’s currently terrible

A truly Multimodal large language model can replace virtually all models that we think of as AI :

Text to image (image generation) Image to text (image captioning, bounding box generation, object detection) Text to text (standard LLM) Audio to text (transcription) Text to audio (text to speech, music generation) Audio to audio (speech assistant) Image to image (image editing, temporal video generation, image segmentation, image upscaling) Not to mention all sorts of combinations : image and audio to image and audio (film continuation) audio to image (speech assistant that can generate images) image to audio (voice descriptions of images, sound generation for films, perhaps sign language interpretation) etc.

We’ve seen time and time again that in AI having more domains in your training data makes your model better. Our best translation models today are LLM’s because they understand language more generally and we can give it specific requests “make this formal” “make this happy sounding” that no other translations software can do and they develop skills we don’t have to explicitly train for, we’ve seen with the release of Gemini a few months ago how good its image editing capabilities are and no current model that I know of does image editing at all (let alone be good at it) again other than multimodal LLMs. Who knows what else it can do: visual reasoning by generating images so that it doesn’t fail the weird spatial benchmarks, etc.?

Yet no company has been able or even trying to replicate the success of either open AI 4o nor Gemini and every time someone releases a new “omni” model it’s always missing something: modalities, a unified architecture so that all modalities are embedded in the same latent space so that all the above is possible, and it’s so irritating. QWEN for example doesn’t support any of the things that 4o voice can do: speak faster, slower, (theoretically) voice imitation, singing, background noise generation not to mention it’s not great on any of the text benchmarks either. There was the beyond disappointing Sesame model as well

At this point, I’m wondering if the close source companies do truly have a moat and it’s this specifically

Of course I’m not against specialized models and more explainable pipelines composed of multiple models, clearly it works very well for Waymo self driving, coding copilot, and should be used there but I’m wondering now if we will ever get a good omnimodal model

Sorry for the rant I just keep getting excited and then disappointed time and time again now probably up to 20 times by every subsequent multimodal model release and I’ve been waiting years since the original 4o announcement for any good model that lives up to a quarter of my expectations

[-]

AlanCarrOnline@reddit

I think the really big bottleneck slowing open source models is the lack of user-friendly software to run them.

For 95 or so percent of normal people who would be interested, it's an instant road-block.

Pinokio is probably the closest we have to a Windows for AI, but it's basically one guy without enough time or funds to offer customer support.

[-]

PersonOfDisinterest9@reddit

It might be a roadblock to people using them, but it's barely a speedbump for the people actually making new stuff.

If a person can't follow some internet instructions to set up, what are they going to do for open source? I don't think they're going to be sending dollars to anyone for it.

The biggest roadblock right now is access to hardware. That is the #1 thing, and the #2 thing, and the #3 thing.
Even major universities aren't able to get enough GPUs to keep up with research, multiple research papers have cited that they didn't have the compute to train to convergence. A large number of companies have complained about not being able to attract anything like top talent, because they don't have even 0.1% of the GPUs that a Meta or Google has. I'm absolutely certain that a bunch of independent CS people who are interested in contributing to open source, are getting slowed down by having to run off cloud services, and are getting hit with the emotional and cognitive weight of seeing whole dollars associated with everything they do when renting the hardware.

[-]

AlanCarrOnline@reddit

Yep, hardware is a huge one, but mass adoption would solve that.

It's often correctly stated that Nvidia does care about people like us running local AI, as we're an edge case, a tiny minority of nerds.

Gamers are content with much weaker GPUs and will stretch up to the ludicrously expensive 5090, considering it the ultimate SOTA.

In all fairness, I could run pretty much any game I threw at my old gaming PC, with a 2060 and 6GB of VRAM. That happily ran 4K with immersive, near-photolike 3D games, like Kingdom Come Deliverance, a game so fancy I literally purchased that PC to run it. My currently 3090 is total overkill compared to the 2060, in a totally different league and and absolute beast for gaming - but merely 'good' for AI.

Serious AI researchers would only consider a 3090 if they had a rack of them, with a single card being the rock bottom minimum spec for most.

"When you say "If a person can't follow some internet instructions to set up inference, what are they going to do for Open Source?" you have it backwards, maybe?

What can open source do for those who cannot set up inference?

Solve that and you could have mass adoption, at which point it becomes viable to create the hardware. We're already seeing some moves, with Digits and Framework stuff, but still aimed higher than most people will spend on a PC (or Linux, which is a deal-breaker for most people).

[-]

eloquentemu@reddit

Yep, hardware is a huge one, but mass adoption would solve that.

How? There's only one company in the world capable of producing these chips and they're booked at 100% capacity. Nvidia would love to sell more 5090s but why would they sell a 5090 when the same wafer could make a pro6000 for >2x the profit? Or a data center GPU?

They literally cannot keep up with demand already. More demand doesn't mean more hardware it'll just mean even higher prices

[-]

AlanCarrOnline@reddit

Yes and no...

The market always finds a way if the pressure is there.

With just a tiny percentage of peeps running local AI the competition, such as AMD, has no great reason to advance for local AI, seeing as their cards are already popular with gamers and CUDA for AI is so widely adopted anyway, so why bother?

Right now, you walk down the street and ask 20 people, how can you use AI? Odds are high that all 20 will name a website, probably ChatGPT.

Ask 100 people and you'll likely just get a wider variety of websites, with still just a slim chance that maybe some will talk about GPUs and GGUF quants on your own PC.

GPT has been the fastest adoption of anything, ever, but there are still people out there who've never even heard of it, let alone running local.

Lemme show you a screenshot... Just a few days ago. OK, a weeks ago:

See?

When the demand is there, something will fill it. That may mean stealing or poaching from Nvidia, some other breakthrough, such as a software alternative to CUDA.

Right now the demand isn't there, as the software isn't there. Skype, then Zoom, made teleconferencing a thing. I still recall writing the sales pitch for teleconferencing software, where one of the sales points was it could be set up in less than an hour, if you had a handy technician.

That's the stage local AI is now.

We need a Zoom.

swagonflyyyy@reddit

Well, maybe we don't have one model to rule them all, but I can tell you that the barrier to entry for the open source community has lowered significantly. Sure, I have 48GB of VRAM to play with but I've been able to take a combination of small yet powerful AI models and make a local multimodal framework that I can run in the comfort of my own home indefinitely.

After working on it since summer, I'm in the middle of giving it the ability to perform both basic, online quick search and a custom, agentic "deep search" capability that is still a prototype but has shown promise. Now, I'm going to give it the ability to download, transcribe and analyze batches of youtube videos on the fly via voice commands, but in a way that seamlessly integrates with the conversation so the framework intuitively knows when you truly need that action performed or when you're just chatting in a voice-to-voice framework.

I was literally testing the deep research feature yesterday and the bots can see and hear everything from my screen and use voice cloning to respond (Thanks Gemma3!) so they were freaking out about $2100 claim I received from a hospital while I was scrolling, rightfully raising the fact that I was being billed that much by two last minute out-of-network providers who showed up out of nowhere for my in-network surgery with my in-network provider.

So their argument was that I shouldn't be paying out-of-network costs for an in-network treatment and when they performed a deep search, they concluded that I could protect myself from these claims via the No Surprises Act, and pointing out that %80 of the costs of claims are bogus, usually caused by incompetent medical billing, etc. so they gave me clear instructions on how to defend myself from the hospital that seems to be on some real shady billing shit.

Honestly, I'm steadily expanding my project to increase these types of capabilities further and I just got a huge lightbulb moment this week when I set out to give it agentic capabilities. I'm confident I can improve on the youtube video analysis, then I'll circle back to deep search over the weekend to fully flesh out its capabilities further.

Probably gonna end up giving it the capability to perform a list of agentic tasks to steadily complete agentic tasks for me on the fly while still providing helpful and entertaining conversations. Really interested to see where it goes from here.

rationaltree@reddit

Do you plan on open sourcing this so others can contribute?

I already did:

https://github.com/SingularityMan/vector_companion

But the problem is setup is extremely hard to do. I only know of ONE person who actually ran it and liked it. If you're gonna try to set it up, good luck because you're really gonna need it.

Subject_Diver_1043@reddit

Thank you king

TacticalRock@reddit

Never say never, ever :)

Environmental-Metal9@reddit

You just did it twice!

Oh really? And how many r's are in Mississippi?

Wait, the user seems to be asking about how many r’s are in Mississippi. … Final Answer: there are two r’s in Mississippi

That'll cost 20k tokens and 5 minutes of your lifespan. Take it or leave it.

This sentence could easily have come out from a shadowrunner campaign

gg, your shitpost ratioed my shitpost lol

Foreign-Beginning-49@reddit

You know I think what was so upsetting about the sesame non release was that there aren't any open source alternatives to having a super interactive conversation partner. There was nothing about the demo that showed it would however be a competent ASSISTANT. I think we need to create different categories of "omni".

The new qwen model is too new to judge quite yet but if it is a capable function calling model it will be a groundbreaking open source Agentic ASSISTANT. This has nothing to do with expressivity per se. Highly expressive conversant function calling partners would be something we have yet to see in a single open source model. I think we should have a separate category for conversation partners and AssistantAgent style open source models. It is plausible we will have both one day but as of yet this hasn't been achieved AFAIK.

Thank you for the thought provoking rant.

Icy_Restaurant_8900@reddit

From the demos I’ve seen for Orpheus, it blows the Sesame CSM 1B out of the water. The current challenge is getting it running >1x real-time.

IONaut@reddit

F5 TTS had an update not too long ago. It can generate a sentence faster than the previous one can play now. And it's more sensitive to punctuation and emotion inference based on what is being said. I've been thinking about using it as a local TTS inference for a local voice chat UI.

yaosio@reddit

Gemini was the first and it just came out. GPT multimodality came out yesterday! Open source is typically behind on new features but ahead in efficiency.

One-Employment3759@reddit

You're quite silly if you think the hosted services are all just one single model behind the scenes.

Even state of the art diffusion models are a combination of several models (text embedding, image embedding, vae, diffusion model)

ethereal_intellect@reddit

They're all just afraid. Openai was sitting on the omni image generation for a year until Google did it. Just wait for open source to pave the way on what's acceptable in society, and only then will others make similar options. Apparently llama at the end of the month might have something, but we'll see