Multi modality is currently terrible in open source

Posted by Unusual_Guidance2095@reddit | LocalLLaMA | View on Reddit | 28 comments

I don’t know if anyone else feels this way, but currently it seems that multimodal large language models are our best shot at a“world model“ (I’m using the term loosely, of course) and that in open source it’s currently terrible

A truly Multimodal large language model can replace virtually all models that we think of as AI :

Text to image (image generation) Image to text (image captioning, bounding box generation, object detection) Text to text (standard LLM) Audio to text (transcription) Text to audio (text to speech, music generation) Audio to audio (speech assistant) Image to image (image editing, temporal video generation, image segmentation, image upscaling) Not to mention all sorts of combinations : image and audio to image and audio (film continuation) audio to image (speech assistant that can generate images) image to audio (voice descriptions of images, sound generation for films, perhaps sign language interpretation) etc.

We’ve seen time and time again that in AI having more domains in your training data makes your model better. Our best translation models today are LLM’s because they understand language more generally and we can give it specific requests “make this formal” “make this happy sounding” that no other translations software can do and they develop skills we don’t have to explicitly train for, we’ve seen with the release of Gemini a few months ago how good its image editing capabilities are and no current model that I know of does image editing at all (let alone be good at it) again other than multimodal LLMs. Who knows what else it can do: visual reasoning by generating images so that it doesn’t fail the weird spatial benchmarks, etc.?

Yet no company has been able or even trying to replicate the success of either open AI 4o nor Gemini and every time someone releases a new “omni” model it’s always missing something: modalities, a unified architecture so that all modalities are embedded in the same latent space so that all the above is possible, and it’s so irritating. QWEN for example doesn’t support any of the things that 4o voice can do: speak faster, slower, (theoretically) voice imitation, singing, background noise generation not to mention it’s not great on any of the text benchmarks either. There was the beyond disappointing Sesame model as well

At this point, I’m wondering if the close source companies do truly have a moat and it’s this specifically

Of course I’m not against specialized models and more explainable pipelines composed of multiple models, clearly it works very well for Waymo self driving, coding copilot, and should be used there but I’m wondering now if we will ever get a good omnimodal model

Sorry for the rant I just keep getting excited and then disappointed time and time again now probably up to 20 times by every subsequent multimodal model release and I’ve been waiting years since the original 4o announcement for any good model that lives up to a quarter of my expectations