This seems pretty hype...
Posted by clduab11@reddit | LocalLLaMA | View on Reddit | 21 comments
https://mistral.ai/news/pixtral-large/
TL;DR: Mistral updated Pixtral-Large, one of their higher-parameter (124B) multimodal models.
That's some pretty impressive figures; though I do hope we soon see benchmarks against o1-preview.
https://huggingface.co/mistralai/Pixtral-Large-Instruct-2411
ItsMeZenoSama@reddit
๐๐๐๐
eposnix@reddit
As an aside, their Le Chat platform is using Flux to generate images for free.
slayyou2@reddit
Is it flux Dev or Flex pro?
Samurai_zero@reddit
It was talked about yesterday. It is a nice model, but not a clear cut improvement over Qwen2-VL, it is even bigger and the license is the typical good for not much from mistral. So, hard to use for common people, hard to use for business.
Good to have another model like this, though.
mikael110@reddit
One thing that's often overlooked as well is that Qwen2-VL has official support for video and for dynamic image resolutions, which helps a lot with OCR. Pretty much no other VLM has matched those features, including Pixtral.
It's something I hoped would become a standard feature after Qwen2-VL launched, but it really hasn't.
clduab11@reddit (OP)
I don't know what it is about Llama-3.2. I really really want to like it, but with my use-cases and what I use generative AI for...well, generally... it just gets shat on by a lot of other models/MoE merges/whathaveyou out there. But you raise an excellent point about marketing themselves against a kid that's always picked on; I'm semi disappointed by that too.
Makes me feel like I really need to dig into Qwen2-VL more.
mikael110@reddit
Yeah, Lama-3.2 was honestly a bit of a disappointment. Not just in benchmarks, but in practice I found it to be quite bad, especially given the size.
Though one thing that is interesting about it, and possibly the reason it performs worse than most of it's competition, is that Meta went out of their way to make the vision aspect entirely additive. They wanted the pure text capabilities to remain identical to the previous models. Which is why they went with a somewhat unusual design that left the text processing untouched.
This is in contrast to most vision models that are trained specifically for vision tasks and tend to get confused if you try to use them without any image included. So it does have that going for it. Though personally I think most people prefer to use a dedicated VLM for vision tasks and dedicated LLMs for text tasks. Especially when the performance difference is so large.
Grimulkan@reddit
That's my thinking too. I would have loved Llama 3.2 vision to be great, especially with being able to re-use workflow for the text portion for training and inference, but not sure the tradeoff is worth it. I wish Meta did not freeze all the layers in training, but still kept the same text architecture to re-use most of the code.
Some of this might just be under-training and lack of diversity in training data. There are many hints that multiple images could be supported, but they never actually included that in training.
I view this as something like Llama-2. It's decent (better than initial Llava versions), and community fine-tunes may have made it competent like we did in L2, but we're just not motivated with so many better open vision models out there. But maybe a Llama-4 vision will have the quality jump of Llama-3 text over L2, where Meta put in a lot more work on their side.
No-Tea5655@reddit
Pixtral has dynamic image resolutions as well.
https://mistral.ai/news/pixtral-12b/
mikael110@reddit
Ah you appear to be right. I could have sworn I remembered it being a fixed sized. But it appears my memory was faulty. Thank you for pointing that out. I've edited my comment to reflect it.
ortegaalfredo@reddit
>Good to have another model like this, though.
I like how quick we get used to incredible things. Mistral is better than all the *commercial closed AI* and they released it for free, so you can run it yourself.
Altman must have nightmares over this.
tucnak@reddit
The biggest improvement 2411 has over Qwen is that 2411 is actually good, and not "shamelessly overfitting on public evals"-good like Qwen and its derivatives.
clduab11@reddit (OP)
Agreed yeah (must've missed it yesterday, whoops my bad!).
My work is definitely more Qwen-influenced than anything else; so I haven't yet played around too much with Qwen2.5-VL, but in augmenting my use with models like this or Grok-Vision-Beta, I'm definitely excited more of the bigger names are getting more into multimodal and making it as utilitarian as can be.
JosefAlbers05@reddit
Too big..
Whotea@reddit
Rent a gpu online for like $0.20 an hourย
JosefAlbers05@reddit
I know, but I would much rather have one in my hard drive.
Xandrmoro@reddit
I wish they made an updated 70B mistral :'c
mlon_eusk-_-@reddit
Compared to qwen 2.5 ? Is it an improvement?
Menteurium@reddit
Incredibly good by first test. It's almost forensic level of image analysis. This can prepare a top-quality image dataset to train next level image/video generators. Prompts=are descriptions! Or in the pipeline process finally the image restoration can be used seriously without current cloning vaguely related to original.
But can't be used locally in any distant future, still no local install instructions anywhere to even older Pixtral, Google showing only 2(!) usable methods in all Internet.
panchovix@reddit
There are no benchmarks of Mistral Large 2411? I saw only for Pixtral Large
clduab11@reddit (OP)
Yeah that was my bad on the phrasing of that! It reads like these could be the benchmarks for Mistral Large 2411?
I havenโt gotten myself a Mistral API yet, but I intend to try it out in the playground of my interface and see how it fares to some of my others