Yes, I do think the NPU processing speed is important for VLMs though, and I haven't seen good numbers. The current GPU and cpu performance on llama.cpp and mlc is only usually 3x faster rather than 30-40x on Qualcomms NPU.
If you're into RAM, you could get the ace 2 pro. For android privacy, load iodeOS (or whatever you like as a GSI)
Qualcomm's HTP framework can work on the gen 2 for llms (there have been successful runs on Qualcomm's slack group), you might need an android 15 gsi.
lmsys is borderline useless. It's just user preference. For actual capabilities I only really refer to personal experience, livebench, and simplebench.
Lmsys doesn't really tell the whole story, it's focused on single turn replies and it's not really objective. They even had to add style control (which doesn't really work either) to account that people just like nicely formatted bullshit better than correct facts in a single paragraph.
There are certainly areas where small models can about match that level of performance today, like writing a made up story, chatting coherently, formatting text, extracting data and other tasks that conceptually require good skills and model stability.
But there's really no way they can match something 200 times the size when it comes to knowing and applying knowledge, i.e. in coding performance, being multilingual, teaching about topics without hallucination, etc. There's a hard limit on how much information content you can squeeze into a dozen GB, it's just physics. Something in the 30B+ range is a lot more realistic to be able to reproduce a similar level across most areas.
I'm not so sure about that. Those are strong foundational models from top AI companies, just 2 short years ago an 8b model wasn't supposed to surpass the 200B GPT3.5, but that is definitely the norm today. It could be the larger models are overcoming mistakes in training through redundancy in parameters rather than the parameters directly scaling the models' abilities. I don't think GPT4 level represents some upper limit for smaller models at all and two years from now the meme will be "this model beats GPT o5".
Small model will always have lower MMMU no matter how you train under current architecture, it is just one metric. The previous only vision (minicpm 2.6) was a great model, current OMNI vision is even more powerful, and for many task like OCR/other vision tasks, it almost matches the bigger gpt4o. It is first OMNI model like openai gpt4o with realtime interruption,emotions, realtime accent change etc, it is not a TTS. It is extremely underrated, under hyped
The fact that a small model “will always have lower” anything is what makes the statement “An 8B size, gpt-4o level in your phone” so amazing and why people click on post. Your comment is describing the exact reason why people are valling it clickbait….
Just dont use that phrase unless that model is actually the same level as 4o…
Yeah it's called Deepseek V3 or Llama 405b. Just cuz a model has the benefits of open source, however, doesn't mean every open source beats GPT-4. It beats it in terms of being free and local, but not in terms of benchmarks or capabilities.
Holy crap! I just finished playing with their gradio demo after reading the docs and WOW this is actually impressive.
https://minicpm-omni-webdemo-us.modelbest.cn/
I sent it a hard to read receipt from a western union transfer to my ex. I asked how much the transfer fee was.
It not only identified the correct amount but then did some math to predict how many pesos she would receive after the transfer, even though I intentionally cropped that part out. It also identified the sender, recipient, the time and date. Keep in mind I can barely read this thing because it’s old and faded AF.
After that I sent it a short video of my dog interacting with my cat. The thing about my dog is she doesn’t know she’s a dog and for some reason she’s in love with the cat. Neither of them are right in the head and they’re the same size and color, so in dim light sometimes I mix them up myself.
It correctly identified my dog which is white, the size of a cat and which Siri has classified as a mop in my picture gallery. It also correctly identified my cat which is the same size as my dog and also white, but doesn’t look like a mop but isn’t particular cat like either.
It explained that the cat seems to be tolerating the dog despite the dog licking her. (Yeah it got the genders wrong but who cares).
Compare this to the last model i tried (ChatGPT o1) which said it was a video of a cat acting strangely around a mop.🤦♂️
Anyways, this model so far seems to be everything they claim. You really should check it out.
The model is built in an end-to-end fashion based on SigLip-400M, ... and Qwen2.5-7B
No, it's definitely not gpt4o level, I'm not saying qwen 2.5 is bad, their 32b is my main local model, but the 7b one is definitely not gpt4o level lol
I’ve been impressed with it. It gives excellent results out of the box and it’s cheap and easy to fine tune and make Loras for.
You should checkout the Marco-o1 model (also based on Qwen 7B). With a quick polish on the react dataset it codes better than Claude or ChatGPT.
What shocked me the most was everything from the 7B model on up actually does seem to support the full 128k context you just have to be creative with the Yarn long rope config.
A bit misleading, the audio component is whisper medium, instead of a true audio encoder. So unlike Qwen 2 Audio or GPT-4o-audio, it doesn't exactly understand the audio per se but instead just transcribes the audio into text and then inputs it into the LLM.
That's not audio understanding, that's still textual understanding. It cannot reason with the audio data like gpt-4o-audio or Qwen 2 audio. For example, if you ask mini-CPM it to identify points where the speakers change, it cannot do that because it doesn't have native audio in
"MiniCPM-o 2.6 is the latest and most capable model in the MiniCPM-o series. The model is built in an end-to-end fashion based on SigLip-400M, Whisper-medium-300M, ChatTTS-200M, and Qwen2.5-7B with a total of 8B parameters. It exhibits a significant performance improvement over MiniCPM-V 2.6, and introduces new features for realtime speech conversation and multimodal live streaming. Notable features of MiniCPM-o 2.6 include:"
henryclw@reddit
I think this might take a while for the llama.cpp or vllm to support this
Lynncc6@reddit (OP)
yep, let's keep an eye on this PR: https://github.com/vllm-project/vllm/pull/12069
henryclw@reddit
Thank you for the info
Aaaaaaaaaeeeee@reddit
This looks like homelab streaming to device. Running on device is currently a challenge.
Their mobile vlm support for an old test app requires a ≥12gb device.
From github: https://openbmb.oss-cn-hongkong.aliyuncs.com/model_center/mobile/android/MiniCPM-2.0.apk
CosmosisQ@reddit
Newer flagships like the Google Pixel 9 Pro XL come with 16 GB of RAM.
Aaaaaaaaaeeeee@reddit
Yes, I do think the NPU processing speed is important for VLMs though, and I haven't seen good numbers. The current GPU and cpu performance on llama.cpp and mlc is only usually 3x faster rather than 30-40x on Qualcomms NPU.
If you're into RAM, you could get the ace 2 pro. For android privacy, load iodeOS (or whatever you like as a GSI)
Qualcomm's HTP framework can work on the gen 2 for llms (there have been successful runs on Qualcomm's slack group), you might need an android 15 gsi.
Many_SuchCases@reddit
My first risky install of 2025 🫡
Aaaaaaaaaeeeee@reddit
it might be true, maybe with iPad with Apple Silicon on web browser for hardware acceleration
MoffKalast@reddit
I see we're back to "this 8B model beats GPT4" posting
Radiant_Dog1937@reddit
8b models that beat GPT4 have been a reality for a while now. It just depends on which version of GPT4 you're referring to.
CheatCodesOfLife@reddit
Gemma2 writes lots of small / simple words, and praises the user. This is the only reason it rates so highly on that leaderboard.
TheRealGentlefox@reddit
lmsys is borderline useless. It's just user preference. For actual capabilities I only really refer to personal experience, livebench, and simplebench.
MoffKalast@reddit
Lmsys doesn't really tell the whole story, it's focused on single turn replies and it's not really objective. They even had to add style control (which doesn't really work either) to account that people just like nicely formatted bullshit better than correct facts in a single paragraph.
There are certainly areas where small models can about match that level of performance today, like writing a made up story, chatting coherently, formatting text, extracting data and other tasks that conceptually require good skills and model stability.
But there's really no way they can match something 200 times the size when it comes to knowing and applying knowledge, i.e. in coding performance, being multilingual, teaching about topics without hallucination, etc. There's a hard limit on how much information content you can squeeze into a dozen GB, it's just physics. Something in the 30B+ range is a lot more realistic to be able to reproduce a similar level across most areas.
Radiant_Dog1937@reddit
I'm not so sure about that. Those are strong foundational models from top AI companies, just 2 short years ago an 8b model wasn't supposed to surpass the 200B GPT3.5, but that is definitely the norm today. It could be the larger models are overcoming mistakes in training through redundancy in parameters rather than the parameters directly scaling the models' abilities. I don't think GPT4 level represents some upper limit for smaller models at all and two years from now the meme will be "this model beats GPT o5".
D50HS@reddit
I have a strong feeling it will actually happen this year.
MrTubby1@reddit
I have a strong feeling that when it happens it will be from benchmarks contaminating the training data.
hapliniste@reddit
Nah it will simple beat it but use 10x more tokens. Still a huge leap for models that can run on less than 24Go of vram.
rorowhat@reddit
Maybe with a larger model, like 14B+ size
Embarrassed-Wear-414@reddit
“Runs on device” but not well at all
StupidityCanFly@reddit
You have to admit it doesn’t say “runs on any device”.
/s
CheatCodesOfLife@reddit
True, like "freshly laid" eggs were fresh when they were laid. And homegrown coffee beans -- grown in the home of some random bugs/insects.
a_slay_nub@reddit
That is an atrocious MMMU score. 50.4 vs 69.2 for GPT-4o. We really are back to the "this 8B model beats GPT4" phase.
Sadman782@reddit
Small model will always have lower MMMU no matter how you train under current architecture, it is just one metric. The previous only vision (minicpm 2.6) was a great model, current OMNI vision is even more powerful, and for many task like OCR/other vision tasks, it almost matches the bigger gpt4o. It is first OMNI model like openai gpt4o with realtime interruption,emotions, realtime accent change etc, it is not a TTS. It is extremely underrated, under hyped
frivolousfidget@reddit
The fact that a small model “will always have lower” anything is what makes the statement “An 8B size, gpt-4o level in your phone” so amazing and why people click on post. Your comment is describing the exact reason why people are valling it clickbait….
Just dont use that phrase unless that model is actually the same level as 4o…
Mkengine@reddit
Is Mini CPM-V 2.6 still better if the only use case is OCR?
mtasic85@reddit
Do you have GPT4 open sourced and released by OpenAI, so you can use it locally, free of charge?
YearZero@reddit
Yeah it's called Deepseek V3 or Llama 405b. Just cuz a model has the benefits of open source, however, doesn't mean every open source beats GPT-4. It beats it in terms of being free and local, but not in terms of benchmarks or capabilities.
ServeAlone7622@reddit
Holy crap! I just finished playing with their gradio demo after reading the docs and WOW this is actually impressive.
https://minicpm-omni-webdemo-us.modelbest.cn/
I sent it a hard to read receipt from a western union transfer to my ex. I asked how much the transfer fee was.
It not only identified the correct amount but then did some math to predict how many pesos she would receive after the transfer, even though I intentionally cropped that part out. It also identified the sender, recipient, the time and date. Keep in mind I can barely read this thing because it’s old and faded AF.
After that I sent it a short video of my dog interacting with my cat. The thing about my dog is she doesn’t know she’s a dog and for some reason she’s in love with the cat. Neither of them are right in the head and they’re the same size and color, so in dim light sometimes I mix them up myself.
It correctly identified my dog which is white, the size of a cat and which Siri has classified as a mop in my picture gallery. It also correctly identified my cat which is the same size as my dog and also white, but doesn’t look like a mop but isn’t particular cat like either.
It explained that the cat seems to be tolerating the dog despite the dog licking her. (Yeah it got the genders wrong but who cares).
Compare this to the last model i tried (ChatGPT o1) which said it was a video of a cat acting strangely around a mop.🤦♂️
Anyways, this model so far seems to be everything they claim. You really should check it out.
Mr-Barack-Obama@reddit
You sent a video to o1? I don’t think that’s possible…
ServeAlone7622@reddit
No I sent it pictures, screenshots of the video. To see if I could find a baseline for visual reasoning as I was trying out different multimodal LLMs.
So perhaps the comparison is flawed. In either event it’s funny.
Ok_Phase_8827@reddit
wow
ArsNeph@reddit
Damn it, I misread that as "8B model is on the same level as GPT4o mini" and got really excited 😭
Many_SuchCases@reddit
GGUF: https://huggingface.co/openbmb/MiniCPM-o-2_6-gguf
Original weights: https://huggingface.co/openbmb/MiniCPM-o-2_6
noneabove1182@reddit
did they add GGUF support locally and then not upstream it..? it's MiniCPMO arch so definitely won't be convertible on master..
segmond@reddit
gguf weight? what inference engine can run this in gguf?
AaronFeng47@reddit
No, it's definitely not gpt4o level, I'm not saying qwen 2.5 is bad, their 32b is my main local model, but the 7b one is definitely not gpt4o level lol
ServeAlone7622@reddit
I’ve been impressed with it. It gives excellent results out of the box and it’s cheap and easy to fine tune and make Loras for.
You should checkout the Marco-o1 model (also based on Qwen 7B). With a quick polish on the react dataset it codes better than Claude or ChatGPT.
What shocked me the most was everything from the 7B model on up actually does seem to support the full 128k context you just have to be creative with the Yarn long rope config.
I am genuinely pleased with Qwen 2.5 series.
Few_Painter_5588@reddit
A bit misleading, the audio component is whisper medium, instead of a true audio encoder. So unlike Qwen 2 Audio or GPT-4o-audio, it doesn't exactly understand the audio per se but instead just transcribes the audio into text and then inputs it into the LLM.
RuthlessCriticismAll@reddit
I like how this complete lie is the top comment. Does no one check anything, just upvote based on vibes.
fannovel16@reddit
Whisper is an encoder-decoder STT model and it uses the encoder part
Lynncc6@reddit (OP)
it's real audio, it can understand the speaker's emotion and speak with emotion
Few_Painter_5588@reddit
That's not audio understanding, that's still textual understanding. It cannot reason with the audio data like gpt-4o-audio or Qwen 2 audio. For example, if you ask mini-CPM it to identify points where the speakers change, it cannot do that because it doesn't have native audio in
metalman123@reddit
"MiniCPM-o 2.6 is the latest and most capable model in the MiniCPM-o series. The model is built in an end-to-end fashion based on SigLip-400M, Whisper-medium-300M, ChatTTS-200M, and Qwen2.5-7B with a total of 8B parameters. It exhibits a significant performance improvement over MiniCPM-V 2.6, and introduces new features for realtime speech conversation and multimodal live streaming. Notable features of MiniCPM-o 2.6 include:"