MiniCPM-o 2.6: An 8B size, GPT-4o level Omni Model runs on device

[-]

heidihobo@reddit

MiniCPM-o added to Voice DevTools: https://github.com/outspeed-ai/voice-devtools

[-]

henryclw@reddit

I think this might take a while for the llama.cpp or vllm to support this

[-]

Lynncc6@reddit (OP)

yep, let's keep an eye on this PR: https://github.com/vllm-project/vllm/pull/12069

[-]

Aaaaaaaaaeeeee@reddit

This looks like homelab streaming to device. Running on device is currently a challenge.

Their mobile vlm support for an old test app requires a ≥12gb device.

From github: https://openbmb.oss-cn-hongkong.aliyuncs.com/model_center/mobile/android/MiniCPM-2.0.apk

[-]

CosmosisQ@reddit

Newer flagships like the Google Pixel 9 Pro XL come with 16 GB of RAM.

[-]

Yes, I do think the NPU processing speed is important for VLMs though, and I haven't seen good numbers. The current GPU and cpu performance on llama.cpp and mlc is only usually 3x faster rather than 30-40x on Qualcomms NPU.

If you're into RAM, you could get the ace 2 pro. For android privacy, load iodeOS (or whatever you like as a GSI)

Qualcomm's HTP framework can work on the gen 2 for llms (there have been successful runs on Qualcomm's slack group), you might need an android 15 gsi.

[-]

Many_SuchCases@reddit

Random .apk

My first risky install of 2025 🫡

[-]

Aaaaaaaaaeeeee@reddit

it might be true, maybe with iPad with Apple Silicon on web browser for hardware acceleration

[-]

MoffKalast@reddit

I see we're back to "this 8B model beats GPT4" posting

[-]

Radiant_Dog1937@reddit

8b models that beat GPT4 have been a reality for a while now. It just depends on which version of GPT4 you're referring to.

[-]

CheatCodesOfLife@reddit

Gemma2 writes lots of small / simple words, and praises the user. This is the only reason it rates so highly on that leaderboard.

[-]

TheRealGentlefox@reddit

lmsys is borderline useless. It's just user preference. For actual capabilities I only really refer to personal experience, livebench, and simplebench.

[-]

MoffKalast@reddit

Lmsys doesn't really tell the whole story, it's focused on single turn replies and it's not really objective. They even had to add style control (which doesn't really work either) to account that people just like nicely formatted bullshit better than correct facts in a single paragraph.

There are certainly areas where small models can about match that level of performance today, like writing a made up story, chatting coherently, formatting text, extracting data and other tasks that conceptually require good skills and model stability.

But there's really no way they can match something 200 times the size when it comes to knowing and applying knowledge, i.e. in coding performance, being multilingual, teaching about topics without hallucination, etc. There's a hard limit on how much information content you can squeeze into a dozen GB, it's just physics. Something in the 30B+ range is a lot more realistic to be able to reproduce a similar level across most areas.

[-]

Radiant_Dog1937@reddit

I'm not so sure about that. Those are strong foundational models from top AI companies, just 2 short years ago an 8b model wasn't supposed to surpass the 200B GPT3.5, but that is definitely the norm today. It could be the larger models are overcoming mistakes in training through redundancy in parameters rather than the parameters directly scaling the models' abilities. I don't think GPT4 level represents some upper limit for smaller models at all and two years from now the meme will be "this model beats GPT o5".

[-]

D50HS@reddit

I have a strong feeling it will actually happen this year.

[-]

MrTubby1@reddit

I have a strong feeling that when it happens it will be from benchmarks contaminating the training data.

[-]

hapliniste@reddit

Nah it will simple beat it but use 10x more tokens. Still a huge leap for models that can run on less than 24Go of vram.

[-]

rorowhat@reddit

Maybe with a larger model, like 14B+ size

[-]

Embarrassed-Wear-414@reddit

“Runs on device” but not well at all

[-]

StupidityCanFly@reddit

You have to admit it doesn’t say “runs on any device”.

/s

[-]

CheatCodesOfLife@reddit

True, like "freshly laid" eggs were fresh when they were laid. And homegrown coffee beans -- grown in the home of some random bugs/insects.

[-]

a_slay_nub@reddit

That is an atrocious MMMU score. 50.4 vs 69.2 for GPT-4o. We really are back to the "this 8B model beats GPT4" phase.

[-]

Sadman782@reddit

Small model will always have lower MMMU no matter how you train under current architecture, it is just one metric. The previous only vision (minicpm 2.6) was a great model, current OMNI vision is even more powerful, and for many task like OCR/other vision tasks, it almost matches the bigger gpt4o. It is first OMNI model like openai gpt4o with realtime interruption,emotions, realtime accent change etc, it is not a TTS. It is extremely underrated, under hyped

[-]

frivolousfidget@reddit

The fact that a small model “will always have lower” anything is what makes the statement “An 8B size, gpt-4o level in your phone” so amazing and why people click on post. Your comment is describing the exact reason why people are valling it clickbait….

Just dont use that phrase unless that model is actually the same level as 4o…

[-]

Mkengine@reddit

Is Mini CPM-V 2.6 still better if the only use case is OCR?

[-]

mtasic85@reddit

Do you have GPT4 open sourced and released by OpenAI, so you can use it locally, free of charge?

[-]

YearZero@reddit

Yeah it's called Deepseek V3 or Llama 405b. Just cuz a model has the benefits of open source, however, doesn't mean every open source beats GPT-4. It beats it in terms of being free and local, but not in terms of benchmarks or capabilities.

[-]

ServeAlone7622@reddit

Holy crap! I just finished playing with their gradio demo after reading the docs and WOW this is actually impressive.

https://minicpm-omni-webdemo-us.modelbest.cn/

I sent it a hard to read receipt from a western union transfer to my ex. I asked how much the transfer fee was.

It not only identified the correct amount but then did some math to predict how many pesos she would receive after the transfer, even though I intentionally cropped that part out. It also identified the sender, recipient, the time and date. Keep in mind I can barely read this thing because it’s old and faded AF.

After that I sent it a short video of my dog interacting with my cat. The thing about my dog is she doesn’t know she’s a dog and for some reason she’s in love with the cat. Neither of them are right in the head and they’re the same size and color, so in dim light sometimes I mix them up myself.

It correctly identified my dog which is white, the size of a cat and which Siri has classified as a mop in my picture gallery. It also correctly identified my cat which is the same size as my dog and also white, but doesn’t look like a mop but isn’t particular cat like either.

It explained that the cat seems to be tolerating the dog despite the dog licking her. (Yeah it got the genders wrong but who cares).

Compare this to the last model i tried (ChatGPT o1) which said it was a video of a cat acting strangely around a mop.🤦‍♂️

Anyways, this model so far seems to be everything they claim. You really should check it out.

[-]

Mr-Barack-Obama@reddit

You sent a video to o1? I don’t think that’s possible…

[-]

ServeAlone7622@reddit

No I sent it pictures, screenshots of the video. To see if I could find a baseline for visual reasoning as I was trying out different multimodal LLMs.

So perhaps the comparison is flawed. In either event it’s funny.

[-]

Ok_Phase_8827@reddit

wow

[-]

ArsNeph@reddit

Damn it, I misread that as "8B model is on the same level as GPT4o mini" and got really excited 😭

[-]

Many_SuchCases@reddit

GGUF: https://huggingface.co/openbmb/MiniCPM-o-2_6-gguf

Original weights: https://huggingface.co/openbmb/MiniCPM-o-2_6

[-]

noneabove1182@reddit

did they add GGUF support locally and then not upstream it..? it's MiniCPMO arch so definitely won't be convertible on master..

[-]

segmond@reddit

gguf weight? what inference engine can run this in gguf?

[-]

AaronFeng47@reddit

The model is built in an end-to-end fashion based on SigLip-400M, ... and Qwen2.5-7B

No, it's definitely not gpt4o level, I'm not saying qwen 2.5 is bad, their 32b is my main local model, but the 7b one is definitely not gpt4o level lol

[-]

ServeAlone7622@reddit

I’ve been impressed with it. It gives excellent results out of the box and it’s cheap and easy to fine tune and make Loras for.

You should checkout the Marco-o1 model (also based on Qwen 7B). With a quick polish on the react dataset it codes better than Claude or ChatGPT.

What shocked me the most was everything from the 7B model on up actually does seem to support the full 128k context you just have to be creative with the Yarn long rope config.

I am genuinely pleased with Qwen 2.5 series.

[-]

Few_Painter_5588@reddit

A bit misleading, the audio component is whisper medium, instead of a true audio encoder. So unlike Qwen 2 Audio or GPT-4o-audio, it doesn't exactly understand the audio per se but instead just transcribes the audio into text and then inputs it into the LLM.

[-]

RuthlessCriticismAll@reddit

I like how this complete lie is the top comment. Does no one check anything, just upvote based on vibes.

[-]

fannovel16@reddit

Whisper is an encoder-decoder STT model and it uses the encoder part

[-]

Lynncc6@reddit (OP)

it's real audio, it can understand the speaker's emotion and speak with emotion

[-]

Few_Painter_5588@reddit

That's not audio understanding, that's still textual understanding. It cannot reason with the audio data like gpt-4o-audio or Qwen 2 audio. For example, if you ask mini-CPM it to identify points where the speakers change, it cannot do that because it doesn't have native audio in

[-]

metalman123@reddit

"MiniCPM-o 2.6 is the latest and most capable model in the MiniCPM-o series. The model is built in an end-to-end fashion based on SigLip-400M, Whisper-medium-300M, ChatTTS-200M, and Qwen2.5-7B with a total of 8B parameters. It exhibits a significant performance improvement over MiniCPM-V 2.6, and introduces new features for realtime speech conversation and multimodal live streaming. Notable features of MiniCPM-o 2.6 include:"