Minicpm-V-4

[-]

lly0571@reddit (OP)

https://preview.redd.it/bpy1hvrqoihf1.png?width=3835&format=png&auto=webp&s=fe9dbeabaa53cb1fcefaeb927e53d9042ee39e8a Among the three 4B-level VLMs(using Q6 GGUF for Minicpm, F16 weight & vllm for Qwen, Q4 Ollama GGUF for gemma), I still think that Qwen2.5-VL-3B performs relatively better in extracting structured information from images. However, I'm particularly interested in this model's video understanding capability. Given its high token density—encoding a 448×448 image into a single tile of 64 tokens, meaning each token represents approximately 3,000 pixels—it could be a promising candidate for training a compact video understanding model.

Reply

[-]

ali0une@reddit

With llama.cpp their Q4_K_M gguf answers in chinese most of the time if i provide an image and ask "describe image in details" ... is there a way to make it answer in english only or do i have some skills issue?

Reply

[-]

hapliniste@reddit

GPT-OSS for orchestration and tool call with this model as a "vision tool" to do some ui use? Cant wait for the next 2 month with the new models

Reply

[-]

paryska99@reddit

Hell yeah, MiniCPM always seemed to deliver some interesting capability

Reply

[-]

abskvrm@reddit

Vision capability looks good.

Reply

[-]

MustBeSomethingThere@reddit

They have a GGUF version too: [https://huggingface.co/openbmb/MiniCPM-V-4-gguf](https://huggingface.co/openbmb/MiniCPM-V-4-gguf)

Reply

Reply to Post

6 Comments

lly0571@reddit (OP)

ali0une@reddit

hapliniste@reddit

paryska99@reddit

abskvrm@reddit

MustBeSomethingThere@reddit