mistral-small-3.2 OCR accuracy way too bad with llama.cpp compared to ollama?

[-]

Awwtifishal@reddit

Which mmproj quant are you using with llama.cpp?

Reply

[-]

As an example, have a look at https://huggingface.co/bartowski/Qwen_Qwen2.5-VL-72B-Instruct-GGUF/tree/main Regardless of the quant you select, it always uses the 16-bit quant of the mmproj. It still does not make the OCR result as accurate as that of the Ollama one. Ollama however embeds the mmproj file into their quant.

Reply

[-]

Awwtifishal@reddit

Ollama embedded mmproj weights are FP16 for the most part. Could you share an example that is different between the two? Or if not, can you try rescaling the picture to a maximum of 1540 pixels for mistral or 560 for qwen? maybe each application is doing something different when the picture is too big. I just realized ollama's qwen doesn't have the limit of 560 in the gguf. Unless it's hardcoded somewhere else, that may be an important difference. But if that's the case I can't explain the difference in the case of mistral (the limit is in both).

Reply

[-]

HumanAppointment5@reddit

OK, I tested Mistral 3.2 bf16: Ollama vs. Bartowski vs MLX. All with the same handwritten note at 1024 pixels. The winner is very clearly the Ollama version with better accuracy. Also tested them with the note resized to 560 pixels. Again the Ollama version has much better accuracy.

Reply

[-]

caetydid@reddit (OP)

thanks for sharing the results of these experiments. Not sure if ollama is resizing I think it might be splitting the image and running subsequently.

Reply

[-]

Awwtifishal@reddit

According to the gguf files, neither is resizing it because the max is 1540 in both cases.

Reply

[-]

HumanAppointment5@reddit

Thank you. I will have to try that. I've been using images of 1024 pixels with Qwen. If Ollama's Qwen does not have the limit of 560 pixels, and the bartowski (and others) do it could explain my results. I'll crop to 560 pixels and test again (tomorrow). I'm also downloading the Ollama Mistral version to test it for my use case of handwritten notes.

Reply

[-]

caetydid@reddit (OP)

F16

Reply

[-]

triynizzles1@reddit

Many people still think ollama is just a llamacpp wrapper. It now uses its own engine. Llamacpp separates the vision part of the model from the text part. This does not happen when using ollama. You can read through ollama’s vision update post from a while back they explain it well.

Reply

[-]

HumanAppointment5@reddit

I found the best OCR accuracy to date with the Ollama version of Qwen-2.5-VL. Tried the GGUF versions and they were all a lot worse. I've also been baffled why the Ollama vision model is so much better. Ollama has the vision projector embedded into the model, where the GGUF files use a separate mmproj file. Have you tried Qwen-2.5-VL?

Reply

[-]

Gregory-Wolf@reddit

I kind of didn't understand half of what you are saying :) "Tried the GGUF versions and they were all a lot worse" - so you are saying "ollama has different versions from GGUF"? no, ollama uses GGUFs. "Ollama vision model" - ollama is an inference software, what vision model does it have? "Ollama has the vision projector embedded into the model, where the GGUF files use a separate mmproj file." - ollama incapsulates llama.cpp as an inference engine (at least it did, not sure if it changed, I never used ollama because why)

Reply

[-]

HumanAppointment5@reddit

Clarify: I tried the GGUF versions created by bartowski and others. Used these with llama.cpp and LM Studio. When you look at the list of files on Hugging Face, they all have a separate .mmproj file used for the vision functionality. Ollama on the other side embed this with their version of the model. Ollama gives some more details at https://ollama.com/blog/multimodal-models They say, "llama.cpp offers first-class support for text-only models. For multimodal systems, however, the text decoder and vision encoder are split into separate models and executed independently. Passing image embeddings from the vision model into the text model therefore demands model-specific logic in the orchestration layer that can break specific model implementations.Within Ollama, each model is fully self-contained and can expose its own projection layer, aligned with how that model was trained."

Reply

[-]

Gregory-Wolf@reddit

My bad, I posted in the wrong place. The comment was to the OP actually :-P Anyways, thanks for the answer. I see that ollama moved from llama.cpp directly to ggml lib. I guess it's still gguf though, but since they don't rely on mmproj anymore, perhaps that could affect the results.

Reply

[-]

caetydid@reddit (OP)

maybe the vision projector implementation in llama.cpp is just inferior? I do not know how the vision projector in ollama works, and I"d prefer to use it but I cannot live with 5t/s

Reply

[-]

HumanAppointment5@reddit

Ollama gives some more details at https://ollama.com/blog/multimodal-models They say, "llama.cpp offers first-class support for text-only models. For multimodal systems, however, the text decoder and vision encoder are split into separate models and executed independently. Passing image embeddings from the vision model into the text model therefore demands model-specific logic in the orchestration layer that can break specific model implementations.Within Ollama, each model is fully self-contained and can expose its own projection layer, aligned with how that model was trained."

Reply

[-]

HumanAppointment5@reddit

I read somewhere that you can copy (or link) the Ollama files and then use them with llama.cpp or via LM Studio. You can then test to see if you can get the same Ollama accuracy with the better 20-40 t/s. If the accuracy is not the same then we know the magic is in the new Ollama engine.

Reply

[-]

HumanAppointment5@reddit

For OCR and HTR the temperature, the Top P and Min P should all be 0. Top K must be 1. Basically it means no "creativity", ensure there is only 1 option to choose from in the list (K).

Reply

[-]

HumanAppointment5@reddit

Ollama gives some more details at https://ollama.com/blog/multimodal-models They say, "llama.cpp offers first-class support for text-only models. For multimodal systems, however, the text decoder and vision encoder are split into separate models and executed independently. Passing image embeddings from the vision model into the text model therefore demands model-specific logic in the orchestration layer that can break specific model implementations.Within Ollama, each model is fully self-contained and can expose its own projection layer, aligned with how that model was trained."

Reply

[-]

pseudonerv@reddit

Yeah. Something is definitely off with mistral small vision adapter in llama.cpp. But I’m not a good programmer to figure out what that is

Reply

[-]

fp4guru@reddit

Any examples? I happen to have mistral small q4 running with mmproj.

Reply

[-]

Gregory-Wolf@reddit

same. running in production q4\_km with mmproj.

Reply

[-]

caetydid@reddit (OP)

and are you satisfied with accuracy?

Reply

[-]

Gregory-Wolf@reddit

Our case is kind of non-standard - the image data are of very variable quality. But let's say, it gives 50% accuracy, which is kind of OK because it reduces 50% of manual labour.

Reply

[-]

caetydid@reddit (OP)

nah...temp=0 yielded the least inaccuracies in ollama so I applied it here as well. it does not seem to make a big difference until up to 0.2 which is the recommended default for the model. i dont have examples but i am processing medical forms and it is just basically messing up person names, dates, addresses occasionally. but when i have a look at the source image it is not like the text is illegible.

Reply

[-]

Cergorach@reddit

Try looking at olmocr or rolmocr.

Reply

[-]

caetydid@reddit (OP)

when i compared numbers it was always performing worse than qwen2.5-vl and mistral

Reply

mistral-small-3.2 OCR accuracy way too bad with llama.cpp compared to ollama?

Reply to Post

27 Comments

Fireblade_5555@reddit

Awwtifishal@reddit

HumanAppointment5@reddit

Awwtifishal@reddit

HumanAppointment5@reddit

caetydid@reddit (OP)

Awwtifishal@reddit

HumanAppointment5@reddit

caetydid@reddit (OP)

triynizzles1@reddit

HumanAppointment5@reddit

Gregory-Wolf@reddit

HumanAppointment5@reddit

Gregory-Wolf@reddit

caetydid@reddit (OP)

HumanAppointment5@reddit

HumanAppointment5@reddit

HumanAppointment5@reddit

HumanAppointment5@reddit

pseudonerv@reddit

fp4guru@reddit

Gregory-Wolf@reddit

caetydid@reddit (OP)

Gregory-Wolf@reddit

caetydid@reddit (OP)

Cergorach@reddit

caetydid@reddit (OP)