Every few days, a new OCR gets released, and every single one claims SOTA results in some regard. You read this and think that OCR is pretty much "solved" by now, but that's not really the case. In real-world applications, you need a way to turn the embedded images (plots, graphics, etc.) in those PDFs super accurately into text to minimize any information loss. For that, you need a 100B+ multimodal LLM. These small OCRs typically just ignore those.
One thing I'm really bothered by is that these new OCR models really suck at converting from screenshots of formatted text --> markdown. Every model claims "SOTA on X benchmark" but then when I actually try it, it's inconsistent as hell and I always end up falling back to something like Gemini 2.0 Flash or Qwen3 VL 235B Thinking.
Yeah, same here. After lots of testing, the only solution I came up with was Gemini. You basically need the entire thing in context (and also enough model parameters) to generate good descriptions for embedded images. That just requires a ton of world knowledge. No way a 1B can do that, those are basically text only models.
This is only tangentially related, but I have to say: OmniDocBench is too easy - it doesn't hold a candle to the insane documents I see at work. We need a harder OCR benchmark.
(I think the problem is that published documents tend to be more cleaned up than the stuff behind the scenes. When I see a challenging document at work I of course cannot add it to a public dataset.)
Found the same thing. DotsOCR in layout mode is the best overall on out stuff, despite Deepseek-OCR and Chandra beating it on Omnidoc. It’s slower than those though (although with a license we can use compared to Chandra).
Right... so someone has to ponder those brand new changes to transformers and then implement that code in C++ before you will see support in llama.cpp.
Exactly. But in those 2 files there's plenty of customization for this ocr model starting from hunyuan family. Don't think all that parameters can be reduced to a command line from llama-swap/llama-server.
Well when you slice off a billion parameters and turn it into a domain specialist on a tight niche with not too much variation in function it’s going to be extremely accurate. Super cool I agree
nmkd@reddit
uhm... GGUF when?
r4in311@reddit
Every few days, a new OCR gets released, and every single one claims SOTA results in some regard. You read this and think that OCR is pretty much "solved" by now, but that's not really the case. In real-world applications, you need a way to turn the embedded images (plots, graphics, etc.) in those PDFs super accurately into text to minimize any information loss. For that, you need a 100B+ multimodal LLM. These small OCRs typically just ignore those.
random-tomato@reddit
One thing I'm really bothered by is that these new OCR models really suck at converting from screenshots of formatted text --> markdown. Every model claims "SOTA on X benchmark" but then when I actually try it, it's inconsistent as hell and I always end up falling back to something like Gemini 2.0 Flash or Qwen3 VL 235B Thinking.
r4in311@reddit
Yeah, same here. After lots of testing, the only solution I came up with was Gemini. You basically need the entire thing in context (and also enough model parameters) to generate good descriptions for embedded images. That just requires a ton of world knowledge. No way a 1B can do that, those are basically text only models.
hp1337@reddit
Agreed. We need something like a Kimi-Linear-VL-235B. That would be GOAT for OCR. On order of Gemini but able to run on pseudo-consumer hardware.
Intelligent-Form6624@reddit
Please add to OCR Arena
the__storm@reddit
This is only tangentially related, but I have to say: OmniDocBench is too easy - it doesn't hold a candle to the insane documents I see at work. We need a harder OCR benchmark.
(I think the problem is that published documents tend to be more cleaned up than the stuff behind the scenes. When I see a challenging document at work I of course cannot add it to a public dataset.)
aichiusagi@reddit
Found the same thing. DotsOCR in layout mode is the best overall on out stuff, despite Deepseek-OCR and Chandra beating it on Omnidoc. It’s slower than those though (although with a license we can use compared to Chandra).
SlowFail2433@reddit
1B model beat 200+B wow
Medium_Chemist_4032@reddit
Those new models almost always come with a vllm template... Is there a llama-swap equivalent for vllm?
SlaveZelda@reddit
Llama swap should also work with vllm I think.
R_Duncan@reddit
Sadly this requires a nightly build of transformers, so will likely not work with llama.cpp until is not ported the patch at
Finanzamt_kommt@reddit
? Llama.cpp doesn't rely on transformers but on their own implementation?
tomz17@reddit
Right... so someone has to ponder those brand new changes to transformers and then implement that code in C++ before you will see support in llama.cpp.
Finanzamt_kommt@reddit
Indeed but it's not blocked by a nightly transformers version because even if that wasn't nightly we still wouldn't have support
R_Duncan@reddit
Exactly. But in those 2 files there's plenty of customization for this ocr model starting from hunyuan family. Don't think all that parameters can be reduced to a command line from llama-swap/llama-server.
Finanzamt_kommt@reddit
Well yeah it has to have support there in c++ /:
silenceimpaired@reddit
Good thing it’s such a small model I can probably get by with transformers.
danigoncalves@reddit
Actually I was thinking the same...
UnionCounty22@reddit
Well when you slice off a billion parameters and turn it into a domain specialist on a tight niche with not too much variation in function it’s going to be extremely accurate. Super cool I agree
kmuentez@reddit
.
exaknight21@reddit
Oh hot dang son. This is crazy.