I made a free playground for comparing 10+ OCR models side-by-side
Posted by Emc2fma@reddit | LocalLLaMA | View on Reddit | 83 comments
It's called OCR Arena, you can try it here: https://ocrarena.ai
There's so many new OCR models coming out all the time, but testing them is really painful. I wanted to give the community an easy way to compare leading foundation VLMs and open source OCR models side-by-side. You can upload any doc, run a variety of models, and view diffs easily.
So far I've added Gemini 3, dots, DeepSeek-OCR, olmOCR 2, Qwen3-VL-8B, and a few others.
Would love any feedback you have! And if there's any other models you'd like included, let me know.
(No surprise, Gemini 3 is top of the leaderboard right now)
zedd1704@reddit
I am wondering how you are prompting the models in the backend. Is it just "parse the pdf"?
rainbow3@reddit
Do any of these models return the coordinates of each row of text?
I am looking for a replacement for a tesseract project. Found many that do better ocr but did not provide coordinates.
peteror@reddit
Really cool! Are you using any specific prompt to call these models? I'm building something that processes mostly invoices / receipts and get quite good results in general with a very specific prompt, but found a few tricky cases that gives way better results on OCRarena than what I get on the same models (GPT 5.1 mostly)
Imaginary_Leg_9383@reddit
I noticed you can view and even customize the models in the playground under "advanced settings"
peteror@reddit
Good catch, I didn't notice that, thank you!
Flimsy_Requirement30@reddit
Thanks OP. Can you share what thinking level used for Gemini 3? I find it make a lot of difference to use high level of thinking for Gemini3, and maybe would be great to get this detail right!
SarcasticBaka@reddit
Great idea! Paddle-VL and MinerU are considered top dogs for OCR iirc, so probably useful to add them. Nanonets, LightOnOCR and Chandra OCR are popular recent releases as well.
ajw2285@reddit
Is there an easy way to deploy Paddle? I am a noob and limited to Ollama
Mayonnaisune@reddit
Imo, the older versions of
paddleocrPython packages (<= 2.10.0 iirc), which supports up to PP-OCRv4 models and still use.ocr()instead of.predict(), are easier to use than the newer ones. They are faster and lighter too, but may not be as accurate and are clearly not as up to date as the newer ones.the__storm@reddit
I would give the vllm backend a try: https://docs.vllm.ai/projects/recipes/en/latest/PaddlePaddle/PaddleOCR-VL.html
Paddle in general is notoriously hard to get running (although it might be better if you can read the Chinese version of the docs). For the older non-VL Paddle OCR models, there's also RapidOCR. It's still kind of awkward and poorly documented but definitely easier than PaddlePaddle.
Emc2fma@reddit (OP)
on it, will have them deployed shortly!
paton111@reddit
Very cool project. At Tomedes we built something similar on our side – a tool that lets you compare OCR outputs from multiple AI models and also shows the most common output for each element so you can see where the models agree. It’s been super useful for spotting consistency. You can try it here: https://www.tomedes.com/tools/image-to-text
cviperr33@reddit
thank you! will check it out when i get home
kathirai@reddit
As far I have tested with written notes (in caps letters) Gemini 3 is performing good, but still lacks more. You can test with samples given by the site and also can upload yours and test.
markingup@reddit
Id love if anyone has any good ocr tests to share. Finding it tough to find validation
iamn0@reddit
Just like on lmarena.ai, we need the ability to vote that both models performed equally good. I had a case where both produced identical results
Emc2fma@reddit (OP)
makes total sense, shipping shortly!
nderstand2grow@reddit
you mean, "vibing" shortly? :)
rm-rf-rm@reddit
Yeah the first (and only) one I tried was matched. Causes a false result if you vote for 1 over the other in that case. And im guessing this happens quite often
RegisteredJustToSay@reddit
Same. Also one for when neither was good for those cases where neither should benefit from an elo boost.
PM_ME_COOL_SCIENCE@reddit
Please add PaddleOCR-VL! I've found it to be the best OCR model outside of the big proprietary models.
Sawada1_@reddit
Do you mind given some hints how you got it running?
AnyJeweler787@reddit
Awesome!
MrMrsPotts@reddit
Should we avoid uploading images in different languages? I have a mixed Arabic/English document for example.
TailSpinBowler@reddit
uploaded a personal letter. hoping they get purged afterwards.
DigThatData@reddit
that's so funny that this -- of all things -- is still an unsolved problem.
Barry_Jumps@reddit
Love it. Is those code open? I'm sure the community would appreciate running themselves and bringing their own keys to take some of the inference cost burden off your site.
versedaworst@reddit
This is amazing work and much needed, I feel like the past few months I’ve been relying on random blog posts for assessments of new OCR models. Hopefully it’s financially sustainable for a while.
Emc2fma@reddit (OP)
you and me both haha
dugganmania@reddit
Add a donate button my man you can crowd source some $$ - I’ve been doing adhoc research also so you’re potentially saving me tons of time
AdventurousFly4909@reddit
You are already giving them training data...
dugganmania@reddit
Sure but they’re providing a service too that I at least haven’t been able to replicate for OCRs without my own testing
TedHoliday@reddit
I haven’t really used OCR a lot in anything I’ve worked on, so pardon my ignorance, but I always figured OCR was sort of a solved problem. Is that not the case?
youarebritish@reddit
I have still not found a reliable model for Japanese. Three main pain points remain: the OCR confusing kanji for similar-looking kanji (agonizing to spot), failing on vertical text, and not dealing with ruby text.
the__storm@reddit
The actual character recognition (for typed text) is mostly solved - not as good as a human but usually good enough - but handling of complex layouts is very much not solved. Even Gemini 3 and GPT 5.1 fail at the first hurdle (I usually throw this one at them as a first test - the world is full of insane document layouts like this, and worse).
SarcasticBaka@reddit
Not at all really, maybe for super clean digitally created documents but not for anything older, with a complex layout or handwriting, etc. I deal with a lot of paperwork day to day so I've always kept a close eye on the advancement of OCR tech, before VLMs I used software like Abbyy FineReader or Adobe Acrobat which provide decent but definitely not great results depending on the scan quality.
schemathings@reddit
It's not loading for me - wondering if granite-docling is on there, been hearing good things about it.
the__storm@reddit
granite-docling is cool for being so small, but my experience is that it's basically worthless for anything more complicated than a book layout (just straight paragraphs of text). It would definitely lose to all the models currently on the leaderboard.
theZeitt@reddit
One of models has been stuck in loop just writing "driving safety", so some way to cancel ongoing prompt would be nice, maybe also good to automate it with some timeout from first token out
the__storm@reddit
Yeah common problem with these OCR models (probably because the temperature has to be set really low). Definitely should have some guardrails on the generation.
the__storm@reddit
Might be nice to put a couple of old standbys in there, like Tesseract and EasyOCR. They can't handle more complicated documents but they're very widely used (and fast) and provide a good baseline.
mace_guy@reddit
How are you absorbing the cost?
Emc2fma@reddit (OP)
I run a doc processing company (https://extend.ai) and we're just lighting money on fire at the moment (this took off way more than expected, so we scaled up the GPUs)
But I feel strongly that this should exist for the community, so we'll (1) keep funding it and (2) open-source it soon
(if any investors find this thread in the future, just call this part of our CAC)
the__storm@reddit
Open source would be awesome - would take some load off your GPUs and I could run company documents through it.
AccordingRespect3599@reddit
Vote model 2 page is empty.
ConstantinGB@reddit
as a total layman: what is OCR?
Imaginary_Leg_9383@reddit
Optical Character Recognition (OCR) - converting documents or PDFs, into editable and searchable data. step changes in LLMs / VLMs have really changed the landscape tho
ConstantinGB@reddit
Oh that's interesting. I'm building my own local LLM Agent (so Ollama LLMs and building tools and UI around it) and one of the next steps is to have it scan, transcribe and catalogue scanned documents and PDFs so I should definitely look into that.
WithoutReason1729@reddit
Your post is getting popular and we just featured it on our Discord! Come check it out!
You've also been given a special flair for your contribution. We appreciate your post!
I am a bot and this action was performed automatically.
geek_at@reddit
Not sure if that happens every time but when I load the page, upload a jpeg, pick a winner and click the "new battle" button, the upload doesn't work anymore. as in I'll have to reload the page for the upload to work (nothing happens after file selection)
Mkengine@reddit
Thank you, I was really missing something like that. Would you consider adding some of the following models?
GOT-OCR: https://huggingface.co/stepfun-ai/GOT-OCR2_0
granite-docling-258m : https://huggingface.co/ibm-granite/granite-docling-258M
Dolphin: https://huggingface.co/ByteDance/Dolphin
MinerU 2.5: https://huggingface.co/opendatalab/MinerU2.5-2509-1.2B
OCRFlux: https://huggingface.co/ChatDOC/OCRFlux-3B
MonkeyOCR-pro: 1.2B: https://huggingface.co/echo840/MonkeyOCR-pro-1.2B 3B: https://huggingface.co/echo840/MonkeyOCR-pro-3B
FastVLM: 0.5B: https://huggingface.co/apple/FastVLM-0.5B 1.5B: https://huggingface.co/apple/FastVLM-1.5B 7B: https://huggingface.co/apple/FastVLM-7B
MiniCPM-V-4_5: https://huggingface.co/openbmb/MiniCPM-V-4_5
GLM-4.1V-9B: https://huggingface.co/zai-org/GLM-4.1V-9B-Thinking
InternVL3_5: 4B: https://huggingface.co/OpenGVLab/InternVL3_5-4B 8B: https://huggingface.co/OpenGVLab/InternVL3_5-8B
AIDC-AI/Ovis2.5 2B: https://huggingface.co/AIDC-AI/Ovis2.5-2B 9B: https://huggingface.co/AIDC-AI/Ovis2.5-9B
RolmOCR: https://huggingface.co/reducto/RolmOCR
Qwen3-VL: Qwen3-VL-2B Qwen3-VL-4B Qwen3-VL-30B-A3B Qwen3-VL-32B Qwen3-VL-235B-A22B
eltonjohn007@reddit
Maybe add Qwen3-VL-235B?
Intelligent-Form6624@reddit
Nice 👍
sdkgierjgioperjki0@reddit
This seems to be contain both VLM and pure OCR models without labeling which is which. Deepseek actually has an VLM similar to Qwen VLM, although it is now a bit old I wonder how it compares to their pure OCR model.
radagasus-@reddit
thank you, very useful
NihilityAeonBeliever@reddit
wow the formatting on gemini 3 preview here is awesome https://www.ocrarena.ai/battles/5df5f5b9-02ea-477a-a61e-e013e9e698e5
Emc2fma@reddit (OP)
wow that's so impressive
BagComprehensive79@reddit
Looks very nice. Maybe it can be good idea to create battle for different formats, looks like it is only working with markdown format right now
microcandella@reddit
back in the 90s when I was working with a ton of OCR systems there was a company that did a pretty brilliant multi ocr engine implementation and employed a weighted voting system to choose what chunk was accurate. One of the only things that worked better at the time than that were the unobtainable OCR systems for national postal services - and even then they were only trained to nail down contents on the outside of an envelope.
It would be interesting to see a voting system implemented with the modern ocr options.
hainesk@reddit
Mistral 3.2 would be great!
Emc2fma@reddit (OP)
I had Mistral before but had to remove it. Their hosted API for OCR was super unstable and returned a lot of garbage results unfortunately.
(I could have also done something wrong integrating it)
do-un-to@reddit
Maybe the test harness needs robustness in handling service instability, perhaps optionally including measurements of that in summary metrics?
do-un-to@reddit
Though that kind of work is really annoying, and I think a nice-to-have rather than a generally-useful, so I wouldn't fault you for not being keen on implementing it.
ProposalOrganic1043@reddit
We have used mistral-ocr api over 10K pages and have noticed this inconsistency too. Some of the responses were total garbage. For really simple images with up to 300-400 clear words, the model responded with just 5-10 tokens with 100s of empty pipes and markdown formatting symbols.
We tried the same images with other models such as qwen:2.5 VL and olmo- ocr 2 and they could do it easily
JoshuaLandy@reddit
OCArena of Time
BestSentence4868@reddit
This is so good, and honestly much needed. Half the HF spaces I've found to try and compare OCR models have been busted or out of date. Way nicer to have a focused leaderboard like this.
Emc2fma@reddit (OP)
that was the goal! thanks for sharing, glad it resonates
Repulsive-Memory-298@reddit
Awesome, but you should really add a stop button or some limits. I uploaded a pdf and am stuck waiting for anonymous model 2 as it is generating hundreds of duplicated lines, I can only wonder how you pay for this haha
rm-rf-rm@reddit
Please add Gemma3
ProposalOrganic1043@reddit
This was needed for sure
GroundbreakingTea195@reddit
Cool, great job!
Emc2fma@reddit (OP)
thanks! any feedback on what could be better?
GroundbreakingTea195@reddit
Wild idea, but maybe add the API costs when users want to use the models themselves? This way, they have a quick overview like, "Wow, Gemini costs $3 and has an 82% win rate, and GPT-5.1 only costs $1 and has a 77% win rate." Also, perhaps define which models are open-source and which are not. I am currently looking for the best open-source OCR model, for example.
Emc2fma@reddit (OP)
that's an awesome idea, I'll work on adding both cost + latency metrics later today.
Gemini 3 is really strong, but very expensive + slow which doesn't make it great for a lot of use cases compared to Paddle or dots.ocr
danyx12@reddit
You don't need Gemini 3. I discovered Vertex AI, Gemini 2.0 Flash-Lite is insane. I know price still high for some people, but without any detailed indications, just a simple prompt he split scanned document, choose required pages and without any request in prompt he extracted few things from that document that he tough are important for me. With a bit more detailed prompt for what I need, he is extracting data from different documents, without any training or fine running.
GroundbreakingTea195@reddit
Great! Latency is also an awesome one. And for my use case, I am only allowed local models, so nothing on the internet. I have tried Paddle and docTR for example 🙃
kellencs@reddit
it’d be cool to have oneocr (windows) and google lens. there are a few free python wrappers for them, owocr for example
z_3454_pfk@reddit
this is really good but it’s missing some important models such as qwen3 30/32/235b, GLM, Granite, Claude, Grok, etc
z_3454_pfk@reddit
There’s no way repeat the battle with different random models
Kregano_XCOMmodder@reddit
Can't tell if DeepSeek OCR was just busted on this run, or it couldn't handle the spicy filter list: https://www.ocrarena.ai/battles/ecd69dc7-8c9b-41ad-acfc-60e60fb36b8d
Emc2fma@reddit (OP)
yeah DeepSeek has been super flaky on anything outside of very clean docs...tbh I don't understand the hype
Kregano_XCOMmodder@reddit
I have to laugh at a uploading a \~5MB collage image and getting this reply:
rikiiyer@reddit
The model itself is mid. The more interesting aspect to me is the details on the training process and the dynamic image encoding