I made a free playground for comparing 10+ OCR models side-by-side

[-]

zedd1704@reddit

I am wondering how you are prompting the models in the backend. Is it just "parse the pdf"?

[-]

rainbow3@reddit

Do any of these models return the coordinates of each row of text?

I am looking for a replacement for a tesseract project. Found many that do better ocr but did not provide coordinates.

[-]

Really cool! Are you using any specific prompt to call these models? I'm building something that processes mostly invoices / receipts and get quite good results in general with a very specific prompt, but found a few tricky cases that gives way better results on OCRarena than what I get on the same models (GPT 5.1 mostly)

[-]

Imaginary_Leg_9383@reddit

I noticed you can view and even customize the models in the playground under "advanced settings"

[-]

peteror@reddit

Good catch, I didn't notice that, thank you!

[-]

Flimsy_Requirement30@reddit

Thanks OP. Can you share what thinking level used for Gemini 3? I find it make a lot of difference to use high level of thinking for Gemini3, and maybe would be great to get this detail right!

[-]

SarcasticBaka@reddit

Great idea! Paddle-VL and MinerU are considered top dogs for OCR iirc, so probably useful to add them. Nanonets, LightOnOCR and Chandra OCR are popular recent releases as well.

[-]

ajw2285@reddit

Is there an easy way to deploy Paddle? I am a noob and limited to Ollama

[-]

Mayonnaisune@reddit

Imo, the older versions of paddleocr Python packages (<= 2.10.0 iirc), which supports up to PP-OCRv4 models and still use .ocr() instead of .predict(), are easier to use than the newer ones. They are faster and lighter too, but may not be as accurate and are clearly not as up to date as the newer ones.

[-]

the__storm@reddit

I would give the vllm backend a try: https://docs.vllm.ai/projects/recipes/en/latest/PaddlePaddle/PaddleOCR-VL.html

Paddle in general is notoriously hard to get running (although it might be better if you can read the Chinese version of the docs). For the older non-VL Paddle OCR models, there's also RapidOCR. It's still kind of awkward and poorly documented but definitely easier than PaddlePaddle.

[-]

Emc2fma@reddit (OP)

on it, will have them deployed shortly!

[-]

paton111@reddit

Very cool project. At Tomedes we built something similar on our side – a tool that lets you compare OCR outputs from multiple AI models and also shows the most common output for each element so you can see where the models agree. It’s been super useful for spotting consistency. You can try it here: https://www.tomedes.com/tools/image-to-text

[-]

cviperr33@reddit

thank you! will check it out when i get home

[-]

kathirai@reddit

As far I have tested with written notes (in caps letters) Gemini 3 is performing good, but still lacks more. You can test with samples given by the site and also can upload yours and test.

[-]

markingup@reddit

Id love if anyone has any good ocr tests to share. Finding it tough to find validation

[-]

iamn0@reddit

Just like on lmarena.ai, we need the ability to vote that both models performed equally good. I had a case where both produced identical results

[-]

Emc2fma@reddit (OP)

makes total sense, shipping shortly!

[-]

nderstand2grow@reddit

you mean, "vibing" shortly? :)

[-]

rm-rf-rm@reddit

Yeah the first (and only) one I tried was matched. Causes a false result if you vote for 1 over the other in that case. And im guessing this happens quite often

[-]

RegisteredJustToSay@reddit

Same. Also one for when neither was good for those cases where neither should benefit from an elo boost.

[-]

PM_ME_COOL_SCIENCE@reddit

Please add PaddleOCR-VL! I've found it to be the best OCR model outside of the big proprietary models.

[-]

Sawada1_@reddit

Do you mind given some hints how you got it running?

[-]

AnyJeweler787@reddit

Awesome!

[-]

MrMrsPotts@reddit

Should we avoid uploading images in different languages? I have a mixed Arabic/English document for example.

[-]

TailSpinBowler@reddit

uploaded a personal letter. hoping they get purged afterwards.

[-]

DigThatData@reddit

that's so funny that this -- of all things -- is still an unsolved problem.

[-]

Barry_Jumps@reddit

Love it. Is those code open? I'm sure the community would appreciate running themselves and bringing their own keys to take some of the inference cost burden off your site.

[-]

versedaworst@reddit

This is amazing work and much needed, I feel like the past few months I’ve been relying on random blog posts for assessments of new OCR models. Hopefully it’s financially sustainable for a while.

[-]

Emc2fma@reddit (OP)

Hopefully it’s financially sustainable for a while

you and me both haha

[-]

dugganmania@reddit

Add a donate button my man you can crowd source some $$ - I’ve been doing adhoc research also so you’re potentially saving me tons of time

[-]

AdventurousFly4909@reddit

You are already giving them training data...

[-]

dugganmania@reddit

Sure but they’re providing a service too that I at least haven’t been able to replicate for OCRs without my own testing

[-]

TedHoliday@reddit

I haven’t really used OCR a lot in anything I’ve worked on, so pardon my ignorance, but I always figured OCR was sort of a solved problem. Is that not the case?

[-]

youarebritish@reddit

I have still not found a reliable model for Japanese. Three main pain points remain: the OCR confusing kanji for similar-looking kanji (agonizing to spot), failing on vertical text, and not dealing with ruby text.

[-]

the__storm@reddit

The actual character recognition (for typed text) is mostly solved - not as good as a human but usually good enough - but handling of complex layouts is very much not solved. Even Gemini 3 and GPT 5.1 fail at the first hurdle (I usually throw this one at them as a first test - the world is full of insane document layouts like this, and worse).

[-]

SarcasticBaka@reddit

Not at all really, maybe for super clean digitally created documents but not for anything older, with a complex layout or handwriting, etc. I deal with a lot of paperwork day to day so I've always kept a close eye on the advancement of OCR tech, before VLMs I used software like Abbyy FineReader or Adobe Acrobat which provide decent but definitely not great results depending on the scan quality.

[-]

schemathings@reddit

It's not loading for me - wondering if granite-docling is on there, been hearing good things about it.

[-]

the__storm@reddit

granite-docling is cool for being so small, but my experience is that it's basically worthless for anything more complicated than a book layout (just straight paragraphs of text). It would definitely lose to all the models currently on the leaderboard.

[-]

theZeitt@reddit

One of models has been stuck in loop just writing "driving safety", so some way to cancel ongoing prompt would be nice, maybe also good to automate it with some timeout from first token out

[-]

the__storm@reddit

Yeah common problem with these OCR models (probably because the temperature has to be set really low). Definitely should have some guardrails on the generation.

[-]

the__storm@reddit

Might be nice to put a couple of old standbys in there, like Tesseract and EasyOCR. They can't handle more complicated documents but they're very widely used (and fast) and provide a good baseline.

[-]

mace_guy@reddit

How are you absorbing the cost?

[-]

Emc2fma@reddit (OP)

I run a doc processing company (https://extend.ai) and we're just lighting money on fire at the moment (this took off way more than expected, so we scaled up the GPUs)

But I feel strongly that this should exist for the community, so we'll (1) keep funding it and (2) open-source it soon

(if any investors find this thread in the future, just call this part of our CAC)

[-]

the__storm@reddit

Open source would be awesome - would take some load off your GPUs and I could run company documents through it.

[-]

AccordingRespect3599@reddit

Vote model 2 page is empty.

[-]

ConstantinGB@reddit

as a total layman: what is OCR?

[-]

Imaginary_Leg_9383@reddit

Optical Character Recognition (OCR) - converting documents or PDFs, into editable and searchable data. step changes in LLMs / VLMs have really changed the landscape tho

[-]

ConstantinGB@reddit

Oh that's interesting. I'm building my own local LLM Agent (so Ollama LLMs and building tools and UI around it) and one of the next steps is to have it scan, transcribe and catalogue scanned documents and PDFs so I should definitely look into that.

[-]

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! Come check it out!

You've also been given a special flair for your contribution. We appreciate your post!

I am a bot and this action was performed automatically.

[-]

geek_at@reddit

Not sure if that happens every time but when I load the page, upload a jpeg, pick a winner and click the "new battle" button, the upload doesn't work anymore. as in I'll have to reload the page for the upload to work (nothing happens after file selection)

[-]

Mkengine@reddit

Thank you, I was really missing something like that. Would you consider adding some of the following models?

GOT-OCR: https://huggingface.co/stepfun-ai/GOT-OCR2_0

granite-docling-258m : https://huggingface.co/ibm-granite/granite-docling-258M

Dolphin: https://huggingface.co/ByteDance/Dolphin

MinerU 2.5: https://huggingface.co/opendatalab/MinerU2.5-2509-1.2B

OCRFlux: https://huggingface.co/ChatDOC/OCRFlux-3B

MonkeyOCR-pro: 1.2B: https://huggingface.co/echo840/MonkeyOCR-pro-1.2B 3B: https://huggingface.co/echo840/MonkeyOCR-pro-3B

FastVLM: 0.5B: https://huggingface.co/apple/FastVLM-0.5B 1.5B: https://huggingface.co/apple/FastVLM-1.5B 7B: https://huggingface.co/apple/FastVLM-7B

MiniCPM-V-4_5: https://huggingface.co/openbmb/MiniCPM-V-4_5

GLM-4.1V-9B: https://huggingface.co/zai-org/GLM-4.1V-9B-Thinking

InternVL3_5: 4B: https://huggingface.co/OpenGVLab/InternVL3_5-4B 8B: https://huggingface.co/OpenGVLab/InternVL3_5-8B

AIDC-AI/Ovis2.5 2B: https://huggingface.co/AIDC-AI/Ovis2.5-2B 9B: https://huggingface.co/AIDC-AI/Ovis2.5-9B

RolmOCR: https://huggingface.co/reducto/RolmOCR

Qwen3-VL: Qwen3-VL-2B Qwen3-VL-4B Qwen3-VL-30B-A3B Qwen3-VL-32B Qwen3-VL-235B-A22B

[-]

eltonjohn007@reddit

Maybe add Qwen3-VL-235B?

[-]

Intelligent-Form6624@reddit

Nice 👍

[-]

sdkgierjgioperjki0@reddit

This seems to be contain both VLM and pure OCR models without labeling which is which. Deepseek actually has an VLM similar to Qwen VLM, although it is now a bit old I wonder how it compares to their pure OCR model.

[-]

radagasus-@reddit

thank you, very useful

[-]

NihilityAeonBeliever@reddit

wow the formatting on gemini 3 preview here is awesome https://www.ocrarena.ai/battles/5df5f5b9-02ea-477a-a61e-e013e9e698e5

[-]

Emc2fma@reddit (OP)

wow that's so impressive

[-]

BagComprehensive79@reddit

Looks very nice. Maybe it can be good idea to create battle for different formats, looks like it is only working with markdown format right now

[-]

microcandella@reddit

back in the 90s when I was working with a ton of OCR systems there was a company that did a pretty brilliant multi ocr engine implementation and employed a weighted voting system to choose what chunk was accurate. One of the only things that worked better at the time than that were the unobtainable OCR systems for national postal services - and even then they were only trained to nail down contents on the outside of an envelope.

It would be interesting to see a voting system implemented with the modern ocr options.

[-]

hainesk@reddit

Mistral 3.2 would be great!

[-]

Emc2fma@reddit (OP)

I had Mistral before but had to remove it. Their hosted API for OCR was super unstable and returned a lot of garbage results unfortunately.

(I could have also done something wrong integrating it)

[-]

do-un-to@reddit

Maybe the test harness needs robustness in handling service instability, perhaps optionally including measurements of that in summary metrics?

[-]

do-un-to@reddit

Though that kind of work is really annoying, and I think a nice-to-have rather than a generally-useful, so I wouldn't fault you for not being keen on implementing it.

[-]

ProposalOrganic1043@reddit

We have used mistral-ocr api over 10K pages and have noticed this inconsistency too. Some of the responses were total garbage. For really simple images with up to 300-400 clear words, the model responded with just 5-10 tokens with 100s of empty pipes and markdown formatting symbols.

We tried the same images with other models such as qwen:2.5 VL and olmo- ocr 2 and they could do it easily

[-]

JoshuaLandy@reddit

OCArena of Time

[-]

BestSentence4868@reddit

This is so good, and honestly much needed. Half the HF spaces I've found to try and compare OCR models have been busted or out of date. Way nicer to have a focused leaderboard like this.

[-]

Emc2fma@reddit (OP)

that was the goal! thanks for sharing, glad it resonates

[-]

Repulsive-Memory-298@reddit

Awesome, but you should really add a stop button or some limits. I uploaded a pdf and am stuck waiting for anonymous model 2 as it is generating hundreds of duplicated lines, I can only wonder how you pay for this haha

[-]

rm-rf-rm@reddit

Please add Gemma3

[-]

ProposalOrganic1043@reddit

This was needed for sure

[-]

GroundbreakingTea195@reddit

Cool, great job!

[-]

Emc2fma@reddit (OP)

thanks! any feedback on what could be better?

[-]

GroundbreakingTea195@reddit

Wild idea, but maybe add the API costs when users want to use the models themselves? This way, they have a quick overview like, "Wow, Gemini costs $3 and has an 82% win rate, and GPT-5.1 only costs $1 and has a 77% win rate." Also, perhaps define which models are open-source and which are not. I am currently looking for the best open-source OCR model, for example.

[-]

Emc2fma@reddit (OP)

that's an awesome idea, I'll work on adding both cost + latency metrics later today.

Gemini 3 is really strong, but very expensive + slow which doesn't make it great for a lot of use cases compared to Paddle or dots.ocr

[-]

danyx12@reddit

You don't need Gemini 3. I discovered Vertex AI, Gemini 2.0 Flash-Lite is insane. I know price still high for some people, but without any detailed indications, just a simple prompt he split scanned document, choose required pages and without any request in prompt he extracted few things from that document that he tough are important for me. With a bit more detailed prompt for what I need, he is extracting data from different documents, without any training or fine running.

[-]

GroundbreakingTea195@reddit

Great! Latency is also an awesome one. And for my use case, I am only allowed local models, so nothing on the internet. I have tried Paddle and docTR for example 🙃

[-]

kellencs@reddit

it’d be cool to have oneocr (windows) and google lens. there are a few free python wrappers for them, owocr for example

[-]

z_3454_pfk@reddit

this is really good but it’s missing some important models such as qwen3 30/32/235b, GLM, Granite, Claude, Grok, etc

[-]

z_3454_pfk@reddit

There’s no way repeat the battle with different random models

[-]

Kregano_XCOMmodder@reddit

Can't tell if DeepSeek OCR was just busted on this run, or it couldn't handle the spicy filter list: https://www.ocrarena.ai/battles/ecd69dc7-8c9b-41ad-acfc-60e60fb36b8d

[-]

Emc2fma@reddit (OP)

yeah DeepSeek has been super flaky on anything outside of very clean docs...tbh I don't understand the hype

[-]

Kregano_XCOMmodder@reddit

I have to laugh at a uploading a \~5MB collage image and getting this reply:

I can’t accurately transcribe this collage due to very low resolution. Please upload a higher‑resolution image or separate close‑ups/pages (or the original PDF) so I can convert everything to markdown per your rules.

[-]

rikiiyer@reddit

The model itself is mid. The more interesting aspect to me is the details on the training process and the dynamic image encoding