Are ocr engines like tesseract still valid or do people just use image recognition models now.
Posted by optipuss@reddit | LocalLLaMA | View on Reddit | 59 comments
had this thought when someone just used qwen3.5 to read the content of a pdf file very accurately even the signature. so this question arose in my mind.
the__storm@reddit
Yes, along with other character/word-level OCR solutions, for a couple reasons:
rorykoehler@reddit
Good multimodal language models don't really do that anymore. We switched our OCR to llm multi-modal... not even very big models and both models we have used now have been flawless for going on 18 months. I think I can count on one hand how many times we've had to manually override results. We're OCRing millions of records a month.
Trillionaire_life@reddit
Hey can I dm? I need to know how to do this. My work project is stuck on this
rorykoehler@reddit
I just send the image and ask for a response containing a structured json object in the system prompt. Nothing more to it.
nonerequired_@reddit
Which model are you using?
rorykoehler@reddit
We're using openai api via azure but i had it working locally before going to prod. I can't remember the model I tested locally. It was a long time a ago sorry
ganonfirehouse420@reddit
My experience was that with a good prompt there is low risk of hallucination. Regular ocr makes mistakes too, especially when trying to read a table.
Danfhoto@reddit
Yeah, I didn’t mention this in my reply but VLMs LOVE to hallucinate. I’d rather have character-level flubs that can be scrubbed later.
ganonfirehouse420@reddit
I especially started using qwen3.5 to simply replace my old ocr tech.
Mkengine@reddit
This was my way as well. I needed basically a scan2SQL pipeline and I had to code so much post-processing logic with specialized OCR models, that I switched to Qwen3.5-35B-A3B and directly get a perfect JSON from the scan.
ganonfirehouse420@reddit
Qwen literally can read handwriting. Not perfectly, but commercial models aren't perfect at that task too.
Mkengine@reddit
There are so many OCR / document understanding models out there, here is my personal OCR list I try to keep up to date:
GOT-OCR:
https://huggingface.co/stepfun-ai/GOT-OCR2_0
granite:
https://huggingface.co/ibm-granite/granite-docling-258M
https://huggingface.co/ibm-granite/granite-4.0-3b-vision
MinerU:
https://huggingface.co/opendatalab/MinerU2.5-2509-1.2B https://huggingface.co/opendatalab/MinerU-Diffusion-V1-0320-2.5B
OCRFlux:
https://huggingface.co/ChatDOC/OCRFlux-3B
MonkeyOCR-pro:
1.2B: https://huggingface.co/echo840/MonkeyOCR-pro-1.2B
3B: https://huggingface.co/echo840/MonkeyOCR-pro-3B
RolmOCR:
https://huggingface.co/reducto/RolmOCR
Nanonets OCR:
https://huggingface.co/nanonets/Nanonets-OCR2-3B
dots OCR:
https://huggingface.co/rednote-hilab/dots.ocr https://modelscope.cn/models/rednote-hilab/dots.ocr-1.5 https://huggingface.co/rednote-hilab/dots.mocr
olmocr 2:
https://huggingface.co/allenai/olmOCR-2-7B-1025
Light-On-OCR:
https://huggingface.co/lightonai/LightOnOCR-2-1B
Chandra:
https://huggingface.co/datalab-to/chandra-ocr-2
Jina vlm:
https://huggingface.co/jinaai/jina-vlm
HunyuanOCR:
https://huggingface.co/tencent/HunyuanOCR
bytedance Dolphin 2:
https://huggingface.co/ByteDance/Dolphin-v2
PaddleOCR-VL:
https://huggingface.co/PaddlePaddle/PaddleOCR-VL-1.5
Deepseek OCR 2:
https://huggingface.co/deepseek-ai/DeepSeek-OCR-2
GLM OCR:
https://huggingface.co/zai-org/GLM-OCR
Nemotron OCR:
https://huggingface.co/nvidia/nemotron-ocr-v2
Qianfan-OCR:
https://huggingface.co/baidu/Qianfan-OCR
Falcon-OCR:
https://huggingface.co/tiiuae/Falcon-OCR
makingnoise@reddit
I'm surprised that you and I are the only ones that mentioned OLM-OCR2. It makes me wonder if I know something others don't (OLM-OCR2 is amazing at extracting accurate text from the absolute worst scans), or if it's the other way around, and others know something I don't (e.g., Is my knowledge dated and general purpose LLMs now perform just as good at OCR as specialized OCR AIs???).
Mkengine@reddit
So usually, when I post this list, I also get asked which model is best for the user's use case. Unfortunately, I'm not quite there yet to be able to answer that, so I'm not sure here either. But what I'm currently planning: we have a huge archive of scanned PDFs from the last 50 years at the company, and they're all just lying around. There isn't even a client, but my boss told me to test out what's possible with local models and just go wild. As fast as the list is changing, I'm pretty glad I don't have any pressure and can just do this as a side project to tinker with. Long story short: in the long term, I'd like to have a test suite where every model from my list is run with user input, and then users can look through it themselves to see where the best output is delivered. There's no necessity for it yet, but it's always worth having a backup in case the cloud route is ever no longer an option.
Pleasant-Regular6169@reddit
The LLMs are far superior to old OCR tools, especially where forms and script are processed.
Conducted tests on a specific data set we have and the best performer was https://mistral.ai/news/mistral-ocr-3
Sonnyjimmy@reddit
I've spent quite a lot of time trying out OCR solutions and this is what I currently use for extracting text from pdfs/images with bounding boxes:
Even with good models, beware of VLM laziness in missing out lines, human checks still needed or a second pass through the VLM.
For the last point you could also do a hybrid approach that others have mentioned - get line bounding boxes with PaddleOCR, then send the cut image of the line to the VLM where Paddle had low confidence. I have found that VLM OCR performance is worse in this case, but it can be a faster process overall. So for me, it depends on how 'difficult' the text is to read to go pure VLM (for most difficult) or hybrid PaddleOCR-VLM.
Mkengine@reddit
I don't know exactly where I would categorize my current use case in your list, but I've tested a lot of small OCR models and none of them worked without extensive post-processing: I have old scanned PDFs where I have a table of contents at the beginning, and indentations have meaning, not just the content. Then there are technical drawings where the title block needs to be extracted and the drawing described, followed by component bills of materials that have to be correctly mapped to the technical drawings. The bills of materials also eventually need to go into a SQL table, so I need structured output (JSON). These are so many tasks requiring more intelligence than OCR models can provide, so I'm just using Qwen3.5-35B-A3B now and can tackle everything with this single model, giving me a fully functional "scan2sql" pipeline.
KeikakuAccelerator@reddit
Tesseract sucks, deep seek OCR and paddle OCR worked reasonably well last I chekced
Sonnyjimmy@reddit
I agree - tesseract can't handle handwriting or scanned, noisy images, even if it has typed text. PaddleOCR is much better in general for scanned documents and can still work on CPU if you don't mind waiting.
nmkd@reddit
Outdated at this point.
I use LightOnOCR and it's far more accurate and much faster than Tesseract, even when running every sample twice to avoid random outliers (which are extremely rare).
revilo-1988@reddit
Gute Frage aktuell nutz ich es noch für große Dokumente da wäre ki mir zu teuer von dem token Verbrauch und locallen llms können mal streicken bei großen Dokumente
flobernd@reddit
Quality wise I found LLMs to be way superior compared to traditional tools like Tesseract. But there are drawbacks: - LLMs are slower - LLMs can’t easily produce bounding boxes (important if you need to produce transparent PDF overlays)
There are some hybrid approaches, but for my taste they are not perfect either. They usually run traditional text detection to determine the bounding boxes and afterward invoke the LLM for the actual OCR. This definitely improves the OCR quality, but if no bounding box was detected in the first place (e.g. handwritten text), the LLM never sees the text.
More advanced algorithms might exist now. Has been a while since I last checked (was trying to replace the Paperless OCR without using Paperless GPT etc.)
Mashic@reddit
Tesseract hasn't been updated for years. If you want a traditional OCR, use PaddleOCr, or you can use an LLM like qwen3-vl-8b.
Caffdy@reddit
anyone know of a good OCR than can read kanji/kana?
optimisticalish@reddit
The new Qwen3.5 4B is massively multilingual, and with its Vision mmproj file loaded, it can both see images, OCR and translate.
fragilesleep@reddit
I use PaddleOCR for Japanese games. There's also PaddleOCR-VL, but the normal model (PP-OCRv5_server_rec) is good enough for me.
ZealousidealBadger47@reddit
Tesseract is way faster than an LLM. Still using Tesseract with Python venv for most OCR tasks. I only use LLMs for handwriting pics.
Danfhoto@reddit
OCR engines are much faster and smaller. A good combination is an object detection/format model that crops bounding boxes and sends those regions to an OCR engine so that specific formatting can be sent, preserving a lot of the natural reading order. This is largely how frameworks like Docling works.
For my purposes, the current limitation with OCR engines is that they do poorly when it comes to latex formula, subscripts, and superscripts. This makes it challenging to extract key details from things like peer reviewed articles. Likewise, using even SOTA local VLMs to extract full pages has been lackluster from my testing, with some of the best coming out of the larger GLM vision models. With Docling as inspiration, I’m working a bit on a pipeline that uses object detection models to send crops of to a VLM (thus requiring less memory on useless blank pixels and limiting poisoning from surrounding text) to try to extract more accurate info. It’s much more accurate than OCR engines but requires a lot more time and resources. I’m building up to using it on about 4,000 scanned PDFs with varying formatting. Right now I got much more accurate extractions than trying to use a VLM to OCR an entire page, but it’s still requiring tweaking.
cbeater@reddit
Try docling, no llm works great for pdf
weiyong1024@reddit
tesseract is still way faster and cheaper if you just need to extract text from clean scans. vision models are overkill for that. but anything with handwriting, tables, or weird layouts... yeah just throw it at qwen-vl and be done with it
makingnoise@reddit
I just ran 19,000 pages through olm-ocr2 after seeing it's nearly flawless performance on a small test sample. Tesseract would have produced garbage, in comparison.
Endurance_Beast@reddit
Automation is cheaper than AI. A tesseract script with apache airflow or fileflows costs you fractions of the AI for routine work and set templates.
loyalekoinu88@reddit
Both. They might use it to validate the LLM. However, the LLM was likely trained on OCR output so differences will likely be small. LLM have the benefit of reading disordered information. The LLM can natively output in formats like json and can output only requested data so they remove steps from the processing.
starkruzr@reddit
I use Qwen3-VL-8B-Instruct for this handwritten note management app. Tesseract wasn't even an option; it can't do handwriting to save its life. https://github.com/jdkruzr/ultrabridge
Chupa-Skrull@reddit
Have you tried Chandra?
starkruzr@reddit
haven't heard of it. go on?
Chupa-Skrull@reddit
Oh, it's just an OCR model I've had decent handwriting success with. Was curious if you'd also used it/compared it to Qwen
poita66@reddit
Nice I’ve been meaning to setup something similar for my nomad
richardanaya@reddit
Check out GLM OCR. I couldn't believe how powerful and fast it was.
Available-Craft-5795@reddit
tesseract
Its a dedicated OCR AI model, I doubt GLM beats it
starkruzr@reddit
Tesseract is absolute trash; ask me how I know
The_frozen_one@reddit
EasyOCR is pretty good, a bit heavier than Tessaract
H_NK@reddit
In a least a couple use cases I’ve had gpt4o outperform tesseract. Maybe you know this already, but clarify for other users; almost all ocr tools can be technically considered AI. Tesseract is neural network based but is handled as a python package without the typical local architecting of modern AI model
ghulamalchik@reddit
tesseract is too basic and requires ideal inputs. It gets confused easily with handwriting, different fonts, weird angles, blurry fonts, etc.
Modern OCR models such as GLM OCR are much, much more powerful, it's not even close.
l_Mr_Vader_l@reddit
wait till you see paddleocr-vl
ttkciar@reddit
It really depends on the job. At work we needed to process hundreds of thousands of pages of fairly well-formed text. Tesseract did a pretty good job, a couple orders of magnitude more quickly than the best vision models of the time. Using Tesseract was a no-brainer.
If your documents are not so well-formed, or if you only need to OCR a few of them, using a modern vision model is a no-brainer. It will take a while to get there, but will give you much higher quality than Tesseract (which does not do well on malformed text).
They're just different tools for different situations, and you should use whichever makes the most sense given the constraints.
OtherwiseHornet4503@reddit
For what I tried to use it for, Tesseract was shit. So shit I just didn’t use it.
So, now, for the critical stuff I just use vision enabled LLMs.
Pixtral 12b from way back then was better than Tesseract.
antonyshen@reddit
If your HW was good for AI, then new AI model is a very good SW for the task, way better than tesseract.
Ketonite@reddit
LLM vision is way better than Tesseract, but the cost is not always justified. Also, submitting the image/page of a PDF to an LLM will get you good text, but the text is not mapped to the graphical appearance of the PDF.
As a workaround, I coded an app that first OCRs with Tesseract, then uses LLM vision to fix errors. I plug the corrected text into the mapped area of the PDF text layer so it mostly matches up. A bit hacky, but does the job for me.
When i just want good text and I don't care about image to text location mapping, Haiku, Gemini Flash, or OSS models like Llama 4 Maverick or Qwen VL do a good job on all but the most complex pages.
AsliReddington@reddit
Tesseract was made for not in the wold content, specific fonts in books quite literally for scanning books in. I do think for LLM backboned OCR models, a way to ground detection could be by convincing a latent or hand drawn shape is what it's seeing, back to some alphabetical MNIST lol and only then ground it's output at a letter level
AdamEgrate@reddit
Depends on the use case. There are scenarios where you may want high precision (even if it’s at the expense of recall). With blurry images LLMs tend to hallucinate when they should return no answer.
Add to that latency requirements and LLMs are not the right choice.
MuDotGen@reddit
I have been using PaddleOCR for private document OCR as even though it takes long, it tends to be runnable on even weaker hardware. It works surprisingly well for Japanese.
itsArmanJr@reddit
I believe when privacy is a concern (and compared to general LLM usage, OCR tends to involve far more sensitive data) tesseract is still widely used.
ZeroXClem@reddit
Tesseract never worked for me unless I hand fed it perfectly formatted images with exact crops. I like vision models because of their general adaptability. I believe as smaller parameter models become more capable we will reach a point where a 25M vision model is as fast as tesseract and better.
vaksninus@reddit
last i tried tesseract it was so inferior to paying a few scents for google clouds solution that it wasn't worth it at all if you care about accuracy (it was a translation task, so accuracy was important).
ZeroXClem@reddit
This was my experience as well
Easy-Unit2087@reddit
If you disable vision on Qwen 3.5, it can still use tools (e.g. Read) to extract text from images.
It just gets handed the text from the tool as-is though, so it can't use the rest of the image (colors and shapes) to better interpret the text that's on it. Vision-enabled models will produce better results because of this.
Azuriteh@reddit
Tesseract is still crazy good when you have limited hardware, especially in terms of speed, you won't be beating it anytime soon unless you have a really optimized stack
I prefer vision models though, they're that much better.
rebelSun25@reddit
I've been going hard at it. Trying my best to do OCR, content validation, presence of text validation, etc.
I found that having Google models excel at PDF file parsing. No other model comes close. If I split up the PDF into images, then older Gemini, Qwen, Grok,etc will work fairly well.
Qwen3.5 27b is good for image to text, most Gemini 2.5+ and newer are good also, Qwen 2.5 VL 72b is a monster for image understanding (it's actually mind blowing how good it is).
Currently I'm using opencv to preprocess images, get info from LLM about documents, then use opencv, then LLM again.
I needed to create step by step pipeline to get best results