PDF Extractor (OCR/selectable text) | TheaterFire

PDF Extractor (OCR/selectable text)

Posted by qPandx@reddit | Python | View on Reddit | 51 comments

I have a project that I am working on but I am facing a couple issues.

In short, my project parses what is inside a pdf order and returns the result to user. The roadblocks Iam in currently is that it works OK for known/seen templates of pdf orders as well as unseen pdf orders. My biggest issue is if the pdf order is non-selectable text/scanned which means it requires OCR to extract the text. I have tried the OCRmyPDF+Tesseract but it misses lines and messes up with the quantity etc...

What's there that can resolve OCR accurately?

P.S. I also tried PaddleOCR but it never finishes the job and keeps the app on a loop with no result.

[-]

phoebeb_7@reddit

tesseract reads line by line without such layout context so for tables in order it loses row-column relatioonships also the quantity, description and price which is why you are seeing the mixing, surya, llamaparse or docling might be worth looking at. also before comitting to any service i'd recommend testing your actual docs on playground but if you want to see parser performance scores and all you might take a peek into the parsebench leaderboard

[-]

qPandx@reddit (OP)

Very useful because I just quickly checked parsebench and I am using a fallback AI model to act as reviewer and finalizer and guess what? The AI model I was using is gemini-3.1-flash-lite-preview which has very good rating for tables at cheap cost. It was doing the job correctly as well the AI logic I have is mistral-ocr for the ocr engine + gemini model.

That set up works but it may get expensive fast which is why I took it as a challenge to see if can do it at no cost or maybe just mistral-ocr cost (2$/1000 pages).

I got alot of docling recommendations but I downloaded it and told my AI (codex) to run benchmarks of our current logic vs docling and it runs for an hour and comes back telling me that docling failed miserably.

Check this output from codex: postimg.cc/CnRF3mw0

Is it my machine thats slow? is it docling? what could it be?

Where can I test my docs on playground for the ones you mentioned?

[-]

phoebeb_7@reddit

docling is slower on cpu, saw on most threads but you can boost up a bit with a gpu if you can manage it and you can test using your existing docs on llamaparse playground, easy UI

[-]

qPandx@reddit (OP)

Surya surprisingly did good for me. It took way longer time but it did do the job. For llamaparse, would I require a sub to get this feature working? Im going to have like 10-15 users using my app, (not all at once though). Would I require the $50 sub or more? Does it OCR+Parse? If so that just replaces my project and work I did but wouldn't it be cheaper to local parse and do mistral-ocr for scanned pdfs?

[-]

martcerv@reddit

I'm literally working on this exact problem right now for my own project!

TL;DR: Try Docling.** It's specifically designed for document understanding (not just OCR) and handles tables way better than Tesseract.

Why Tesseract struggles with your use case:

Tesseract does OCR but doesn't understand document structure. So it:

- Misses table boundaries (reads across rows)

- Gets confused by multi-column layouts

- Struggles with quantity/number alignment

- Doesn't preserve table semantics

OCRmyPDF + Tesseract makes the PDF selectable, but the underlying OCR is still Tesseract with the same issues.

[-]

qPandx@reddit (OP)

I tried Docling but my "AI" ran the benchmarks and results and Docling told me that Docling is better as a fallback to OCRmyPDF + Tesseract. Is Docling slow to run? It takes quite sometime but my current setup is much faster.

Do you think I should push them more to the test? My parser struggles to read and parse the accurate information from the uploaded PDF (but the uploaded PDF would be an unknown/unseen template). Not sure how to make it handle unknown PDFs.

[-]

api-services@reddit

Just wondering. Has anyone tried PDFMiner?

[-]

qPandx@reddit (OP)

I read through the repo but it does not seem to do OCR image-only scans, I think pdfplumber already does the job. I may be wrong though

[-]

binaryfireball@reddit

there is no way to get the magic box to shake out the text better than to train it. with that being said not all pdf data needs to be extracted via ocr

[-]

qPandx@reddit (OP)

If it is a scanned/imaged pdf, how else can I extract the content?

[-]

binaryfireball@reddit

if its only scanned images then yea only OCR. I was assuming there would be actual text as well, combining both would be the most accurate as you minimize the amount of OCR text in general and can even use the real text to help train the ocr model. Also experiment with different OCR services as they have different levels of accuracy.

if the pdfs will be continued to be generated its best if you can just convert the forms to have fields so you dont even have to parse anything

[-]

qPandx@reddit (OP)

No yeah, the users will upload a pdf ranging from scanned pdfs, selectable-text pdfs, known & coded templates, unknown/unseen templates. It is kind of free-for-all and trying to make my codebase to be able to handle it with appropriate routings. I believe the only weakness I am having is the OCR section. The parser is doing the job when it is a selectable-text.

I have to make it so that it can handle over 1500 type of order templates that we receive from customers.

[-]

Professional_Car3334@reddit

ocr for scanned pdfs is a whole different beast than selectable text, thats where most diy parsers fall apart

ive been messing with reseek lately and it handles both types automatically, extracts text from images and pdfs without me routing anything. might save you from building that whole pipeline yourself

150 templates is no joke though, even with good ocr youre gonna need solid fallback logic for the weird ones

[-]

qPandx@reddit (OP)

Okay this is something, initially I was going with OpenRouter for Mistral-OCR as the OCR brains and Gemini as a secondary reviewer of my codebase parser then output the result to user.

Reseek looks like it does both. Very curious now about how this setup would go.

Would you happen to know if there is limits? Is it your primary or a fallback? I'll reach out to them to see if I can just test it out and if it works with my setup.

[-]

Motox2019@reddit

Try trocr on huggingface. I believe it’s a Microsoft model that I’ve had good luck with in the past reading structure table data written in a welding shop environment. Wasn’t perfect but decent. For your case, I’d expect pretty fantastic accuracy. It’s a transformer based ocr model so a bit closer to AI kinda IIRC.

[-]

qPandx@reddit (OP)

I have trocr vs docling vs paddleocr vs ocrmypdf+tesseract vs mistral to try out extinsevil. However, do you think trocr will be the most accurate? thing is im on work laptop so not sure how fast itll run and when i host it (on render), will it be fine?

[-]

Motox2019@reddit

I don’t have an answer as which will be the most accurate. I do know it worked much better for me than tesseract did though.

Yes, it’s quite performant. Depending on the size you end up using, I found training to be rather slow using a rtx 3060, but the actual ocr is quite quick. After I trained the model, I ran it at work which as a p1000 class gpu I believe and while slower, was still fine.

Just for context, I was trying to transfer handwritten scanned tables into an excel sheet so preprocessed the documents such that each cell became its own image and then over these cell images. I did this with ~800 pdf files each with ~1-3 pages and it took about 5-8 hours if I remember correctly. Might give ya a clue as to how it might behave for your case.

Really just boils down to your gpu but I don’t think it should be a problem for you, especially if the large model is too much, just go down in size.

[-]

qPandx@reddit (OP)

Man gemma e2b was already struggling on this work laptop and slow so I dont think I can run down this path to even try it. I do appreciate you though

[-]

Motox2019@reddit

Ah fair enough. And no problem, best of luck!

[-]

Civil-Image5411@reddit

Which PaddleOCR variant did you use? They ship several models. In my experience it significantly outperforms Tesseract. One thing to watch out for: if you used the VL model, which is transformer-based, it can be very slow and get stuck in generation loops when the parameters aren’t set correctly.

Here there is another OCR server based on the non VL/ non autoregressive model of paddle ocr: https://github.com/aiptimizer/TurboOCR

[-]

qPandx@reddit (OP)

Is it heavily dependent on CPU/GPU? I am using PPStructureV3 first then plain PaddleOCR fallback. However, it just does not want to run and crashes.

I am currently running OCRmyPDF+Tesseract as primary, Paddle path is the fallback which it hits PPStructureV3 first then if that fails, fallback to plain PaddleOCR (CPU-only)

My work laptop specs is Ultra 7 258v with 32 gb ram and intel arc 140v gpu (16gb)

[-]

Civil-Image5411@reddit

So StructureV3 and the non-VL PaddleOCR both don’t work?

I’m not sure. PPStructureV3 worked for me on my Nvidia GPU, but depending on the models you’re using it requires a lot of resources though 32 GB of memory should be enough. Not sure it can use the Intel GPU, but it should run on CPU.

TurboOCR runs on CPU and you can directly pass the PDF without having to convert it to an image first. It’s one command to run the Docker container in case you wanna try it out.

Alternatively there is also OnnxOCR on github that could potentially also utilize your GPU, you can plugin whatever backend you want.

[-]

qPandx@reddit (OP)

FYI, This is a first time that I have done such a project but if it works on my system while utilizing the CPU/GPU and I host it on render/on-prem server, how could the users run it if they have weak specs? Will it also be very demanding to run?

At the end of the day, it's a project that will roll out to departments at my workplace and they are the ones who will be using it daily.

StructureV3 and plain PaddleOCR was taking a really long time to do anything and then it just crashes (looking at my terminal and its as if i pressed ctrl+c when i didnt), I will try to get it working again temporarily to see how it would perform against my current flow of OCRmyPDF+Tesseract but do you think I should trial TurboOCR and OnnxOCR?

I will have to run a test between Docling vs Paddle vs OCRmyPDF+Tesseract vs Mistral-OCR (if local doesnt work) vs TurboOCR vs OnnxOCR

Looks quite extensive of testing but whatever gives me most accuracy+speed is what I really need.

[-]

Civil-Image5411@reddit

Well, it depends. Most of them also run with low specs, you could even offload to disk (swap) if you don't have enough memory, however at some point it just gets extremely slow. Easiest is of course to run it via a cloud provider like Mistral OCR, but it gets expensive for high volume. You could also just serve it on one computer/server in your organization and give the other users access to it (for instance via VPN). For OnnxOCR (only supports English, Chinese, and Japanese) and TurboOCR (supports Latin languages) you have to check whether it supports the language you need, not all models do.

[-]

qPandx@reddit (OP)

WinError 127, a missing dll was the error.
Running via Mistral at 2$/1000 documents is very reasonable and my managers definitely don’t mind that. We wouldn’t expect high volumes and I tried this mistral ocr combined with gemini for maximum accuracy which definitely worked but was also kinda costly (running this via openrouter).

I could take out the gemini AI for reviewing and instead harden my code parsing in which we’d be only be running mistral ocr in terms of costs wise.

I guess I took it upon myself as a challenge to do everything locally and Im paying the price of the headaches.

Languages are not an issue, it’s mainly just numbers and english templates or rarely french templates (I’m in Canada).

To give you a quick example, we have adobe pdf licenses and when I ran the built in feature of OCR, it would take 0 as an 8 which was really dumb. Initially, I was like ok if the pdf requires OCR then users can just run it through adobe and put it in my project but then after trialing this, I couldn’t trust adobe’s ocr which put me in this rabbit hole.

I could run a VPN to one machine but it didn’t seem ideal if 20-30 users are running it.

[-]

Civil-Image5411@reddit

Yes, if you trust cloud providers and don't have high volume, it’s much easier and potentially even cheaper to just use their OCR as well.

P.S. Serving from a single machine isn’t necessarily slow. With a mid-range NVIDIA GPU, you could serve around 100 images per second concurrently using TurboOCR, which is probably fast enough.

[-]

Civil-Image5411@reddit

might also be worth checking what the error message actually is 😁

[-]

sugarlata@reddit

Paddle OCR is a good fit if you have a GPU. I've found it treats everything as an image, and using CPU can take a while appearing to freeze (in one case found a 6 page document taking over an hour). With a GPU it's seconds though, but you need to feed in the GPU parameters when instantiating the model.

I've used OCRv5 to get all the text from a document unstructured. From there process as you want. I've found the other modules to be very hit and miss with document structure.

[-]

qPandx@reddit (OP)

I tried it and yeah it takes forever and crashes for me personally. Can't risk releasing that to my users especially since they already dont have the specs that I have.

My work laptop specs is Ultra 7 258v with 32 gb ram and intel arc 140v gpu (16gb) and by users, I mean the departments at my work.

[-]

Basic-Gazelle4171@reddit

ocr on scanned pdfs is a nightmare and tesseract really struggles with tables and aligned numbers. ive been there with the quantity fields getting jumbled and lines just disappearing entirely.

Qoest for Developers has an OCR API that handle structured extraction way better, especially for forms and order docs. it actually keeps the table layout intact and returns clean json with the quantities parsed right. way less headache than fighting with open source tools that loop forever or miss half the page.

[-]

qPandx@reddit (OP)

Their website is quite vague; says I have a 100 credits for OCR API but how much credits would i be using per pdf? Would you happen to know

If I dont end up doing a local OCR then I will probably stick with Mistral-OCR unless if there is obvious better alternative

[-]

presentsq@reddit

If you are fine with making api calls, then I high recommend checking out Upstage's OCR solutions.

I benchmarked OCR APIs at work a while back. (different task though, I was testing OCR in extremely noisy images) Surprisingly, a Korean company called upstage had the best performing model.

I think They have two OCR related product, one for pure OCR and one specializes in parsing document like your case. The price was pretty cheap and i think they give free credits for testing.

From my experience, using apis can save you a lot of headache and time. so if you are interested definitely check it out

[-]

Affectionate_Way337@reddit

OCR apis arent some magic fix for document parsing, theyre just another tool and people DO use them when self hosted stuff falls over.

If youre expecting perfect extraction out of the box, sorry, thats not gonna happen. But a solid api can save you weeks of preprocessing hell for messy layouts.

I went down the self hosted rabbit hole once and burned like two weekends on tesseract configs before just throwing money at a service.

[-]

presentsq@reddit

Exactly, and if changing config and adding preprocessing doesn't meet your requirements then you have to train your own weights. You need to collect data, annotate them, train, evaluate, maybe tweak the model a little bit and repeat... that can take months and you still might not get the desired performance. considering how much pain you skip, api calls are actually very cheap.

[-]

qPandx@reddit (OP)

Would you happen to know how it compares with Mistral OCR? Mistral OCR is where I'll head if nothing else works but wondering how it compares in terms of price, quality etc..

The PDFs that users will be uploading is not noisy at all but I do need it to be very accurate as my whole project is to convert them into a .csv file so that it can be easily imported to our ERP.

[-]

presentsq@reddit

Honestly haven't tried Mistral OCR myself. But, I assume it would be pretty good being a model made by Mistral.

It seems all you have to do for comparing the two models is just swap a few lines of api call.
https://docs.mistral.ai/resources/sdks
https://console.upstage.ai/docs/capabilities/parse/document-ocr

Since Mistral OCR seems to be a little cheaper. I would test Mistral OCR first and just use it if it is good enough.

One last FYI, https://www.reddit.com/r/MistralAI/comments/1n6r1y4/bouding_boxes_mistral_ocr/?tl=en
This seems to suggest that Mistral OCR seems to not provide individual text boxes. This would be a problem if you need to select text by their position (use the bbox information) Very weird!
haven't tried this myself. please let me know if this is true if you do end up using Mistral OCR.

[-]

danted002@reddit

Make sure you pre-download the ocr models or you will endup with your server downloading 1.1GB first time it parses a document (and if you use Docker that happens on each container restart)

[-]

FarRub2855@reddit

That silent download is a killer. If a user is waiting on an order to parse and the app just hangs while it pulls down a gig of data, their gonna assume the whole system is broken.

[-]

danted002@reddit

You have an architectural problem if you app hangs on anything that might take more then a second to process. File parsing should be done async with the UI pooling for status.

[-]

qPandx@reddit (OP)

I think I did with the terminal and also downloaded the PaddleOCR from the github repo but it just doesn't seem to work for some reason. Where can I find the downloads for those models? What model do you recommend for max accuracy?

[-]

danted002@reddit

I meant if you are going with a Docling

[-]

MathMXC@reddit

Docling! It's a bit over powered for your use case but should perfect

[-]

qPandx@reddit (OP)

Would you happen to know how it compares to the ones I tried?

[-]

MathMXC@reddit

Its definitely better than tesseract ootb but I can't say about the others

[-]

MaskedSmizer@reddit

Mistral OCR endpoint is my go-to. Not suitable if your are trying to keep everything local, but good (although not perfect) accuracy.

[-]

qPandx@reddit (OP)

Yeah tried Mistral but I’m running it from OpenRouter as mistral-ocr and it was doing the job when I combined it with AI reviewer (gemini 3.1-flash).

How can I use Mistral without OpenRouter and possibly without the AI reviewer (fallback option)?

[-]

MaskedSmizer@reddit

Just use their SDK and wire it into your pipeline as needed https://share.google/mTkucnsmTgX3ZKQ9w

Examples in the cookbook https://github.com/mistralai/client-python/tree/main/examples%2Fmistral%2Focr

[-]

qPandx@reddit (OP)

Very well. Thank you

[-]

zangler@reddit

Build a classifier, train it, profit.

[-]

qPandx@reddit (OP)

First thing I tried but didn’t work. Not optimal for 1600 different types of templates

[-]

zangler@reddit

I mean...you can train the templates. I'm not saying it is easy, but I do/done this exact thing multiple times.

Another is a multi step model design that is only about resolving one or 2 parts of the template and do the same in concert with a trained classifier trained on the outputs of the pre-model as additional inputs in the final classifier. Also, consider bayes for this if you don't have hg volume of samples...or even if you do. Additionally, those outputs and their posteriors can be fed into a downstream model.