Best way to handle OCR for scanned PDFs in a web app (cost vs accuracy)?
Posted by MeanMasterpiece5438@reddit | Python | View on Reddit | 34 comments
Hey, I’m building a project where users upload PDFs and I need to extract text from them.
For normal text PDFs, extraction works fine. But for scanned/image-based PDFs, I’m using Tesseract + some preprocessing.
The problem is:
- Accuracy is inconsistent (especially on low-quality scans)
- Output needs cleanup
- Doesn’t handle structure well (tables, formatting, etc.)
I’ve also looked into Google Vision OCR, but:
- It asks for card details (which is fine, but I’m cautious)
- Free tier is limited
- Not sure if it’s worth depending on it long-term
Right now I’m considering:
- Tesseract (free but weak)
- PaddleOCR (better but more setup)
- Google Vision (accurate but paid eventually)
My goal:
- Build something reliable enough for real users (not just demo-level)
- Keep costs low initially (student project)
- Scale later if needed
Questions:
- What OCR stack would you recommend for this use case?
- Is it worth switching to PaddleOCR over Tesseract?
- For those using Google Vision OCR — how do you manage costs?
- Any tips for improving OCR accuracy (preprocessing, pipelines, etc.)?
Would appreciate real-world advice instead of just docs.
Thanks.
Certain-Taro-6627@reddit
Hi , any updates on which library/tech you are mpving forward with
Confident-Ninja-733@reddit
I used Qoest API for a similar project and it handled scanned PDFs way better than Tesseract without the setup headache.
PaddleOCR is solid if you want self hosted, but for a student project I'd just start with something that works out of the box.
GameCounter@reddit
It depends on the document type, but I use docling (https://github.com/docling-project/docling) for my research projects.
GameCounter@reddit
It has a server that works pretty well to turn it into an API.
https://github.com/docling-project/docling-serve
Agreeable_Degree5860@reddit
Running docling serve locally works fine until you're the one debugging why a scanned PDF choked the pipeline at 2am. Self hosting always looks cheaper on paper.
I burned a weekend on infra and still ended up pointing my app at Qoest API for OCR. Pay for what you use, no gpu babysitting. Sometimes the math on free doesn't account for your own time.
If you're already running kubernetes and have spare capacity, self hosting might make sense. For everything else, hosted options exist for a reason.
More-Flight-7741@reddit
Docling is solid for research stuff, especially if you're already in that workflow.
For web apps where you're processing a ton of scanned PDFs and need structured JSON back, I've been using Qoest API. Their pay per use pricing keeps costs predictable when volume spikes, which beats locking into another monthly SaaS tier.
MeanMasterpiece5438@reddit (OP)
exam papers which are mainly scanned images pdf
GameCounter@reddit
Is there extensive handwriting?
That's far more difficult than scanned docs consistently mostly of mechanical print.
MeanMasterpiece5438@reddit (OP)
nah computer generated papers
which are scanned to make pdfs
ElectronicStyle532@reddit
PaddleOCR is usually a step up from Tesseract in accuracy, especially for messy scans. But if structure (tables, layout) matters, you’ll still need post-processing or a layout-aware tool.
Prior_Yard7273@reddit
Qoest API handles the table extraction automatically, so you can skip the post processing headache entirely.
MeanMasterpiece5438@reddit (OP)
ok
Heavy-Inevitable-292@reddit
i've been using Qoest API for scanned pdfs and the accuracy's solid without the enterprise pricing. most devs overthink ocr and end up paying for features they'll never touch.
MeanMasterpiece5438@reddit (OP)
is this free?
swift-sentinel@reddit
Get a anthropic or open ai account with api access. AI is fit for person.
Ancient_Fun9680@reddit
Using an LLM for OCR feels like renting a bulldozer to plant tulips. I've been running scanned PDFs through Qoest API instead and it's been way cheaper for pure text extraction.
Edit: just checked my last bill. Routing the same volume through Claude would've cost me about 4x more for zero accuracy gain on documents.
DrDeems@reddit
I was having accuracy issues with tesseract on one of my projects and switched to PaddleOCR. Accuracy for me was much higher. I was mainly working with screenshots from games that have decent enough resolution though.
Practical_Drop5112@reddit
PaddleOCR is solid for game screenshots with clean text, but Qoest API ended up saving me a ton of headaches when I needed to batch process mixed documents with tables and handwriting.
MeanMasterpiece5438@reddit (OP)
i think yeah
ppl are strongly suggesting paddle over tesseract
banalytics_live@reddit
Paddle OCR consists from 2 main models - text detection & text recognition. And it works better than tesseract.
Prestigious-Box9961@reddit
I've had better luck with services that handle the preprocessing automatically Qoest API cleaned up some messy invoice PDFs for me that Paddle struggled with. Worth testing both on your actual document set before committing.
MeanMasterpiece5438@reddit (OP)
okok
Scared-Beyond-4531@reddit
Qoest API handles scanned PDFs pretty well if you need something that just works without a subscription.
thefinest@reddit
Have you heard of AI?
MeanMasterpiece5438@reddit (OP)
no
whats that?
TheseTradition3191@reddit
for the structure and table problem, vision LLMs are worth considerng alongside traditional OCR. paddleocr is a solid step up for raw text accuracy, but for tables, form layouts, and messy formatting it still needs post-processing. a multmodal model handles cleanup and structuring in the same call.
rough cost with haiku: around $0.002-0.005 per page dependng on complexity, which competes well with google vision for low to medium volume:
main downside: latency is higher than local OCR so batch if you have volume. but for the "output needs cleanup" problem it solves two steps at once
TheseTradition3191@reddit
paddleOCR is a big step up from tesseract for scan quality, especially anything with slight rotation or uneven lighting.
but the preprocessing before the OCR maters just as much as which engine you pick:
dpi=300 is the big one. most people set it to 150 and wonder why small text comes out garbled. under 200 both engines strugle.
for tables in exam papers: paddle's table detection works ok but you'll probaly still end up postprocessing the bounding boxes for anything structural. docling handles layout beter if tables become the bottleneck.
google vision skip for now. paddle + decent preprocessing gets you to 85-90% on typical exam scans.
MeanMasterpiece5438@reddit (OP)
i cant go with paddle
coz i am on render free tier
it doesnt allow paddle(500mb)
wRAR_@reddit
What's the business model of PaddleOCR that benefits from these paid posts?
USS_Penterprise_1701@reddit
There isn't one, because it's free and this isn't a paid post. Weird to suggest it considering the topic wasn't even explicitly PaddleOCR.
MeanMasterpiece5438@reddit (OP)
??
Student-Tricky@reddit
Hi! If you need a highly accurate and straightforward OCR solution, I recommend checking out LensCopy.
We are dedicated to building the simplest OCR software without compromising on performance. Our engine tackles the tough tasks—complex table extraction, stamp recognition, and handwritten documents—with near 99.9% accuracy. We also seamlessly support exporting your scanned documents into DOCX, XLSX, and Markdown formats.
As the founder, I invite you to try our solution at https://lenscopy.com/ . Registering gives you a free 10-page quota to see if it fits your needs.
aminoy77@reddit
PaddleOCR is worth the setup over Tesseract — accuracy difference on real-world scans is significant, especially for tables. Tesseract needs heavy preprocessing to get comparable results.
For the preprocessing pipeline regardless of engine: deskew first, then denoise, then binarize. Most accuracy issues come from skipped preprocessing not the OCR engine itself.
On Google Vision costs — the free tier (1000 pages/month) covers most student projects. If you go beyond that, batch your requests and cache results aggressively. Never re-OCR the same document twice.
My approach for a similar project: PaddleOCR for standard docs, fallback to Google Vision only for documents where confidence score drops below threshold. Keeps costs near zero for 90% of cases.
MeanMasterpiece5438@reddit (OP)
i think i will try paddle then google