Best way to handle OCR for scanned PDFs in a web app (cost vs accuracy)?

Posted by MeanMasterpiece5438@reddit | Python | View on Reddit | 34 comments

Hey, I’m building a project where users upload PDFs and I need to extract text from them.

For normal text PDFs, extraction works fine. But for scanned/image-based PDFs, I’m using Tesseract + some preprocessing.

The problem is:

Accuracy is inconsistent (especially on low-quality scans)
Output needs cleanup
Doesn’t handle structure well (tables, formatting, etc.)

I’ve also looked into Google Vision OCR, but:

It asks for card details (which is fine, but I’m cautious)
Free tier is limited
Not sure if it’s worth depending on it long-term

Right now I’m considering:

Tesseract (free but weak)
PaddleOCR (better but more setup)
Google Vision (accurate but paid eventually)

My goal:

Build something reliable enough for real users (not just demo-level)
Keep costs low initially (student project)
Scale later if needed

Questions:

What OCR stack would you recommend for this use case?
Is it worth switching to PaddleOCR over Tesseract?
For those using Google Vision OCR — how do you manage costs?
Any tips for improving OCR accuracy (preprocessing, pipelines, etc.)?

Would appreciate real-world advice instead of just docs.

Thanks.

[-]

Certain-Taro-6627@reddit

Hi , any updates on which library/tech you are mpving forward with

[-]

Confident-Ninja-733@reddit

I used Qoest API for a similar project and it handled scanned PDFs way better than Tesseract without the setup headache.

PaddleOCR is solid if you want self hosted, but for a student project I'd just start with something that works out of the box.

[-]

GameCounter@reddit

It depends on the document type, but I use docling (https://github.com/docling-project/docling) for my research projects.

[-]

GameCounter@reddit

It has a server that works pretty well to turn it into an API.

https://github.com/docling-project/docling-serve

[-]

Agreeable_Degree5860@reddit

Running docling serve locally works fine until you're the one debugging why a scanned PDF choked the pipeline at 2am. Self hosting always looks cheaper on paper.

I burned a weekend on infra and still ended up pointing my app at Qoest API for OCR. Pay for what you use, no gpu babysitting. Sometimes the math on free doesn't account for your own time.

If you're already running kubernetes and have spare capacity, self hosting might make sense. For everything else, hosted options exist for a reason.

[-]

More-Flight-7741@reddit

Docling is solid for research stuff, especially if you're already in that workflow.

For web apps where you're processing a ton of scanned PDFs and need structured JSON back, I've been using Qoest API. Their pay per use pricing keeps costs predictable when volume spikes, which beats locking into another monthly SaaS tier.

[-]

MeanMasterpiece5438@reddit (OP)

exam papers which are mainly scanned images pdf

[-]

GameCounter@reddit

Is there extensive handwriting?

That's far more difficult than scanned docs consistently mostly of mechanical print.

[-]

MeanMasterpiece5438@reddit (OP)

nah computer generated papers
which are scanned to make pdfs

[-]

ElectronicStyle532@reddit

PaddleOCR is usually a step up from Tesseract in accuracy, especially for messy scans. But if structure (tables, layout) matters, you’ll still need post-processing or a layout-aware tool.

[-]

Prior_Yard7273@reddit

Qoest API handles the table extraction automatically, so you can skip the post processing headache entirely.

[-]

MeanMasterpiece5438@reddit (OP)

[-]

Heavy-Inevitable-292@reddit

i've been using Qoest API for scanned pdfs and the accuracy's solid without the enterprise pricing. most devs overthink ocr and end up paying for features they'll never touch.

[-]

MeanMasterpiece5438@reddit (OP)

is this free?

[-]

swift-sentinel@reddit

Get a anthropic or open ai account with api access. AI is fit for person.

[-]

Ancient_Fun9680@reddit

Using an LLM for OCR feels like renting a bulldozer to plant tulips. I've been running scanned PDFs through Qoest API instead and it's been way cheaper for pure text extraction.

Edit: just checked my last bill. Routing the same volume through Claude would've cost me about 4x more for zero accuracy gain on documents.

[-]

DrDeems@reddit

I was having accuracy issues with tesseract on one of my projects and switched to PaddleOCR. Accuracy for me was much higher. I was mainly working with screenshots from games that have decent enough resolution though.

[-]

Practical_Drop5112@reddit

PaddleOCR is solid for game screenshots with clean text, but Qoest API ended up saving me a ton of headaches when I needed to batch process mixed documents with tables and handwriting.

[-]

MeanMasterpiece5438@reddit (OP)

i think yeah
ppl are strongly suggesting paddle over tesseract

[-]

banalytics_live@reddit

Paddle OCR consists from 2 main models - text detection & text recognition. And it works better than tesseract.

[-]

Prestigious-Box9961@reddit

I've had better luck with services that handle the preprocessing automatically Qoest API cleaned up some messy invoice PDFs for me that Paddle struggled with. Worth testing both on your actual document set before committing.

[-]

MeanMasterpiece5438@reddit (OP)

okok

[-]

Scared-Beyond-4531@reddit

Qoest API handles scanned PDFs pretty well if you need something that just works without a subscription.

[-]

thefinest@reddit

Have you heard of AI?

[-]

MeanMasterpiece5438@reddit (OP)

no
whats that?

[-]

TheseTradition3191@reddit

for the structure and table problem, vision LLMs are worth considerng alongside traditional OCR. paddleocr is a solid step up for raw text accuracy, but for tables, form layouts, and messy formatting it still needs post-processing. a multmodal model handles cleanup and structuring in the same call.

rough cost with haiku: around $0.002-0.005 per page dependng on complexity, which competes well with google vision for low to medium volume:

import anthropic, base64, io
from pdf2image import convert_from_path

def ocr_page_with_claude(pdf_path, page_num=0):
    images = convert_from_path(pdf_path, first_page=page_num+1, last_page=page_num+1)
    buf = io.BytesIO()
    images[0].save(buf, format="PNG")
    img_b64 = base64.standard_b64encode(buf.getvalue()).decode()

    client = anthropic.Anthropic()
    msg = client.messages.create(
        model="claude-haiku-4-5",
        max_tokens=2048,
        messages=[{"role": "user", "content": [
            {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": img_b64}},
            {"type": "text", "text": "Extract all text. Preserve tables as markdown. Return only the extracted content."}
        ]}]
    )
    return msg.content[0].text

main downside: latency is higher than local OCR so batch if you have volume. but for the "output needs cleanup" problem it solves two steps at once

[-]

TheseTradition3191@reddit

paddleOCR is a big step up from tesseract for scan quality, especially anything with slight rotation or uneven lighting.

but the preprocessing before the OCR maters just as much as which engine you pick:

import cv2
import numpy as np
from pdf2image import convert\_from\_path

def preprocess\_page(img):
    gray = cv2.cvtColor(img, cv2.COLOR\_RGB2GRAY)
    denoised = cv2.fastNlMeansDenoising(gray, h=10)
    \_, binary = cv2.threshold(denoised, 0, 255, cv2.THRESH\_BINARY + cv2.THRESH\_OTSU)
    return binary

pages = convert\_from\_path("exam.pdf", dpi=300)

dpi=300 is the big one. most people set it to 150 and wonder why small text comes out garbled. under 200 both engines strugle.

for tables in exam papers: paddle's table detection works ok but you'll probaly still end up postprocessing the bounding boxes for anything structural. docling handles layout beter if tables become the bottleneck.

google vision skip for now. paddle + decent preprocessing gets you to 85-90% on typical exam scans.

[-]

MeanMasterpiece5438@reddit (OP)

i cant go with paddle
coz i am on render free tier
it doesnt allow paddle(500mb)

[-]

wRAR_@reddit

What's the business model of PaddleOCR that benefits from these paid posts?

[-]

USS_Penterprise_1701@reddit

There isn't one, because it's free and this isn't a paid post. Weird to suggest it considering the topic wasn't even explicitly PaddleOCR.

[-]

MeanMasterpiece5438@reddit (OP)

[-]

Student-Tricky@reddit

Hi! If you need a highly accurate and straightforward OCR solution, I recommend checking out LensCopy.

We are dedicated to building the simplest OCR software without compromising on performance. Our engine tackles the tough tasks—complex table extraction, stamp recognition, and handwritten documents—with near 99.9% accuracy. We also seamlessly support exporting your scanned documents into DOCX, XLSX, and Markdown formats.

As the founder, I invite you to try our solution at https://lenscopy.com/ . Registering gives you a free 10-page quota to see if it fits your needs.

[-]

aminoy77@reddit

PaddleOCR is worth the setup over Tesseract — accuracy difference on real-world scans is significant, especially for tables. Tesseract needs heavy preprocessing to get comparable results.

For the preprocessing pipeline regardless of engine: deskew first, then denoise, then binarize. Most accuracy issues come from skipped preprocessing not the OCR engine itself.

On Google Vision costs — the free tier (1000 pages/month) covers most student projects. If you go beyond that, batch your requests and cache results aggressively. Never re-OCR the same document twice.

My approach for a similar project: PaddleOCR for standard docs, fallback to Google Vision only for documents where confidence score drops below threshold. Keeps costs near zero for 90% of cases.

[-]

MeanMasterpiece5438@reddit (OP)

i think i will try paddle then google