Best way to handle OCR for scanned PDFs in a web app (cost vs accuracy)?

Posted by MeanMasterpiece5438@reddit | Python | View on Reddit | 34 comments

Hey, I’m building a project where users upload PDFs and I need to extract text from them.

For normal text PDFs, extraction works fine. But for scanned/image-based PDFs, I’m using Tesseract + some preprocessing.

The problem is:

I’ve also looked into Google Vision OCR, but:

Right now I’m considering:

My goal:

Questions:

  1. What OCR stack would you recommend for this use case?
  2. Is it worth switching to PaddleOCR over Tesseract?
  3. For those using Google Vision OCR — how do you manage costs?
  4. Any tips for improving OCR accuracy (preprocessing, pipelines, etc.)?

Would appreciate real-world advice instead of just docs.

Thanks.