How I handle OCR fallback and per-language field parsing when extracting data from PDFs in Python (w
Posted by LorenzoNardi@reddit | Python | View on Reddit | 4 comments
I've been working on a document processing tool that extracts structured data from PDFs (invoices, bank statements, contracts) and I ran into two problems that aren't well documented anywhere: OCR fallback strategy and per-language field normalization. Sharing what worked.
**Problem 1: Silent OCR failure**
Most guides tell you to use `pdfplumber` or `PyMuPDF` to extract text. What they don't tell you is that scanned PDFs return an empty string (or worse, garbage spacing characters) without raising any exception. You'll process it, send it to an LLM, and get hallucinated data back – all silently.
My solution: check text length and density *before* calling the LLM. If the extracted text is below a threshold (I use 50 meaningful characters per page), fall back to Tesseract OCR:
```python
import pdfplumber
import pytesseract
from pdf2image import convert_from_bytes
def extract_text_with_fallback(pdf_bytes: bytes) -> str:
with pdfplumber.open(io.BytesIO(pdf_bytes)) as pdf:
text = ''.join(p.extract_text() or '' for p in pdf.pages)
# Scanned PDF check: meaningful chars per page
pages = len(pdf.pages) if pdf.pages else 1
if len(text.strip()) / pages < 50:
images = convert_from_bytes(pdf_bytes, dpi=300)
text = '\n'.join(pytesseract.image_to_string(img) for img in images)
return text
```
The `dpi=300` matters a lot – at 150dpi Tesseract misses characters on dense invoices. 300 is the sweet spot between accuracy and speed.
**Problem 2: Per-language field normalization**
European invoices are a nightmare. The same field can be:
- `Total` / `Totale` / `Gesamtbetrag` / `Montant total`
- Dates as `31/12/2024` (IT), `31.12.2024` (DE), `2024-12-31` (ISO)
- Decimals as `1.234,56` (IT/DE) vs `1,234.56` (EN)
Instead of trying to make one regex rule to catch all formats, I built a simple language detector that runs on a short sample of the text, then loads a locale-specific normalization config:
```python
LOCALE_CONFIGS = {
'it': {'decimal_sep': ',', 'thousand_sep': '.', 'date_formats': ['%d/%m/%Y', '%d-%m-%Y']},
'de': {'decimal_sep': ',', 'thousand_sep': '.', 'date_formats': ['%d.%m.%Y']},
'en': {'decimal_sep': '.', 'thousand_sep': ',', 'date_formats': ['%m/%d/%Y', '%Y-%m-%d']},
'fr': {'decimal_sep': ',', 'thousand_sep': ' ', 'date_formats': ['%d/%m/%Y']},
}
def normalize_amount(raw: str, locale: str) -> float:
cfg = LOCALE_CONFIGS.get(locale, LOCALE_CONFIGS['en'])
cleaned = raw.replace(cfg['thousand_sep'], '').replace(cfg['decimal_sep'], '.')
return float(re.sub(r'[\^\d.]', '', cleaned))
```
For language detection I use `langdetect` on the first 500 characters of extracted text – fast, lightweight, accurate enough for this use case.
Hope this helps anyone building document processing pipelines. Happy to answer questions on edge cases I've hit.
Some-Session-1554@reddit
I swapped Tesseract for Qoest API on the OCR part
Just sends back structured JSON with the fields already sorted out and handles the language stuff for you
Made the whole pipeline way simpler compared to writing all that locale config yourself
Not saying your approach is wrong or anything
ianitic@reddit
What I did like 5ish years ago was fallback to using ocrmypdf which would also make the pdf searchable going forward. I also did this when pdfplumber outputted nothing but cid:999/random numbers.
A little later on I added document ai services, custom ml processes, and a rule engine to the pipeline.
Immereally@reddit
Cheers.
I was actually just thinking about building my own app for this like an invoice manager to keep track and update my medical and finances
Great timing👍
timpkmn89@reddit
Which is exactly what it should do