How I handle OCR fallback and per-language field parsing when extracting data from PDFs in Python (w

Posted by LorenzoNardi@reddit | Python | View on Reddit | 4 comments

I've been working on a document processing tool that extracts structured data from PDFs (invoices, bank statements, contracts) and I ran into two problems that aren't well documented anywhere: OCR fallback strategy and per-language field normalization. Sharing what worked.

**Problem 1: Silent OCR failure**

Most guides tell you to use `pdfplumber` or `PyMuPDF` to extract text. What they don't tell you is that scanned PDFs return an empty string (or worse, garbage spacing characters) without raising any exception. You'll process it, send it to an LLM, and get hallucinated data back – all silently.

My solution: check text length and density *before* calling the LLM. If the extracted text is below a threshold (I use 50 meaningful characters per page), fall back to Tesseract OCR:

```python

import pdfplumber

import pytesseract

from pdf2image import convert_from_bytes

def extract_text_with_fallback(pdf_bytes: bytes) -> str:

with pdfplumber.open(io.BytesIO(pdf_bytes)) as pdf:

text = ''.join(p.extract_text() or '' for p in pdf.pages)

# Scanned PDF check: meaningful chars per page

pages = len(pdf.pages) if pdf.pages else 1

if len(text.strip()) / pages < 50:

images = convert_from_bytes(pdf_bytes, dpi=300)

text = '\n'.join(pytesseract.image_to_string(img) for img in images)

return text

```

The `dpi=300` matters a lot – at 150dpi Tesseract misses characters on dense invoices. 300 is the sweet spot between accuracy and speed.

**Problem 2: Per-language field normalization**

European invoices are a nightmare. The same field can be:

- `Total` / `Totale` / `Gesamtbetrag` / `Montant total`

- Dates as `31/12/2024` (IT), `31.12.2024` (DE), `2024-12-31` (ISO)

- Decimals as `1.234,56` (IT/DE) vs `1,234.56` (EN)

Instead of trying to make one regex rule to catch all formats, I built a simple language detector that runs on a short sample of the text, then loads a locale-specific normalization config:

```python

LOCALE_CONFIGS = {

'it': {'decimal_sep': ',', 'thousand_sep': '.', 'date_formats': ['%d/%m/%Y', '%d-%m-%Y']},

'de': {'decimal_sep': ',', 'thousand_sep': '.', 'date_formats': ['%d.%m.%Y']},

'en': {'decimal_sep': '.', 'thousand_sep': ',', 'date_formats': ['%m/%d/%Y', '%Y-%m-%d']},

'fr': {'decimal_sep': ',', 'thousand_sep': ' ', 'date_formats': ['%d/%m/%Y']},

}

def normalize_amount(raw: str, locale: str) -> float:

cfg = LOCALE_CONFIGS.get(locale, LOCALE_CONFIGS['en'])

cleaned = raw.replace(cfg['thousand_sep'], '').replace(cfg['decimal_sep'], '.')

return float(re.sub(r'[\^\d.]', '', cleaned))

```

For language detection I use `langdetect` on the first 500 characters of extracted text – fast, lightweight, accurate enough for this use case.

Hope this helps anyone building document processing pipelines. Happy to answer questions on edge cases I've hit.