Best Python approach for extracting structured financial data from inconsistent PDFs?

Posted by leggo-my-eggo-1@reddit | Python | View on Reddit | 30 comments

Hi everyone,

I'm currently trying to design a Python pipeline to extract structured financial data from annual accounts provided as PDFs. The end goal is to automatically transform these documents into structured financial data that can be used in valuation models and financial analysis.

The intended workflow looks like this:

Upload one or more PDF annual accounts
Automatically detect and extract the balance sheet and income statement
Identify account numbers and their corresponding amounts
Convert the extracted data into a standardized chart of accounts structure
Export everything into a structured format (Excel, dataframe, or database)
Run validation checks such as balance sheet equality and multi-year comparisons

The biggest challenge is that the PDFs are very inconsistent in structure.

In practice I encounter several types of documents:

1. Text-based PDFs

Tables exist but are often poorly structured
Columns may not align properly
Sometimes rows are broken across lines

2. Scanned PDFs

Entire document is an image
Requires OCR before any parsing can happen

3. Layout variations

The position of the balance sheet and income statement changes
Table structures vary significantly
Labels for accounts can differ slightly between documents
Columns and spacing are inconsistent

So the pipeline needs to handle:

Text extraction for normal PDFs
OCR for scanned PDFs
Table detection
Recognition of account numbers
Mapping to a predefined chart of accounts
Handling multi-year data

My current thinking for a Python stack is something like:

pdfplumber or PyMuPDF for text extraction
pytesseract + opencv for OCR on scanned PDFs
Camelot or Tabula for table extraction
pandas for cleaning and structuring the data
Custom logic to detect account numbers and map them

However, I'm not sure if this is the most robust approach for messy real-world financial PDFs.

Some questions I’m hoping to get advice on:

What Python tools work best for reliable table extraction in inconsistent PDFs?
Is it better to run OCR first on every PDF, or detect whether OCR is needed?
Are there libraries that work well for financial table extraction specifically?
Would you recommend a rule-based approach or something more ML-based for recognizing accounts and mapping them?
How would you design the overall architecture for this pipeline?

Any suggestions, libraries, or real-world experiences would be very helpful.

Thanks!

[-]

Chemical_Matter3385@reddit

For my use case I have a detection first , using pymupdf(fitz) I check if the 1st page is an image , and has no selectable text then it goes to Mistral Ocr , its good for most of the cases , what I have tried and failed

Tried

1) Tesseract 2) Paddle Paddle 3) Docling 4) Deepseek Ocr 5) Claude opus 4.6 6) Gemini Vision api (enterprise) 7)Azure Document Intelligence 8)Mistral 9) A model by IBM (I'm forgetting the name pretty sure it's granite)

Azure , Mistral are good

Can't rely much on Claude and Deepseek Ocr as they are vision language models and have been observed (by me) give hallucinated placeholders which is very risky in production, they worked well in most of the cases, but were useless in old scanned books

Try them all , most likely your use case would be fulfilled by azure or mistral

Ps: For op's use case Azure Documen Intelligence or Mistral Ocr 3 would be perfect

[-]

Bitter_Broccoli_7536@reddit

yea the proxy cost part is real, residential ips can get stupid expensive per GB. i switched to qoest proxy for my scraping setup, their pricing is way more predictable for heavy volume and the rotation just works without me babysitting it. saved me a ton of dev time fighting blocks.

[-]

Bitter_Broccoli_7536@reddit

yeah that detection first step is key, we do something similar with fitz. honestly after trying like 5 different ocr engines, the hallucination risk from the vision llms is just too high for anything serious. azure's been the most consistent for us too, especially on weird old scans.

[-]

Chemical_Matter3385@reddit

Also Tried Adobe Pdf Services

Works well with tables but often misses ₹or $ signs , so it's most likely an encoding issue which I haven't looked upon yet , but with a simple script that can be managed as well.

Ok_Diver9921@reddit

Spent 6 months on almost this exact pipeline for a fintech project. Save yourself some pain - skip the pure rule-based stack and go hybrid from the start.

What actually worked for us: pdfplumber for text-based PDFs (it handles column alignment better than tabula for financial tables), but detect scanned pages first by checking if pdfplumber returns empty text per page. Only run OCR on pages that need it - running tesseract on everything adds 10x processing time for no benefit on text-based files. For OCR, docTR beat pytesseract significantly on financial documents because it handles the dense number grids better.

For the table extraction specifically - Camelot lattice mode works well when there are actual grid lines, but most annual reports use invisible tables (no borders, just spacing). For those, the LLM approach that u/thuiop1 mentioned is genuinely the right call. Feed the pdfplumber text output (which preserves spatial layout) into a smaller model and ask it to extract specific fields into a JSON schema you define. We went from 60% accuracy with pure regex/heuristics to 92% by adding a Qwen 14B pass for the messy pages.

Architecture tip: build a classifier first that categorizes each page as "balance sheet", "income statement", "notes", "other" before you try to extract anything. This saves you from parsing 80 pages when you only need 4-6. A simple tf-idf classifier trained on 50 labeled pages worked fine for this.

Lawson470189@reddit

This is exactly what my team is doing. We run a classification step to identify page types using meta data and sometimes quick OCR on a section of the page. Then we run full OCR on the document using DocTR into data classes. Then we apply rules and validations using that collected data.

Exactly right on the classification step. We ended up with a simple heuristic - if pdfplumber returns a table with more than 3 columns and consistent row counts, it is a structured page. Anything with fewer than 50 extractable characters per page gets routed to OCR. The metadata approach you mention is solid too, especially for standardized financial docs where the issuer follows a template. Biggest time saver was caching the page classification results so reprocessing the same document skips the detection step entirely.

Yep right there with you. We have a caching layer so we avoid pulling and parsing documents again (within the TTL window). Glad to hear we aren't the only ones facing this!

wRAR_@reddit

LMAO

bravelogitex@reddit

what is so funny

Bots talking to bots, as usual.

Ok_diver is a bot but not Lawson. See their post histories. It is weird how realistic sounding Ok_diver's response is. It sounds plausible based on my research

Can confirm... not a bot.

No_Sprinkles1374@reddit

Bonjour je travaille sur le meme projet je vous ai envoyé en privée si vous pouvez me repondre s'il vous plait

Visual-Succotash9428@reddit

just try mineru if you don't want to deal with all these details.
they got a specail trained llm to do tables and formular. now the api is free.

Clever_Username69@reddit

If you're getting your annual reports for companies in the US, use the EDGAR site (sec.gov) and download/parse XBRL formatted files instead of PDFs. That's what the XBRL format was designed for and will make your job a million times easier. If you have to use PDFs then it looks like you have plenty of good advice in this thread already.

threatcon22@reddit

Following

Then_Illustrator9892@reddit

been down this exact road with financial pdfs and honestly the custom pipeline route is brutal for inconsistent docs. i ended up switching to reseek for this, its ai handles the text/ocr extraction and auto tagging from pdfs and images, which covers your scanned and text based cases. its free to test rn, saved me months of dev time on the parsing hell.

Accomplished-Tap916@reddit

I’ll spend some time going through their content.

Amazing_Upstairs@reddit

Dspy is the best I've found so far

phrygian_life@reddit

Another vote for LLM. even if the layout stays the same year to year, the entire PDF structure could change.

Dominican_mamba@reddit

There’s a package called kreuzberg try it out and maybe use an LLM if needed

thuiop1@reddit

As much as I hate it, this is probably a task where LLMs can shine. Otherwise it will likely be more painful to devise an extraction scheme than to do it manually.

ambidextrousalpaca@reddit

Agreed. Other thing I would suggest would be to try multiple runs with - if possible - multiple models and mark the stuff they agree on as more reliable and the stuff they disagree on as requiring human checking.

DetectivePeterG@reddit

Agreed on the LLM angle. The trick is getting clean input first. I've been using pdftomarkdown.dev as a preprocessing step: send your PDF, get structured markdown back including tables. It uses a VLM rather than Tesseract so it handles both digital and scanned pages consistently. Then you run your LLM extraction on the markdown instead of raw PDF bytes, which makes prompts simpler and results more reliable. Has a Python SDK too, only takes a few lines to wire in.

xiannah@reddit

The strategy is simple: a text-first extract, a Markdown extract for a structural fallback, and a VLM as the intelligent orchestrator. The VLM will cross-reference the raw text and structural fallback to validate the output—effectively creating a verification loop that catches OCR hallucinations before they hit the downstream dataset.

southstreamer1@reddit

I have been working on this exact problem for about 4 months. I’m trying to extract data from about 900 annual reports.

The approach I have taken is 1) use PyMuPDF / tesseract to extract text 2) apply rules/heuristics to determine if the page is the one I’m looking for 3) pass this page to Claude computer vision API for data extraction. You want to find the page you want and only send it that one page at a time to reduce noise (reduce risk of errors) and token consumption. If you send it heaps of pages and ask it to find the right one it wastes tokens and you risk polluting the extraction with data from the wrong page.

Claude vision does an excellent job of extracting the data. I have found an astonishingly small number of read errors. It handles different header levels easily. Very happy with its performance.

The real problem is finding the right page. I have used a rule based/heuristic approach to find income statements, etc, but it’s too brittle. There are always edge cases that give false positives/negatives. It’s time consuming to rerun the search and hard to debug. I’m sure there’s a smart way to do it but it’s beyond me.

I have recently switched to extracting all pages to an SQLite db up front. Finding the right page is then a matter of querying the db. This is way faster than having to rerun the pymupdf/tesseract based search every time. Still a WIP but so far this is giving me far better results.

Halibut@reddit

I haven't used it myself, but Microsoft have a Python library for this: https://github.com/microsoft/markitdown

Main_War9026@reddit

We use MistralOCR, hundreds of documents per month and only pay like $20-30 for the API

knobbyknee@reddit

You are in for a lot of grief. There is no standard table construct in the PDF format.

You would have to write code that detects a grid layout and then parse that layout into a table.

Unfortunately, there are many ways of constructing a grid, and the parts may be spread over different sections of the PDF data. Your best option is probably to build a middleware layer for a PDF renderer, so you can collect the position and text data for each item rendered.

There are also non-table items that are arranged like grids, and you will need heuristics to ignore those.