Best Python approach for extracting structured financial data from inconsistent PDFs?

Posted by leggo-my-eggo-1@reddit | Python | View on Reddit | 30 comments

Hi everyone,

I'm currently trying to design a Python pipeline to extract structured financial data from annual accounts provided as PDFs. The end goal is to automatically transform these documents into structured financial data that can be used in valuation models and financial analysis.

The intended workflow looks like this:

  1. Upload one or more PDF annual accounts
  2. Automatically detect and extract the balance sheet and income statement
  3. Identify account numbers and their corresponding amounts
  4. Convert the extracted data into a standardized chart of accounts structure
  5. Export everything into a structured format (Excel, dataframe, or database)
  6. Run validation checks such as balance sheet equality and multi-year comparisons

The biggest challenge is that the PDFs are very inconsistent in structure.

In practice I encounter several types of documents:

1. Text-based PDFs

2. Scanned PDFs

3. Layout variations

So the pipeline needs to handle:

My current thinking for a Python stack is something like:

However, I'm not sure if this is the most robust approach for messy real-world financial PDFs.

Some questions I’m hoping to get advice on:

Any suggestions, libraries, or real-world experiences would be very helpful.

Thanks!