Looking for tools/approaches for structural extraction from long, complex PDFs (sections + multi-page tables)

Posted by Expensive-Remote2650@reddit | LocalLLaMA | View on Reddit | 2 comments

I'm working on a side project where I need to process fairly long and complex PDFs - mostly text-selectable (no OCR needed for now), formal administrative / legal-style documents with a mix of prose sections and data tables. Before I start gluing things together myself I'd like to hear what people have actually had success with, because the gap between "extract text from a PDF" and "understand the document" is huge and I keep falling into it.

What I need isn't really "read text from a PDF". It's understanding the document as a structured object:

  1. Clean page-level text on selectable-text PDFs. Basic, but has to be reliable and lossless.
  2. Noise removal  repeating headers, footers, page numbers, organizational labels. Strip them without touching real content.
  3. Block classification inside a page  document title vs section titles vs subtitles vs paragraphs vs lists vs metadata lines vs regions that look like table content.
  4. Logical hierarchy  going from "pages with blocks" to a tree of sections / subsections with titles correctly linked to their body.
  5. Table detection  knowing where tables exist and keeping them separate from prose.
  6. Table structure rows, columns, headers vs data, multi-line cells, broken rows.
  7. Multi-page table continuation  this is the one that really worries me. When a table spans 10+ pages I need to recognize it's the same table continuing (repeated headers ≠ new data), not a series of small tables.
  8. A stable output artifact at the end one consistent representation of sections + tables + doc-level metadata, with traceability back to where in the original document each piece came from.

Stack is Python. I know the usual suspects pdfplumber, PyMuPDF, pdfminer.six, Camelot, Tabula, unstructured.io, Marker, Docling, LlamaParse, etc. and I've played with a few. What I'm actually trying to figure out:

Preference for local / self-hostable solutions - happy to use a small local LLM as a fallback for ambiguous cases, but I want the structural extraction itself to be mostly deterministic and reproducible.

War stories about what didn't work are more useful than recommendations, in my experience. So if you tried X and it fell apart on real documents, I'd love to hear it.