Looking for tools/approaches for structural extraction from long, complex PDFs (sections + multi-page tables)
Posted by Expensive-Remote2650@reddit | LocalLLaMA | View on Reddit | 2 comments
I'm working on a side project where I need to process fairly long and complex PDFs - mostly text-selectable (no OCR needed for now), formal administrative / legal-style documents with a mix of prose sections and data tables. Before I start gluing things together myself I'd like to hear what people have actually had success with, because the gap between "extract text from a PDF" and "understand the document" is huge and I keep falling into it.
What I need isn't really "read text from a PDF". It's understanding the document as a structured object:
- Clean page-level text on selectable-text PDFs. Basic, but has to be reliable and lossless.
- Noise removal repeating headers, footers, page numbers, organizational labels. Strip them without touching real content.
- Block classification inside a page document title vs section titles vs subtitles vs paragraphs vs lists vs metadata lines vs regions that look like table content.
- Logical hierarchy going from "pages with blocks" to a tree of sections / subsections with titles correctly linked to their body.
- Table detection knowing where tables exist and keeping them separate from prose.
- Table structure rows, columns, headers vs data, multi-line cells, broken rows.
- Multi-page table continuation this is the one that really worries me. When a table spans 10+ pages I need to recognize it's the same table continuing (repeated headers ≠ new data), not a series of small tables.
- A stable output artifact at the end one consistent representation of sections + tables + doc-level metadata, with traceability back to where in the original document each piece came from.
Stack is Python. I know the usual suspects pdfplumber, PyMuPDF, pdfminer.six, Camelot, Tabula, unstructured.io, Marker, Docling, LlamaParse, etc. and I've played with a few. What I'm actually trying to figure out:
- Has anyone solved multi-page table continuation reliably without hand-rolling heuristics per document type? This seems to be where every library quietly gives up.
- Layout-aware models (LayoutLM family, newer document-AI stuff) vs deterministic pipelines (geometry + regex on top of pdfplumber/PyMuPDF) where's the real tradeoff for this kind of structural understanding? Not looking for hype, looking for "I ran this on 500 real docs and here's what happened".
- Any library that actually gives you a document tree (sections → subsections → blocks/tables) as output, instead of a flat list of text blobs that you then have to re-group yourself?
- Is there an open-source pipeline you'd recommend as a starting point so I don't reinvent this from scratch?
Preference for local / self-hostable solutions - happy to use a small local LLM as a fallback for ambiguous cases, but I want the structural extraction itself to be mostly deterministic and reproducible.
War stories about what didn't work are more useful than recommendations, in my experience. So if you tried X and it fell apart on real documents, I'd love to hear it.
Accomplished-Tap916@reddit
multi page table continuation is the absolute killer, everyone hits that wall. i ended up building a pipeline around pymupdf for raw blocks and then a bunch of custom logic to stitch tables by checking header row similarity and vertical spacing. its not perfect but its deterministic.
for a starting point, look at the docling codebase on github, skip the llm parts and just study how they do layout analysis. its the most coherent open source structure ive seen, even if you have to tweak it heavily. the tradeoff is that deterministic pipelines are stable but youll always have edge cases, while layout models can be brittle on formal docs.
Expensive-Remote2650@reddit (OP)
Yeah, that sounds a lot closer to reality than most answers I’ve seen.
That’s basically where my head is at too , keep the structural layer deterministic, accept that there’ll be ugly edge cases, and only use smarter stuff when the rules get shaky.
Out of curiosity, when you were stitching tables, what signals actually held up best? Repeated header text, stable column positions, spacing, row shape, some combo of all of them?