Best document parser

Posted by aiwtl@reddit | LocalLLaMA | View on Reddit | 17 comments

I am in quest of finding SOTA document parser for PDF/Docx files. I have about 100k pages with tables, text, images(with text) that I want to convert to markdown format.

What is the best open source document parser available right now? That reaches near to Azure document intelligence accruacy.

I have explored

Doclin
Marker
Pymupdf

Which one would be best to use in production?

[-]

Allergic2Humans@reddit

I use pymupdf with a vision llm model locally running. Haven’t faced any issues so far

[-]

GullibleEngineer4@reddit

Hey, how are you using Pymupdf with the vision model? I don't think Pymupdf uses any kind of AI as far as I know.

[-]

SouthTurbulent33@reddit

You can check this out: https://github.com/Zipstack/llmwhisperer-table-extraction

I did try a few others a while back, but now use llm whisperer for my day-to-day

[-]

a_slay_nub@reddit

Marker doesn't allow for commercial use.

Docling if you have the compute, pymupdf if you don't.

[-]

harsh_khokhariya@reddit

i think marker does allow commercial use: GNU GPL V3

[-]

secopsml@reddit

Maybe check https://github.com/microsoft/markitdown

[-]

g0pherman@reddit

I never tested and was curious to see that running. Specially for other formats where Microsoft knows it better like docx

[-]

aiwtl@reddit (OP)

doesn't work that good in edge cases.

[-]

a_slay_nub@reddit

That just runs pdfminer on the backend which imo is worse than pymupdf and slower.

[-]

Mkengine@reddit

https://github.com/GiftMungmeeprued/document-parsers-list

[-]

Reason_is_Key@reddit

I’ve tried a bunch of parsers too, and honestly struggled with consistency on large volumes like yours, especially when it came to tables and mixed-content layouts.

I now use Retab.com (not open-source but developer-friendly), it handles PDF/Docx parsing at scale with near-perfect accuracy, especially on structured outputs like Markdown or JSON.

It’s been more reliable than Azure Document Intelligence in my case (and faster and easier to QA thanks to the visual interface).

Happy to share more if you’re curious, but there is a free trial if you want to check it out.

[-]

cpdomina@reddit

Here are some extra ones:

Unfortunately there's no "better" one, it all depends on your files/domain. And no, nothing compares to Azure wrt precision.

[-]

nerdlord420@reddit

Docling with the EasyOCR/RapidOCR backend should do what you want.

[-]

g0pherman@reddit

Probably the best OSS, but i got better results with Mistral OCR

[-]

Yeah, for OCR locally we got very good results with RolmOCR (Qwen2.5-VL-7B finetuned on olmOCR dataset) but found that "there's no replacement for displacement" at least for handwriting because we achieved the best results with Qwen2.5-VL-72B in our testing

[-]

Fair-Elevator6788@reddit

pypdf docling