What are you using to preprocess pdfs before feeding them to a local model?

Posted by TangeloOk9486@reddit | LocalLLaMA | View on Reddit | 12 comments

I have been running a local setup for document QA and the output quality varies a lot depending on what the pdf looks like when it hits the LLM. clean prose docs are fine but anything with tables or multi column layouts comes out garbled and the model just works with whatever broken input it got. (No complaints, no demands sort of thing)

I had tried pymupdf and pdfplumber and both were decent for simple stuff tho. now stuck trying to figure out whether to go with docling or llamaparse for the messier docs, both keep coming up but i cant tell which actually makes sense for my setup or if theres something else people are using locally that holds up better. Whats your take on these guys?? Which one would be more practical

[-]

while-1-fork@reddit

PyMuPDF4LLM but I don't know if there are better options. From what I have seen the extract images feature misses most images even in pdfs that are not scanned. Seems to be decent with the format or pure text ones.

[-]

natermer@reddit

I've used a old version of tesseract on a ancient hacked kindle and it is fast enough to convert pages of PDFs to be reflowable with a short pause. Like seconds.

And that has less processing power then the average smartwatch.

It doesn't seem like much of a stretch to use normal Linux pdf tools to convert the pdf into tifs and then tesseract into text.

I haven't tried that in any document processing workflow, but for a lot of common things it is probably a lot more efficient to use existing tools then to blow billions of GPU cycles on a LLM blackbox. And feeding text to models is probably a lot more efficient then feeding pdfs.

[-]

monkmartinez@reddit

I am using docling and custom pipeline with qwen and gemma along the way to double check all the work.

[-]

jake_that_dude@reddit

docling is the practical local default here.

use it for the first pass to markdown + table blocks, then route only the ugly pages through MinerU/PaddleOCR or a VLM page-image path.

the big win is keeping layout metadata with each chunk: `page`, `bbox`, table id, heading path. if you flatten everything to text, RAG will hallucinate over broken columns no matter what model you put behind it.

[-]

Big_Wave9732@reddit

Docling is the more involved solution, but it is the one that most thoroughly solves OP's problem.

[-]

OutlandishnessIll466@reddit

if its a multi modal like qwen just create an image from every page and feed the images. Dont need to OCR.

[-]

dir3ctly@reddit

Yes. You will get small errors or spelling from time to time though. That's why I feed both image and parsed PDF text (pdftotext -layout "$pdf_file" "$txt_file") to the LLM and let it use the image as primary source and the text as reference to fix typos.

Of course, this only works if the PDF contains text and is not image only.

[-]

seamonn@reddit

Kreuzberg + Gemma 4

[-]

mikewilkinsjr@reddit

I like kreuzberg for this (and I’m also using it with Gemma 4, oddly enough).

My only complaint with kreuzberg out of the box is sometimes it sacrifices accuracy for speed. For example, I just ran a doc through and it misspelled clinical as “climicel” in the markdown. It doesn’t happen often but it does happen.

[-]

seamonn@reddit

you can configure it to force OCR every page

[-]

mikewilkinsjr@reddit

I was able to adjust it so it very rarely has an issue. I do tend to at least skim the output before feeding it into the knowledge base just to catch outliers.

OCR works but I found it to be relatively slow on my hardware.

[-]

guai888@reddit

You might want to take a look at PaddleOCR and MinerU