What are you using to preprocess pdfs before feeding them to a local model?

Posted by TangeloOk9486@reddit | LocalLLaMA | View on Reddit | 12 comments

I have been running a local setup for document QA and the output quality varies a lot depending on what the pdf looks like when it hits the LLM. clean prose docs are fine but anything with tables or multi column layouts comes out garbled and the model just works with whatever broken input it got. (No complaints, no demands sort of thing)

I had tried pymupdf and pdfplumber and both were decent for simple stuff tho. now stuck trying to figure out whether to go with docling or llamaparse for the messier docs, both keep coming up but i cant tell which actually makes sense for my setup or if theres something else people are using locally that holds up better. Whats your take on these guys?? Which one would be more practical