RAG for complex PDFs (DDQ finance) — struggling with parsing vs privacy trade-off
Posted by Proof-Exercise2695@reddit | LocalLLaMA | View on Reddit | 5 comments
Hey everyone,
I’ve built a fairly flexible RAG pipeline that was initially designed to handle any type of document (PDFs, reports, mixed content, etc.). The setup allows users to choose between different parsers and models:
- Parsing: LlamaParse (LlamaCloud) or Docling
- Models: OpenAI API or local (Ollama)
What I’m seeing
After a lot of testing:
-
Best results by far: LlamaParse + OpenAI → handles complex PDFs (tables, graphs, layout) really well → answers are accurate and usable
-
Local setup (Docling + Ollama): → very slow → poor parsing (structure is lost) → responses often incorrect
The problem
Now the use case has evolved:
👉 We need to process confidential financial documents (DDQ — Due Diligence Questionnaires)
These are:
- 150–200 page PDFs
- lots of tables, structured Q&A, repeated sections
- very sensitive data
So:
- ❌ Can’t really send them to external cloud APIs
- ❌ LlamaParse (public API) becomes an issue
- ❌ Full local pipeline gives bad results
What I’ve tried
- Running Ollama directly on full PDFs → not usable
- Docling parsing → not good enough for DDQ
- Basic chunking → leads to hallucinations
My current understanding
The bottleneck is clearly parsing quality, not the LLM.
LlamaParse works because it:
- understands layout
- extracts tables properly
- preserves structure
My question
What are people using today for this kind of setup?
👉 Ideally I’m looking for one of these:
- Private / self-hosted equivalent of LlamaParse
- Paid but secure (VPC / enterprise) parsing solution
- A strong fully local pipeline that can handle:
- complex tables
- structured Q&A documents (like DDQs)
Bonus question
For those working with DDQs:
- Are you restructuring documents into Q/A pairs before indexing?
- Any best practices for chunking in this context?
Would really appreciate any feedback, especially from people working in finance / compliance contexts.
Thanks 🙏
ImpossibleCollege635@reddit
Which operating system are you on?
I am currently developing a mac app that runs 100% local, no preisnatlls/ coding/ scripting/ ollama etc needed. It does PDF->CLean MD with detailed chart annotation, complex table and math preservation & AI guided extraction. The extraction does not work with structured output LLM stuff but instead with LLM inspecting MD -> writing extraction script with regex -> sandboxed local execution.
Its not out but I'd love to get you a free tester access if you'd be interested and ok with providing feedback?
In my own tests it beats even LLamaparse for papers and is SIGNIFICANTLY faster than docling because I replace 90% of ML stuff with heuristics.
I developed it because our org works with tons of scientific papers and me and the colleagues face similar problems.
I have never even seen a DDQ doc and have 0 knowledge about finance/ compliance but it sounds like the foundational hurdles are the same as with paper pdfs.
Shoot me a dm if you'd be interested:)
ImpossibleCollege635@reddit
For the extraction you need to either download an additional model within app (no prior runtime needed) or connect Ollama/ cloud tho... On my M1 its super speedy and good using Gemma4...
Proof-Exercise2695@reddit (OP)
I’m on Windows.
For now I’ve already built a local tool. When a user logs in and uploads a file/folder, I ask if it’s private or not.
We also have Copilot Enterprise (fully private), but it struggles when answers are inside images in PDFs. My pipeline (parsing + LLM) actually performs better in those cases—I’ve tested against ChatGPT, Claude, and Copilot.
Another option internally is Rovo (Atlassian/Confluence), which works well too, except again for images.
What users really want is simple: upload an Excel/Word file with questions (DDQ = Due Diligence Questionnaire, basically large structured docs with lots of company/compliance questions, often in tables), and get it back auto-filled.
Example: one file has “Company name: Reddit”, and the uploaded file just has “Company name” — the goal is to automatically fill the answer next to each question.
One-Setting7510@reddit
Yeah, that's the classic privacy vs accuracy squeeze. For confidential financial docs, you really can't compromise on extraction quality since a missed table or misread number tanks the whole analysis.
Docling's gotten better but still struggles with complex layouts compared to LlamaParse. The speed issue with Ollama is real too, especially at scale. Since you're handling DDQs, have you looked into self-hosted parsing solutions that don't send data externally? UnWeb (https://unweb.info) has some interesting approaches for keeping documents local while maintaining reasonable parsing quality. Not perfect for finance complexity, but worth testing before you commit to shipping everything to the cloud.
For your use case though, you might need to accept the LlamaParse call with on-prem inference for the actual LLM work. That's honestly the pragmatic middle ground.
Status_Record_1839@reddit
For self-hosted parsing comparable to LlamaParse, look at Marker — it handles tables and complex layouts well and runs fully local. Pair it with a local Ollama model for the RAG part. For DDQs specifically, pre-converting each Q/A pair into a chunk with the question as metadata helps a lot with retrieval accuracy.