RAG for complex PDFs (DDQ finance) — struggling with parsing vs privacy trade-off

Posted by Proof-Exercise2695@reddit | LocalLLaMA | View on Reddit | 5 comments

Hey everyone,

I’ve built a fairly flexible RAG pipeline that was initially designed to handle any type of document (PDFs, reports, mixed content, etc.). The setup allows users to choose between different parsers and models:

Parsing: LlamaParse (LlamaCloud) or Docling
Models: OpenAI API or local (Ollama)

What I’m seeing

After a lot of testing:

Best results by far: LlamaParse + OpenAI → handles complex PDFs (tables, graphs, layout) really well → answers are accurate and usable
Local setup (Docling + Ollama): → very slow → poor parsing (structure is lost) → responses often incorrect

The problem

Now the use case has evolved:

👉 We need to process confidential financial documents (DDQ — Due Diligence Questionnaires)

These are:

150–200 page PDFs
lots of tables, structured Q&A, repeated sections
very sensitive data

So:

❌ Can’t really send them to external cloud APIs
❌ LlamaParse (public API) becomes an issue
❌ Full local pipeline gives bad results

What I’ve tried

Running Ollama directly on full PDFs → not usable
Docling parsing → not good enough for DDQ
Basic chunking → leads to hallucinations

My current understanding

The bottleneck is clearly parsing quality, not the LLM.

LlamaParse works because it:

understands layout
extracts tables properly
preserves structure

My question

What are people using today for this kind of setup?

👉 Ideally I’m looking for one of these:

Private / self-hosted equivalent of LlamaParse
Paid but secure (VPC / enterprise) parsing solution
A strong fully local pipeline that can handle:
complex tables
structured Q&A documents (like DDQs)

Bonus question

For those working with DDQs:

Are you restructuring documents into Q/A pairs before indexing?
Any best practices for chunking in this context?

Would really appreciate any feedback, especially from people working in finance / compliance contexts.

Thanks 🙏

[-]

ImpossibleCollege635@reddit

Which operating system are you on?
I am currently developing a mac app that runs 100% local, no preisnatlls/ coding/ scripting/ ollama etc needed. It does PDF->CLean MD with detailed chart annotation, complex table and math preservation & AI guided extraction. The extraction does not work with structured output LLM stuff but instead with LLM inspecting MD -> writing extraction script with regex -> sandboxed local execution.

Its not out but I'd love to get you a free tester access if you'd be interested and ok with providing feedback?
In my own tests it beats even LLamaparse for papers and is SIGNIFICANTLY faster than docling because I replace 90% of ML stuff with heuristics.
I developed it because our org works with tons of scientific papers and me and the colleagues face similar problems.
I have never even seen a DDQ doc and have 0 knowledge about finance/ compliance but it sounds like the foundational hurdles are the same as with paper pdfs.

Shoot me a dm if you'd be interested:)

[-]

ImpossibleCollege635@reddit

For the extraction you need to either download an additional model within app (no prior runtime needed) or connect Ollama/ cloud tho... On my M1 its super speedy and good using Gemma4...

[-]

Proof-Exercise2695@reddit (OP)

I’m on Windows.

For now I’ve already built a local tool. When a user logs in and uploads a file/folder, I ask if it’s private or not.

If not private: I use either LlamaParse or Docling (user can choose, e.g. Docling for full-text docs), plus Ollama models (local or cloud like gpt-oss).
If private: I stick to Docling, and I recently added Azure Document Intelligence + a deployed Azure LLM for better privacy (on top of local Ollama).

We also have Copilot Enterprise (fully private), but it struggles when answers are inside images in PDFs. My pipeline (parsing + LLM) actually performs better in those cases—I’ve tested against ChatGPT, Claude, and Copilot.

Another option internally is Rovo (Atlassian/Confluence), which works well too, except again for images.

What users really want is simple: upload an Excel/Word file with questions (DDQ = Due Diligence Questionnaire, basically large structured docs with lots of company/compliance questions, often in tables), and get it back auto-filled.

Example: one file has “Company name: Reddit”, and the uploaded file just has “Company name” — the goal is to automatically fill the answer next to each question.