RAG for complex PDFs (DDQ finance) — struggling with parsing vs privacy trade-off

Posted by Proof-Exercise2695@reddit | LocalLLaMA | View on Reddit | 5 comments

Hey everyone,

I’ve built a fairly flexible RAG pipeline that was initially designed to handle any type of document (PDFs, reports, mixed content, etc.). The setup allows users to choose between different parsers and models:


What I’m seeing

After a lot of testing:


The problem

Now the use case has evolved:

👉 We need to process confidential financial documents (DDQ — Due Diligence Questionnaires)

These are:

So:


What I’ve tried


My current understanding

The bottleneck is clearly parsing quality, not the LLM.

LlamaParse works because it:


My question

What are people using today for this kind of setup?

👉 Ideally I’m looking for one of these:

  1. Private / self-hosted equivalent of LlamaParse
  2. Paid but secure (VPC / enterprise) parsing solution
  3. A strong fully local pipeline that can handle:
  4. complex tables
  5. structured Q&A documents (like DDQs)

Bonus question

For those working with DDQs:


Would really appreciate any feedback, especially from people working in finance / compliance contexts.

Thanks 🙏