Retrieval augmented generation - pdf extraction is all you need ?

Posted by kleenex007@reddit | LocalLLaMA | View on Reddit | 22 comments

Hello community!

I have checking with the community how to best approach rag in 2025 for niche domain, in particular for sensitive pdfs you can’t send to a closed vendor.

I have observed over the last year - tons of framework all claiming to be the best, but no leaderboard to sort out. It’s about learning curve and abstraction level - most tools out there are not really modular, so you have to commit to one. Or just write your own with api call directly - llms seem to plateau, hard to see big difference - context size increasing , why bother with chunking, embedding etc.. too much hassle - rag, cag, kag, etc, to improve context retrieval, but now you have more parameters to tweak through vibe testing - pdf extraction creates a lot of gibberish, especially on books, business reports, presentations where you have lots of tables, figures, special content structure

where are the most effective area to focus in terms of ROI? How would you start over given what you know today ?

Personally I would invest my entire effort on two fronts:

1) make pdf extraction bullet proof, possibly vision as well (graph, figures, infographic, etc). 2) create eval dataset + llm-as-a-judge

Thanks. Looking forward to a vibrant discussion