Retrieval augmented generation - pdf extraction is all you need ?

Posted by kleenex007@reddit | LocalLLaMA | View on Reddit | 22 comments

Hello community!

I have checking with the community how to best approach rag in 2025 for niche domain, in particular for sensitive pdfs you can’t send to a closed vendor.

I have observed over the last year - tons of framework all claiming to be the best, but no leaderboard to sort out. It’s about learning curve and abstraction level - most tools out there are not really modular, so you have to commit to one. Or just write your own with api call directly - llms seem to plateau, hard to see big difference - context size increasing , why bother with chunking, embedding etc.. too much hassle - rag, cag, kag, etc, to improve context retrieval, but now you have more parameters to tweak through vibe testing - pdf extraction creates a lot of gibberish, especially on books, business reports, presentations where you have lots of tables, figures, special content structure

where are the most effective area to focus in terms of ROI? How would you start over given what you know today ?

Personally I would invest my entire effort on two fronts:

1) make pdf extraction bullet proof, possibly vision as well (graph, figures, infographic, etc). 2) create eval dataset + llm-as-a-judge

Thanks. Looking forward to a vibrant discussion

[-]

GrouchyGeologist2042@reddit

You hit the nail on the head. Everyone is obsessed with chunking strategies and embedding models, but if your PDF extraction outputs gibberish, your RAG pipeline is dead on arrival. Government and corporate PDFs are the worst offenders.

I got so frustrated with this that I completely abandoned live PDF parsing in my agent workflows. Instead, I built an M2M (Machine-to-Machine) architecture using Google Dorks to find the PDFs and a nightly cronjob that pipes them through Llama-3 to extract strict JSON into an SQLite cache.

Now, my agents just make a sub-50ms GET request to a REST endpoint instead of trying to read PDFs on the fly. I left a public test endpoint open at https://redactproxy.com/v1/opportunities/search if anyone wants to see how much cleaner the context window gets when you feed the LLM strictly typed JSON instead of raw PDF text.

[-]

kleenex007@reddit (OP)

How do you use Google Dorks exactly? I’m Not familiar with it

[-]

PDFBolt@reddit

Yeah, totally agree. PDF extraction is still a mess, especially with tables and weird formatting. Feels like better OCR + vision models could fix a lot of that. And yeah, an actual leaderboard for RAG frameworks would save so much time instead of trial and error. Curious to see where things go this year.

[-]

if47@reddit

RAG is just smart contract on LLM.

[-]

kleenex007@reddit (OP)

can you elaborate? first time hearing this

[-]

if47@reddit

RAG is a technology that is primarily VC oriented and has been very heavily hyped as a solution to many difficult problems, it actually solves very few problems.

[-]

knselektor@reddit

https://huggingface.co/blog/ngxson/make-your-own-rag

at first sound pedantic BUT after you read the article you can evaluate by yourself what is done in a rag framework/library and if you want your own code or a framework maintained by others.

for me is about risk management; the problem is simple enough to not need to include an external library complexity and attack surface.

[-]

kleenex007@reddit (OP)

so you're going with your own implementation from scratch, with just the features you need?

[-]

knselektor@reddit

for the RAG code yes

[-]

klam997@reddit

look into multiagent workflow

"context size increasing , why bother with chunking, embedding etc.. too much hassle" - because no matter how good your model is, after a certain context length, they drop in performance in regards to the task

[-]

kleenex007@reddit (OP)

thanks on your insight.

I dont get your multiagent suggestion tough. Are you referring to "document" agent? https://docs.llamaindex.ai/en/stable/examples/agent/multi_document_agents/

or something else?

[-]

Mr_International@reddit

I've done a ton of testing with Docling, it's very good, but Allen Institute just came out with OlmOCR that I'm looking to test which looks fantastic as well olmOCR – Open-Source OCR for Accurate Document Conversion

[-]

kleenex007@reddit (OP)

how do you test?

[-]

nanokeyo@reddit

RAGs are useful, but struggle with global summaries due to context limitations. Improving this might involve increasing context or using context caching to cover more PDF content. With a dataset of 5000 legal PDFs, precision is a challenge. I’m exploring hybrid and search methods, but haven’t found a perfect solution yet.

[-]

Moreh@reddit

What do you mean by hybrid and search methods?

[-]

TorontoBiker@reddit

Probably they’re referring to Reciprocal Rank Fusion.

[-]

TheActualStudy@reddit

"Looking forward to a vibrant discussion" - Less AI-generated posting, please.

PDF extraction - What's your experience with docling? It's meant to overcome the exact problems you listed.

Putting everything into context - Yup, that's how I do it now.

LLMs Plateauing - Sort-of. Fuse01, QwQ, and Qwen2.5 are still my go-tos and generally don't mess this sort of stuff up.

Tooling - A multi-turn chat interface is all you need when dealing with data that fits in context and iteratively refining the results.

Vision - I made an interface for myself so I can switch between models pretty effortlessly and regenerate if I accidentally sent images to a non-vision model.

[-]

Firm-Fix-5946@reddit

too many dashes, your text must be AI generated

[-]

kleenex007@reddit (OP)

Hey I swear I’m not using ai to write down. Funny to see when you show enthusiasm, you sound like ai nowadays. Not the first time this happens to me

[-]

spazKilledAaron@reddit

You should write the posts with AI then haha

I wonder why AI bugs AI people so much. Try and mention that you used a model to learn something and someone will immediately downvote and call it a hallucination.

The only AI posts that are bad are the ones written by bots to inflate accounts, and probably those from super stoned people at 3am thinking the LLM just gave them “the key to agi”

[-]

glowcialist@reddit

Looks like you need to start having MopeyMule write your posts

[-]

LoSboccacc@reddit

It still depends heavily on your domain. If you can digest domain specific document tags with a low size large contenxt model it will improve all downstream tasks, especially if you can have them apply a known onthology in the space that partition the content in a rankable tree.

Otherwise is just blindly embedding with this or that or multiple strategies for chunking, and figure out which can produce useful work.