If Accuracy > Efficiency, How Would You Spec A Local RAG Machine?

Posted by elgringorojo@reddit | LocalLLaMA | View on Reddit | 11 comments

Hey all,

I’ve already built a proof of concept on my personal machine (4090 + 64GB RAM) for a fully offline setup handling medical-records Q&A and drafting from my own documents, and it works well enough to show the idea is viable.

Now I’m trying to spec a real dedicated office machine. One key requirement is handling a few PDFs totaling 1000 - 1500ish pages. (maybe 1m tokens I think?) (Sometimes more but rarely) I understand this is fundamentally a RAG problem rather than fitting everything into context, but precision really matters here (medical records), so I’m even considering more brute force approaches if hardware can support it.

For those running more serious local setups, is sticking with a single 4090-class GPU still the best value, or does this kind of use case justify moving to higher VRAM or multi-GPU? And if you’ve prioritized accuracy over efficiency, where did you see the biggest gains or bottlenecks?

Ive been playing around in my head with repurposing an old 3080 I have to do the chunking and then get an RTX 6000 ADA 48gb but is that over kill? Would an rtx 6000 blackwell be able to hold that much in context for brute forcing?

Would really appreciate any real world experience here