Am I going about this RAG Perplexity-on-crack Jarvis project the wrong way?

Posted by vick2djax@reddit | LocalLLaMA | View on Reddit | 4 comments

First real LLM project for me, probably same endgame as half the people here: personal Jarvis. But the reason I'm actually building it is bigger than that.

I'm a dad, and the more I mess with commercial LLMs the more worried I get that we're nearing the end of actually source-able information. Misinformation has been rough forever, but I already only really trust a small handful of outlets (AP, Reuters, a couple others), and the idea of some company baking their own agenda into the next model and deciding what counts as true for my kids does not sit right with me.

Started small. Daily digest that only pulls from sources I trust so I stop doom scrolling. Worked better than I expected.

Then I got ambitious. Extended it into a full RAG chatbot, basically Perplexity on crack but only pulling from corpus I personally curated. Every answer cites back to what I put in, shows a confidence score, blind spots, and flags claims the corpus actually contradicts. 2M+ chunks in across 14 collections and 67ish download sources now, so it's real. Which is also why the scope problem is getting painful.

Rigs

Unraid box

Stack

Where I'm stuck

Measured production throughput: \~13,500 chunks/hr on the 4B embedder. For the full 7M English Wikipedia pages:

So I'm staring down 2.5 to 3.5 years for full local Wikipedia. That's already assuming the tail runs background-only.

Already tried:

Staying on the 7900 XT, budget isn't there for hardware moves yet. Maybe eventually I can get on a 256GB Mac Studio if they release and prices aren't too absured. Trying to figure out what's left on the table in software.

Questions:

  1. Anyone actually chewed through a full ZIM Wikipedia ingest on consumer hardware? Wall clock and embedder? I know there's pre-embedded Wikipedia sets on HF, but none of them carry the extraction layers my pipeline builds on top (claims, entities, contextual headers, provenance), so I'm stuck running it myself.
  2. Any reason not to run 0.6B on the tail 5M and 4B on the top 2M and just accept the quality tier?
  3. Anyone squeezing more out of a single 7900 XT for batch embedding than I am? Already on llama.cpp Vulkan, flash attention off, KV cache quant off (segfaults)
  4. Anyone pulled off multi-GPU on ROCm without losing their mind, or is CUDA genuinely the only tensor-split path right now?