Am I going about this RAG Perplexity-on-crack Jarvis project the wrong way?
Posted by vick2djax@reddit | LocalLLaMA | View on Reddit | 4 comments
First real LLM project for me, probably same endgame as half the people here: personal Jarvis. But the reason I'm actually building it is bigger than that.
I'm a dad, and the more I mess with commercial LLMs the more worried I get that we're nearing the end of actually source-able information. Misinformation has been rough forever, but I already only really trust a small handful of outlets (AP, Reuters, a couple others), and the idea of some company baking their own agenda into the next model and deciding what counts as true for my kids does not sit right with me.
Started small. Daily digest that only pulls from sources I trust so I stop doom scrolling. Worked better than I expected.
Then I got ambitious. Extended it into a full RAG chatbot, basically Perplexity on crack but only pulling from corpus I personally curated. Every answer cites back to what I put in, shows a confidence score, blind spots, and flags claims the corpus actually contradicts. 2M+ chunks in across 14 collections and 67ish download sources now, so it's real. Which is also why the scope problem is getting painful.
Rigs
Unraid box
- AMD RX 7900 XT 20GB
- MacBook Pro M3 Max 36GB, retired from the inference role. A 7900 XT was beating it on tok/s for every model I cared about. Unified memory sounds great until you realize the memory bandwidth isn't being used by the thing you want to run.
Stack
- Qdrant for vectors
- llama-swap + llama.cpp Vulkan on Unraid. Moved off Ollama after catching the same model pass 5/5 JSON extractions on llama.cpp while Ollama failed them. Backend mattered more than the model
- Interactive chat: qwen3.6 Q3_K_S, \~108 tok/s, 262K ctx
- Bulk extraction: qwen3.6 IQ3_XXS, \~112 tok/s. Different quants won different benchmarks so I route by content type. Swap is under a second
- Embeddings: Qwen3-Embedding-4B Q8, Matryoshka truncated to 1024d
- GTE modernbert reranker on CPU
- Claude Sonnet for the synthesis pass, Opus only for deep mode
Where I'm stuck
Measured production throughput: \~13,500 chunks/hr on the 4B embedder. For the full 7M English Wikipedia pages:
- Top 2M by pageview rank, dense ingest: \~8 months
- Tail 5M (\~80M chunks): 22 to 36 months elastic duty cycle
So I'm staring down 2.5 to 3.5 years for full local Wikipedia. That's already assuming the tail runs background-only.
Already tried:
- 0.6B embedder for the 2x bump. Got 1.91x raw. Quality dropped past my retrieval gate. Rejected
- Parallel batching (-np 2) on the 0.6B. Got 1.03 to 1.23x over the 4B pipeline. Below my pre-committed 1.4x floor. Rejected
- Vulkan has no multi-GPU tensor-split, so adding a second AMD card wouldn't give me a unified VRAM pool anyway
Staying on the 7900 XT, budget isn't there for hardware moves yet. Maybe eventually I can get on a 256GB Mac Studio if they release and prices aren't too absured. Trying to figure out what's left on the table in software.
Questions:
- Anyone actually chewed through a full ZIM Wikipedia ingest on consumer hardware? Wall clock and embedder? I know there's pre-embedded Wikipedia sets on HF, but none of them carry the extraction layers my pipeline builds on top (claims, entities, contextual headers, provenance), so I'm stuck running it myself.
- Any reason not to run 0.6B on the tail 5M and 4B on the top 2M and just accept the quality tier?
- Anyone squeezing more out of a single 7900 XT for batch embedding than I am? Already on llama.cpp Vulkan, flash attention off, KV cache quant off (segfaults)
- Anyone pulled off multi-GPU on ROCm without losing their mind, or is CUDA genuinely the only tensor-split path right now?
YakaaAaaAa@reddit
As a father building a sovereign local OS (Mnemosyne) to pass down uncorrupted architecture to my daughter, your motivation resonates on every level. You are fighting the exact right battle. But you are fighting it with the wrong compute strategy.
You are staring down a 3.5-year bottleneck because you are trying to preemptively extract and annotate the entire ocean. You need to shift from an "Ahead-of-Time" (AOT) ingestion pipeline to a "Just-In-Time" (JIT) extraction architecture.
Here is how you bypass the 3.5-year wait time this weekend:
Swallow the Pride and Use the HF Pre-Embedded Sets (For Routing Only): Grab the pre-embedded Wikipedia datasets on HuggingFace. Do not use them for your final context. Use them strictly as a dumb, fast semantic router.
The JIT Extraction Layer:
When you query your Jarvis, let Qdrant hit that pre-embedded HF dataset to pull the Top-K raw text chunks. Only then, at retrieval time, do you pass those specific raw chunks through your Qwen IQ3_XXS extraction pipeline (claims, entities, provenance).
Once a chunk is dynamically extracted at runtime, save that output into your premium, highly-curated 14-collection vector space.
Why this works: You are currently spending 20GB of VRAM and months of wall-clock time extracting provenance for 5 million Wikipedia articles on obscure 17th-century pottery that you and your kids will never ask about.
Regarding Question 2 (Mixing 0.6B and 4B): Don't do it in the same geometric space. If you mix embedding models in the same Qdrant collection, the cosine distances become mathematically meaningless. You'd have to maintain two parallel retrieval pipelines and normalize the scoring, which will destroy the latency you are trying to save.
Stop brute-forcing the ingestion. Build a lazy-loading architecture. Let your family's actual questions dictate what gets deeply extracted.
Keep building. It's the most important legacy code you'll ever write.
ItilityMSP@reddit
Sounds just like something Gemini would say, regardless better strategy than op.
Imaginary-Unit-3267@reddit
Okay maybe I'm dumb but like. Why download Wikipedia? Why not just... have your agent read the actual web site? Do you expect Wikipedia to get yeeted at some point in the future or something?
CockBrother@reddit
Like another poster said - it is probably hopeless with those speeds. But I would at least benchmark vllm with as many parallel chunks as your GPU will allow.
Serially embedding 1-2 chunks is going to take a very very long time.