Did anyone of you already make the "doomsday" or "offgrid" knowledge based? (ofc powered with LLM)

Posted by Altruistic_Heat_9531@reddit | LocalLLaMA | View on Reddit | 13 comments

Basically, I’m really into the idea of a fully offline setup.
(Another way to say it: I’m a data hoarder.)

For LLMs, I’m using uncensored models from both Western (Gemma, GPT-OSS) and Eastern ones (GLM 4.7 Flash, Qwen 35B). For daily use, I stick to models in the 20–35B range, and when I need stronger reasoning, I switch to Qwen 3.5 120B.

Anyway:

  1. After looking around, Wikipedia (text-only, no media) is about 24 GB in English. I’m planning to include Indonesian (my country), Chinese, Russian, and Arabic as well, mainly to reduce bias. That would probably bring it to around 120 GB i guess for text-only data. For images, google estimating around 4 TB (and i dont know if it is ALL wiki or just English). I’m not planning to store videos. 4 TB is manageable using LTO for archival and HDD for day2day access.
  2. Planet.osm This is basically a map of the entire Earth. For my setup, I only need major roads outside Indonesia, but full detail within Indonesia. Has anyone here tried unpacking the planet file without full detail? When I processed just my home island (Java), processing edges and vertices increased the size to around 30 GB, from about 1.2 GB if I remember correctly.
  3. Any other suggestions for datasets or storage/setup optimizations? Especially from people who’ve already built similar offline systems?