Anyone actually using a local LLM as their daily knowledge base? Not for coding, for life stuff. What's your setup?
Posted by InformationSweet808@reddit | LocalLLaMA | View on Reddit | 61 comments
So I've been going down a rabbit hole lately and I can't find many people actually talking about this specific use case.
everyone here runs local LLMs for coding, chat, maybe some creative writing. cool. But what about using it as a proper personal knowledge base? like, dump your own notes, PDFs, random docs into it and actually query your own life privately, every day.
I tried looking into this seriously and hit a wall. Most resources either assume you're a developer building something, or they're 2 years old and recommend tools that have completely changed since.
So genuinely asking, is anyone here actually doing this day to day? Not as an experiment, but as a real workflow?
Things I keep running into that I can't figure out:
- What model are you running for this? RAG on consumer hardware seems finicky depending on quant
- Do you actually trust the retrieval or do you double check everything because hallucinations?
- LlamaIndex vs Ollama vs whatever else has anything actually made this less painful recently?
- Context length, how do you handle it when your personal docs start piling up?
Not looking for a tutorial or a GitHub repo. Just want to hear from someone who's made this work without it becoming a part time job to maintain.
Bouros@reddit
I play an MMORPG that doesn't allow you to copy the chat.
The majority of players I communicate with are Spanish.
I made an app so I hold my middle mouse button and speak and it translates it to Spanish and sends it to my clipboard to paste onto the game (id post into the game but it uses an anticeat I'm wary of)
I also selected the area of the chat box on my monitor and when I hit a hotkey on my keyboard it takes a photo of that area and sends it to the ai to translate. It displays om the app which I have on my second monitor and also can use tts to read it out.
And for discord messages I love this feature whenever I copy non English text to my clipboard it translates it to English, and tts it to me.
I love it so much and it let's me so easily communicate with a group of friends that I probably wouldn't have kept up with otherwise.
I know I could use OCR for the images but I have never had good luck with OCR in my life and ai just works magic at vision.
After using the translator for a few weeks I added the feature to just hold a key to speak and have it sent to my clipboard. It works so well and is so convenient when gaming as I can keep my actions up in game.
I remember using speech recognition in the early 2000's and it was SO BAD! I haven't had a single time I've noticed an error in the speech to text using whisper.
Currently learning to set up Hermes agent. I manage a local business and have the staff fill out sheets while they are new saying when they start and finish each task. Once my program is done I'll scan the sheets and the ai will pull all their text out, create tasks in a database and track all information related to that task. They I'll be able to have the ai generate summaries based in the data provided.
Juanisweird@reddit
Which MMO?
MendozaHolmes@reddit
Tibia MMO?
Kahvana@reddit
Friend of mine had good experience with lightrag, might be for you:
https://github.com/hkuds/lightrag
Haven't used it myself however.
Personally I use SillyTavern + Server/Client MCP extensions for MCP support.
I'm sure most users here have a very different setup, but this worked for me over the year.
MundanePercentage674@reddit
build one myself with n8n workflow + telegram for chat interface use case mostly todo task manager with 3 memory layer chat history short memory, long fact memory loop each week to remove unnecessary or unimportant thing and Rag permanent memory, workflow can be extendable if i want to add new use case.
_raydeStar@reddit
I've got a personal project. It's got a wiki, memory, or you can auth it to use a folder on your machine.
The wiki is basically a canvas + wiki. I built it for storytelling, notes, etc. Instead of memory, I just do intelligent searches, etc.
So far it works really well. I haven't load tested it yet though (ie, 100+ files)
Howie33@reddit
Hi, I use a tree index database where I have a directory called “collections”. Inside there I have various topics like “medical research”, “finances”, “photovoltaic”, “air traffic”, etc. I index all the documents weekly, then use a flask web server to access the data via Safari either local (on machine) or using TailScale if I’m at work. I have a collection toggle bar at the top of the web page to filter which collection(s) I am searching. Some of my collections are marked private so they do not appear via flask server. The search results are numerically scored via keywords. When I click on one of the results, it opens that’s actual page of the document so I can read that page/document. I use a LLM in 2 places: first as a query translator - if requested, it will take my search query and reinterpret it into a search term. Second, I use a LLM in my indexer script. I try to use a LLM in very restricted roles due to potential hallucinations. My motto is try to never use a LLM in a deterministic role. My tree index turned into a pretty flat tree since it only goes 1 level deep. The LLM I use is Qwen 2.5 14b for translation and indexing. I treat daily notes differently. Those I index nightly via a launchd script.
Edit: my apologies for the vague answer. I wanted to give an general overview without getting into the nitty gritty. Each of my topics has its own directory. Inside that directory I have a “books” directory (my source documents go here), and an index directory (indexed files go here). The indexer checks to see if any book documents do not have a corresponding index document. If this is the case, it then runs the indexer on these un-indexed documents.
pkief@reddit
Google AI Edge Gallery on Android - using Gemma 4 E2B or E4B are running nicely on my Pixel. The knowledge is quite good, but not as strong as the hosted LLMs of course depending on what you're asking.
InformationSweet808@reddit (OP)
Running it on pixel is wild, didn't even consider mobile. How's the speed on device?
AppTB@reddit
Yeah I tested this last night with a different tool kit. Off grid AI iOS app -> mlx and lmstudio endpoint. This allows inference in my Mac Studio through Tailscale and I’ll test a gateway later which can invoke studio tools
relmny@reddit
Does it
still not keep passed chats, or did they fixed that?
MrHumanist@reddit
What's your ram? And how did you fit e4b?
BitterProfessional7p@reddit
Yup, I have all my personal notes in local .md files from Logseq (similar to Obsidian) and my OpenClaw can read any of it agentically, not through RAG. From the notes it created a personal profile of me which is in its permanent memory.
I use it as a personal assistant to register my habits, calorie counting, registering and consulting knowledge (I have notes for books, videogames, music, movies, TV shows, gifts to people, travel, food, restaurants...), editing my grocery list and more. I interact mainly via Signal, but I made a dashboard for my habits and I always can read the notes with Logseq for the rest.
Running with Qwen3.6-27b-q4 on my dual RTX3060 machine (700 $), llama.cpp, tg at 15-18 tk/s which is not super fast but it is usable. Context is not super long, 80k but I like to /reset the context frequently so it is not a problem for me.
Overall it took one afternoon to set up. Never touched the configurations in a few weeks, just using it.
Dazzling_Equipment_9@reddit
On the topic of building a personal knowledge base, here’s my approach:
Hermes agent + Qwen 3.6 35B A3B + Obsidian.
I don’t use any complicated RAG setups — at this stage, they feel more flashy than practical.
Building a knowledge base and using RAG are not as tightly linked as people think. RAG is merely one possible implementation method, not the only or necessary path. I simply call my Obsidian notes a knowledge base, and it works very well for me. It’s more than sufficient for my needs.
As for those frequent questions about everyday use cases for local LLMs, I have to vent a bit — please don’t take it personally. I see almost identical posts every day. Instead of asking the same questions again, why not first search for existing threads? The answers are already there, and reading a few would quickly give a clear picture. Most practical use cases don’t change dramatically, at least in the short term.
I’m also not entirely sure about the real motivation behind these posts. Are people genuinely unsure what to do with a local LLM, or are they probing for something else? The intent often feels unclear.
If the goal is learning, you can simply ask an AI directly — it can give you a comprehensive list. If you don’t actually have a real use case, there’s no need to force one. Doing so often leads to frustration and fatigue rather than enjoyment. Believe me.
It’s much more effective to ask specific, well-defined questions with clear context. Overly broad or vague topics rarely yield useful answers. To make it easier for others to respond thoughtfully, posters should provide sufficient background and state their questions clearly and concretely.
MarcusAurelius68@reddit
One other point - things are changing very quickly as well. A recommendation from 6 months ago might be outdated due to new solutions, models and approaches.
Not an excuse for open-ended questions - those can easily be asked of frontier AI as a starting point.
AppTB@reddit
I’m over here chasing my config from July for obsidian as a coordination substrate with smart connections like overlap chunking.
mouseofcatofschrodi@reddit
why hermes and not pi.dev?
Evanisnotmyname@reddit
This is the way, like Karpathy’s LLMwiki.
I’ve been having a lot of trouble setting up Hermes with Onsidian, Qwen, and some kind of GUI/TUI. Can you give me details on your setup, MCPs, etc?
yes2matt@reddit
not OP, but the model used makes a giant difference in Hermes, and I think temperature. My current happy place is :
Dazzling_Equipment_9@reddit
From your description, I can’t quite determine the specific issue you’re facing. I’ll assume you’re unsure how to install and integrate these tools, so here’s a straightforward approach:
These steps are quite simple and widely applicable — any AI could easily find them. That said, I want to be clear: I’m trying to help solve your problem, but since you didn’t describe the exact difficulty you’re encountering, I can only provide a general guide. If you want more detailed or personalized instructions, feel free to follow up with more specific questions.
Also, AI can handle a lot of this for you. For example, you can connect OpenCode to your local model and let it manage system operations, including deploying Hermes + llama-server + Obsidian.
Ultimately, getting satisfactory answers — whether from humans or AI — depends on asking clear, well-defined questions. Do you agree?
Ritofix@reddit
Holy bot
Dazzling_Equipment_9@reddit
Hahaha, I did it...
yes2matt@reddit
I haven't figured out how to use RAG effectively yet. I do have focussed research I want to mine (beehive audio analysis) but asking questions via chat gets answers which are almost entirely generalized from the model. Some reference will be made to the papers, depending on the model used. I need a better way too.
InformationSweet808@reddit (OP)
Fair point on the edit lol, appreciate you actually going back to clarify.
The Obsidian + Hermes setup is something I hadn't really considered tbh. I always assumed you needed RAG the moment your notes got big enough to query. So you're basically just letting the agent navigate the vault directly? No retrieval pipeline at all?
Asking because if that actually works well at scale that's way simpler than what I was planning to build.
Dazzling_Equipment_9@reddit
I believe Obsidian’s built-in search is already more than sufficient for most personal knowledge base needs.
The reason is that the LLM can intelligently craft multi-term fuzzy semantic searches. For example, if you ask it to find a note you vaguely remember about local LLM deployment, it might generate something like:
obsidian-cli search "llama | gguf | vllm | local"
(In reality, it would likely create an even more refined and comprehensive query.) It then reads the relevant notes, extracts the information, and answers based directly on your original content. This keeps the source completely faithful.
If it doesn’t find anything, the model can automatically broaden the search by adding more keywords — such as “q4_k_m | qwen | huggingface” — and try again.
After this explanation, if you still feel RAG is necessary in a personal knowledge base scenario, I’d be interested to hear your specific understanding and requirements for what counts as a “large-scale knowledge base.” I can compare it with my own setup and see if there are any useful new ideas worth considering.
Honestly, I’d like to hear more about your actual detailed use cases rather than the relatively vague term “large-scale.”
mouseofcatofschrodi@reddit
have you checked anythingLLM? It has the RAG already implemented. So it would be the fastest way, I guess. And has a very cool function for recording meetings, transcribing them, getting the summary and chatting with the transcript as knowledge. This app was the first thing were I started using local LLMs for something "useful" beside just playing around (now that improved a lot since qwen3.6 35B + pi.dev + omlx, super combination for getting agentic work done).
tbh I'm also thinking a lot about how to build something like this for personal and company knowledge. Probably also with obsidian, or maybe just markdown files with good tags within structured folders and an automatically generated index (with a little python).
Special_Permit_5546@reddit
For personal knowledge base use, I would separate two problems that often get mixed together:
finding the right source material
letting the model modify or synthesize from it
For (1), I have had better luck with boring file/search tools over pure vector RAG, especially for Markdown notes. Heading-aware chunks, filename/title context, and plain keyword search matter a lot because personal notes are full of weird proper nouns, half-phrases, project names, and short dense entries. Dense retrieval alone can feel magical until it misses the exact note you know exists.
For (2), I would not let the model silently rewrite the knowledge base. Read/search/summarize is low risk. Creating a draft note is usually fine. Editing existing notes should be treated like code: show a diff, accept/reject, keep the raw files inspectable.
The setup I trust most is something like:
- plain Markdown folder as source of truth
- grep/BM25 first, embeddings second if needed
- citations that point to actual filenames/headings
- separate daily journals from reference/project notes
- no silent mutation of source-of-truth notes
Small disclosure because this is exactly the product shape I am working on: I am building an open-source local-first Markdown app called Kuku around the "AI can search/read/create/edit notes, but edits are reviewable diffs" model. So I am biased. But independent of the app, I think the key is not "RAG vs no RAG". It is whether you can inspect what the assistant used and review what it wants to change.
FormalAd7367@reddit
Just curious - does anyone have experiences with 1 3090 and use a qwen 3.5 distilled model to do coding and have a cloud model to debug or test it? i can write the architecture with a llm no problem. is it possible? just trying to save $
Howie33@reddit
Hi, I use a tree index database where I have a directory called “collections”. Inside there I have various topics like “medical research”, “finances”, “photovoltaic”, “air traffic”, etc. I index all the documents weekly, then use a flask web server to access the data via Safari either local (on machine) or using TailScale if I’m at work. I have a collection toggle bar at the top of the web page to filter which collection(s) I am searching. Some of my collections are marked private so they do not appear via flask server. The search results are numerically scored via keywords. When I click on one of the results, it opens that’s actual page of the document so I can read that page/document. I use a LLM in 2 places: first as a query translator - if requested, it will take my search query and reinterpret it into a search term. Second, I use a LLM in my indexer script. I try to use a LLM in very restricted roles due to potential hallucinations. My motto is try to never use a LLM in a deterministic role.
Scared_Bedroom_8367@reddit
Low parameter models are hallucination machines
Otherwise_Economy576@reddit
doing this for about 8 months daily, here's the unvarnished version.
setup: 36gb M3 Max, qwen3 32b for the answering model, bge-m3 for embeddings, obsidian vault as the source of truth, postgres+pgvector for the index because i didn't want to babysit chroma or a faiss file. ollama for serving, no llamaindex, hand-rolled retrieval in maybe 300 lines of python. boring is good.
the stuff that actually matters more than model choice:
chunking is everything. 90% of bad retrieval is bad chunks. for personal notes i chunk by markdown heading (not fixed token windows) and prepend the doc title + parent headings to each chunk before embedding. recall went up massively when i started prepending context. fixed-size 512-token chunks of personal notes give terrible results because notes are short and dense.
hybrid retrieval. dense alone misses anything with proper nouns or rare terms. i run bm25 over the same corpus and rrf-fuse the top 20 from each. takes an extra 50ms and fixes the "i KNOW i wrote about this person, why isn't it surfacing" problem.
answers must cite. the LLM never just answers, it has to quote which chunks and the source filenames. when i see no citations or a citation that doesn't actually contain the claim, i know it hallucinated. this is the only mechanism that makes me trust the output without re-reading every doc.
context length is a non-problem if your retrieval is good. you do not need 200k context. you need to put the right 6 chunks in 8k context. people scale context to mask bad retrieval.
maintenance: i rebuild the index nightly via a cron because obsidian writes faster than i can be bothered to do incremental updates. takes 4 minutes for ~3000 notes. not a part time job, more like "i forget it exists" until i upgrade hardware.
the one thing that bit me hard: don't include daily journal entries in the same index as reference notes. retrieval will keep surfacing emotional sentence fragments when you ask factual questions. separate indexes per content type, route at query time.
Public_Umpire_1099@reddit
This is from a work project, but I developed an app that queries a RAG for equipment related documentation. When committing something to the RAG, first an entry in a SQL database was made with a key that gets prefixed on the rag chunks for that document, then after the RAG upload, it would automatically upload the file to a file storage. In the system prompt for the LLM, I forced it to basically write inline the exact sentence it was citing, which gets hidden by the UI. Using that exact sentence, I was able to make citations clickable, and then the PDF viewer would immediately pull that document up and automatically ctrl-f for that sentence. It works about 90% of the time. At the end of the day it still pulls up the correct document, the only issue is that sometimes the LLM paraphrases so it doesnt find a perfect match. This was built in Nest. Not sure if it would be useful for you, but figured I'd share anyways.
InformationSweet808@reddit (OP)
okay this is the comment i was hoping someone would leave when i posted this
the chunking point hit hard i had no idea fixed token windows were that bad for personal notes specifically, makes total sense now that you say it. the separate indexes for journal vs reference notes is something i would've 100% screwed up on my own
one thing im still wrapping my head around the hybrid retrieval part. so you're running both dense and bm25 on the same corpus and then fusing the results? is that something you built yourself or is there a library that handles the rrf part cleanly?
either way this whole comment should be pinned somewhere
achiya-automation@reddit
Yeah, doing this for about 8 months now, not as an experiment. Setup is boring on purpose: Ollama running qwen2.5:14b on a 32GB M1 Mac, plus paperless-ngx for everything PDF, plus a flat folder of markdown notes. Open WebUI on top with RAG pointed at both. That's it. What actually made it work day-to-day was lowering my expectations on retrieval. I treat it like a smart grep, not a brain. If I ask "what did I write about that vendor in march" it pulls the right chunks ~80% of the time. If I ask anything inferential ("summarize my opinions on X") it confidently fabricates, every time. So I never ask inferential questions on personal data anymore, only locate-and-quote. re: chunking and hallucinations - smaller chunks (300 tokens) with 50 overlap, and I always show sources in the UI. If the source quote doesn't actually contain what the model said, I assume it lied. Saves me from acting on bad recall. Hardware-wise the 14b at q4 is fine for retrieval. I tried 32b and the latency made me stop using it, which means the small model wins by default. Honest gotcha: maintenance isn't zero. Re-indexing when I dump a batch of new docs takes ~10 min, and Ollama updates have broken my docker stack twice. Worth it for me because I trust the data isn't leaving the box, but I wouldn't recommend it to anyone who just wants "Notion but local".
Memoishi@reddit
Claude code (but you can use any) wired to my llama.cpp server (again host with whatever).
Hardware is modest 32 DDR5 and 16 VRAM (RTX 5080).
I'm using Obsidian (optional here but the data view is so satisfying lol) + Qwen3.5-9b + LLM wiki pattern.
I install this shit in all my projects, nothing flashy nothing extraordinary but very clean and like 10 mins of setup once you understand. I slap my .md converted files into a raw folder, it ingest and then just improve/clean/fix whatever.
Results, it build a good knowledge wiki and it can easily retrieve and help you with whatever you're supposed to do with these.
For example I got this project, fine tuning LLMs for coding, but since the dataset is getting bigger and bigger I need an easy retrieval that will tell me if I've already written a piece of code; it's very good in my case because the worst it can happen is the LLM saying "you don't have this" and I just do it twice, which is not catastrophic and only time-wasting.
Compared to classic RAG this one is dumber and worse in scaling but if we talking about handling 300/500 files, it's not impossible to get value out of it. I can help you setup something if you're interested, just ask or DM!
InformationSweet808@reddit (OP)
one thing im curious about though where does it actually start falling apart for you? like is it a retrieval accuracy thing past a certain number of files or just gets slow?
Memoishi@reddit
I'm having troubles with this to be honest because so far the projects I've touched with this were pretty much already well organised and all light work. The pattern is specifically studied for this, so here's that.
One of the known issue is with versioning tho, if you have let's say function A version 1.0.0 and function A version 1.1.0, it might not catch (unless you let him notice) that the 1.0.0 is outdated and should follow 1.1.0. You ask about function A and he retrieves 1.0.0 properties, even tho one has been changed in v1.1.0.
Same goes for files that respect this logic, if you have something that overrides concepts defined elsewhere it might not understand at all.
This approach is all about garbage in garbage out more than ever, I would say maintaining this is mandatory but that's true to any LLM in any given task, be it RAG or simple dumb queries to an LLM.
Clean files, clean structures, I read people made it work with around 1-2k files, but then again I've gone as far as 300 files and no issues at all.
I would die for a big ass dataset and a use case, with these things the dataset are always the bottlenecks.
StupidityCanFly@reddit
I don’t trust LLMs, so they always have to verify their facts. Aside from that, I’m using Qwen3.6-27B as my daily driver.
p_235615@reddit
I offten use qwen3.6 35B with websearch in openwebui, some times also via voice.
croholdr@reddit
for me i go in 'sprints' where I talk to my lm studio models a few hours daily for a week. I stick to (mostly) what lm studio suggests (q4) and various tweaks to increase context length; keeping 'vision' tasks seperate from the pure 'questions.'
Sometimes I'll spend a bit to see if I can figure out good prompts to help keep context length under control.
When context window fills up its very noticeable and I'll usually turn the work station off, touch grass and requestion the mysteries of faith and start the process over during the next month.
Dany0@reddit
What are you even doing that requires HOURS of talking to an LLM?
m02ph3u5@reddit
Probably unifying quantum and gravity. That's the only explanation I have.
Dany0@reddit
He's gonna succeed, I believe he will. Probably by burning enough tokens to let the universe collapse in on itself. Finally everything will be unified
InformationSweet808@reddit (OP)
The "sprints" approach is actually interesting never thought about batching it that way instead of keeping it always on. Do you find the q4 quality holds up well when you're doing longer sessions?
CatTwoYes@reddit
I tried both RAG and the simpler "give the LLM a grep tool + markdown folder" approach. For under ~1000 personal notes, the grep approach wins hands-down. RAG embeddings for personal docs are finicky — you spend more time debugging why the right chunk didn't get retrieved than actually using the thing. The tool-calling + file search pattern is dumber but more predictable, and with Qwen 3.6 27B the quality is good enough that I stopped maintaining the RAG pipeline entirely.
MainEnAcier@reddit
At the moment to me it's too complex for little gain.
I have an other philosophy :
I store data massively ( insurrance, phone contract, data for curriculum etc) in structured sheets.
When an option will Côme out, all the datas will be ready
unfortunately I still don't understand exactly how work hermes/openclaw properly. But I'm sûre one day we will have some plug and play system, and we won't need to make so many manipulations to make that system working.
xupetas@reddit
Yes. Openwebui, RAG and chromaDB. Rail guards to the wazuuuuu
remarkedcpu@reddit
Genuinely wondering how is one’s daily life so important that everything has to be written down. I get it that the YC founder needed this, but I don’t.
onlythehighlight@reddit
128GB M3 Max using
vLLM -> to set up server for Gemma 4
Obsidian -> for KnowledgeDB
AnythingLLM -> To use RAG
It's been pretty good to just my own dataset to maintain my own copy of records
Funny_Working_7490@reddit
like how you use btw show me some example use cases? which gemma variant
Rooneybuk@reddit
Yes my stack is
ingestion through API in to n8n then Postgres and qdrant agent tools with qwen3.6-35b-a3b q4_k_xl on 2 x 4060ti ~32GB total
My inference setup is here https://d3v0ps.cloud/posts/2026/05/my-local-llm-setup-one-model-many-personalities/
I haven’t yet documented the client-side, such as n8n.
Zeeplankton@reddit
You can totally do half this now, super easy. Use like OpenCode and run like qwen via lm studio and point it at your obsidian .md folder. In can absolutely search through, create files, find connections etc. I use Codex for work stuff this way, (generating work md files) but for private I'm sure a local model would work.
Thoughts:
- RAG is cool in concept but personally bad in reality. Creating embeddings is it's own challenge locally, (how long will it take to embed 10k notes on local hardware..) storing that to a db, then querying is just not elegant. Any time you add or change files, you have to figure out how to re-embed those specific files.
- Tool calling and just grepping around is probably close enough
Ideal state: CoT knowledge graph stuff is what dozens of companies are working on now, trying to solve the memory problem of llms. So realistically none of them are privacy focused or easy to setup; but I'm sure if you wanted to you could find and create your own system.
kaliku@reddit
Rag is not needed for notes because of the realistic low volume. But I'd like to challenge the notion that it would take a long time. Embedding models are very small and for text notes without ocr it's not slow at all. For large pdf files like books with tables etc yes it can be slow. I've tested ragflow and it looks pretty good . It comes with everything including an mcp server. My tests so far were only about enhancing qwen3.6 knowledge from technical books and comparing outputs without / with rag. So far with rag wins hands down.
Amazing_Athlete_2265@reddit
I have big plans for a personal assistant, but little time.
admajic@reddit
Lmstudio is the easiest to start with. Load a model drop in your file and you can ask it about the file.
Evanisnotmyname@reddit
LMstudio’s security is sketchy as fuck from what I’ve been seeing lately
MarcusAurelius68@reddit
How so? Would like to know more about this.
InformationSweet808@reddit (OP)
That's actually one of my concerns too what specifically have you seen? Is it the app itself or more about the models it pulls?
Etroarl55@reddit
I live in Canada. Hardware prices are extremely high, internet speeds slow.
I would definitely be incentivized to experiment with using it as a daily knowledge base if I could run newer 2026 models and have a fast enough internet speed to allow it to browse freely.
Some-Cauliflower4902@reddit
Not that I have to query my own life too much, though I have too many hobbies and need some tracking of those. Assuming you don’t need anything too precise like financials. I section things so it’s not a big mess. Every hobby has its own project + memory + folder. RAG for background context. Anything specific llm go search in the folder themselves. Also have cross encoder reranking for larger file base. As for trust issues … It’s your stuff you should have a rough idea so don’t 100% rely on llm to tell you. Context length not a problem because if it’s a large doc they searches relevant sections instead of read my 300k word novel. Any llm that can reliably tool call is fine. Llama.cpp for speed. It’s my yet another hobby so I don’t call it a part time job, but there is always new things I look to add.
InformationSweet808@reddit (OP)
For context, I'm looking at this for personal use, not building a product. Just want something that works reliably on a normal machine.