Sharing my unorthodox home setup, and how I use local LLMs
Posted by SomeOddCodeGuy@reddit | LocalLLaMA | View on Reddit | 32 comments
So for the past year and a half+ I've been tinkering with, planning out and updating my home setup, and figured that with 2025 here, I'd join in on sharing where it's at. It's an expensive little home lab, though nothing nearly as fancy or cool as what other folks have.
tl;dr- I have 2 "assistants" (1 large and 1 small, with each assistant made up of between 4-7 models working together), and a development machine/assistant. The dev box simulates the smaller assistant for dev purposes. Each assistant has offline wiki access, vision capability, and I use them for all my hobby work/random stuff.
The Hardware
The hardware is a mix of stuff I already had, or stuff I bought for LLM tinkering. I'm a software dev and tinkering with stuff is one of my main hobbies, so I threw a fair bit of money at it.
- Refurb M2 Ultra Mac Studio w/1 TB internal drive + USB C 2TB drive
- Refurb M2 Max Macbook Pro 96GB
- Refurb M2 Mac Mini base model
- Windows 10 Desktop w/ RTX 4090
Total Hardware Pricing: \~$5,500 for studio refurbished + \~$3000 for Macbook Pro refurbished + \~$500 Mac Mini refurbished (already owned) + \~$2000 Windows desktop (already owned) == $10,500 in total hardware
The Software
- I do most of my inference using KoboldCPP
- I do vision inference through Ollama and my dev box uses Ollama
- I run all inference through WilmerAI, which handles all the workflows and domain routing. This lets me use as many models as I want to power the assistants, and also setup workflows for coding windows, use the offline wiki api, etc.
- For zero-shots, simple dev questions and other quick hits, I use Open WebUI as my front end. Otherwise I use SillyTavern for more involved programming tasks and for my assistants.
- All of the gaming quality of life features in ST double over very nicely for assistant work and programming lol
The Setup
The Mac Mini acts as one of three WilmerAI "cores"; the mini is the Wilmer home core, and also acts as the web server for all of my instances of ST and Open WebUI. There are 6 instances of Wilmer on this machine, each with its own purpose. The Macbook Pro is the Wilmer portable core (3 instances of Wilmer), and the Windows Desktop is the Wilmer dev core (2 instances of Wilmer).
All of the models for the Wilmer home core are on the Mac Studio, and I hope to eventually add another box to expand the home core.
Each core acts independently from the others, meaning doing things like removing the macbook from the network won't hurt the home core. Each core has its own text models, offline wiki api, and vision model.
I have 2 "assistants" set up, with the intention to later add a third. Each assistant is essentially built to be an advanced "rubber duck" (as in the rubber duck programming method where you talk through a problem to an inanimate object and it helps you solve this problem). Each assistant is built entirely to talk through problems with me, of any kind, and help me solve them by challenging me, answering my questions, or using a specific set of instructions on how to think through issues in unique ways. Each assistant is built to be different, and thus solve things differently.
Each assistant is made up of multiple LLMs. Some examples would be:
- A responder model, which does the talking
- A RAG model, which I use for pulling data from the offline wikipedia api for factual questions
- A reasoning model, for thinking through a response before the responder answers
- A coding model, for handle code issues and math issues.
The two assistants are:
- RolandAI- powered by the home core. All of Roland's models are generally running on the Mac Studio, and is by far the more powerful of the two. Its got conversation memories going back to early 2024, and I primarily use it. At this point I have to prune the memories regularly lol. I'm saving the pruned memories for when I get a secondary memory system into Wilmer that I can backload them into.
- SomeOddCodeBot- powered by the portable core. All these models run on the Macbook. This is my "second opinion" bot, and also my portable bot for when I'm on the road. It's setup is specifically different from Roland, beyond just being smaller, so that they will "think" differently about problems.
Each assistant's persona and problem solving instructions exist only within the workflows of Wilmer, meaning that front ends like SillyTavern have no information in a character card for it, Open WebUI has no prompt for it, etc. Roland, as an entity, is a specific series of workflow nodes that are designed to act, speak and process problems/prompts in a very specific way.
I generally have a total of about 8 front end SillyTavern/Open WebUI windows open.
- Four ST windows. Two are for the two assistants individually, and one is a group chat that have both in case I want the two assistants to process a longer/more complex concept together. This replaced my old "development group".
- I have a fourth ST window for my home core "Coding" Wilmer instance, which is a workflow that is just for coding questions (for example, one iteration of this was using QwQ + Qwen2.5 32b coder, which the response quality landed somewhere between ChatGPT 4o and o1. Tis slow though).
- After that, I have 4 Open WebUI windows for coding workflows, reasoning workflows and a encyclopedic questions using the offline wiki api.
How I Use Them
Roland is obviously going to be the more powerful of the two assistants; I have 180GB, give or take, of VRAM to build out its model structure with. SomeOddCodeBot has about 76GB of VRAM, but has a similar structure just using smaller models.
I use these assistants for any personal projects that I have; I can't use them for anything work related, but I do a lot of personal dev and tinkering. Whenever I have an idea, whenever I'm checking something, etc I usually bounce the ideas off of one or both assistants. If I'm trying to think through a problem I might do similarly.
Another example is code reviews: I often pass in the before/after code to both bots, and ask for a general analysis of what's what. I'm reviewing it myself as well, but the bots help me find little things I might have missed, and generally make me feel better that I didn't miss anything.
The code reviews will often be for my own work, as well as anyone committing to my personal projects.
For the dev core, I use Ollama as the main inference because I can do a neat trick with Wilmer on it. As long as each individual model fits on 20GB of VRAM, I can use as many models as I want in the workflow. Ollama API calls let you pass the model name in, and it unloads the current model and loads the new model instead, so I can have each Wilmer node just pass in a different model name. This lets me simulate the 76GB portable core with only 20GB, since I only use smaller models on the portable core, so I can have a dev assistant to break and mess with while I'm updating Wilmer code.
2025 Plans
- I plan to convert the dev core into a coding agent box and build a Wilmer agent jobs system; think of like an agent wrapping an agent lol. I want something like Aider running as the worker agent, that is controlled by a wrapping agent that calls a Roland Wilmer instance to manage the coder. ie- Roland is in charge of the agent doing the coding.
- I've been using Roland to code review me, help me come up with architectures for things, etc for a while. The goal of that is to tune the workflows so that I can eventually just put Roland in charge of a coding agent running on the Windows box. Write down what I want, get back a higher quality version than if I just left the normal agent to its devices; something QAed by a workflow thinking in a specific way that I want it to think. If that works well, I'd try to expand that out to have N number of agents running off of runpod boxes for larger dev work.
- All of this is just a really high level plan atm, but I became more interested in it after finding out about that $1m competition =D What was a "that's a neat idea" became a "I really want to try this". So this whole plan may fail miserably, but I do have some hope based on how I'm already using Wilmer today.
- I want to add Home Assistant integration in and start making home automation workflows in Wilmer. Once I've got some going, I'll add a new Wilmer core to the house, as well as a third assistant, to manage it.
- I've got my eye on an NVidia digits... might get it to expand Roland a bit.
Anyhow, that's pretty much it. It's an odd setup, but I thought some of you might get a kick out of it.
Ok_Warning2146@reddit
Your experience confirmed that Apple Silicon and the upcoming DIGITS are ideal for workflows that involve multiple small llms.
Recently I am into long context but is annoyed by quadratic increase in both VRAM usage and run time. While the Nvidia cards are 2 to 5 times faster than M2 Ultra in prompt processing, their small VRAM made them to use any decent model with long context. Did you try the RWKV or SSM models (e.g. Jamba) that are more linear for long context on M2 Ultra?
SomeOddCodeGuy@reddit (OP)
I can't remember- is Nemo a Jamba? If not, then no, I don't think I have.
Honestly, one of the things I focus on with Wilmer is to try to keep my contexts low. Part of why I went heavily into the memory system was that I wanted the quality of lower context with the ability to retain at least high level important info, so for the most part I will only ever load a model, at most, at 32k context. I don't think I've ever loaded a model at a larger context than that.
Even the small models start to get frustrating at larger context on the Mac anyhow lol. Context shifting helps to a degree, but when you're doing workflows you also break the shifting, so you have to be careful/clever about how you do it.
With the exception of one or two workflow nodes, most of my models never process more than 8,000-12,000 tokens at a shot.
DuckRedWine@reddit
Hey, I checked wilmer and it seems your memory system is about summarizing chat messages, but not creating the initial memory. In my case I'm trying to automate the selection of files needed to be edited for a feature implementation on a large codebase, to pass as prompt context, and don't know if it could help in some way? Basically I'd need some sort of RAG to find the relevant files (some use LLMs for that), but the goal would be to not have to send my entire codebase to some model providers that don't take privacy as a priority.
Ok_Warning2146@reddit
While you may not have a use case for long context yet, there might be one in the future. It would make sense for you to explore this in your free time. I think you can start by trying llama-3.3-70B (128k max, 64k effective) and Jamba-mini (a 52B MoE model w/ 12B active, 256k max, 256k effective). Then compare the two to see which one can be good for your future long context workflow.
reza2kn@reddit
I would love to see a video on this, EVEN if it's just a simple hit record and you start talking and showing us what you got there! :)
SomeOddCodeGuy@reddit (OP)
Im planning to install that obs video recording app this weekend and see what I can scrounge up. Need to downgrade roland to slightly smaller models or the video will be a lot of us all waiting for a response together =D
No_Afternoon_4260@reddit
If you had no hardeware and 10k on your bank account how would you spend it now? Are you happy with your macs?
SomeOddCodeGuy@reddit (OP)
Yes and no. I'm not unhappy with my macs, but if I had absolutely no limitations at all and $10k in the bank, I'd build an RTX 3090 box. I had a secondary constraint of an older house and breaker box, so I couldn't handle the power draw on a machine with 5-6 3090s, but if I had a dedicated 30 amp plug/breaker and $10k, I'd have built that. The speed difference between Mac and CUDA would easily offset the time to hot swap models via Ollama API calls to get the exact same number/layout of models.
With that said, there's a caveat here. There is no better laptop for LLMs than the Macbook Pro right now, so $3000 of that would be the exact same laptop I have now. Nothing out there I'd trade it for. Then the other $7000 would build whatever CUDA machine I could.
Elite_Crew@reddit
Do you talk to your AI like Jarvis?
SomeOddCodeGuy@reddit (OP)
I'm afraid I do not, for two reasons:
I do want to get there, though; I have plans for one day doing it, but I think I want to handle it differently. I'm not sure if Roland is the assistant I'd use it with, but rather whatever assistant core I put together for running the house IOT stuff, since I won't be typing a lot to talk to that one.
Chances are Roland will remain just a text assistant for a good while longer.
Elite_Crew@reddit
You should check out the latency on that Glados project. It would need training for different voices though. I would actually rather prefer to talk to my AI if it is as capable as a droid in Starwars. Its sounds like your AI system is probably nearing that capability. That project has a reddit thread here and its in the video description.
https://www.youtube.com/watch?v=N-GHKTocDF0
Environmental-Metal9@reddit
Can you elaborate on how you’re doing memory management? I’m doing some research on the topic, and I got a nice semantic contextual memory system going, but it is so slow it becomes barely usable after a few sessions even if accuracy and relevance are stellar. I’m willing to go with something simpler before I decide to spend months finetuning my own homegrown system.
SomeOddCodeGuy@reddit (OP)
Absolutely. Luckily I saw this right before going to bed lol. You will likely be disappointed when you hear how it works, but for me the simplicity of it is exactly what I needed.
So Wilmer's memory system started as basically me just outright rejecting the vector memory systems that all the front ends were using in late 2023/early 2024. It frustrated me to no end that it wouldn't capture the right data that I wanted. For example, say I told the LLM a story about the house I grew up in, a big white house, and then explained the layout of it; then over the course of the next few months I kept talking about the big white house and various rooms in it during the course of other topics. Eventually I ask it "What was the layout of the big white house? How many rooms did it have?" The scoring that retrieved the memory may not score my initial message to the LLM explaining the layout higher than other times I talked about other rooms.
Also, constantly shifting the memories in the prompt kept causing the LLM to reprocess the whole context which also annoyed me.
So Wilmer's memory was a brute force method. As the conversation continues, a settings file gives values for the number of messages that should pass before a new memory is generated OR a max number of estimated tokens since the last memory was generated, as well as the number of messages to skip from the most recent (because some front ends fill the first few messages with stuff). Every time you hit either the number of message limit or the size limit, that number of messages get sent to the LLM alongside the last 3 memories that had been created, and the prompt in that setting file is used to generate a new memory. That is added to a .json file.
Then, every N number of memories, a summary is generated based on the settings in another file. This rolling summary gets regenerated every N number of memories, asking the LLM to rewrite the summary with the new context.
Now, there are various ways I use this, but for Roland what I do is
Now you can see why I have to prune them once in a while =D
Anyhow, you can read more about them here and here.
DeltaSqueezer@reddit
Can you explain context shifting? I've yet to find an explanation that I understand. I thought this was somehow selectively pruning content from the context window to fit but not sure if this is correct and if so, what is the algorithm for selecting what is kept and what is discarded.
SomeOddCodeGuy@reddit (OP)
Absolutely. I don't know the exact algorithm but I can tell you what I've observed and how I've managed to work with it.
First, to imagine Context Shifting, imagine you have a long page of paper with hundreds of lines of text. And then you have a small "window" like frame that you put over it that can only see 20 lines at a time. As you move the window frame downward, you see new lines and lose old lines. That, from how I read their explanation, is how they are managing the KV Cache; taking in new tokens and pushing out old tokens from the KV.
So for the most part, Context Shifting only works when you are modifying the bottom of the conversation. Let me give you an example:
If you add a new message to the bottom, what I've seen is that it will bump message 9 off context and will add the new message under message 1. This will cause the LLM to only need to process the newest messages that came in, keeping the rest in the KV Cache. In essence, that appears to be the core of how context shifting works.
Now, the issue comes in when YOU, or your front end, modifies anything closer to the top of that example message. If you add a new message under 1, or you edit up to maybe message 2 or 3, you're pretty safe; context shifting should just drop those messages and then reprocess the new versions + whatever your new message is. But the higher up you go, the more likely you are to cause it to just throw everything away and start fresh, reprocessing the whole context.
So, for example, in SillyTavern the world books or whatever they call them may tack on pieces of information higher up in the conversation. It could add a piece of info as high up as right below the system prompt, or somewhere between messages 5 or 6, etc. In doing so, context shifting would likely just throw its hands in the air and reprocess the entire prompt
Environmental-Metal9@reddit
❤️
Thank you so much for such a comprehensive writeup! It gives me hope that I was going in the right direction. My setup has many of the same building blocks (n messages till memory, rolling summaries, system prompt with the memory context) but I was spending too much time computing and saving embeddings, calculating similarities, calculating time decay (to simulate more human like memory, where we tend to forget things after some time if they don't surface again), and on top of it all i am working with exactly 32gb on an m1 mac, and everything happens in serial, so as you can imagine this is not sustainable.
I started playing with llama-cpp-python's save_context functions (from llama-cpp) and that is a nice way to save state from session to session, but doesn't solve the memory issue. I really like your idea of parallelizing this specific aspect of the memory system, so that you don't need to waste time on it, and it can then take as long as it needs. I might experiment with that idea using an api provider just to test it out.
Thank you for providing links to the repo so I can learn more!
Some-Conversation517@reddit
This is sick 🤯
prometheus_pz@reddit
可以考虑出视频来讲解整个流程和演示效果,比文本更直观
SomeOddCodeGuy@reddit (OP)
我同意。其他人也提到了这一点,并向我推荐了视频捕捉软件。我很久没有真正制作过视频了,所以一直拖延,但我打算尝试在这个周末制作一些视频。如果读起来有点奇怪,请原谅;我用的是谷歌翻译。
--------
Using google translate, their message was: "You can consider making a video to explain the whole process and demonstration effect, which is more intuitive than text."
I responded with "I agree. Someone else also mentioned this, and gave me a recommendation for video capture software. I haven't really done videos in a long time, so I kept putting it off, but I'm planning to try to set up this weekend and do a few. Sorry if this reads a little strange; I used Google Translate."
rorowhat@reddit
Feel like this needs a video of you using it
SomeOddCodeGuy@reddit (OP)
I'll get obs set up this weekend, but for now here are a few screenshots to give an idea of what some of the setup looks like in use.
SomeOddCodeGuy@reddit (OP)
Yea I need to do a whole video series on setting it up, because nothing about Wilmer is user friendly. The problem is that I haven't made a video of anything in years, so I keep putting off setting it all up lol. Especially to show off the different cores; I'd need to do the video off one of my Macs since it's easier to remote into Windows from a Mac than the other way around.
tl;dr- I've been procrastinating, but I know I should lol
rorowhat@reddit
You can probably write the steps, ask one of your AI's to create a video script from it, and get another AI to speak it and you just click around to demonstrate lol
SomeOddCodeGuy@reddit (OP)
lol! I don't mind the talking and stuff, I have to do a lot of that for meetings at work (dev manager); it's finding a good recording program for the Mac and then sanitizing my desktops so I don't stream my tax info or something =D
This weekend I'll start looking to get everything set up and find a good video capture program. I'll probably do video walkthroughs of the quickguides in Wilmer, and then one just kind of showing everything.
I also need to think of a way to make the video interesting. I feel like it would be really boring to see me ask Roland something and then go bounce around terminal windows like "Oh look, this model is doing work! ... and now this one... and now this one... hey Roland responded!" =D
rorowhat@reddit
For screen capture this is the best open source one https://obsproject.com/
failcookie@reddit
Echoing interest in a video demo of what is going on here. I feel like I get the idea, but I’d really like to see it play out lol
SomeOddCodeGuy@reddit (OP)
Yea I'll definitely set up a vid. I've owed folks some videos on this for a while. I'll start looking for a good video capture software for the Mac so I can walk through it. Because the cores are spread across multiple machines, I think Mac capture will be the way to go.
Co0lboii@reddit
Interested to try this out for myself
reneil1337@reddit
this is incredible. really impressive! need to dig into WilmerAI
SomeOddCodeGuy@reddit (OP)
Reddit keeps eating my comments... trying this again. If you get 3 comments from me, I apologize.
You can likely recreate most of this setup using just about any workflow application; the assistants are the only thing that really need Wilmer. The only thing Wilmer does differently from something like n8n or omnichain is the domain routing, so that the assistants can handle more complex problems, but otherwise things like the coding workflow or encyclopedic workflow in Open WebUI could likely be done with either n8n or omnichain, if you prefer.
For Roland- I have a very simply example of an assistant workflow as an example user in the codebase; Roland and Socb are both just mega customized versions of that. Their routing is different now and the workflows are all very different, but the overall concept is the same. If you did use Wilmer, you could use those users as a starting point for the assistant.
hainesk@reddit
Wow, this is really cool! A great use of AI!
SomeOddCodeGuy@reddit (OP)
I think reddit ate my comment, so going to rewrite it. If I have 2 comments saying similar things, that's why. I don't see the comment.
NOTE: I used to only have 1 Wilmer "core" in the house, which powered Roland. The models were spread across all the computers, so I had a lot more VRAM for a single assistant. I ended up changing this after discovering a fundamental flaw with the design: one day my Macbook restarted for an update, and while it was done Roland was gone =D Then, later, the UPS for my Windows machine died and again Roland was gone.
I only had 1 assistant at the time, so when a single node died, it killed my whole system. That's why I broke it up into 3 "cores", so I could have 2 assistants in case one died, and I could have a dev system to build against.
It has an additional benefit for letting me do the group chat, which replaced my old dev chat where I had different models talk to each other. Back then, it was just routing to workflows with a single responder node for each model, so it was model talking to model with nothing else. It was neat and a little helpful, but even with 6-7 models they were too similar.
These two assistants have specific workflows and domain routings that cause them to talk/act/process problems/etc completely differently from each other, so I feel like I get more out of it now.