offline companion robot for my disabled husband (8GB RAM constraints) – looking for optimization advice
Posted by BuddyBotBuilder@reddit | LocalLLaMA | View on Reddit | 77 comments
Hi everyone. I’m probably posting slightly outside the usual scope here, but I’m hoping some of you might have advice.
I’m Gen-X with no formal programming background, but I’ve been building a small AI companion project for my husband. He’s mostly quadriplegic (paralyzed legs and limited use of his hands) and spends most of the day alone at home while I’m at work. We live in a very rural area with no close neighbors or nearby friends, and the isolation has been hard on him.
So I decided to try building him a companion robot.
For the past year I’ve been scavenging parts and learning as I go. The goal is a fully local, offline mobile robot built on a small power-wheelchair base (two 24V batteries) that can talk with him and keep him company.
Current prototype setup:
LLM (conversation):
• Mistral-7B-Instruct via llama.cpp
• Running on a free Lenovo ThinkPad
• Intel i5 @ 1.6 GHz
• 8 GB RAM
Speech Recognition:
• Jetson Nano running faster-whisper (base, INT8)
Text-to-Speech:
• Piper TTS – en\_us-ryan-medium
Right now the output is just going to an HDMI port connected to a TV while I test everything.
The main limitation is the ThinkPad’s 8 GB RAM, so I’m restricted to smaller quantized models.
My main question:
What are the best ways to maximize usable RAM and performance for llama.cpp on an 8 GB system?
For example:
• Better quantization choices
• Swap/zram strategies on Linux
• Smaller models that still feel conversational
• Any other tricks people use on low-resource systems
OS is Linux Mint 22.3 Cinnamon (64-bit).
I know this is a bit of an unusual use case, but if anyone has suggestions for squeezing more performance out of limited hardware, I’d really appreciate it.
ItsFlybye@reddit
What a great project for the hubby =). Lots of great replies, and I'm a bit of a noob myself so there isn't much I can contribute. One thing I didn't see mentioned: Have you considered a PTZ camera?
There is a PTZ camera around $50 I caught in a Medium article someone programed to have their AI interact with. It could allow the AI buddy to look around without having to move its entire self around to look at your hubby. This could provide additional interaction by moving more often at what he is doing and asking questions/comments on his actions.
BuddyBotBuilder@reddit (OP)
That’s a fun idea. Don’t only have a local camera on the not. Have one that observes the room as well. I might add that later in the project. The trick is I have a few limits I am trying not to cross for this project: main one is it’s local, not dependent on the internet as we have power outages a few times a year that have lasted up to three days. It can’t depend on networking in the house as well. :)
darkgamer_nw@reddit
I recommend you take a look at this repository; it basically has most of what you want to do already set up
https://github.com/brenpoly/be-more-agent
Not_your_guy_buddy42@reddit
Running a local voice bot since over a year (40k chats now) thought a lot about aspects behind artificial companionship in a jank situation (jank is good), dm if you like
daLazyModder@reddit
I would look into moonshine asr its similar to whisper but better for edge cases like this.
https://github.com/moonshine-ai/moonshine
You could also try something like phi mini moe for the llm
https://huggingface.co/microsoft/Phi-tiny-MoE-instruct
It is a moe model that is 3.8b parameters total with 1.1b active
being moe means the model runs as fast as a 1.1b with 3.8b knowledge, though phi's personality i've read can leave a lot to be desired... (its by microsoft and was reported to be super censored for even basic stuff)
piper tts is fast and good on cpu and low latency, kokoro would probably work as well, but you can actually do halfway decent voice cloning with pocket tts on cpu bit more of a pain to setup but could be done in your voice as the companion if your spouse would like that
https://huggingface.co/KevinAHM/pocket-tts-onnx
I kind of hate to recommend upgrading the laptop but I suspect its running ddr4
https://www.ebay.com/itm/204766825540
16gb is 100$ which is ridiculous (ram pricing is currently crazy) but might be worth the investment to run a slightly larger model like granite 4 tiny which is a 7b total 1b active I believe
(I would personally check if you have 2 sticks of ram in the laptop, if it says 8gb in task manager but there is only 1 ram stick then you could double your ram capacity for about 50$ buy buying a cheap stick off ebay, just make sure its ddr4 not ddr5 or something older)
https://huggingface.co/ibm-granite/granite-4.0-h-tiny
is the granite model I mentioned you might look into.
Stepfunction@reddit
For your stack, given your limited specs, I would recommend the recently released Gemma 4 E4B or E2B and Kokoro TTS for the most bang for your buck.
The real trick is being able to interrupt the robot when it's talking and to be able to store the long-term context in some sort of RAG setup so it doesn't forget everything all the time.
BuddyBotBuilder@reddit (OP)
I will absolutely try this. Will add the ability for people to interrupt the bot while it’s taking. Right now Buddy stores memory in ~/buddy_memory.json on the Jetson. It saves facts, appointments, medications, reminders, and OCEAN personality scores. However it’s pretty basic — just keyword extraction, no semantic understanding.
MeowChamber@reddit
I think you might wanna try front-end tool like SillyTavern or LettuceAI (this one has llama.cpp built in the backend so you don't need to install the backend separately like ST). While it mostly aims for roleplaying, I think you can use this tool the way your husband needs it. Both have memory systems that could help remembering more.
Mkengine@reddit
Gemma 4 is already a good recommendation, and maybe this one is too large, but at least from the conversational requirements this could also be interesting:
https://huggingface.co/nvidia/personaplex-7b-v1
TheDigitalRhino@reddit
Also, it looks like you have a Claude subscription, consider giving Claude Code a try, it will do the coding for you. It does use your tokens quickly but will help code things up quickly. Going back and forth between the browser will just slow you down.
BuddyBotBuilder@reddit (OP)
Nope free version. I work around the token limits by having it summarize the progress. Then I delete the conversation and start a new one using the summarization and by giving it a copy of all my py files. Right now it’s not a lot of coding luckily. It’s just a lot of editing of what’s there. Trying to get it not to go off the rails of the conversation. The files dont need much changed when switching models.
Exciting-Mall192@reddit
I would recommend try using Gemini on Google AI Studio with the summarization done by Claude when you're out of message. Is the companion only for conversational, btw? I agree with everyone's recommendation on trying newer models like Qwen 3.5, Gemma 4, or the Bonsai model
BuddyBotBuilder@reddit (OP)
No subscription API, only because we loose power out here in the woods a few times a year. Some times for a few days. That’s when he would need the bot the most. So it a hard rule that it’s must be able to function without internet access for extended periods of time. We don’t even get cell phone reception unless our router is working. We have a small generator that runs the pellet stove, Fridge, tv and a dvd player and could keep the bot charged. No easy outs for this project :). That’s what makes it fun!
Stepfunction@reddit
That makes sense. Another option, similar in spirit would be that if you have a powerful desktop computer, you could host the LLM on that and make API calls from the laptop to reduce the power consumption and potentially host larger models or get better inference performance.
braydon125@reddit
You need gpu dude
unjustifiably_angry@reddit
And a fleshlight.
braydon125@reddit
Nvidia personaplex is an incredible conversational model
HeyEmpase@reddit
Have you thought about using lightweight LLMs like Phi-3-mini (3.8B) or TinyLlama (1.1B) quantized to 4-bit? They can work well on 8GB of RAM with CPU-only inference and are capable of handling basic dialogue, reminders, and command parsing offline. I'm curious about what sensors or actuators you plan to integrate! Voice input latency and response naturalness can really impact the user experience, so it's worth considering those factors.
Such a heartbreaking and useful case. I think most of code nowadays without any end goal, but this... please continue!
BuddyBotBuilder@reddit (OP)
Thanks for your interest and questions! have tried all three. It’s trying to find the balance between speed and reason that’s the trick. The one I’m using now is a little slow but is better at actual conversation. Going to try the Gemma 4 tomorrow, haven’t tried that one yet.
Have a bunch of basic sensors that come with uno kits. Have a x-box connect camara that has some fun stuff but it’s a little power hungry for this setup. Have a few web cams lying around. Dream would be LiDAR of course but I’m probably going to have to just use the web cams and the proximity sensors.
Body: base small power chair bases with two 24volt batteries. Got a few step down power converters so I can split the power up and not overload anything. Probably need to devise some kind of fuse box as well. To the base I will be attaching the body to main support poll using three screw actuators and some gide polls so the bot can bend a little and emote a little with its movement. I would like for the bot to be able to help pick things up off the floor he drops them.
(In his chair he cant pick stuff up off the floor. He’s taken a few nose dives out of his chair trying. I get home from work and find out he’s been in the floor for a few hours. It’s terrible. He’s to proud to call me even though he knows I would come home right away. So for now I have metal taped to items so he can use an extendable magnet sticks to get stuff he drops)
Going to use the 3D printable parts from the InMoove robot. (Open source android body).specifically the neck joint. I just need to expand the print about 200% so it’s the right size. Will print the same joint a second time for the head at normal settings. So that’s 6 actuators currently. The power chair base has two motors for forward, back, and turn.
Arms, not quite there yet. Lots of options out there since I have a 3D printer.
Head, got a cool looking helmet and a led matrix. Will put the led matrix in the helmet visor and give it some expressive eyes. Going to try for eyes like eve from movie Wall-e
Personality, I incorporated the open source “ocean”personally assessment tool. It’s a solid diagnostic tool. The bot has instructions to try and slip in a question or two each day, record the response and then use the assignment tools findings to tailor it’s interactions with him.
Anyway, thats what I’ve come up with so far. Right now just trying to get step one done, and see if I can get this equipment work fast enough to simulate normal conversation speeds so it feels natural. Might not be possible, but that’s why I came here :)
MagoViejo@reddit
Wow , you really are something else! I'm planning on something for the next decade as I will probably be disabled and you are giving me some ideas.
About the LIDAR thing , maybe you can scavenge pieces from those cheap roomba clones from chin. Their mecanic parts give way before their electronics, you may find a treasure trove there.
As for the llm , I saw in this sub something being done with old mobile phones, maybe separating different taks to different units may make it more responsible/capable.
Anyway I'm just starting to learn to look under the hood with LLMs
HeyEmpase@reddit
Didn't expect such an elaborate answer! That sounds like a really thoughtful build. Since you’ve already tested several small models, one promising next step might be fine-tuning a smaller model for your exact use case instead of chasing bigger models. A 3B–4B model tuned for your husband’s style of conversation, reminders, routines, and preferred tone may feel better than a larger general model while using much less RAM and running faster.
Tools from unsloth.ai make LoRA fine-tuning much easier, and people do run it in Google Colab [1]. Even a small dataset of example conversations, reminders, reassurance, and daily check-ins could help. For this project, a fast, warm, reliable specialized model may beat a slower “smarter” one. May you be happy and by doing so making happy your husband.
[1] https://unsloth.ai/docs/models/gemma-4/train
ai_guy_nerd@reddit
This is such a thoughtful project. For 8GB on Mistral 7B, you've basically hit the ceiling unless you go super aggressive with quantization. Some practical wins:
Quant approach: Q3_K_S or Q4_K_S gets you back maybe 500MB-1GB of headroom vs your current setup. You lose some reasoning quality but the conversation stays coherent for companionship use.
Swap trick that actually works: Set up zram (not swap to disk). On Linux:
modprobe zram, echo 2G > /sys/block/zram0/disksize. It's compressed in-memory, doesn't thrash the disk, and buys you another gig of effective headroom. Noticeable latency hit but not terrible.Smaller model wins: PhiLM 3.8B or TinyLLaMA 1.1B run on the i5 without the Jetson and still feel conversational for companionship. Less grounded knowledge, but for consistency and warmth (which matters for your use case), sometimes smaller + responsive beats bigger + slow.
The real win is bundling everything together: route lightweight conversations to PhiLM directly, offload longer chats to Mistral via the Jetson. You're already doing the hard part with multidevice orchestration.
Erwindegier@reddit
Does it have to be that hardware? Can you find a used Air M1 or Mini M1 with 16gb? It’ll allow you to run way better models. I’m actually building a companion bot to run on those specs. Wrote my own vector DB for memory on top of SQLite and am currently looking at index2tts for text to speech as it sounds way better than piper. You can clone your own voice and pass emotion to the speak prompt.
brickout@reddit
Awesome! Have you checked if the thinkpad has expandable RAM? A lot of those models do. You could likely get another 8GB for $30 or less.
If you want a slightly stronger laptop, I'll bet I have one I could donate...but I read that you are enjoying being scrappy. I'm the same way.
This project is very cool. Are you powering it directly from the wheelchair power?
Some low resource thoughts:
my fedora laptops HATE hitting zram for some reason. If you're maxing out your RAM and getting random hangs, maybe look at that. I disabled zram and instead made a classic swap file on my disk and no more hangs.
Gemma4 is absolutely incredible for its RAM usage. I have found the smaller sized ones to be pretty verbose and chatty.
very cool project.
BuddyBotBuilder@reddit (OP)
Yes, checked. Only a single ram slot. Someone recommend two 4gig chips might work better than one 8 so took a look and there’s
only one slot.
Thanks for the laptop offer! ;)
So laptop has its own battery, but yup, everything else will be running off of the power chair batteries.
brickout@reddit
Gotcha. Figured you would have checked with how comfortable you are putting all of this together. Good luck with the rest of the build! Will like to see updates in the future.
while-1-fork@reddit
I suppose that the thinkpad won't have a NVME SDD? If it does you will likely do well with MoE models larger than it would seem reasonable through memory mapped files. Maybe even worth testing even if you have a slow hard drive. I have not tried them but the Marco Mini and Marco Nano models may be a good idea as they are MoE with very good benchmark scores for their size (may or may not translate to real use) but with a tiny amount of experts activated so they should be fast even on constrained hardware and only the active weights really need to be on memory simultaneously.
What is almost a must is using a modern model with hybrid attention whatever the size of model you settle with. The Qwen 3.5 line up is very good. Nemotron Nano and Gemma 4 are also strong contenders. Even Qwen 3.5 0.8B would be an improvement over Mistral-7B and way faster with less resource use.
I quants offer better bang for their weight buck than k-quants at the same bits (not available over 5 bits but you will likely run 4 or 3 bit). If you use ik-llama there is also i-k quants that are even better.
You may consider inverting your setup and running the LLM on the Nano while whisper runs on the pc through whisper.cpp . Specially if the Nano has a SSD for the memory mapped MoE I talked about.
As for zram. Given your cpu , you likely don't want to use zstd but lzo. zstd is often recomended because it can reach a higher compression ratio but it is way slower even on much stronger cpus. There are other algorithms that are slightly faster than lzo but offer worse compression and are likely not worth the trade. You also want to set vm.page-cluster=0 (the number of blocks that it reads ahead, in a hard drive swap it helps, here it often causes uneeded decompressions for almost no troughput gain and kills latency and cpu use). And when using z-ram you want to swap as early as possible so set vm.swappiness=200 (Even with that set it won't really begin swapping until your ram is about 80% full , early swapping results in less thrashing and distributes the cpu use over more time). Also disable swap partitions and swap files and set the z-ram swap to be 2x the system ram. I am running that on a 16GB machine (and a 24GB gpu) with OpenClaw + llama.cpp running Qwen 3.5 35B A3B in IQ4 + SearXNG + full Chrome on a container for OpenClaw to use + Yolo11 nano running on cpu filtering frames of a camera for images containing my cat + Claude code and everything runs great. The 3090 does a lot of heavy lifting of course but z-ram helps a lot too as rarely used stuff gets pushed into it and even some frequent usage won't fully kill performance. I don't use it as a main machine, but only as an OpenClaw + Claude code machine. But I have been using z-ram for many years and it is great. I have not swapped to disk in maybe a decade, my main pc has 128GB and I still run z-ram on it and have done crazy things that required 300GB+ which would have been impossible swapping to disk.
ab2377@reddit
please swap mistral 7b with qwen3.5-4b q4. its insanely intelligent for its size you will love it, also much faster. do you build llama.cpp on your pc yourself or download from releases section from github? can i suggest you install gemini cli free version in case you want to write quick scripts or building llama.cpp without wasting time. its really good the free version.
good luck with your project. post updates on this sub as you move forward with it. lots of good luck and wishes.
Porespellar@reddit
You might want to look towards the folks at Stanford who built the open source Mobile Aloha robot for some inspiration on your project
https://mobile-aloha.github.io
They are west coast like yourself. They’ve pretty much open sourced all the plans and everything needed to build the working system.
JohnTheNerd3@reddit
that's such a nice thing to do! for the more technical side of things, i found Pocket TTS to be extremely fast with good quality, while still not requiring a GPU. it also supports voice cloning, so the assistant can have the voice of your choice!
while streaming on CPU, i can typically get the first word output within 200ms, and one of the projects support an OpenAI-compatible API so most tools "just work" with it. i personally use it for Home Assistant and am quite happy with it.
DevilaN82@reddit
As most of RAM would be taken by model weights, that are somewhat random numbers, and thou hard to compress, then zram will be almost no gain here. In fact it might harm performance when those weights would be "compressed" (cpu power used) and still take the same amount of place.
You should try using mmap (this maps part of hard disk as a memory addresses), so instead of reading from disk, writing to RAM, compressing, decompressing, even swapping (still going to disk back and forth). It would read from disk and use those.
This hardware is very very low spec for LLMs. You could get away with adding some knowledge base. Consider using wikipedia ZIM snapshot and allow your model to search / browese it to enrich its context and knowledge.
Also I would use a better model. Mistral-7b-instruct is IDK... 2 years old? Newer models are better with the same size. Use qwen3.5 or Gemma4 (whichever variant fits you device). Unsloth's models are great value for it's size - you should try Unsloth Dynamic quants. I would not go below Q4, but hey - maybe Q3 will still be useable for your usecase.
Good luck and please post a video showing how your current setup is working!
Previous_Escape3019@reddit
this is really cool. hope it works out
AnonymZ_@reddit
That’s cute and helpful
ssalvo41@reddit
I'm pretty experienced with Jetson stuff, so if you ever run into any issues, I'd be happy to help
brown2green@reddit
Most (all?) small conversational LLMs are going to feel very shallow very quickly as companions. I'd reconsider your idea, even if it's well-intentioned.
CaptnSauerkraut@reddit
I have nothing to add besides what Stepfunction said.
Just wanted to say that this is an awesome project and you sound like a great person. Keep us updated on the progress. Open sourcing the build once it is somewhat stable could help many more people.
MEGAnALEKS@reddit
I would try turboquant for more context window
Shayps@reddit
We can build something wonderful, but being this constrained will require us to be very creative.
Faster-whisper on the nano is a great design choice. Piper is as small and fast as you’re going to get it too. Good call on both of those. Latency is great for voice.
For the LLM we’re going to need to add memory, manage context, and ideally get e2e voice latency down to around a second.
I can help you, we can make this work. I build a lot of these that do all kinds of things. Can you DM me? I will likely want to send you some (free) hardware.
BuddyBotBuilder@reddit (OP)
Following some advice I got from a few people here and am going to try Gemma 4 and kokoro tts tomorrow to see if it improves the bots conversation ability as well as add the ability for the person talking to interrupt the bot while it’s talking. Lots of good advice popping up! Thanks for the offer of free hardware, that’s really sweet of you. I’ll pass for now as I want to see what I get out of the hardware I have collected already. I have collected a lot of stuff for this project, probably too much. lol.
Spicy_mch4ggis@reddit
I am fully invested in getting something to run on what you have. The add more compute answer, while clearly the optimal solution, do not result in something truly beautiful: getting buddybot to run on a wheelchair powered potato
Spicy_mch4ggis@reddit
What are the processing time considerations for everything that isn’t conversational? I mean specifically like if he asks a question, is it ok to wait a little bit for a better answer or is it required to immediately follow up? I ask because I have done some work on edge hardware that doesn’t require “real” real time processing and there are some processes that can be ran in sequence to fit more on less hardware.
pot_sniffer@reddit
Really cool project. A few things that might help:
The Jetson may be underused, depends which one you have. If it's an Orin, it's worth testing running the LLM there instead of the ThinkPad CPU. If it's a standard Nano (2GB/4GB), probably not. Whisper will already be eating the VRAM and there won't be enough left for a usable LLM.
Memory will matter more than model size for this use case. For a companion talking to the same person every day, the biggest jump in quality probably isn't a bigger model, it's giving it memory of past conversations. Even something simple: store a short daily summary in a text file, load the last few days into the system prompt. "Yesterday we talked about X, medication is Y, he mentioned Z." For what you're building this will feel more personal than any model upgrade. One important caveat: put a hard limit on how much you inject, last 3 days maximum, discard the rest. Context window fills fast on a CPU and inference slows noticeably as it grows. Without a cap you'll hit 30-second response times on a simple good morning within a couple of weeks.
On quant, more bits isn't always faster or better in practice. For CPU inference specifically, Q6_K or Q5_K_M will usually give you noticeably faster generation than Q8 for no meaningful loss in conversational quality. The speed difference on a CPU is real; the quality difference in casual conversation is hard to notice.
Streaming TTS will make a big difference to how natural it feels. Rather than waiting for the full response before speaking, pipe the LLM output to Piper in sentence-sized chunks, wait for a period, comma, or question mark, then send that chunk. Start speaking the first sentence while generating the second. If you send raw tokens as they stream the prosody will sound broken. Sentence boundaries is the key step most people miss.
GWGSYT@reddit
Try the Qwen 3.5 or Gemma 4 models, specifically Gemma 4 e2b or e4b. Please make sure that you are using their mmproj file it allows the ai to see images or in some cases, even audio and video this is not supported by all models. Qwen 3.5 (text and image only) and Gemma 4 (text, image, audio, video) support it. There are multiple versions of the qwen3.5 and gemma 4 models use the smaller ones, smaller than 8b, about 4b for larger context or memory. Their 4b is comparable to the original ChatGPT 3 which is 175B (not gpt 1 or gpt 2) model released in 2023 on the chat gpt website. I advise that you look for the Q4_k_m or Q4k_s versions of the model you only need larger models to solve math problems or doing programming using a more uncompressed model will not help in conversation that much and local models that are 7B or less are not reliable for programming anyway. They are great conversational models with vision or image input and the gemma models by google even support sending audio and video, but sending too much audio and video can fill up the model's memory, causing it to forget older things. such as the first few messages. Try the q4_k_m quant, it should allow you to set the context or memory to 64k
You can also delete or turn old images, videos and audio into text descriptions to make its short-term chat memory bigger. These models support tool calls so theoretically they can use the computer on their own but in practice they struggle to do so. I think you should look into silly taver it is an app for ai roleplay such as giving your ai a character like batman but it has alot of stuff prebuilt like text-to-speech, speech-to-text, image, audio, text and video sending if your model supports it, 3d models to make the ai seem more livly and built-in chat management to save, view and load old chats anytime. It is also open source so you can legally edit it to do anything but if you want to publicly share it you must allow others to do the same but if you are not sharing it publically you are allowed to edit anything about it. It is not like llama cpp it can allow you to talk to the models but you must have llama cpp running in the background.
You can use gpt codex it works with any chat gpt free account and does the work for you non stop for hours by using google, visiting official sources to fix any bugs in its code, tuning the app into an exe, optimizing etc this will allow you to just ask chat gpt to use webserch or google to look up any error, new models and fix them or add support for new features and optimisations. It can work for 4+ hours non stop until it thinks that the work is done even if you reach 0% usage left. The current task will get completed but if you are happy with claude fell free to use it but the Codex app can automate alot of things, like optimisation. You can just give it buzz words like better quantisation, lower presicon, tool calling, etc and it will add all the things it can in that senario you can use it to complete your AI assistant faster.
**NOTE:** Unless you are making the model use tools such as browse the web on voice command (which they might struggle with) but if you think that it works reliably then only use thinking, thinking will fill up context such as generating about 2000 words of though just to repily to a simple hello so please dont you thinking unless you have a usecase that requres it.
Optimizations like xformer, flash attnention 2, 8bit, 4bit, sage attention2 depend alot on your cpu or system that is whether it can actually support it like camera if your pc does not have a camera a camera app wont give it a camera
Even though gemma 4 supports audio and video I find the qwen 3.5 model more conversational as it uses emojis and stuff.
If you own a good android or any Android with 16gb ram it will be faster than your laptop you can use it to run llama cpp using Termux but it is moderately hard to setup if you use any random app to run the model from the play store or app store it might not support you jetset nano setup but as Termux is just an app that can launch liniux on your phone you can do what ever you wish to do on it. You can do this on an iPhone but even iphone 17 has like 8gb ram so it will may be not be faster but with optimization you laptop setup should beat it depends what varient you have though.
Try to have a larger context rather than a larger model imagine if you have the best model possible but it will forget what you said 4 messeges before due to having a small context or memory. This is mostly determined by your hardware
If you are using a cpu optimized version of mistral you can ask claude to find a cpu optimized version for any new model that you find there are people whose whole job is to optimize newly released models within a day or two to run smoothly on low-end devices
Use the "heretic" or "uncensored" or "Abliterated" modes of any model you decide to use even if you want to use Mistral. Use this version, it makes the chance of the model saying something like "i cant help you with that" about 0% but keep in mind it can boost its conversation abilities but reduce its coding or math ability if you have a use case for that
Here is a link to various compressed versions of gemma 4 e4b (Will run at the same slow speed as mistral 7b but much much better than it in every way unless you like the specific style of how mistral 7b talks.)
"heretic" version https://huggingface.co/mradermacher/gemma-4-E4B-it-heretic-GGUF/tree/main
normal version https://huggingface.co/unsloth/gemma-4-E4B-it-GGUF
Here is a link to gemma4 e2b (small but much better than even gpt 3 (about 175B) though) all other models I recommended are even much better than gemma 4 e2b
normal version https://huggingface.co/unsloth/gemma-4-E2B-it-GGUF
I could not find a reliable compressed, uncensored version I don't want to give a broken or poor model
Here is a link to qwen 3.5 4b you can try 9b but a smaller model will allow you to have a bigger context you can even use 2b but 0.8b just does not work you will find reviews about how it is a great model but it will just forget what you told it even with a large context you can test it though qwen 3.5 0.8b will run even on a 4b ram mobile
Uncensored version https://huggingface.co/HauhauCS/Qwen3.5-4B-Uncensored-HauhauCS-Aggressive/tree/main
https://huggingface.co/unsloth/Qwen3.5-4B-GGUF Normal version
Feel free to ask any follow-up questions
GWGSYT@reddit
About the gemma 4 model it is small but slow even on a good machine please try qwen 3.5 4b it is fast like really fast and only 3 gb in q4km so you can easily run it on your 8gb ram laptop BUT it does not support audio I dont mean your jetson nano setup i mean sending mp3 songs and stuff you can just send gemma4 songs and videos but it is not that great at working with them i think that even thoug qwen 3.5 supports images and video no sending songs you will like qwen 3.5 over gemma 4 in every way its smaller, faster and like 2-3% dumber than gemma 4 please try both i like qwen 3.5 more personally not that great at using the computer on its own though
GWGSYT@reddit
I dont know if you have calude paid if you have use calude code if you are working around the free version try using the antigracity app by google it has the claude model for free and gemini is just free gemini 3 is worse than claude but you can talk longer with it to fix your problems. I recomende that you use gpt codex with gpt 5.2 model in x-high mode you can tell it all that you want and it will work for upto 4 hours or how much time it needs to fix the bug or add something new even if you hit your weekly limit while its working it wont sop in the middle it will keep working till you stop it or till it thinks the work is done you can use another account even free one of chat gpt to continue the work in codex app or wait for a week for its weekly limit. IT IS MUCH BETTER THAN EVEN CLAUDE PAID as long as you dont have like claude working on sometihng in 5 tabs but then again you can do the same with chat gpt codex but it will deplete your usage faster and its not that much better unless you are doing theoratical physics or fixing an operationg system or something actually hard to do
GWGSYT@reddit
Please no matter what model you use dont use thinking unless you are trying to get the model to control the pc using "tool calls", "agent mode" or something similar thinking can actually make chats feel unnatural and it will fill up the context faster much faster it will reduce if from 400 messages to like 50
fuckAIbruhIhateCorps@reddit
please do checkout gemma 4 e2b
Kahvana@reddit
Really cool!
Some notes for technicalities:
Having that said...
The biggest problems is that small models simply don't have the capacity to have in-depth emotional conversations. 8B feels (to me at least) the bare minimum. Mistral (Ministral 8b / Ministral 3 8b) and Google (Gemma 4 E4B) have more optimized for conversational-style chatting than other models.
The context limitation is also a real problem, it will get fustrating fast when the small context keeps cycling out, no longer remembering things from the hour prior.
If you're willing to use API's, you could do asr/tts local and use a text llm over openrouter. It will be remarkably more intelligent than what you could run with the limited hardware available.
Having an AI companion is really nice, but consider the problems that might come with emotional attachment to the device and the well documented mental health implications it can have. But I asume you already considered this before making it.
If you have any questions, I'm very happy to answer and help thinker! Good luck and once again awesome that you're doing this!
BuddyBotBuilder@reddit (OP)
Amazing info. Very technical for me but this whole project is a big learning curve and it’s fun figuring it all out. I will be working through a few of your suggestions! Thanks!
Kahvana@reddit
Yeah that was indeed a lot, sorry for the information density and happy it helped!
ironmatrox@reddit
Absolutely wonderful motivation and project. Planning on open sourcing or productizing this later so it may help others in similar situations? You might also get more contribution to make it better. I'd be down to help out but I'm new in this unfortunately. But I'll be cheering for you and your husband! Looking forward to posts on how this turn out
BeneficialVillage148@reddit
Really inspiring project.
On 8GB, stick with Q4/Q3 quants, enable mmap, and use zram—it makes a big difference. Smaller models like TinyLlama can still feel surprisingly good.
Fair_Ad845@reddit
This is one of the most meaningful projects I have seen on this sub. A few practical suggestions for your 8GB constraint:
Model choice: Gemma 4 E2B (as someone mentioned) is good, but also look at Qwen2.5-3B-Instruct. It is specifically fine-tuned for conversation and runs comfortably in 3-4GB RAM with Q4 quantization, leaving headroom for TTS and whisper.
Memory matters: For a companion that talks to the same person every day, the biggest quality jump is not a bigger model — it is giving the model memory of past conversations. Even a simple approach like appending "Yesterday we talked about X, Y, Z" to the system prompt makes the interaction feel dramatically more personal. You could store conversation summaries in a local SQLite file and load the last few each morning.
TTS latency: Kokoro is great quality but check the latency on your hardware. For real-time conversation flow, Piper TTS is faster and still sounds natural. A 2-second pause between his question and the robot responding will kill the conversational feel.
Power tip: If you are using llama.cpp, set
--ctx-sizeas low as you can tolerate (2048 is fine for casual chat). Context size is the biggest RAM consumer after the model weights.This is exactly what local AI should be used for. Keep us posted on progress.
sunshinecheung@reddit
AI companion... Just use grok?
BuddyBotBuilder@reddit (OP)
lol.
Bingo-heeler@reddit
I am super interested in this project.
BuddyBotBuilder@reddit (OP)
:)
redditorialy_retard@reddit
also have you considered changing models to Gemma or Qwen? their small models are much more powerful and you can decide if you like the tone
chooseyouravatar@reddit
Hearts on you and your husband. In that limited space, you should try https://huggingface.co/janhq/Jan-v1-4B-GGUF . Based on Qwen 3, agentic friendly, this version (v1) is particulary smart and lives in 2.5 GB. More space for context. More fun. Note : I am not affiliated with Jan, i'm just an end user. ;-)
TheDigitalRhino@reddit
Wow this is very cool. Strongly consider the Gemma 4 models as they perform better even when quantization.
In order of importance I would do this.
-cflag to strictly limit how much history the model remembers (e.g.,-c 2048or-c 4096).BuddyBotBuilder@reddit (OP)
More info for you.
5) The jetson has a 500gb ssd installed I haven’t touched yet. Still using the micro ssd for everything. The laptop has a 1tb hard drive and a 2tb ssd.
4) Ram cost is out of budget because there is no budget. I buy stuff for this project only when get a gift card from work, I find something at a thrift store, or I need something that costs $20 or less off Amazon.
I will need to read about mixing models and LM Studio. Been working only through the prompt windows. Before that I was using Anaconda with limited success.
I appreciate the help!
BuddyBotBuilder@reddit (OP)
Noob so still dependent on a web browser and Claude or gpt. (Takes up a lot of ram I know). I have the laptop setup to load the AI on startup. But yes, I want to minimize the OS as much as possible. I just didn’t quite know how to do that. Thank you for the roadmap!
traveddit@reddit
https://www.youtube.com/watch?v=l5ggH-YhuAw
I remember seeing this video recently and maybe this person's project has some parallels to help you with yours.
Echo9Zulu-@reddit
Hey, so what gen is your i5? Great project!
Individual_Table4754@reddit
Hi, sorry don't have much advice, i could only think suggesting maybe using kokoro over piper tts? it is very small and sounds a lot more natural (at least to me). Also, inference of a "big" model like mistral 7B could be too much for a CPU (most of CPUs really), resulting in not very pleasing inference speeds, could you consider "Bonsai" models? ( they're optimized for CPU inference, as far as i understand), or maybe the new gemma4 models (quantized E2B version by unsloth). You can find these models on huggingface.
One last thing, i don't know your level of expertise, but should you encounter any major obstacle, just stick to the simplest solution, the one that you find to work best.
Anyways, asking around (like you did here) should get you a lot of help and inspo. And sorry for my english, it's not my primary language. Good luck with this project!
Shayps@reddit
Kokoro is more realistic, but it’s also a lot slower on resource constrained environments. Piper is the right call here IMO.
Far_Falcon_6158@reddit
Damn you are a great person. I love this. If you live in ohio i might be able to donate some hardware.
Far-Low-4705@reddit
honestly, for something like this it might be worth shipping the hardware so long as its not insanely expensive to do so, which i cant imagine it costing more than 20 bucks
BuddyBotBuilder@reddit (OP)
Thanks! Sadly I’m over on the west cost. Though I’m having fun just scavenging parts. Enjoying the whole scrappy thing. :).
Far-Low-4705@reddit
use gemma 4 e4b
This model has native text, vision, and audio inputs. while supporting native tool calling and advanced reasoning
native audio input is probably very useful for this application. llama.cpp doesnt support audio input for gemma yet, but it probably will so i would keep an eye out for it.
But mistral 7b is very outdated. at least switch the model to qwen 3.5 4b or qwen 3.5 9b
habachilles@reddit
I love this. Will do anything I can to help. Have been experimenting similarly and have an awesome memory system but 8gb ram might be rough.
lochlainnv@reddit
I recently made a "low vram" voice agent setup (linked below), however your project has significantly higher constraints.
I am willing to help you with this project actively if you are willing to open source it to also help others. I have built my own agent harnesses and have a programming and robotics background among other things.
Firstly I suggest looking at the Qwen 3.5 small models for practical reasons and start at q4 with llama.cpp. Try one of these:
https://huggingface.co/unsloth/Qwen3.5-4B-GGUF https://huggingface.co/unsloth/Qwen3.5-2B-GGUF https://huggingface.co/unsloth/Qwen3.5-0.8B-GGUF
For a companion bot the speed of inference will matter, so you need to run whatever runs fast enough to feel interactive.
Piper TTS is good, I would also suggest looking at Kokoro 82m, but unsure if it will be a good fit... Piper is very light.
An ideal companion bot will need to have some kind of memory and some access to tools for computer use and should be able to hold a decent conversation. I imagine it chatting, reading the news or books, possibly controlling any smart electronics around the home.
It should be possible to bleed this kind of performance out of these LLMs although the companion won't be the sharpest tool in the shed.
References: https://github.com/lvwilson/voice_agents https://github.com/lvwilson/agents
Billysm23@reddit
I won't choose mistral 7b instruct for the model because it's kinda outdated. To maximize efficiency, you can go for the new turboquant.
BuddyBotBuilder@reddit (OP)
Ran it by Claud (I’m a noob to all this so I have to depend on some AI’s while I learn python and manage the hardware). Says -TurboQuant is for long context, while this bot’s conversations will be short. -it’s not mainline, it may have breaks I wouldn’t know how to handle at my level. - I’m using Ollama which sits on top of llama.cpp. I would have to replace Ollama and setup some kind of custom fork which is beyond my skills currently. But thanks! This is exactly the kind of stuff I’m looking for.
Billysm23@reddit
I've sent you a message
fulgencio_batista@reddit
you might have good luck with qwen3.5 or gemma4! they offer some smart models that fit in your constraints but are also fun to have conversation with
Billysm23@reddit
I'm agree with you