offline companion robot for my disabled husband (8GB RAM constraints) – looking for optimization advice

[-]

Stepfunction@reddit

For your stack, given your limited specs, I would recommend the recently released Gemma 4 E4B or E2B and Kokoro TTS for the most bang for your buck.

The real trick is being able to interrupt the robot when it's talking and to be able to store the long-term context in some sort of RAG setup so it doesn't forget everything all the time.

[-]

Gemma required 9.5:(. So thinking I just really need to get better hardware. Gunna save up for Minisforum um890 pro barebone pc. I’ll have to get the ram separately but I’m on a budget! One component at a time. Will take a few months but until then, I’ll just work on the physical body. :)

[-]

Stepfunction@reddit

Be sure you're using a quantized model. You should get your Gemma from here:

https://huggingface.co/bartowski/google_gemma-4-E2B-it-GGUF/tree/main

These are all different degrees of quantization in the GGUF format. You'll likely want to start with the _IQ4_XS one (3.32GB), which should fit within your hardware constraints.

9.5GB is for the full, unquantized model in BF16 precision, which is how it is trained, but not how it is used for inference.

[-]

BuddyBotBuilder@reddit (OP)

Months later….

[-]

Stepfunction@reddit

Glad you got it going!

[-]

BuddyBotBuilder@reddit (OP)

I will absolutely try this. Will add the ability for people to interrupt the bot while it’s taking. Right now Buddy stores memory in ~/buddy_memory.json on the Jetson. It saves facts, appointments, medications, reminders, and OCEAN personality scores. However it’s pretty basic — just keyword extraction, no semantic understanding.

[-]

MeowChamber@reddit

I think you might wanna try front-end tool like SillyTavern or LettuceAI (this one has llama.cpp built in the backend so you don't need to install the backend separately like ST). While it mostly aims for roleplaying, I think you can use this tool the way your husband needs it. Both have memory systems that could help remembering more.

[-]

Mkengine@reddit

Gemma 4 is already a good recommendation, and maybe this one is too large, but at least from the conversational requirements this could also be interesting:

https://huggingface.co/nvidia/personaplex-7b-v1

[-]

TheDigitalRhino@reddit

Also, it looks like you have a Claude subscription, consider giving Claude Code a try, it will do the coding for you. It does use your tokens quickly but will help code things up quickly. Going back and forth between the browser will just slow you down.

[-]

BuddyBotBuilder@reddit (OP)

Nope free version. I work around the token limits by having it summarize the progress. Then I delete the conversation and start a new one using the summarization and by giving it a copy of all my py files. Right now it’s not a lot of coding luckily. It’s just a lot of editing of what’s there. Trying to get it not to go off the rails of the conversation. The files dont need much changed when switching models.

[-]

Exciting-Mall192@reddit

I would recommend try using Gemini on Google AI Studio with the summarization done by Claude when you're out of message. Is the companion only for conversational, btw? I agree with everyone's recommendation on trying newer models like Qwen 3.5, Gemma 4, or the Bonsai model

[-]

BuddyBotBuilder@reddit (OP)

No subscription API, only because we loose power out here in the woods a few times a year. Some times for a few days. That’s when he would need the bot the most. So it a hard rule that it’s must be able to function without internet access for extended periods of time. We don’t even get cell phone reception unless our router is working. We have a small generator that runs the pellet stove, Fridge, tv and a dvd player and could keep the bot charged. No easy outs for this project :). That’s what makes it fun!

[-]

Stepfunction@reddit

That makes sense. Another option, similar in spirit would be that if you have a powerful desktop computer, you could host the LLM on that and make API calls from the laptop to reduce the power consumption and potentially host larger models or get better inference performance.

[-]

Bingo-heeler@reddit

I am super interested in this project.

[-]

BuddyBotBuilder@reddit (OP)

:)

[-]

redditorialy_retard@reddit

also have you considered changing models to Gemma or Qwen? their small models are much more powerful and you can decide if you like the tone

[-]

BuddyBotBuilder@reddit (OP)

Tried Gemma this week. The model requires 9.5 GB ram. :(. So looked at getting upgrading the ram, but that like $150+. That or upgrading to a better pc. It’s out of my budget range for a while as I can only put about $50 a month into this project but I’m thinking the next step would be

MINISFORUM UM890 Pro Barebone with Mini PC, AMD Ryzen 9 8945HS Mini Computers,8K Quad Display HDMI/DP1.4/USB4 х 2, AMD Radeon 780M/Dual LAN 2.5(NO RAM/SSD/OS)

It’s $463 on Amazon but it will get me closer to a companion type ai personality and will be able give these little AI’s some breathing room to “think”. So that’s about 4-5 months away. Until then I will just work in building the physical bot. Got an old 3D printer. The PLA was never pulled out so it all dried out in feeder lines and got brittle. So need to do a little cleaning there. I knew better than to let it sit that long but, you know, busy life. Just a reminder to myself to take care of my equipment! The old saying. If you don’t make time for preventative maintenance, your equipment will do it for you. lol.

[-]

redditorialy_retard@reddit

get the quantised version and the E4b. Q4 needs like 2GB iirc

[-]

ItsFlybye@reddit

What a great project for the hubby =). Lots of great replies, and I'm a bit of a noob myself so there isn't much I can contribute. One thing I didn't see mentioned: Have you considered a PTZ camera?

There is a PTZ camera around $50 I caught in a Medium article someone programed to have their AI interact with. It could allow the AI buddy to look around without having to move its entire self around to look at your hubby. This could provide additional interaction by moving more often at what he is doing and asking questions/comments on his actions.

[-]

BuddyBotBuilder@reddit (OP)

That’s a fun idea. Don’t only have a local camera on the not. Have one that observes the room as well. I might add that later in the project. The trick is I have a few limits I am trying not to cross for this project: main one is it’s local, not dependent on the internet as we have power outages a few times a year that have lasted up to three days. It can’t depend on networking in the house as well. :)

[-]

darkgamer_nw@reddit

I recommend you take a look at this repository; it basically has most of what you want to do already set up

https://github.com/brenpoly/be-more-agent

[-]

Not_your_guy_buddy42@reddit

Running a local voice bot since over a year (40k chats now) thought a lot about aspects behind artificial companionship in a jank situation (jank is good), dm if you like

[-]

daLazyModder@reddit

I would look into moonshine asr its similar to whisper but better for edge cases like this.

https://github.com/moonshine-ai/moonshine

You could also try something like phi mini moe for the llm

https://huggingface.co/microsoft/Phi-tiny-MoE-instruct

It is a moe model that is 3.8b parameters total with 1.1b active

being moe means the model runs as fast as a 1.1b with 3.8b knowledge, though phi's personality i've read can leave a lot to be desired... (its by microsoft and was reported to be super censored for even basic stuff)

piper tts is fast and good on cpu and low latency, kokoro would probably work as well, but you can actually do halfway decent voice cloning with pocket tts on cpu bit more of a pain to setup but could be done in your voice as the companion if your spouse would like that

https://huggingface.co/KevinAHM/pocket-tts-onnx

I kind of hate to recommend upgrading the laptop but I suspect its running ddr4

https://www.ebay.com/itm/204766825540

16gb is 100$ which is ridiculous (ram pricing is currently crazy) but might be worth the investment to run a slightly larger model like granite 4 tiny which is a 7b total 1b active I believe

(I would personally check if you have 2 sticks of ram in the laptop, if it says 8gb in task manager but there is only 1 ram stick then you could double your ram capacity for about 50$ buy buying a cheap stick off ebay, just make sure its ddr4 not ddr5 or something older)

https://huggingface.co/ibm-granite/granite-4.0-h-tiny

is the granite model I mentioned you might look into.

[-]

braydon125@reddit

You need gpu dude

[-]

unjustifiably_angry@reddit

And a fleshlight.

[-]

braydon125@reddit

Nvidia personaplex is an incredible conversational model

[-]

HeyEmpase@reddit

Have you thought about using lightweight LLMs like Phi-3-mini (3.8B) or TinyLlama (1.1B) quantized to 4-bit? They can work well on 8GB of RAM with CPU-only inference and are capable of handling basic dialogue, reminders, and command parsing offline. I'm curious about what sensors or actuators you plan to integrate! Voice input latency and response naturalness can really impact the user experience, so it's worth considering those factors.

Such a heartbreaking and useful case. I think most of code nowadays without any end goal, but this... please continue!

[-]

BuddyBotBuilder@reddit (OP)

Thanks for your interest and questions! have tried all three. It’s trying to find the balance between speed and reason that’s the trick. The one I’m using now is a little slow but is better at actual conversation. Going to try the Gemma 4 tomorrow, haven’t tried that one yet.

Have a bunch of basic sensors that come with uno kits. Have a x-box connect camara that has some fun stuff but it’s a little power hungry for this setup. Have a few web cams lying around. Dream would be LiDAR of course but I’m probably going to have to just use the web cams and the proximity sensors.

Body: base small power chair bases with two 24volt batteries. Got a few step down power converters so I can split the power up and not overload anything. Probably need to devise some kind of fuse box as well. To the base I will be attaching the body to main support poll using three screw actuators and some gide polls so the bot can bend a little and emote a little with its movement. I would like for the bot to be able to help pick things up off the floor he drops them.
(In his chair he cant pick stuff up off the floor. He’s taken a few nose dives out of his chair trying. I get home from work and find out he’s been in the floor for a few hours. It’s terrible. He’s to proud to call me even though he knows I would come home right away. So for now I have metal taped to items so he can use an extendable magnet sticks to get stuff he drops)

Going to use the 3D printable parts from the InMoove robot. (Open source android body).specifically the neck joint. I just need to expand the print about 200% so it’s the right size. Will print the same joint a second time for the head at normal settings. So that’s 6 actuators currently. The power chair base has two motors for forward, back, and turn.

Arms, not quite there yet. Lots of options out there since I have a 3D printer.

Head, got a cool looking helmet and a led matrix. Will put the led matrix in the helmet visor and give it some expressive eyes. Going to try for eyes like eve from movie Wall-e

Personality, I incorporated the open source “ocean”personally assessment tool. It’s a solid diagnostic tool. The bot has instructions to try and slip in a question or two each day, record the response and then use the assignment tools findings to tailor it’s interactions with him.

Anyway, thats what I’ve come up with so far. Right now just trying to get step one done, and see if I can get this equipment work fast enough to simulate normal conversation speeds so it feels natural. Might not be possible, but that’s why I came here :)

[-]

MagoViejo@reddit

Wow , you really are something else! I'm planning on something for the next decade as I will probably be disabled and you are giving me some ideas.

About the LIDAR thing , maybe you can scavenge pieces from those cheap roomba clones from chin. Their mecanic parts give way before their electronics, you may find a treasure trove there.

As for the llm , I saw in this sub something being done with old mobile phones, maybe separating different taks to different units may make it more responsible/capable.

Anyway I'm just starting to learn to look under the hood with LLMs

[-]

HeyEmpase@reddit

Didn't expect such an elaborate answer! That sounds like a really thoughtful build. Since you’ve already tested several small models, one promising next step might be fine-tuning a smaller model for your exact use case instead of chasing bigger models. A 3B–4B model tuned for your husband’s style of conversation, reminders, routines, and preferred tone may feel better than a larger general model while using much less RAM and running faster.

Tools from unsloth.ai make LoRA fine-tuning much easier, and people do run it in Google Colab [1]. Even a small dataset of example conversations, reminders, reassurance, and daily check-ins could help. For this project, a fast, warm, reliable specialized model may beat a slower “smarter” one. May you be happy and by doing so making happy your husband.

[1] https://unsloth.ai/docs/models/gemma-4/train

[-]

ai_guy_nerd@reddit

This is such a thoughtful project. For 8GB on Mistral 7B, you've basically hit the ceiling unless you go super aggressive with quantization. Some practical wins:

Quant approach: Q3_K_S or Q4_K_S gets you back maybe 500MB-1GB of headroom vs your current setup. You lose some reasoning quality but the conversation stays coherent for companionship use.

Swap trick that actually works: Set up zram (not swap to disk). On Linux: modprobe zram, echo 2G > /sys/block/zram0/disksize. It's compressed in-memory, doesn't thrash the disk, and buys you another gig of effective headroom. Noticeable latency hit but not terrible.

Smaller model wins: PhiLM 3.8B or TinyLLaMA 1.1B run on the i5 without the Jetson and still feel conversational for companionship. Less grounded knowledge, but for consistency and warmth (which matters for your use case), sometimes smaller + responsive beats bigger + slow.

The real win is bundling everything together: route lightweight conversations to PhiLM directly, offload longer chats to Mistral via the Jetson. You're already doing the hard part with multidevice orchestration.

[-]

Erwindegier@reddit

Does it have to be that hardware? Can you find a used Air M1 or Mini M1 with 16gb? It’ll allow you to run way better models. I’m actually building a companion bot to run on those specs. Wrote my own vector DB for memory on top of SQLite and am currently looking at index2tts for text to speech as it sounds way better than piper. You can clone your own voice and pass emotion to the speak prompt.

[-]

brickout@reddit

Awesome! Have you checked if the thinkpad has expandable RAM? A lot of those models do. You could likely get another 8GB for $30 or less.

If you want a slightly stronger laptop, I'll bet I have one I could donate...but I read that you are enjoying being scrappy. I'm the same way.

This project is very cool. Are you powering it directly from the wheelchair power?

Some low resource thoughts:

my fedora laptops HATE hitting zram for some reason. If you're maxing out your RAM and getting random hangs, maybe look at that. I disabled zram and instead made a classic swap file on my disk and no more hangs.

Gemma4 is absolutely incredible for its RAM usage. I have found the smaller sized ones to be pretty verbose and chatty.

very cool project.

[-]

BuddyBotBuilder@reddit (OP)

Yes, checked. Only a single ram slot. Someone recommend two 4gig chips might work better than one 8 so took a look and there’s

only one slot.

Thanks for the laptop offer! ;)

So laptop has its own battery, but yup, everything else will be running off of the power chair batteries.

[-]

brickout@reddit

Gotcha. Figured you would have checked with how comfortable you are putting all of this together. Good luck with the rest of the build! Will like to see updates in the future.

[-]

while-1-fork@reddit

I suppose that the thinkpad won't have a NVME SDD? If it does you will likely do well with MoE models larger than it would seem reasonable through memory mapped files. Maybe even worth testing even if you have a slow hard drive. I have not tried them but the Marco Mini and Marco Nano models may be a good idea as they are MoE with very good benchmark scores for their size (may or may not translate to real use) but with a tiny amount of experts activated so they should be fast even on constrained hardware and only the active weights really need to be on memory simultaneously.

What is almost a must is using a modern model with hybrid attention whatever the size of model you settle with. The Qwen 3.5 line up is very good. Nemotron Nano and Gemma 4 are also strong contenders. Even Qwen 3.5 0.8B would be an improvement over Mistral-7B and way faster with less resource use.

I quants offer better bang for their weight buck than k-quants at the same bits (not available over 5 bits but you will likely run 4 or 3 bit). If you use ik-llama there is also i-k quants that are even better.

You may consider inverting your setup and running the LLM on the Nano while whisper runs on the pc through whisper.cpp . Specially if the Nano has a SSD for the memory mapped MoE I talked about.

As for zram. Given your cpu , you likely don't want to use zstd but lzo. zstd is often recomended because it can reach a higher compression ratio but it is way slower even on much stronger cpus. There are other algorithms that are slightly faster than lzo but offer worse compression and are likely not worth the trade. You also want to set vm.page-cluster=0 (the number of blocks that it reads ahead, in a hard drive swap it helps, here it often causes uneeded decompressions for almost no troughput gain and kills latency and cpu use). And when using z-ram you want to swap as early as possible so set vm.swappiness=200 (Even with that set it won't really begin swapping until your ram is about 80% full , early swapping results in less thrashing and distributes the cpu use over more time). Also disable swap partitions and swap files and set the z-ram swap to be 2x the system ram. I am running that on a 16GB machine (and a 24GB gpu) with OpenClaw + llama.cpp running Qwen 3.5 35B A3B in IQ4 + SearXNG + full Chrome on a container for OpenClaw to use + Yolo11 nano running on cpu filtering frames of a camera for images containing my cat + Claude code and everything runs great. The 3090 does a lot of heavy lifting of course but z-ram helps a lot too as rarely used stuff gets pushed into it and even some frequent usage won't fully kill performance. I don't use it as a main machine, but only as an OpenClaw + Claude code machine. But I have been using z-ram for many years and it is great. I have not swapped to disk in maybe a decade, my main pc has 128GB and I still run z-ram on it and have done crazy things that required 300GB+ which would have been impossible swapping to disk.

[-]

ab2377@reddit

please swap mistral 7b with qwen3.5-4b q4. its insanely intelligent for its size you will love it, also much faster. do you build llama.cpp on your pc yourself or download from releases section from github? can i suggest you install gemini cli free version in case you want to write quick scripts or building llama.cpp without wasting time. its really good the free version.

good luck with your project. post updates on this sub as you move forward with it. lots of good luck and wishes.

[-]

Porespellar@reddit

You might want to look towards the folks at Stanford who built the open source Mobile Aloha robot for some inspiration on your project

https://mobile-aloha.github.io

They are west coast like yourself. They’ve pretty much open sourced all the plans and everything needed to build the working system.

[-]

JohnTheNerd3@reddit

that's such a nice thing to do! for the more technical side of things, i found Pocket TTS to be extremely fast with good quality, while still not requiring a GPU. it also supports voice cloning, so the assistant can have the voice of your choice!

while streaming on CPU, i can typically get the first word output within 200ms, and one of the projects support an OpenAI-compatible API so most tools "just work" with it. i personally use it for Home Assistant and am quite happy with it.

[-]

DevilaN82@reddit

As most of RAM would be taken by model weights, that are somewhat random numbers, and thou hard to compress, then zram will be almost no gain here. In fact it might harm performance when those weights would be "compressed" (cpu power used) and still take the same amount of place.

You should try using mmap (this maps part of hard disk as a memory addresses), so instead of reading from disk, writing to RAM, compressing, decompressing, even swapping (still going to disk back and forth). It would read from disk and use those.

This hardware is very very low spec for LLMs. You could get away with adding some knowledge base. Consider using wikipedia ZIM snapshot and allow your model to search / browese it to enrich its context and knowledge.

Also I would use a better model. Mistral-7b-instruct is IDK... 2 years old? Newer models are better with the same size. Use qwen3.5 or Gemma4 (whichever variant fits you device). Unsloth's models are great value for it's size - you should try Unsloth Dynamic quants. I would not go below Q4, but hey - maybe Q3 will still be useable for your usecase.

Good luck and please post a video showing how your current setup is working!

[-]

Previous_Escape3019@reddit

this is really cool. hope it works out

[-]

AnonymZ_@reddit

That’s cute and helpful

[-]

ssalvo41@reddit

I'm pretty experienced with Jetson stuff, so if you ever run into any issues, I'd be happy to help

[-]

brown2green@reddit

Most (all?) small conversational LLMs are going to feel very shallow very quickly as companions. I'd reconsider your idea, even if it's well-intentioned.

[-]

CaptnSauerkraut@reddit

I have nothing to add besides what Stepfunction said.

Just wanted to say that this is an awesome project and you sound like a great person. Keep us updated on the progress. Open sourcing the build once it is somewhat stable could help many more people.

[-]

MEGAnALEKS@reddit

I would try turboquant for more context window

[-]

Shayps@reddit

We can build something wonderful, but being this constrained will require us to be very creative.

Faster-whisper on the nano is a great design choice. Piper is as small and fast as you’re going to get it too. Good call on both of those. Latency is great for voice.

For the LLM we’re going to need to add memory, manage context, and ideally get e2e voice latency down to around a second.

I can help you, we can make this work. I build a lot of these that do all kinds of things. Can you DM me? I will likely want to send you some (free) hardware.

[-]

BuddyBotBuilder@reddit (OP)

Following some advice I got from a few people here and am going to try Gemma 4 and kokoro tts tomorrow to see if it improves the bots conversation ability as well as add the ability for the person talking to interrupt the bot while it’s talking. Lots of good advice popping up! Thanks for the offer of free hardware, that’s really sweet of you. I’ll pass for now as I want to see what I get out of the hardware I have collected already. I have collected a lot of stuff for this project, probably too much. lol.

[-]

Spicy_mch4ggis@reddit

I am fully invested in getting something to run on what you have. The add more compute answer, while clearly the optimal solution, do not result in something truly beautiful: getting buddybot to run on a wheelchair powered potato

[-]

Spicy_mch4ggis@reddit

What are the processing time considerations for everything that isn’t conversational? I mean specifically like if he asks a question, is it ok to wait a little bit for a better answer or is it required to immediately follow up? I ask because I have done some work on edge hardware that doesn’t require “real” real time processing and there are some processes that can be ran in sequence to fit more on less hardware.

[-]

pot_sniffer@reddit

Really cool project. A few things that might help:

The Jetson may be underused, depends which one you have. If it's an Orin, it's worth testing running the LLM there instead of the ThinkPad CPU. If it's a standard Nano (2GB/4GB), probably not. Whisper will already be eating the VRAM and there won't be enough left for a usable LLM.

Memory will matter more than model size for this use case. For a companion talking to the same person every day, the biggest jump in quality probably isn't a bigger model, it's giving it memory of past conversations. Even something simple: store a short daily summary in a text file, load the last few days into the system prompt. "Yesterday we talked about X, medication is Y, he mentioned Z." For what you're building this will feel more personal than any model upgrade. One important caveat: put a hard limit on how much you inject, last 3 days maximum, discard the rest. Context window fills fast on a CPU and inference slows noticeably as it grows. Without a cap you'll hit 30-second response times on a simple good morning within a couple of weeks.

On quant, more bits isn't always faster or better in practice. For CPU inference specifically, Q6_K or Q5_K_M will usually give you noticeably faster generation than Q8 for no meaningful loss in conversational quality. The speed difference on a CPU is real; the quality difference in casual conversation is hard to notice.

Streaming TTS will make a big difference to how natural it feels. Rather than waiting for the full response before speaking, pipe the LLM output to Piper in sentence-sized chunks, wait for a period, comma, or question mark, then send that chunk. Start speaking the first sentence while generating the second. If you send raw tokens as they stream the prosody will sound broken. Sentence boundaries is the key step most people miss.

[-]

GWGSYT@reddit

Try the Qwen 3.5 or Gemma 4 models, specifically Gemma 4 e2b or e4b. Please make sure that you are using their mmproj file it allows the ai to see images or in some cases, even audio and video this is not supported by all models. Qwen 3.5 (text and image only) and Gemma 4 (text, image, audio, video) support it. There are multiple versions of the qwen3.5 and gemma 4 models use the smaller ones, smaller than 8b, about 4b for larger context or memory. Their 4b is comparable to the original ChatGPT 3 which is 175B (not gpt 1 or gpt 2) model released in 2023 on the chat gpt website. I advise that you look for the Q4_k_m or Q4k_s versions of the model you only need larger models to solve math problems or doing programming using a more uncompressed model will not help in conversation that much and local models that are 7B or less are not reliable for programming anyway. They are great conversational models with vision or image input and the gemma models by google even support sending audio and video, but sending too much audio and video can fill up the model's memory, causing it to forget older things. such as the first few messages. Try the q4_k_m quant, it should allow you to set the context or memory to 64k

2k context is roughly a short-term memory of 20 messages.
4k context is a solid medium-term memory of about 40 messages.
8k context can be about 80 to 100 messages.
16k context is a deep long-term memory of roughly 160 messages.
64k context is large and would likely not be saturated properly by text alone, holding over 600 messages unless you consistently send in audio, images, or video.

You can also delete or turn old images, videos and audio into text descriptions to make its short-term chat memory bigger. These models support tool calls so theoretically they can use the computer on their own but in practice they struggle to do so. I think you should look into silly taver it is an app for ai roleplay such as giving your ai a character like batman but it has alot of stuff prebuilt like text-to-speech, speech-to-text, image, audio, text and video sending if your model supports it, 3d models to make the ai seem more livly and built-in chat management to save, view and load old chats anytime. It is also open source so you can legally edit it to do anything but if you want to publicly share it you must allow others to do the same but if you are not sharing it publically you are allowed to edit anything about it. It is not like llama cpp it can allow you to talk to the models but you must have llama cpp running in the background.

You can use gpt codex it works with any chat gpt free account and does the work for you non stop for hours by using google, visiting official sources to fix any bugs in its code, tuning the app into an exe, optimizing etc this will allow you to just ask chat gpt to use webserch or google to look up any error, new models and fix them or add support for new features and optimisations. It can work for 4+ hours non stop until it thinks that the work is done even if you reach 0% usage left. The current task will get completed but if you are happy with claude fell free to use it but the Codex app can automate alot of things, like optimisation. You can just give it buzz words like better quantisation, lower presicon, tool calling, etc and it will add all the things it can in that senario you can use it to complete your AI assistant faster.

**NOTE:** Unless you are making the model use tools such as browse the web on voice command (which they might struggle with) but if you think that it works reliably then only use thinking, thinking will fill up context such as generating about 2000 words of though just to repily to a simple hello so please dont you thinking unless you have a usecase that requres it.

Optimizations like xformer, flash attnention 2, 8bit, 4bit, sage attention2 depend alot on your cpu or system that is whether it can actually support it like camera if your pc does not have a camera a camera app wont give it a camera

Even though gemma 4 supports audio and video I find the qwen 3.5 model more conversational as it uses emojis and stuff.

If you own a good android or any Android with 16gb ram it will be faster than your laptop you can use it to run llama cpp using Termux but it is moderately hard to setup if you use any random app to run the model from the play store or app store it might not support you jetset nano setup but as Termux is just an app that can launch liniux on your phone you can do what ever you wish to do on it. You can do this on an iPhone but even iphone 17 has like 8gb ram so it will may be not be faster but with optimization you laptop setup should beat it depends what varient you have though.

Try to have a larger context rather than a larger model imagine if you have the best model possible but it will forget what you said 4 messeges before due to having a small context or memory. This is mostly determined by your hardware

If you are using a cpu optimized version of mistral you can ask claude to find a cpu optimized version for any new model that you find there are people whose whole job is to optimize newly released models within a day or two to run smoothly on low-end devices

Use the "heretic" or "uncensored" or "Abliterated" modes of any model you decide to use even if you want to use Mistral. Use this version, it makes the chance of the model saying something like "i cant help you with that" about 0% but keep in mind it can boost its conversation abilities but reduce its coding or math ability if you have a use case for that

Here is a link to various compressed versions of gemma 4 e4b (Will run at the same slow speed as mistral 7b but much much better than it in every way unless you like the specific style of how mistral 7b talks.)

"heretic" version https://huggingface.co/mradermacher/gemma-4-E4B-it-heretic-GGUF/tree/main
normal version https://huggingface.co/unsloth/gemma-4-E4B-it-GGUF

Here is a link to gemma4 e2b (small but much better than even gpt 3 (about 175B) though) all other models I recommended are even much better than gemma 4 e2b

normal version https://huggingface.co/unsloth/gemma-4-E2B-it-GGUF

I could not find a reliable compressed, uncensored version I don't want to give a broken or poor model

Here is a link to qwen 3.5 4b you can try 9b but a smaller model will allow you to have a bigger context you can even use 2b but 0.8b just does not work you will find reviews about how it is a great model but it will just forget what you told it even with a large context you can test it though qwen 3.5 0.8b will run even on a 4b ram mobile

Uncensored version https://huggingface.co/HauhauCS/Qwen3.5-4B-Uncensored-HauhauCS-Aggressive/tree/main

https://huggingface.co/unsloth/Qwen3.5-4B-GGUF Normal version

Feel free to ask any follow-up questions

[-]

GWGSYT@reddit

About the gemma 4 model it is small but slow even on a good machine please try qwen 3.5 4b it is fast like really fast and only 3 gb in q4km so you can easily run it on your 8gb ram laptop BUT it does not support audio I dont mean your jetson nano setup i mean sending mp3 songs and stuff you can just send gemma4 songs and videos but it is not that great at working with them i think that even thoug qwen 3.5 supports images and video no sending songs you will like qwen 3.5 over gemma 4 in every way its smaller, faster and like 2-3% dumber than gemma 4 please try both i like qwen 3.5 more personally not that great at using the computer on its own though

[-]

GWGSYT@reddit

I dont know if you have calude paid if you have use calude code if you are working around the free version try using the antigracity app by google it has the claude model for free and gemini is just free gemini 3 is worse than claude but you can talk longer with it to fix your problems. I recomende that you use gpt codex with gpt 5.2 model in x-high mode you can tell it all that you want and it will work for upto 4 hours or how much time it needs to fix the bug or add something new even if you hit your weekly limit while its working it wont sop in the middle it will keep working till you stop it or till it thinks the work is done you can use another account even free one of chat gpt to continue the work in codex app or wait for a week for its weekly limit. IT IS MUCH BETTER THAN EVEN CLAUDE PAID as long as you dont have like claude working on sometihng in 5 tabs but then again you can do the same with chat gpt codex but it will deplete your usage faster and its not that much better unless you are doing theoratical physics or fixing an operationg system or something actually hard to do

[-]

GWGSYT@reddit

Please no matter what model you use dont use thinking unless you are trying to get the model to control the pc using "tool calls", "agent mode" or something similar thinking can actually make chats feel unnatural and it will fill up the context faster much faster it will reduce if from 400 messages to like 50

[-]

fuckAIbruhIhateCorps@reddit

please do checkout gemma 4 e2b

[-]

Kahvana@reddit

Really cool!

Some notes for technicalities:

Running from RAM? Is it DDR4 or DD5? On soldered single-channel DDR4-2400, I struggle to run and fit Qwen 3.5 2B at Q4_K_S with vision encoder and 8k context (kv offload) at \~2t/s.
Make sure your processor at least supports AVX 2 (mine doesn't) if running from CPU, and try the vulkan backend on the iGPU which can be faster in some cases.
You can gain a decent chunk of performance (\~30%) by running 2 dimms (2x4GB) instead of 1 dimm (1x8GB) in case you don't already.
On slow processors, I found Qwen 3.5 / LFM 2 / Granite 4.0 H (basically RNN based models like gated deltanet or mamba) to perform much faster than SWA (Gemma) / GQA (Qwen 3) based models.
You can save some memory by enabling Flash Attention with K and V cache set to Q8_0.
Mint is heavy on resources, try LDXE (http://www.lxde.org/) or go headless (even better!).
Koboldcpp and llama.cpp are really neat. The former is easier to deploy than the latter, but takes longer to update to newer models.
Kokoro TTS is very easy to set up with Koboldcpp as it has build-in support.
Kitten TTS is faster, but requires thinkering (like running https://github.com/devnen/Kitten-TTS-Server ).
Parakeet is much faster than Whisper for ASR, give it a look!
Whisper is alos supported by Koboldcpp build-in.

Having that said...

The biggest problems is that small models simply don't have the capacity to have in-depth emotional conversations. 8B feels (to me at least) the bare minimum. Mistral (Ministral 8b / Ministral 3 8b) and Google (Gemma 4 E4B) have more optimized for conversational-style chatting than other models.

The context limitation is also a real problem, it will get fustrating fast when the small context keeps cycling out, no longer remembering things from the hour prior.

If you're willing to use API's, you could do asr/tts local and use a text llm over openrouter. It will be remarkably more intelligent than what you could run with the limited hardware available.

Having an AI companion is really nice, but consider the problems that might come with emotional attachment to the device and the well documented mental health implications it can have. But I asume you already considered this before making it.

If you have any questions, I'm very happy to answer and help thinker! Good luck and once again awesome that you're doing this!

[-]

BuddyBotBuilder@reddit (OP)

Amazing info. Very technical for me but this whole project is a big learning curve and it’s fun figuring it all out. I will be working through a few of your suggestions! Thanks!

[-]

Kahvana@reddit

Yeah that was indeed a lot, sorry for the information density and happy it helped!

[-]

ironmatrox@reddit

Absolutely wonderful motivation and project. Planning on open sourcing or productizing this later so it may help others in similar situations? You might also get more contribution to make it better. I'd be down to help out but I'm new in this unfortunately. But I'll be cheering for you and your husband! Looking forward to posts on how this turn out

[-]

BeneficialVillage148@reddit

Really inspiring project.

On 8GB, stick with Q4/Q3 quants, enable mmap, and use zram—it makes a big difference. Smaller models like TinyLlama can still feel surprisingly good.

[-]

Fair_Ad845@reddit

This is one of the most meaningful projects I have seen on this sub. A few practical suggestions for your 8GB constraint:

Model choice: Gemma 4 E2B (as someone mentioned) is good, but also look at Qwen2.5-3B-Instruct. It is specifically fine-tuned for conversation and runs comfortably in 3-4GB RAM with Q4 quantization, leaving headroom for TTS and whisper.

Memory matters: For a companion that talks to the same person every day, the biggest quality jump is not a bigger model — it is giving the model memory of past conversations. Even a simple approach like appending "Yesterday we talked about X, Y, Z" to the system prompt makes the interaction feel dramatically more personal. You could store conversation summaries in a local SQLite file and load the last few each morning.

TTS latency: Kokoro is great quality but check the latency on your hardware. For real-time conversation flow, Piper TTS is faster and still sounds natural. A 2-second pause between his question and the robot responding will kill the conversational feel.

Power tip: If you are using llama.cpp, set --ctx-size as low as you can tolerate (2048 is fine for casual chat). Context size is the biggest RAM consumer after the model weights.

This is exactly what local AI should be used for. Keep us posted on progress.

[-]

sunshinecheung@reddit

AI companion... Just use grok?

[-]

BuddyBotBuilder@reddit (OP)

lol.

[-]

chooseyouravatar@reddit

Hearts on you and your husband. In that limited space, you should try https://huggingface.co/janhq/Jan-v1-4B-GGUF . Based on Qwen 3, agentic friendly, this version (v1) is particulary smart and lives in 2.5 GB. More space for context. More fun. Note : I am not affiliated with Jan, i'm just an end user. ;-)

[-]

TheDigitalRhino@reddit

Wow this is very cool. Strongly consider the Gemma 4 models as they perform better even when quantization.

In order of importance I would do this.

Use a gemma 4 model
Figure out a way to slim down the OS footprint. If you can switch to a lighter version like XFCE, or run the ThinkPad "headless" (command line only) once the robot is configured, you'll instantly reclaim 1GB+ of RAM for your models.
Clamp the context window. In your llama.cpp command, use the -c flag to strictly limit how much history the model remembers (e.g., -c 2048 or -c 4096).
Try to find more ram. I would look up your model and see if you can find sodimm that would work.
Also, if not already the main drive should be a SSD or NVME.

[-]

BuddyBotBuilder@reddit (OP)

More info for you.
5) The jetson has a 500gb ssd installed I haven’t touched yet. Still using the micro ssd for everything. The laptop has a 1tb hard drive and a 2tb ssd.

4) Ram cost is out of budget because there is no budget. I buy stuff for this project only when get a gift card from work, I find something at a thrift store, or I need something that costs $20 or less off Amazon.

I will need to read about mixing models and LM Studio. Been working only through the prompt windows. Before that I was using Anaconda with limited success.

I appreciate the help!

[-]

BuddyBotBuilder@reddit (OP)

Noob so still dependent on a web browser and Claude or gpt. (Takes up a lot of ram I know). I have the laptop setup to load the AI on startup. But yes, I want to minimize the OS as much as possible. I just didn’t quite know how to do that. Thank you for the roadmap!

[-]

traveddit@reddit

https://www.youtube.com/watch?v=l5ggH-YhuAw

I remember seeing this video recently and maybe this person's project has some parallels to help you with yours.

[-]

Echo9Zulu-@reddit

Hey, so what gen is your i5? Great project!

[-]

Individual_Table4754@reddit

Hi, sorry don't have much advice, i could only think suggesting maybe using kokoro over piper tts? it is very small and sounds a lot more natural (at least to me). Also, inference of a "big" model like mistral 7B could be too much for a CPU (most of CPUs really), resulting in not very pleasing inference speeds, could you consider "Bonsai" models? ( they're optimized for CPU inference, as far as i understand), or maybe the new gemma4 models (quantized E2B version by unsloth). You can find these models on huggingface.

One last thing, i don't know your level of expertise, but should you encounter any major obstacle, just stick to the simplest solution, the one that you find to work best.

Anyways, asking around (like you did here) should get you a lot of help and inspo. And sorry for my english, it's not my primary language. Good luck with this project!

[-]

Shayps@reddit

Kokoro is more realistic, but it’s also a lot slower on resource constrained environments. Piper is the right call here IMO.

[-]

Far_Falcon_6158@reddit

Damn you are a great person. I love this. If you live in ohio i might be able to donate some hardware.

[-]

Far-Low-4705@reddit

honestly, for something like this it might be worth shipping the hardware so long as its not insanely expensive to do so, which i cant imagine it costing more than 20 bucks

[-]

BuddyBotBuilder@reddit (OP)

Thanks! Sadly I’m over on the west cost. Though I’m having fun just scavenging parts. Enjoying the whole scrappy thing. :).

[-]

Far-Low-4705@reddit

use gemma 4 e4b

This model has native text, vision, and audio inputs. while supporting native tool calling and advanced reasoning

native audio input is probably very useful for this application. llama.cpp doesnt support audio input for gemma yet, but it probably will so i would keep an eye out for it.

But mistral 7b is very outdated. at least switch the model to qwen 3.5 4b or qwen 3.5 9b

[-]

habachilles@reddit

I love this. Will do anything I can to help. Have been experimenting similarly and have an awesome memory system but 8gb ram might be rough.

[-]

lochlainnv@reddit

I recently made a "low vram" voice agent setup (linked below), however your project has significantly higher constraints.

I am willing to help you with this project actively if you are willing to open source it to also help others. I have built my own agent harnesses and have a programming and robotics background among other things.

Firstly I suggest looking at the Qwen 3.5 small models for practical reasons and start at q4 with llama.cpp. Try one of these:

https://huggingface.co/unsloth/Qwen3.5-4B-GGUF https://huggingface.co/unsloth/Qwen3.5-2B-GGUF https://huggingface.co/unsloth/Qwen3.5-0.8B-GGUF

For a companion bot the speed of inference will matter, so you need to run whatever runs fast enough to feel interactive.

Piper TTS is good, I would also suggest looking at Kokoro 82m, but unsure if it will be a good fit... Piper is very light.

An ideal companion bot will need to have some kind of memory and some access to tools for computer use and should be able to hold a decent conversation. I imagine it chatting, reading the news or books, possibly controlling any smart electronics around the home.

It should be possible to bleed this kind of performance out of these LLMs although the companion won't be the sharpest tool in the shed.

References: https://github.com/lvwilson/voice_agents https://github.com/lvwilson/agents

[-]

Billysm23@reddit

I won't choose mistral 7b instruct for the model because it's kinda outdated. To maximize efficiency, you can go for the new turboquant.

[-]

BuddyBotBuilder@reddit (OP)

Ran it by Claud (I’m a noob to all this so I have to depend on some AI’s while I learn python and manage the hardware). Says -TurboQuant is for long context, while this bot’s conversations will be short. -it’s not mainline, it may have breaks I wouldn’t know how to handle at my level. - I’m using Ollama which sits on top of llama.cpp. I would have to replace Ollama and setup some kind of custom fork which is beyond my skills currently. But thanks! This is exactly the kind of stuff I’m looking for.

[-]