Built a fully offline suitcase robot around a Jetson Orin NX SUPER 16GB. Gemma 4 E4B, ~200ms cached TTFT, 30+ sensors, no WiFi/BT/cellular. He has opinions.

Posted by CreativelyBankrupt@reddit | LocalLLaMA | View on Reddit | 57 comments

Sparky runs entirely on the Jetson. Gemma 4 E4B at Q4_K_M via llama.cpp with q8_0 KV cache and flash attention. 12K context, native system role, sampler defaults from the model card. Cached TTFT around 200ms, sustained 14-15 tok/s. SenseVoiceSmall for STT, Piper for TTS with 43Hz mouth sync, PixiJS face on the lid display. Vision and OCR are native to Gemma 4 now so the BLIP subprocess is gone. 30+ sensors fold into the prompt as natural language every turn.

One of the biggest wins was prompt structure for cache stability. Persona and tools at the top, history in the middle, volatile sensor and vision data at the end of the latest user turn. Moving dynamic context out of the system block dropped cached TTFT from multi-second to \~200ms.

Configurable entirely on-device via a button row, a joystick, and an analog encoder knob. No network interface at all.

Curious if anyone else is running E4B on Orin-class hardware. I'd love to compare tok/s and how you're handling sensor or tool context without blowing your prefix cache.

[-]

BlaizeOlle@reddit

Excellent

[-]

KindMonitor6206@reddit

cyberdecks be trending

[-]

teachersecret@reddit

Definitely not taking that thing on a plane... lol

[-]

lemondrops9@reddit

That was my first thought, some dumbass is going to say its a bomb.

[-]

DonnaPollson@reddit

The cache-stability point is the real gem here. A lot of edge projects obsess over quant choice and tok/s, but prompt layout is usually the hidden performance lever once you start mixing sensors, vision, and tool state. Putting volatile context at the tail instead of poisoning the prefix is exactly the kind of boring systems choice that turns a demo into something you can actually live with.

[-]

Intrepid_Dare6377@reddit

If you gave it a robotic middle finger, it would use it.

[-]

DJ_PoppedCaps@reddit

Compubro

[-]

LeoStark84@reddit

It would be so cool if the screen was in the outside and you could walk around carrying an opinionated suitcase with eyes that speaks.

[-]

laul_pogan@reddit

Solid cache structure. One thing that bit me with rapidly-sampled sensor data: floating point noise on continuous readings (temp at 23.14 vs 23.15 next turn) silently invalidates the prefix even though semantically nothing changed. Rounding sensor values to fixed precision before folding into the prompt (one decimal for temperature, integer for distance/light) gets the volatile tail structurally identical across more turns, so the cached path fires more often. Same for timestamps; bin to the nearest second or drop them unless Sparky actually needs temporal reasoning. Small change, measurable improvement in cache hit rate without touching your prompt structure.

[-]

CreativelyBankrupt@reddit (OP)

This is great because I never get to talk to anyone about the details. It's exactly the trap I was hitting when I started folding sensor data into the prompt; Glad to report Sparky is already running this playbook. ENV lives in the user message rather than the system prompt, so the persona and conversation history KV stays cached regardless of what the volatile tail does. It's also event-gated with cooldowns (threshold crossings only, then 1 to 10 minute quiet windows depending on the category), so most turns have no ENV block at all and hit a clean cached prefix.

Every numeric is rounded before formatting, integer Fahrenheit, integer humidity, light, and pressure, distance in whole cm. The only one-decimal value anywhere is Celsius in the on-demand "what's the temperature" path, which is exactly the case you called out as worth keeping the precision.

The one residual is banker's-rounding flips right at the integer boundaries (22.5°C rounding to 72°F, 22.51°C rounding to 73°F), but the cooldown gate keeps that to once per session worst case, so I'm letting it ride. Have you measured the cache-hit-rate delta on your setup? Curious how big the win was for you in practice.

[-]

laul_pogan@reddit

Honest answer: never instrumented exact hit rate on my end. Qualitative was clear though, vLLM long-context TTFT dropped enough that I stopped reaching for shorter prompts as a workaround.

On the Anthropic API path the response metadata exposes cache_read vs cache_creation tokens, which is the cleanest direct signal if you want a number. The event-gated cooldown is a clever escape hatch; hadn't seen anyone bin event types separately for this.

[-]

PigSlam@reddit

That's pretty cool. I hadn't considered anything like that. Would it be plausible to throw one together quick and dirty with an old gaming laptop with a decent GPU?

[-]

CreativelyBankrupt@reddit (OP)

Yeah, that'll absolutely work. The 2070 8GB has plenty of headroom for E4B at Q4. You'll probably see 20+ tok/s since desktop-class CUDA is faster per-watt than Jetson at low-batch.

I'd skip the embodied hardware on your first pass. Just get llama.cpp serving E4B locally, pipe a USB mic through SenseVoiceSmall for STT, and pipe the output through Piper for TTS. That's a fully offline voice loop.

I think the craft is the prompt and persona design. Sparky works because the system prompt commits him to a character and everything around him (sensors, vision, history) gets folded into that frame so he riffs on it. A laptop is just as good as any place to learn that.

If you can get E4B + Piper + a mic, a persona prompt that commits to a character, and a 10-turn conversation where you don't break out of character once, you'd learn more than any tutorial.

[-]

PigSlam@reddit

I’m thinking of a slightly different application. I have a 2016 13” MBP sitting round with a bad keyboard. I was about to toss it out, but seeing this made me think I could bolt it on a little stand, and put that on top of an old Roomba I have. Then the “robot” could just roam around my basement, and strike up conversations with the people it encounters. The “brain” could run on any of several systems I already use for AI things over my WiFi/LAN, so the laptop would just draw the face, and handle the I/O. I suppose the Roomba might need some tweaking. It’d be nice to have it hold still when it’s hanging a conversation.

[-]

Sofakingwetoddead@reddit

OMG you just moved humanity at last 40 years into the future!!!! Wow that's sooo freaking cool man

[-]

breadinabox@reddit

Man I have been wanting to build something along these lines (not so much standalone but the multi-sensor input stuff)

You got any more details on the prompting or well.. anything? I'd love to hear basically anything

[-]

CreativelyBankrupt@reddit (OP)

The GPIO aspect is the most fun for me. Trick is don't dump raw sensor data into the prompt. Convert it to natural language addressed at Sparky, only when something is actually worth saying, and only after a cooldown so the same thing doesn't keep firing. Instead of temp=48F, humidity=22%, distance=12cm, pir=1, what the model sees is "Someone's face is RIGHT in front of you, 12cm away!" or "It's freezing, 48°F." Goes in the user message as an [ENV] prefix only on turns where there's an event. Most turns there's nothing, which is what keeps the cache happy.

Camera works the same way, keyword-gated. "Look," "describe," "what do you see" attach the latest frame to that single turn. Saves a ton on every other turn and keeps the model from getting stuck narrating images.

Persona is a templating system, six voices (default Sparky, Comedy, Grump, Storyteller, Thinker, Sci-Fi Fanatic), swappable mid-session. The thing I underestimated most was post-processing strippers that catch the model echoing its own patterns. Without them Gemma 4 would start every reply with the same syntactic opener after 30 turns.

I also made him a smaller sister that he named Sparkle, built into a CrowPi 3 electronics learning station, with a 4-inch face display, camera, microphone, onboard sensor/IO board, glowing 64-pixel LED heart matrix etc. She's RPi5-based and has to use WiFi and cloud inference: she listens through the mic, sends the conversation to a Groq-hosted 120B LLM for reasoning, uses Llama 4 Scout for on-demand vision through her camera, then replies in a warm female voice while her PixiJS/WebGL face, LED heart, status lights, buzzer, and haptics express mood and state. Her physical body is basically a cute cybernetic lab tray: small, sensor-packed, expressive, and deliberately art-object-like, with a frosted cover that slips over her like a lid so she becomes glowing ambient wall art when idle.

It's the wild west right now and there are so many avenues to take with all of this! Find reasons to get started.

[-]

JudgePhobos@reddit

Smh looks like something from Fallout, very cool!

[-]

laserborg@reddit

interesting.
I'm running Gemma4 E4B Q4_K_N GGUF in llama.cpp/llama-server with decent context on a headless Orin Nano 8GB at ~18 t/s.
wonder how much ram tts and stt would consume.

[-]

overand@reddit

In theory, Gemma-4-e4b handles STT natively via multimodal support - but I generally see people stacking a separate STT pipeline in builds like these; I'm not quite sure why. (I assume it's a lack of tooling and/or documentation.)

[-]

CreativelyBankrupt@reddit (OP)

I run both. SenseVoiceSmall owns the primary text path, and Gemma 4's audio encoder (gemma4a in the mmproj) handles on-demand stuff like tone, accent, language, and background sounds when someone asks Sparky things like "how do I sound" or "what language am I speaking."

I think the reason most builds skip Gemma 4 for primary STT is latency. The multimodal audio encoder is around 700ms per turn just for the encode pass, before the LLM even sees the tokens. SenseVoiceSmall is around 150ms and runs its own VAD continuously, so endpointing is basically free and transcription is mostly done by the time the user stops talking. llama.cpp's audio path is also one-shot, you hand it the full clip and wait, and adding chunk-streaming is a 1 to 2 week upstream patch I decided wasn't worth 700ms.

So Sparky is hybrid. Cheap fast STT every turn, native multimodal audio when I actually need Sparky to listen to how I'm talking, not just what I said.

[-]

CreativelyBankrupt@reddit (OP)

Nice, 18 t/s on the 8GB is solid. I'm at 14-15 on the NX SUPER mostly because I'm running everything in parallel with the LLM.

For your question: Piper TTS is the lightweight one. The medium-quality voice models run around 60-80MB resident, generation is fast enough that it's basically free on Orin-class hardware. SenseVoiceSmall STT is the heavier one, roughly 800MB to 1GB depending on how you load it. Both are CPU-bound for me, not GPU, so they don't compete with the LLM for VRAM. Together they add maybe 1GB of system RAM and effectively zero GPU memory.

The bigger budget question on the 8GB Nano is whether you have room for vision. If you're using Gemma 4's native multimodal, the mmproj weights add another ~2GB on top of the LLM. If you skip vision, you have plenty of headroom for the speech pipeline. This is why I eventually moved up to the NX Super with 16GB.

Are you doing this headless on purpose or just because the use case doesn't need a display?

[-]

laserborg@reddit

thanks for the detailed reply! I'm running headless to save RAM; GNOME eats > 600 MB (an odd choice for a 8 GB platform iyam). official llm tutorial for Nano features ollama and 0.9-1.8B models, pretty odd choice considering ollama's hunger for RAM and given that llama.cpp can run a 4B model without swapping.

[-]

CorpusculantCortex@reddit

"Less existential threat of dampness" 😂

[-]

unculturedperl@reddit

Excessive cheese and grease!

[-]

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! Come check it out!

You've also been given a special flair for your contribution. We appreciate your post!

I am a bot and this action was performed automatically.

[-]

PigSlam@reddit

See all that stuff in there, Homer? That's why your robot never worked.

[-]

Paradigmind@reddit

Cool shit!

[-]

swagonflyyyy@reddit

Fucking.

Love it.

I want one.

[-]

Ylsid@reddit

This little guy needs tiny legs to get around

[-]

Recoil42@reddit

Really cool hardware design, OP.

[-]

CreativelyBankrupt@reddit (OP)

The case was fashioned from the Elecrow Jetson AI starter kit, which already included the sensor board suitcase and screen.

I turned it from the sensor training platform into Sparky: the conversational pipeline, the persona work, the face animation, the on-device control panel, the sensor-to-prompt integration, and all the prompt engineering for cache stability. The Elecrow kits are great as a base if anyone's looking for one but this wasn't their intended purpose.

[-]

riceinmybelly@reddit

450 is quite the price for some sensors and a little screen, I guess they sell the lessons this way. Wouldn’t your phone do better than the jetson?

[-]

CreativelyBankrupt@reddit (OP)

The kit was $179 direct from Elecrow when I bought it, plus tariffs. Resellers mark it up significantly, but it was still a nice starting point as I learned everything.

A phone CPU can run a small LLM, but it can't drive all the sensors over I²C, SPI, and GPIO, can't run llama.cpp with flash attention and a 12K KV cache at 14-15 tok/s sustained while also handling vision, STT, TTS, and a kiosk display in parallel, and isn't built to sit headless with USB peripherals for hours. The Jetson is an edge AI compute board with industrial I/O which fully enabled what I was attempting.

[-]

riceinmybelly@reddit

Oh, I just found it on aliexpress for 450€ I would slap a raspberry pi in there since I have some laying around and do the inference from my phone. But it’s a cool project!

[-]

NotForResus@reddit

Second this!

[-]

VectorB@reddit

The face when you open it. "Man, not this guy and his BS again"

[-]

E8@reddit

Congratulations! You've invented George Jetson's computer friend, RUDI.

Now do the ship's computer from Star Trek NG.

I guess I'll take my moon pie over there and enjoy it quietly. What a time to be alive!

[-]

Cosack@reddit

like talking to an alien lol. I'd look into memory systems for Sparky, so it can evolve a bit.

Though the "existential threat of dampness" was pretty tasteful. Well noted, Sparky.

[-]

CreativelyBankrupt@reddit (OP)

Totally - that alien feeling is what I like!

Sparky hit a lot of hard limits, though, which is why I’m already building another robot around the AGX Thor 128GB Blackwell architecture: real autonomy, persistent memory, and much deeper vision. Sparky proved the concept and tech stack but I want the next level that remembers and evolves.

[-]

Cosack@reddit

Now that's commitment! Nice. What are you looking at for the battery?

Also are you documenting this adventure anywhere? Would be cool to learn from your experience

[-]

ferranpons@reddit

That's really cool! Does it have a name?

[-]

rog1121@reddit

[-]

ctanna5@reddit

Dude.. holy moly! The duality!

[-]

Meowingway@reddit

What in the Skynet version 0.1alpha is this hahaha 😂 I love it. Keep rocking!

[-]

LocalLLaMa_reader@reddit

This is soooo coool!!! DO I understand correctly, you have a temperature sensor integrated into the device? Would be funny to have it make use of other sensor inputs, like GPS, time of day, etc. Does it "learn" about you over time? Does it "remember" your last sessions?

[-]

CreativelyBankrupt@reddit (OP)

Thanks! Yeah, temperature is one of about 30 sensors feeding him context every turn. Light, humidity, pressure, IMU, ultrasonic, PIR, ambient mic, plus face detection and emotion from the camera. Time of day is in there too. There's a customization interface not shown that lets me toggle individual sensors off.

On memory: within a session he has 12K of context. Across sessions, face ID persists so he recognizes returning people by name, but conversation memory resets on reboot by design. I've considered a small local vector store for things he should remember longer term, but the tricky part is deciding what to remember versus what to let fade. I tried saving all of the chat logs and then fine-tuning the model but I wasn't happy with the results.

Originally I was building him as a little hacker parrot, gathering all the floating metadata we love. WiFi probe requests, BLE ads, sub-GHz remotes, ADS-B from aircraft, TPMS from passing cars, weather stations, pagers, fobs, NFC/RFID, even the room signature with Mid-360S LiDAR at one point. Everything landed in an aggregator daemon and got distilled into a [SIGNALS] context block. But he definitely thinks he's alive, and the weird remarks from what he was seeing on the camera ended up more interesting than any of the RF stuff. So the hacker parrot got demoted.

[-]

wearesoovercooked@reddit

Cool project

Also: to /r/idiotsincars you go

[-]