Gemma 4 E2B runs surprisingly well on my 8GB Android phone, so I built a private voice notes app around it.
Posted by Effective-Drawer9152@reddit | LocalLLaMA | View on Reddit | 23 comments
Been running Gemma 4 E2B locally on my OnePlus CE 5 (8GB RAM) for a few months. Chat quality is fine for the size. What surprised me was JSON output. Short input, give it a structured prompt, you get clean parse able JSON back. Way better than I expected from a 2.4GB model on a phone.
Got me thinking about voice notes. You ramble for a few seconds, "call the dentist tomorrow at 3, also buy milk on the way home", and Gemma can split that into separate items, tag each one (reminder, buy), resolve the time. Tried it for a few weeks. Categorization is actually decent on real notes, not just the toy ones I started with.
Built an Android app around it. Whisper Small (244MB) for transcription via Sherpa-ONNX, Gemma 4 E2B (2.4GB) for the splitting and categorization via LiteRT-LM. Both run on the phone, no cloud, no account.
End-to-end on the CE 5, a typical 10-15 second voice note takes about 12-15s. Whisper does transcription in \~5s, Gemma categorizes in \~8-10s, rest is model load + Room writes + UI hop.
At search time( for eacmple -> "what did I say about the dentist last week") it does query expansion, rewriting the user's question into keywords plus hypothetical example items before retrieval. Multiple FTS lanes get merged with reciprocal rank fusion, then there's an optional Gemma reranker pass over the top-K with a 15s timeout and fallback to RRF order if it doesn't finish.
Curious what people here are doing with local LLMs on their phones lately. Any other good models to try out for local device.
If anyone wants to try it on their own device and share feedback, happy to share it . Mostly looking to know if the categorization holds up on real notes and any weirdness on first model
MrAatishB@reddit
Great app and idea. It's amazing what you can do with enough enthusiasm.
zhenfengzhu@reddit
Interesting setup. The part I’d be most curious about is how well the categorization holds up after a few weeks of messy real notes, not just clean examples.
Do you keep the original transcript + the model’s parsed JSON side by side for later correction? For this kind of on-phone workflow I feel like the hard problem is less the first answer and more having enough trace to fix bad splits or wrong reminders later.
emiliobay@reddit
XDA Developers just ran the 4B version of Gemma 4 on an Oppo Find N5 and got about 8 tokens per second with native audio transcription. That on-device audio support is a massive shift for offline notes. I noticed the real friction with setups like that is the actual input step, which is why I've been prototyping a physical Bluetooth clicker to trigger the recording instantly.
wbulot@reddit
I did something a bit different. I used Qwen 3.6 27B to code an Android keyboard tailored for me. I integrated NVIDIA’s Parakeet voice model into it, which runs directly on the phone. It then sends the transcription to my local LLM server with a predefined prompt. Everything is accessible through small icons right in the keyboard. It works really well.
The audio transcription is instant with Parakeet and it almost never misses a word. It’s also multilingual, which is a huge advantage since I speak both French and English. The LLM runs on my server instead of on the phone so it stays smart enough.
Running the LLM directly on the phone is an option, but with such a small number of parameters, I feel like it would fail too often. I prefer to keep the LLM on the server and only run the voice model locally.
xeeff@reddit
I keep seeing parakeet as the best option for STT - does it run on ROCm?
Effective-Drawer9152@reddit (OP)
Cool setup, keyboard + Parakeet + server LLM is a clever route, especially the multilingual angle. The server-LLM choice makes total sense if you're doing freeform reasoning where parameter count matters more. Thanks for sharing this.
starkruzr@reddit
this is making me want to look into one of those mega-RAM Redmagic phones.
Technical-Earth-3254@reddit
If you don't want all the downsides, you can look into Nubia phones.
Comfortable_Ebb7015@reddit
With 24gb RAM you could even run Gemma 4 26b!
Effective-Drawer9152@reddit (OP)
yeah those were beast
starkruzr@reddit
so I'm reading about LiteLT-LM: https://github.com/google-ai-edge/LiteRT-LM what does it actually end up running models on, the CPU, GPU or NPU? I understand that for a lot of devices the available RAM the NPU or GPU can use is limited and it's not as much of a shared memory architecture as something like a Spark or an Apple Silicon Mac.
Effective-Drawer9152@reddit (OP)
I am not aware of such deep etails, I think android have unified memeory architectur and GPU cap is way smaller i think. I am learning as i am building. I am not a android developer to be honest.
Effective-Drawer9152@reddit (OP)
Thank you guys for all comment and discussion if anyone wants to try then
Beta access (2 quick steps):
(Closed testing is required by Google Play for new apps — this 2-step opt-in is unfortunately the lowest-friction path until I unlock public testing in \~2 weeks.)
This is one screenshot of app
good-luck11235@reddit
Please open source so I can try it out amd contribute if I can
Effective-Drawer9152@reddit (OP)
I am not sure about open source right now, if you want to try i can add you in closed testing
good-luck11235@reddit
Awesome. DM when you want to do it
Effective-Drawer9152@reddit (OP)
Thanks dmed you
SOCSChamp@reddit
Not sure why you'd need whisper in this case, the model should be perfectly capable of taking your voice and writing formatted text out of it natively.
Effective-Drawer9152@reddit (OP)
Yeah i totally get it, i did tried but it was not working cleanly, totally try again.
SOCSChamp@reddit
What's your setup look like? I haven't tried building anything in android so I'm not familiar with the toolkits. For standard linux, vLLM works great for me with audio input.
Effective-Drawer9152@reddit (OP)
Standard Android stack — Kotlin + Compose for UI, Gradle build, Koin for DI. The on-device LLM is LiteRT-LM.
I also tried Lamma.cpp it was good but LiteRT-LM was better
mhl47@reddit
Sounds great. Did you try to use the model directly for voice input instead of adding whisper?
Effective-Drawer9152@reddit (OP)
Tried it early on, couldn't get it working cleanly so I went with Whisper. I am goona try again and see how it goes.