Real-time AI (audio/video in, voice out) on an M3 Pro with Gemma E2B

Posted by ffinzy@reddit | LocalLLaMA | View on Reddit | 32 comments

Sure you can't do agentic coding with the Gemma 4 E2B, but this model is a game-changer for people learning a new language.

Imagine a few years from now that people can run this locally on their phones. They can point their camera at objects and talk about them. And this model is multi-lingual, so people can always fallback to their native language if they want. This is essentially what OpenAI demoed a few years ago.

Repo: https://github.com/fikrikarim/parlor

[-]

TruckUseful4423@reddit

It would be great any BAT file for Windows users - full automatic installation and start :-)

[-]

casualcoder47@reddit

How much RAM are you consuming?

[-]

ffinzy@reddit (OP)

I'm using the gemma-4-E2B-it-litert-lm and the model size is 2.58 GB.

The python process hovers around \~3GB when it's idle and can be \~4GB when I run the benchmark/use it.

[-]

False_Process_4569@reddit

How about VRAM. I haven't tried loading any Gemma 4 variants yet. But I only have 8gb of VRAM.

[-]

ffinzy@reddit (OP)

On mac it's unified, so technically the RAM is the same as the VRAM.

Google provided some benchmark on the performance and the GPU memory needed here: https://huggingface.co/litert-community/gemma-4-E2B-it-litert-lm

[-]

Critical_008@reddit

This is great! 👍 if you reduce it to ~800ms response time , it will be a game changer. Great work.

[-]

ffinzy@reddit (OP)

Thank you. M3 Pro has memory bandwidth of 150 GB/s while the M5 Pro has 307GB/s. So if you're on M5 Pro this might have \~1s response time.

I agree with the general sentiment that we need to optimize this to run on less powerful hardware as well.

[-]

Critical_008@reddit

I installed this and this actually really good! Running on M2 pro with about 3 seconds end to end latency.. sometimes 2 seconds.

[-]

ffinzy@reddit (OP)

Nice! Glad to know that it's working well for you. You're the first one that confirms that this works outside of my machine haha.

[-]

Critical_008@reddit

I wonder if there is a way to turn off the camera and just use just audio..

I will look through the code to check if there are any latency optimizations 😀 thanks for this.

[-]

ffinzy@reddit (OP)

there's a button on the bottom left xd

[-]

Born-Caterpillar-814@reddit

I tried to install this on ubuntu, but it fails to download the kokoro files, those url paths don't seem to excist anymore.

[-]

ffinzy@reddit (OP)

Could you pull the latest commit and try again? I just tried it and it downloads the correct Kokoro model for me.

Although I'm still testing it on a mac so I'm not sure it'll work 100% on ubuntu.

[-]

Born-Caterpillar-814@reddit

Thank you so much, I will test again tomorrow.

[-]

ffinzy@reddit (OP)

Sorry I haven't tested the ubuntu version. Let me check it real quick.

[-]

mycall@reddit

I find it interesting people only consider local AI is good for privacy and speed, never about offline use.

[-]

ffinzy@reddit (OP)

That's a good point.

[-]

bluemondayishere@reddit

Not a hotdog

[-]

mrgulabull@reddit

Ohh, this is very nice. Thanks for the demo and open source reference.

I’ve built a voice controlled interface for Claude Code and have focused on optimizing every millisecond like you. The STT, TTS and LLM are all pluggable. I’m going to see where E2B can fit into things - perhaps offering a completely local version if someone doesn’t want to use Claude’s models. The vision processing would be really nice to integrate.

Here’s a quick demo:

https://www.reddit.com/r/ClaudeCode/s/RFG88a18IJ

[-]

neOwx@reddit

Impressive. Maybe you can make it feel even quicker by streaming the response ? In your demo, the text appears in one go.

[-]

ffinzy@reddit (OP)

Yeah that's possible. Although during my testing most of the time is spent during the prefill (processing the video, audio, and text) and not on the actual decoding. For example, the total time for prefill/TTFT is 2s, while the decode/text generation that we can stream the is only 0.3s.

So it'd be much more significant if we could reduce the TTFT. Disabling the image input would reduce it from \~2s to \~1.5s.

[-]

kmil-17@reddit

interesting

[-]

-deflating@reddit

Wow, impressive! Thanks!

[-]

ffinzy@reddit (OP)

Thank you!

[-]

Medium_Chemist_4032@reddit

I thought the model is STT, does it do TTS too?

[-]

balder1993@reddit

Must be using something else to read the model’s output.

[-]

ffinzy@reddit (OP)

I use Kokoro for the TTS

[-]

JacketHistorical2321@reddit

It's a 5b model isn't it? My phone (16gb RAM) can handle that now

[-]

sersoniko@reddit

I was happy that my iPhone 17 Pro came with 4 more GB of RAM but each app is limited to around 4 or 5 GB

[-]

ffinzy@reddit (OP)

Yeah, 5.1B with embeddings. You should try it on the Google AI Edge Gallery app. Although AFAIK they currently don't provide the multi-modal realtime. Only a separate text input, voice input, or video input. My guess would be that although it fits in the RAM, the GPU isn't fast enough to process it in real-time.

[-]

misha1350@reddit

Why don't you use E4B instead

[-]

ffinzy@reddit (OP)

Good question. I want to optimize the speed and the "real-time" feeling. You should use the E2B if you have a faster GPU.