interacting with gemma 4 w/ live video and audio

Posted by jcsimmo@reddit | LocalLLaMA | View on Reddit | 6 comments

I saw someone on this forum demonstrate using gemma 4 - live streaming audio and video from his webcam to it asking it what it was seeing. It was pretty great but I cant find that post anymore and I can't find a good repo on github where I can try that out. I can't seem to get it working on my own

[-]

RebouncedCat@reddit

https://www.reddit.com/r/LocalLLaMA/comments/1sda3r6/realtime_ai_audiovideo_in_voice_out_on_an_m3_pro/

[-]

jcsimmo@reddit (OP)

Thank you!!!!

[-]

Acrobatic_Stress1388@reddit

I think you were looking for something called parlor, maybe?

[-]

Due-Function-4877@reddit

Shouldn't be too difficult, but it would be hard to get it going in real time on affordable local hardware without using heavy quantization.

Set up a venv and run insightface on it's own backend that also hosts your browser front end. Have the backend grab a webcam cap. Next, call it in to a open ai endpoint and whatever multimodal model you fancy at the moment. Send along some cooked insightface data and the rest of your caption prompt.

[-]

GrungeWerX@reddit

Is all that necessary? I’ve heard that Gemma can actually process video

[-]

keally1123@reddit

How are you going to ask how to do something and then question the necessity of the process given to you? If "all that" wasn't necessary then you wouldve figured it out by now.