Most efficient way of running Gemma 4 E4B with multimodal capabilities on a laptop?

Posted by PrashantRanjan69@reddit | LocalLLaMA | View on Reddit | 16 comments

The gemma 4 E4B and E2B models have built-in multimodal capabilities. However, as far as I am aware, llama.cpp does not have proper support for vision and audio inputs (specially audio) for these models as of now.

I was able to extract the audio encoder from the official model repository on huggingface, and vibe-code a bridge that passes on the embeddings of the audio directly to the model, and it actually works as well. This system uses the Unsloth's GGUF version at Q4 and the audio encoder at full precision (pytorch), and takes up about 5.5-6GB VRAM.

The thing is that this entire thing feels like a workaround for what should be readily available, and built in a more robust way, and not vibe-coded by someone like me.

Maybe I am just unaware, but I am looking for a more complete and non-hacky way of using the model's multimodal capabilities under 6GB VRAM. So if anyone can guide me with this please it would be awesome!

P.s : I tried mistral.rs but for multimodal capabilities I guess it takes a lot of extra VRAM for some reason?

[-]

overand@reddit

I'm disappointed with the audio functionality - not in a "it's bad" way, but in a "it doesn't do what I'd hoped it would."

I made a recording with the same sentence spoken with three very different tones of voice, and unfortunately, gemma-4-e4b:q8_0 (Unsloth) wasn't able to distinguish between them. It also couldn't identify "Twinkle Twinkle Little Star" being whistled, even when prompted that it's a song. Hilariously, I got:

Based on the audio you provided, it does not appear to be a song. It sounds like a recording of cats making various vocalizations.
The song is "Blinding Lights" by The Weeknd.

So, it's not all a loss - according to Gemma 4 E4B, The Weeknd is indistinguishable from cats making "various vocalizations."

Checks out.

[-]

PrashantRanjan69@reddit (OP)

My use case is currently limited to transcription tasks only, but the reason why I'm not going for Whisper is because I want the vast multilingual capability of Gemma 4 E4B.

Someone in the comments told me that llama-server does support audio inputs so I'm going to try that out

[-]

overand@reddit

I'm disappointed with the audio functionality - not in a "it's bad" way, but in a "it doesn't do what I'd hoped it would."

I made a recording with the same sentence spoken with three very different tones of voice, and unfortunately, gemma-4-e4b:q8_0 (Unsloth) wasn't able to distinguish between them. It also couldn't identify "Twinkle Twinkle Little Star" being whistled, even when prompted that it's a song. Hilariously, I got:

Based on the audio you provided, it does not appear to be a song. It sounds like a recording of cats making various vocalizations.
The song is "Blinding Lights" by The Weeknd.

So, it's not all a loss - according to Gemma 4 E4B, The Weeknd is indistinguishable from cats making "various vocalizations."

[-]

Parzival_3110@reddit

Nice workaround with the audio embeddings! llama.cpp multimodal support is evolving fast—check recent PRs for Gemma vision. For audio, precomputing embeddings like you did is smart for low VRAM. What's your laptop's GPU and tokens/sec?

[-]

PrashantRanjan69@reddit (OP)

I have a 3060 6GB RAM on my laptop, and this system easily gives me around 10-15 tps. Prefill is much faster though.

It also splits the audio into multiple chunks using VAD, and the audio encoder sequentially generates the embeddings of each chunk and passes the entire audio's embeddings into the model, essentially bypassing the 30 second limit of the model as well.

I am also currently working on enabling batch processing of each chunk through parallelism, which should reduce the time for processing the audio

[-]

wilo108@reddit

Don't feed the bots...

[-]

PrashantRanjan69@reddit (OP)

Oh... I didn't realise that it was a bot 🫪

[-]

grandong123@reddit

if i am not wrong, llama cpp already suppory gemma4 audio input via built in webui. use latest build. but for very STT use case, i dont know how to do that yet

[-]

D2OQZG8l5BI1S06@reddit

It does indeed. For STT you can use OpenAI /audio/transcriptions API with llama-server.

[-]

PrashantRanjan69@reddit (OP)

I'll try this right now! Maybe implement continuous batching around it to make processing long audio chunks parallely.

Thanks for the info!

[-]

grandong123@reddit

HOLLY, THANKS FOR THE INFO! i just know about this

[-]

DatteRo@reddit

RemindMe! 7 days

[-]

RemindMeBot@reddit

I will be messaging you in 7 days on 2026-05-05 12:27:11 UTC to remind you of this link

CLICK THIS LINK to send a PM to also be reminded and to reduce spam.

^(Parent commenter can ) ^(delete this message to hide from others.)

^(Info)	^(Custom)	^(Your Reminders)	^(Feedback)

[-]

DatteRo@reddit

!remindme 6 days

[-]

Oscylator@reddit

You can try:
https://github.com/google-ai-edge/LiteRT-LM
For Gemma 4 it's faster (if your hardware is supported) and it seems to support all modalities. That's said it is less developed that llama.ccp, so you will probably need to code some piping to actually be able to use it as you want.

PS. consider sharing your vibe codded solution, if it is not too sloppy.

[-]

PrashantRanjan69@reddit (OP)

I had tried LiteRT-LM before settling with llama.cpp ultimately. The thing is they have limited support for Windows (they recommend using WSL). Also their python API is yet to come.

I think I should open source my implementation and maybe the community can make it more robust. I wouldn't say it's a slop. However, I do not have much machine learning experience, and I have built this using what I could research and read in a week. It works, but I think people with expertise in this can improve it in ways I may not understand right now.