How do I use Gemma 4 video multimodality?

Posted by HornyGooner4401@reddit | LocalLLaMA | View on Reddit | 14 comments

I normally just chuck my models to LM Studio for a quick test, but it doesn't support video input. Neither does llama.cpp or Ollama.

How can I use the video understanding of Gemma 4 then?

[-]

ComplexType568@reddit

i think almost all models running on llama.cpp don't support video. if not all.

also, what a username you have

[-]

Herr_Drosselmeyer@reddit

Where do you get the idea from that Gemma 4 supports video?

[-]

grumd@reddit

https://huggingface.co/blog/gemma4#video-understanding

[-]

Odd that the main model card doesn't include this. But from skimming your link, it seems that video is not supported via llama.cpp and MLX. LM Studio and Ollama both rely on llama.cpp or MLX, so yeah, that's not going to work.

[-]

grumd@reddit

Yep. Can't do it with llama at the moment sadly

[-]

floconildo@reddit

There is a PR open for video support, but I don't expect that to arrive any time soon

[-]

antwon_dev@reddit

Have you tried LiteRT-LM by Google on GitHub? I’m trying to get the E4B audio modality working. Will let you know how it goes

[-]

Funny-Trash-4286@reddit

LiteRT-LM ASR works but it's really bad compared to the ASR with full model

[-]

antwon_dev@reddit

Thanks for letting me know, is there something else you’d recommend? Maybe vLLM?

[-]

Funny-Trash-4286@reddit

EDIT It does work with MLX on mac with this

huggingface.co/FakeRockert543/gemma-4-e4b-it-MLX-4bit

I tried the 4 bit and the multilingual ASR is still very bad compared to the original version

Parakeet v3 = Gemma E4B IT 16 bit ASR (the ASR section weights only 300 million params)

i think the ASR section should not be quantized to get goot results with 4 bit

[-]

Funny-Trash-4286@reddit

There is no audio support for anything but transformers base 16gb version and LiteRT-LM

Some contributor on llama is working on it

[-]

bitplenty@reddit

Use vLLM: https://docs.vllm.ai/projects/recipes/en/latest/Google/Gemma4.html

"Natively processes text and images (video supported via a custom vLLM processing pipeline that extracts frames; smaller gemma4-E2B and gemma-4-E4B also support audio)."

[-]

FusionCow@reddit

It doesn't support video input in the way you would think, it supports taking frames of a video and telling you the general meaning of the frames. it doesn't take in audio for the bigger ones, but if you wanted to, just break a video into up to 60 frames though I'd mess around with it and it depends on video length, and give it the frames.