How do I use Gemma 4 video multimodality?
Posted by HornyGooner4401@reddit | LocalLLaMA | View on Reddit | 14 comments
I normally just chuck my models to LM Studio for a quick test, but it doesn't support video input. Neither does llama.cpp or Ollama.
How can I use the video understanding of Gemma 4 then?
ComplexType568@reddit
i think almost all models running on llama.cpp don't support video. if not all.
also, what a username you have
Stepfunction@reddit
Hey, bonus points for honesty!
Herr_Drosselmeyer@reddit
Where do you get the idea from that Gemma 4 supports video?
grumd@reddit
https://huggingface.co/blog/gemma4#video-understanding
Herr_Drosselmeyer@reddit
Odd that the main model card doesn't include this. But from skimming your link, it seems that video is not supported via llama.cpp and MLX. LM Studio and Ollama both rely on llama.cpp or MLX, so yeah, that's not going to work.
grumd@reddit
Yep. Can't do it with llama at the moment sadly
floconildo@reddit
There is a PR open for video support, but I don't expect that to arrive any time soon
antwon_dev@reddit
Have you tried LiteRT-LM by Google on GitHub? I’m trying to get the E4B audio modality working. Will let you know how it goes
Funny-Trash-4286@reddit
LiteRT-LM ASR works but it's really bad compared to the ASR with full model
antwon_dev@reddit
Thanks for letting me know, is there something else you’d recommend? Maybe vLLM?
Funny-Trash-4286@reddit
EDIT It does work with MLX on mac with this
huggingface.co/FakeRockert543/gemma-4-e4b-it-MLX-4bit
I tried the 4 bit and the multilingual ASR is still very bad compared to the original version
Parakeet v3 = Gemma E4B IT 16 bit ASR (the ASR section weights only 300 million params)
i think the ASR section should not be quantized to get goot results with 4 bit
Funny-Trash-4286@reddit
There is no audio support for anything but transformers base 16gb version and LiteRT-LM
Some contributor on llama is working on it
bitplenty@reddit
Use vLLM: https://docs.vllm.ai/projects/recipes/en/latest/Google/Gemma4.html
"Natively processes text and images (video supported via a custom vLLM processing pipeline that extracts frames; smaller gemma4-E2B and gemma-4-E4B also support audio)."
FusionCow@reddit
It doesn't support video input in the way you would think, it supports taking frames of a video and telling you the general meaning of the frames. it doesn't take in audio for the bigger ones, but if you wanted to, just break a video into up to 60 frames though I'd mess around with it and it depends on video length, and give it the frames.