What framework support audio / video input for gemma 4?

Posted by ResponsibleTruck4717@reddit | LocalLLaMA | View on Reddit | 4 comments

I tried with transformers but it was too slow.

llama.cpp doesnt support it.

So any good framework?

[-]

TokenRingAI@reddit

I haven't verified that it supports Gemma 4 in particular, but VLLM supports single/multi image, video, and audio input.

[-]

KokaOP@reddit

not audio, tested it just now, the docs are sheet, gemma4 requires latest vllm which has command for image and audio, exmaples are wacked , TBH just wait for llama.cpp

[-]

TokenRingAI@reddit

VLLM supports audio, have not tested it specifically with Gemma 4

https://docs.vllm.ai/en/stable/features/multimodal_inputs/#audio-inputs_1

VLLM is miles ahead of llama.cpp when it comes to fully supporting model features.

[-]

No-Blood-9115@reddit

you can search github. I remember seeing a framework handling visual input. but I forgot the name. mlx VL?