Audio processing landed in llama-server with Gemma-4
Posted by srigi@reddit | LocalLLaMA | View on Reddit | 51 comments

Ladies and gentlemen, it is a great pleasure the confirm that llama.cpp (llama-server) now supports STT with Gemma-4 E2A and E4A models.
theagenthubai@reddit
has anyone tested this with longer audio clips? curious about latency vs whisper ?
Mashic@reddit
I wonder if it's better than Whisper at transcription.
MoffKalast@reddit
Tbf, Parakeet is already better than Whisper.
andy2na@reddit
parakeet is amazing and extremely fast even on CPU. wondering how gemma4 is compared to parakeet, bummer that only E2B and E4B have it
Mashic@reddit
Doesn't support a lot of languages.
MoffKalast@reddit
Neither does Whisper at any usable WER.
Mashic@reddit
At least it supports Asian languages like Japanese, Korean, and Chinese.
ArtfulGenie69@reddit
Qwen asr has more languages. Maybe it could work in your project?
citrusalex@reddit
Qwen-asr is pretty slow
adeadbeathorse@reddit
How often does an extra couple of minutes on top of a couple of minutes really matter when transcribing lengthy audio?
Competitive_Travel16@reddit
What languages does Gemma4 support for STT?
Mashic@reddit
I don't really think there is a list.
Competitive_Travel16@reddit
The Model Card says "multilingual support in over 140 languages" but I'm not sure if that is true for STT -- https://ai.google.dev/gemma/docs/core/model_card_4
citrusalex@reddit
Canary is even better and supports language selection.
Ylsid@reddit
Thanks for watching and hit subscribe
florinandrei@reddit
That's why it's called Whisper.
rhinodevil@reddit
Problems with the silence did not happen to me.. Maybe a configuration issue? Or using one of the smaller models? Is Parakeet also better than Whisper for other languages than English?
llama-impersonator@reddit
[Applause] hear, hear
Bakoro@reddit
It makes me wonder what the tokenization process is like. There are lots of kinds of "silence", and most of it is really more like "ambient noise".
There probably isn't nearly enough labeling of ambient noise and high gain on the microphone. That stuff should probably get a few dozen tokens of its own, and explicit labels in the data.
Space_Pirate_R@reddit
Or use some sort of light preprocessing like a noise gate.
Ayumu_Kasuga@reddit
I don't know, I've tried both parakeet and whisper, and whisper in my experience is a lot better at understanding stuff like "commit to git"
EbbNorth7735@reddit
I did not know this. I should probably get a fastApi server going for it and test it out
justletmesignupalre@reddit
Yep, me too. I hope someone tries this
Cosmicdev_058@reddit
Honestly the thing I'm most excited about here is not the transcription quality, it's that I can stop babysitting a separate Whisper container. Running two inference processes side by side, splitting VRAM between them, restarting Whisper when it inevitably hangs on a long audio file at 2am. Collapsing that into one llama-server process is a genuine quality of life upgrade even if Gemma's STT is slightly worse right now.
That said, Chromix_'s notes about it looping and dying on anything over 30 seconds is a bit concerning. I have a use case that needs to handle 10 to 15 minute recordings and that is a dealbreaker until it stabilizes. Going to keep running Whisper alongside it for now but will be watching the PRs closely.
GroundbreakingMall54@reddit
wait so native audio support actually works in llama.cpp now? this is huge. been waiting for this instead of having to spin up a whole separate whisper pipeline
srigi@reddit (OP)
Agree, with llama-server supporting this in its REST API, you can create "speak to your agent" (STT) solutions with fully local processing.
Protheu5@reddit
Can I do it on my phone? That would be incredibly helpful during trips abroad.
rm-rf-rm@reddit
But its only with Gemma E*B right? You cant use whisper, parakeet etc.?
Aerroon@reddit
I was able to use an older Voxtral model about a month ago.
RIP26770@reddit
Done ! Via llama-swap
iadanos@reddit
Could you please post an example?
ZootAllures9111@reddit
didn't the CLI already have this though? Isn't this just a matter of "they exposed an existing feature in their (still overly basic IMO) webui"?
WithoutReason1729@reddit
Your post is getting popular and we just featured it on our Discord! Come check it out!
You've also been given a special flair for your contribution. We appreciate your post!
I am a bot and this action was performed automatically.
EbbNorth7735@reddit
Does it support any sort of pause detection or streaming or is it like batch processing sort of thing?
RebouncedCat@reddit
you'd have to use a VAD for that.
seefatchai@reddit
Can it do song lyrics?
AcaciaBlue@reddit
I'm kinda new here, did any other software support this before (Like LM Studio?). Is audio processing also available in the PolarQuant branch?
ZootAllures9111@reddit
I mean I think the llama.cpp CLI executable has had flags to support audio capable LLMs for a while. So this is more just a UI thing. But no LM Studio doesn't expose those flags currently.
Chromix_@reddit
It seems that there are some issues left to be ironed out. In the current state it's mostly unusable for me for 5+ minutes of audio - Voxtral works way better. I'm using E4B as Q8_XL quant with BF16 mmproj (recommended, as other mmproj formats lead to degraded capabilities)
llama-context.cpp:1601: GGML_ASSERT((cparams.causal_attn || cparams.n_ubatch >= n_tokens_all) && "non-causal attention requires n_ubatch >= n_tokens") failed-ubmakes it proceed here.According to the original readme, you shouldn't just use "transcribe this text", but follow these exact templates for better result quality:
Transcription:
Follow these specific instructions for formatting the answer:
* Only output the transcription, with no newlines.
* When transcribing numbers, write the digits, i.e. write 1.7 and not one point seven, and write 3 instead of three.
Translation:
Aerroon@reddit
About a month ago I used some older Voxtral model for transcribing using llama-server. At the time I could do about 2 minute audio clips. Anything longer than that crashed.
The specific transcription instructions are good to know though. That Voxtral model would just randomly fail when I wrote "transcribe the audio".
Mistercheese@reddit
Which voxtral model? 4B tts with vllm Omni?
Chromix_@reddit
I've initially tried Voxtral-Mini-3B-2507-bf16 in transcription mode (not properly implemented in llama.cpp last time I checked). My results with Voxtral-Small-24B-2507-Q5_K_L were way better for longer snippets. It also provided proper capitalization, dictation artifact removal, and word fixing in context.
EbbNorth7735@reddit
So does it work with this system prompt with <5 minute audio segments?
Chromix_@reddit
It's not the system prompt, but the text prompt to accompany the audio snippet.
Quality depends. Regular spoken < 30 second snippets worked fine for me so far, except for song texts despite there being barely any background music, like for example the first 30 seconds of "Warriors".
Skystunt@reddit
Super great news !
c64z86@reddit
I can't even use it, just gives this error:
"encoding audio slice...
audio slice encoded in 86 ms
decoding audio batch 1/1, n_tokens_batch = 750
D:/a/llama.cpp/llama.cpp/src/llama-context.cpp:1601: GGML_ASSERT((cparams.causal_attn || cparams.n_ubatch >= n_tokens_all) && "non-causal attention requires n_ubatch >= n_tokens") failed"
Enthu-Cutlet-1337@reddit
Nice, but watch the VRAM hit: audio tokenization and STT usually push context pressure up fast. On 8GB cards this is probably GGUF-only territory unless the model is tiny; would love a rough ms/sec benchmark on CPU vs CUDA.
ML-Future@reddit
Do we need new benchmarks for this?
ML-Future@reddit
Tested in spanish: not perfect, but pretty accurate. I like it. Better than whisper for sure.
AppealThink1733@reddit
So good !
El_90@reddit
Does mic>text appear in this timeline?
Or do we need to still record (potentially convert) and then upload a solid file?
I vibe coded a workaround, but native in the solution would be amazing