Audio processing landed in llama-server with Gemma-4

The Model Card says "multilingual support in over 140 languages" but I'm not sure if that is true for STT -- https://ai.google.dev/gemma/docs/core/model_card_4

[-]

citrusalex@reddit

Canary is even better and supports language selection.

[-]

Ylsid@reddit

Thanks for watching and hit subscribe

[-]

florinandrei@reddit

make shit up on silence

That's why it's called Whisper.

[-]

rhinodevil@reddit

Problems with the silence did not happen to me.. Maybe a configuration issue? Or using one of the smaller models? Is Parakeet also better than Whisper for other languages than English?

[-]

It makes me wonder what the tokenization process is like. There are lots of kinds of "silence", and most of it is really more like "ambient noise".
There probably isn't nearly enough labeling of ambient noise and high gain on the microphone. That stuff should probably get a few dozen tokens of its own, and explicit labels in the data.

[-]

Space_Pirate_R@reddit

Or use some sort of light preprocessing like a noise gate.

[-]

Ayumu_Kasuga@reddit

I don't know, I've tried both parakeet and whisper, and whisper in my experience is a lot better at understanding stuff like "commit to git"

[-]

EbbNorth7735@reddit

I did not know this. I should probably get a fastApi server going for it and test it out

[-]

justletmesignupalre@reddit

Yep, me too. I hope someone tries this

[-]

Cosmicdev_058@reddit

Honestly the thing I'm most excited about here is not the transcription quality, it's that I can stop babysitting a separate Whisper container. Running two inference processes side by side, splitting VRAM between them, restarting Whisper when it inevitably hangs on a long audio file at 2am. Collapsing that into one llama-server process is a genuine quality of life upgrade even if Gemma's STT is slightly worse right now.

That said, Chromix_'s notes about it looping and dying on anything over 30 seconds is a bit concerning. I have a use case that needs to handle 10 to 15 minute recordings and that is a dealbreaker until it stabilizes. Going to keep running Whisper alongside it for now but will be watching the PRs closely.

[-]

GroundbreakingMall54@reddit

wait so native audio support actually works in llama.cpp now? this is huge. been waiting for this instead of having to spin up a whole separate whisper pipeline

[-]

srigi@reddit (OP)

Agree, with llama-server supporting this in its REST API, you can create "speak to your agent" (STT) solutions with fully local processing.

[-]

Protheu5@reddit

Can I do it on my phone? That would be incredibly helpful during trips abroad.

[-]

rm-rf-rm@reddit

But its only with Gemma E*B right? You cant use whisper, parakeet etc.?

[-]

Aerroon@reddit

I was able to use an older Voxtral model about a month ago.

[-]

RIP26770@reddit

Done ! Via llama-swap

[-]

iadanos@reddit

Could you please post an example?

[-]

ZootAllures9111@reddit

didn't the CLI already have this though? Isn't this just a matter of "they exposed an existing feature in their (still overly basic IMO) webui"?

[-]

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! Come check it out!

You've also been given a special flair for your contribution. We appreciate your post!

I am a bot and this action was performed automatically.

[-]

EbbNorth7735@reddit

Does it support any sort of pause detection or streaming or is it like batch processing sort of thing?

[-]

RebouncedCat@reddit

you'd have to use a VAD for that.

[-]

seefatchai@reddit

Can it do song lyrics?

[-]

AcaciaBlue@reddit

I'm kinda new here, did any other software support this before (Like LM Studio?). Is audio processing also available in the PolarQuant branch?

[-]

ZootAllures9111@reddit

I mean I think the llama.cpp CLI executable has had flags to support audio capable LLMs for a while. So this is more just a UI thing. But no LM Studio doesn't expose those flags currently.

[-]

Chromix_@reddit

It seems that there are some issues left to be ironed out. In the current state it's mostly unusable for me for 5+ minutes of audio - Voxtral works way better. I'm using E4B as Q8_XL quant with BF16 mmproj (recommended, as other mmproj formats lead to degraded capabilities)

Transcribing slightly longer audio fails with this error: llama-context.cpp:1601: GGML_ASSERT((cparams.causal_attn || cparams.n_ubatch >= n_tokens_all) && "non-causal attention requires n_ubatch >= n_tokens") failed
Increasing -ub makes it proceed here.
The reasoning mentions snippets from the whole audio, yet the transcription just catches a longer paragraph of it.
The transcript often starts looping sentences and stops early.

According to the original readme, you shouldn't just use "transcribe this text", but follow these exact templates for better result quality:

Transcription:

Transcribe the following speech segment in {LANGUAGE} into {LANGUAGE} text.

Follow these specific instructions for formatting the answer:
* Only output the transcription, with no newlines.
* When transcribing numbers, write the digits, i.e. write 1.7 and not one point seven, and write 3 instead of three.

Translation:

Transcribe the following speech segment in {SOURCE_LANGUAGE}, then translate it into {TARGET_LANGUAGE}.
When formatting the answer, first output the transcription in {SOURCE_LANGUAGE}, then one newline, then output the string '{TARGET_LANGUAGE}: ', then the translation in {TARGET_LANGUAGE}.

[-]

Aerroon@reddit

About a month ago I used some older Voxtral model for transcribing using llama-server. At the time I could do about 2 minute audio clips. Anything longer than that crashed.

The specific transcription instructions are good to know though. That Voxtral model would just randomly fail when I wrote "transcribe the audio".

[-]

Mistercheese@reddit

Which voxtral model? 4B tts with vllm Omni?

[-]

Chromix_@reddit

I've initially tried Voxtral-Mini-3B-2507-bf16 in transcription mode (not properly implemented in llama.cpp last time I checked). My results with Voxtral-Small-24B-2507-Q5_K_L were way better for longer snippets. It also provided proper capitalization, dictation artifact removal, and word fixing in context.

[-]

EbbNorth7735@reddit

So does it work with this system prompt with <5 minute audio segments?

[-]

Chromix_@reddit

It's not the system prompt, but the text prompt to accompany the audio snippet.

Quality depends. Regular spoken < 30 second snippets worked fine for me so far, except for song texts despite there being barely any background music, like for example the first 30 seconds of "Warriors".

[-]

Skystunt@reddit

Super great news !

[-]

c64z86@reddit

I can't even use it, just gives this error:

"encoding audio slice...

audio slice encoded in 86 ms

decoding audio batch 1/1, n_tokens_batch = 750

D:/a/llama.cpp/llama.cpp/src/llama-context.cpp:1601: GGML_ASSERT((cparams.causal_attn || cparams.n_ubatch >= n_tokens_all) && "non-causal attention requires n_ubatch >= n_tokens") failed"

[-]