Whisper.cpp is underwhelming

Posted by Larkonath@reddit | LocalLLaMA | View on Reddit | 17 comments

Hi, I'm running whisper.cpp with the best model I could find (ggml-large-v3) but after about 20 min of transcription it hallucinates a sentence that it will repeat endlessly until the end.

Is there something I'm missing or should I cut my files to about 20 minutes length?

[-]

SeoFood@reddit

Yeah, this is a pretty common failure mode with long Whisper runs. I wouldn’t treat it as “large-v3 is bad” so much as “one long unbroken decode can go off the rails.”

A few things I’d try:

Split the audio into smaller chunks, e.g. 5–10 min, ideally on silence rather than fixed timestamps.
If you’re using whisper.cpp directly, experiment with temperature fallback / no-speech thresholds / context settings. Carrying too much previous context can sometimes make repetition worse.
Try the same file with faster-whisper or another implementation just to isolate whether it’s the model, the implementation, or your audio.
If the audio has long silence/noise/music sections, run VAD first and transcribe only speech segments.

For practical transcription workflows, chunking is usually the boring answer that works. Long single-pass transcription looks cleaner in theory, but once it starts looping there’s not much to recover except rerunning from a clean boundary.

[-]

llama-impersonator@reddit

whisperx is a more developed pipeline for whisper models, imo, though i prefer parakeet

[-]

hainesk@reddit

Implementing VAD helps with the looping. It usually happens during a break in audio, like if there is a long period of silence.

[-]

iMakeSense@reddit

https://huggingface.co/spaces/hf-audio/open_asr_leaderboard

[-]

iMakeSense@reddit

https://www.reddit.com/r/LocalLLaMA/comments/1rlqfd7/we_collected_135_phrases_whisper_hallucinates/

[-]

llitz@reddit

If you are doing English only, V2 works better

Regarding the looping... I saw someone saying it it a out file size and etc, but that's not my case: v3 loops for me even on the first few seconds of the audio.

With V2, it transcribed some long 1h videos without issues.

[-]

dangerous_inference@reddit

I have become a big fan of Qwen3 1.7B ASR recently. You have to chunk audio sent to it, but it is fast and trivial to run.

[-]

nuclearbananana@reddit

Qwen3 ASR is super slow for me. Even cohere's 2b model is faster than the o.6b qwen. I think it's the llm decode stage. Maybe if you have a gpu

[-]

dangerous_inference@reddit

Well, yeah, you need a GPU to run it. I have it on my server and all the voice clients in my house are instant.

[-]

cibernox@reddit

Also whisper has been long surpassed by other models like parakeet/canary from nvidia that are both faster and more accurate.

[-]

thecstep@reddit

Like others mentioned, medium.en seems to have the best results for whatever reason.

[-]

tinny66666@reddit

Vosk works much better for me. It's faster and more accurate.

[-]

ttkciar@reddit

That's Russian-only, right?

[-]

tinny66666@reddit

No. I use it for English. It works well.

[-]

noctrex@reddit

I have transcribed successfully 3 hour sessions, but using the medium model. And if it's in English use the medium.en model to be more accurate

[-]

RogerRamjet999@reddit

I ran whisper.cpp on a set of about 30 one hour long meetings, and never saw any issues. I was running the medium model though.

[-]

DeltaSqueezer@reddit

whisper has been trained with certain audio lengths in mind. you need to break down audio into chunks.