Whisper.cpp is underwhelming
Posted by Larkonath@reddit | LocalLLaMA | View on Reddit | 17 comments
Hi, I'm running whisper.cpp with the best model I could find (ggml-large-v3) but after about 20 min of transcription it hallucinates a sentence that it will repeat endlessly until the end.
Is there something I'm missing or should I cut my files to about 20 minutes length?
SeoFood@reddit
Yeah, this is a pretty common failure mode with long Whisper runs. I wouldn’t treat it as “large-v3 is bad” so much as “one long unbroken decode can go off the rails.”
A few things I’d try:
For practical transcription workflows, chunking is usually the boring answer that works. Long single-pass transcription looks cleaner in theory, but once it starts looping there’s not much to recover except rerunning from a clean boundary.
llama-impersonator@reddit
whisperx is a more developed pipeline for whisper models, imo, though i prefer parakeet
hainesk@reddit
Implementing VAD helps with the looping. It usually happens during a break in audio, like if there is a long period of silence.
iMakeSense@reddit
https://huggingface.co/spaces/hf-audio/open_asr_leaderboard
iMakeSense@reddit
https://www.reddit.com/r/LocalLLaMA/comments/1rlqfd7/we_collected_135_phrases_whisper_hallucinates/
llitz@reddit
If you are doing English only, V2 works better
Regarding the looping... I saw someone saying it it a out file size and etc, but that's not my case: v3 loops for me even on the first few seconds of the audio.
With V2, it transcribed some long 1h videos without issues.
dangerous_inference@reddit
I have become a big fan of Qwen3 1.7B ASR recently. You have to chunk audio sent to it, but it is fast and trivial to run.
nuclearbananana@reddit
Qwen3 ASR is super slow for me. Even cohere's 2b model is faster than the o.6b qwen. I think it's the llm decode stage. Maybe if you have a gpu
dangerous_inference@reddit
Well, yeah, you need a GPU to run it. I have it on my server and all the voice clients in my house are instant.
cibernox@reddit
Also whisper has been long surpassed by other models like parakeet/canary from nvidia that are both faster and more accurate.
thecstep@reddit
Like others mentioned, medium.en seems to have the best results for whatever reason.
tinny66666@reddit
Vosk works much better for me. It's faster and more accurate.
ttkciar@reddit
That's Russian-only, right?
tinny66666@reddit
No. I use it for English. It works well.
noctrex@reddit
I have transcribed successfully 3 hour sessions, but using the medium model. And if it's in English use the medium.en model to be more accurate
RogerRamjet999@reddit
I ran whisper.cpp on a set of about 30 one hour long meetings, and never saw any issues. I was running the medium model though.
DeltaSqueezer@reddit
whisper has been trained with certain audio lengths in mind. you need to break down audio into chunks.