Best realtime open source STT model?
Posted by ThatIsNotIllegal@reddit | LocalLLaMA | View on Reddit | 17 comments
What's the best model to transcribe a conversation in realtime, meaning that the words have to appear as the person is talking.
Senior_Wrongdoer_252@reddit
In hte past I used this open source project speaches-ai which uses fast-whisper and works great on CPU if you use tiny or small models if you have a decent PC.
Live-Character-5272@reddit
Have you found best and cheap model?
SENOMPAN@reddit
Try this for best experience, you can choose whisper model: https://github.com/WEIFENG2333/VideoCaptioner
Slight-Honey-6236@reddit
For non-english use and speaker diarization in more noisy cases shunyalabs/pingala-v1-universal is pretty good
RustinChole1@reddit
You meant a streaming speech recognition model. Nvidia's parakeet tdt is very good. It has the best benchmarks on hugging face's open asr leaderboard(in both latency and RTF). Because the RTF score is exceptionally good compared to others, I'd suggest you give it a try.
ExplanationEqual2539@reddit
It is not multilingual though
the__storm@reddit
Btw, for people coming across this in the future, v3 supports most European languages: https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3
ExplanationEqual2539@reddit
Useful info, thanks
z_3454_pfk@reddit
yeah for english this is the best
CheatCodesOfLife@reddit
For English only
bullerwins@reddit
if you are going the whisper route as it has multilingual support, check whisperX or faster-whisper too
Zulfiqaar@reddit
I believe WhisperX is optimised for batch processing or complete audio files, not so much realtime streaming stt - unless they've added new features recently
nexe@reddit
None of the suggested models have speaker diarization as far as I know. There are some auxiliary libraries that try to achieve this as an addon (e.g. https://github.com/MahmoudAshraf97/whisper-diarization) but from my experience they only work for very distinguishable voices (e.g. woman speaking with a man or child with adult etc)
swagonflyyyy@reddit
whisperv3 turbo. Its my daily driver.
ExplanationEqual2539@reddit
If you have GPU, check out whisper If u wanna run transcription through mobile application like flutter, try Sherpa onnx, I wouldn't bet too much on it, but it's good enough
For web streaming try whisper base model, example or is already available open source
Even for CPU I can see that whisper is doing good...
Every application which I mentioned is available for streaming
ExplanationEqual2539@reddit
GPU streaming is better, like you'll be running a bigger model that's better accuracy
olympics2022wins@reddit
I use google docs if I’m writing