Best open TTS/ASR model with accurate timestamps

Posted by pvrlek@reddit | LocalLLaMA | View on Reddit | 5 comments

WhisperX with large-v2 works okay-ish for my use case, for the most part, with timestamp accuracy only dipping with slightly chaotic audio. I haven't been able to keep up with what the SOTA is here, just wondering what your guys' real world experiences are.

I'd appreciate any info here, this community has been immensely helpful. Thank you all!

[-]

nhatnv@reddit

I have been working on this. The best for me is still LLM-based ASR then forced aligner to get better timestamps. And that's even not enough, sometimes you also need to align with audio energy.

[-]

pvrlek@reddit (OP)

Very cool. You mean forced alignment with wav2vec2, or is there something newer and better?

[-]

nhatnv@reddit

I tried several models and found WavLM is still the best. You might want to apply noise reduction before alignment too.

[-]

dametsumari@reddit

I assume you mean STT. I compared recently whisper v3 large (turbo) and qwens latest ASR model. At least for multilingual stuff whisper still seems better, although qwen was ok with English.

[-]

pvrlek@reddit (OP)

Oh, messed up the title. Thanks, useful info.