Building real-time speech translation (VAD→ASR→MT→TTS) - struggling with latency

Posted by Big_Fix_7606@reddit | LocalLLaMA | View on Reddit | 0 comments

I'm also working on this. Trying to build a real-time speech translation system, but honestly the results are pretty rough so far. Really curious how commercial simultaneous interpretation systems manage to hit that claimed 3-second average for first-word latency.

It's just a weekend project at this point. My pipeline is VAD → ASR → MT → TTS. Tried using nllb-200-distilled-600M and Helsinki-NLP/opus-mt-en-x for translation but neither worked that well. Even though I went with Kokoro TTS (smallest parameter count), the overall TTS latency is still way too high.
---
repo: https://github.com/xunfeng1980/e2e-audio-mt