What is the best open source TTS model with multi language support?

Posted by Anxietrap@reddit | LocalLLaMA | View on Reddit | 22 comments

I'm currently developing an addon for Anki (an open source flashcard software). One part of my plan is to integrate an option to generate audio samples based on the preexisting content of the flashcards (for language learning). The point of it is using a local TTS model that doesn't require any paid services or APIs. To my knowledge the addons that are currently available for this have no option for a free version that still generate quite good audio.

I've looked a lot on HF but I struggle a bit to find out which models are actually suitable and versatile enough to support enough languages. My current bet would be XTTS2 due to the broad language support and its evaluation on leaderboards, but I find it to be a little "glitchy" at times.

I don't know if it's a good pick because it's mostly focussed on voice cloning. Could that be an issue? Do I have to think about some sort of legal concerns when using such a model? Which voice samples am I allowed to distribute to people so they can be used for voice cloning? I guess it wouldn't be user friendly to ask them to find their own 10s voice samples for generating audio.

So my question to my beloved local model nerds is:
Which models have you tested and which ones would you say are the most consistent and reliable?