Looking for High-Quality Open-Source Local TTS That’s Faster Than IndexTTS2

Posted by TomNaughtyy@reddit | LocalLLaMA | View on Reddit | 11 comments

Me and my cousin have been using IndexTTS2 for a while and really like the voice quality, it sounds natural and expressive. The only issue is that it’s slow. He’s getting around 1.6 RTF on his 3090, which makes it hard to generate longer audio efficiently (we work with long audio, not real-time use).

We’ve also tried Kokoro TTS and CosyVoice 2. Kokoro is super fast, but most of the voices sound too synthetic or “AI-like” for our needs. One voice we actually liked was “Nicole” in Kokoro, it has a more natural and calm tone that works well for us. CosyVoice 2 had better expressiveness and sounded promising, but it had a habit of changing words or pronouncing them weirdly, which broke the consistency.

We’re only interested in open-source models. No commercial or cloud APIs.

A few things to note: We’re not planning to use emotion vectors, style tokens, or any prompt engineering tricks, just clean, straightforward narration. We’re on strong hardware (3090 and 4090), so GPU resources aren’t a problem. Just want something with good voice quality that runs faster than IndexTTS2 and ideally has at least one solid voice that sounds natural.

Any models or voices you recommend?
Thanks

[-]

CheatCodesOfLife@reddit

https://huggingface.co/neuphonic/neutts-air is fast and expressive.

There's also echo-tts but it needs 12gb vram.

Orpheus as others have mentioned.

[-]

Agreeable-Market-692@reddit

kittentts https://github.com/KittenML/KittenTTS

[-]

Decent-Blueberry3715@reddit

Chatterbox? You can run it with TTS webui. There are many others in TTS webui so you can easily compare.

[-]

Bit_Poet@reddit

Have you tried mixing voices and adapting speeds in Kokoro? At least with Kokoro-FastAPI, you can pass in voice combinations like "bf_lily+af_nicole(2)". The number in parenthesis is the weight, so you get quite a variety of combinations. Even male + heavier weighted female can yield usable results.

[-]

TomNaughtyy@reddit (OP)

I tried a little bit, but it didnt sound that great. I will try more. If you have any more tips or like a mix you personally like I would be so happy to try it. Thank you

[-]

Bit_Poet@reddit

I'm currently in the process of figuring out the best combinations, as I want to assemble a set of reasonably distinct voices for audio books (running audiobook-creator on my own and downloaded stories for private consumption). I'm about to finish a little Gradio UI that lets me create/import/export a JSON file with named voice combinations. I'll let you know what I come up with.

[-]

linlin@reddit

try IndexTTS2 vllm， 0.3 RTF on 4090 ，https://www.bilibili.com/video/BV1nhW6zoES4

[-]

TomNaughtyy@reddit (OP)

If i can get it that fast, it would be amazing. I really appreciate it

[-]

nickless07@reddit

How about this one? https://github.com/canopyai/Orpheus-TTS

[-]

TomNaughtyy@reddit (OP)

I will give it a try tomorrow. Thank you so much

[-]

RebouncedCat@reddit

Orpheus will run close to realtime in 3090 or 4090, but if you want absolute speed, try supertonic. Its 30x realtime on cpu