TTS Benchmark Comparison (all known TTS up until May 2026)

Posted by UkieTechie@reddit | LocalLLaMA | View on Reddit | 40 comments

I was tired of not having a proper TTS related benchmark that I can use and test for personal projects, so I had to make one. Hopefully this helps those looking for running local TTS tools.

Has Windows and Mac results already. Linux will be tested shortly (have a 5900XT and 3090 workstation)

Has an HTML page for results (still running a few right now)

https://github.com/5uck1ess/tts-bench

[-]

danigoncalves@reddit

Pocket TTS is a 100M parameter model and it has multilingual support with voice cloning.

[-]

UkieTechie@reddit (OP)

yep already on the list. thank you for the contribution

[-]

related to tts, using one in a MI50 is a bit of chaotic due pytorch and dependencies , but this one uses ggml https://github.com/ServeurpersoCom/omnivoice.cpp so it works with vulkan, cuda , metal, cpu... and so far is the best i found for my language (i had to clone a voice to get the accent)

[-]

UkieTechie@reddit (OP)

putting this down as future implementation.

[-]

NewtoAlien@reddit

I am using a codex dockerized version of vibevoice 7B from: https://github.com/zeropointnine/tts-audiobook-tool on a headless Ubuntu 26.04.

I am able to run 4 batches at the same time using 23.7GB of VRAM on rtx 3090.

It has music detection and error check and regeneration via whisper which is running on CPU.

I am getting great results with it and it's running between 2-3.8 speed, for example generating 53.2 seconds of audio in 14 seconds.

The speed varies up and down, nevertheless more than 1x.

[-]

UkieTechie@reddit (OP)

amazing. I'm gonna test on my ubuntu 3090 system too and upload the results. thank you for sharing

[-]

NewtoAlien@reddit

Np 😉

The tool is for making audio books. Running it headless saves all the VRAM.

I am running it in tmux so I can ssh to my computer from my phone to monitor the session.

I already generated a 50 hr audio book with it and it has been generating a bigger audiobook for 70 hours straight with no issues for me and about 30 hours more to go.

Mind you I have set a strict no errors option so it will retry the generation if it detect word errors, max words per segment to be 75 words and maximized word generation. I am also voice cloning. Error detection is done via whisper v3 large on cpu.

Let me know if you want what other settings I am using.

So far so good and I am liking it.

It feels more expressive than all other tts solutions I tried.

[-]

UkieTechie@reddit (OP)

that does sound cool. vibevoice has been good for me. used it on voice cloning social engineering projects. do let me know any details

any reason you're using 7b and not the original microsoft removed models (thought they were 9b)?

[-]

NewtoAlien@reddit

It's the community edition one.

You can load the fork one if you give it the hugging face name. It gives you the option to load other versions.

[-]

UkieTechie@reddit (OP)

noted. ty. that's the one that produced the best results for me for both normal tts and cloning in the past but so many new models are out since.

[-]

NewtoAlien@reddit

I just switched to the Vibevoice-large from aoi-ot. It started with 23.4GB of VRAM for 4 batches so its looking good so far.

The application has an option to download models from HF, you just have to give it the model name.

[-]

sword-in-stone@reddit

Thanks OP, omnivoice was a nightmare to get working on strix halo. It now produces output but it's all garbled and jumbled. Lmk if you make it work.

[-]

MarkoMarjamaa@reddit

I'm using Zipvoice with strix halo. Cloning, Finnish finetune. Running it with RealTimeTTS but build own FastApi interface for streaming.

[-]

UkieTechie@reddit (OP)

I think omnivoice is amazing so far. it's nto the fastest but its voice cloning is almost perfect. clones tone and accent also.

[-]

sword-in-stone@reddit

got it working on nvidia blackwell, it's high quality cloning but asking for strix

[-]

UkieTechie@reddit (OP)

ah i see. unfortunately dont have a strix halo myself to make it work but in the future i def want to grab one and add it. if only they supported more than 128gb of unified memory

[-]

rngesius@reddit

Original QwenTTS repo has dogshit code and speed. Use https://github.com/andimarafioti/faster-qwen3-tts, it's much faster than realtime, though still has a very steep startup cost.

[-]

Timely-Perception-26@reddit

There's also fasterqwen3-tts combined with custom Triton kernels from this repo:

https://github.com/newgrit1004/qwen3-tts-triton

> Hybrid Mode (Triton + CUDA Graph, \~5x faster)

With warmup, hybrid mode, and intelligent chunking, I achieve a TTFA of \~120ms on my 3090TI using my own trained custom voice model.
I’ve tried everything, and you can’t get any better in terms of quality versus speed. The footprint is naturally a bit large, but I can use it as a daily assistant with Qwen3.6 27B.

[-]

Erdeem@reddit

Fish-audio tts is the only open source tts I know that outputs in 48khz which makes it the best sounding, is 1 shot, preserves accent, has emotional controls and only takes about a minute for a 30 second output. However, it is non commerical and needs 22gb of vram.

You might want to consider adding it.

[-]

Equivalent-Repair488@reddit

Only speed is tested? My main problem when using TTS is usually not speed, its the roboty undertones from whatever I tried in the past, it gives me discomfort whenever I hear it.

[-]

pmttyji@reddit

My main problem when using TTS is usually not speed, its the roboty undertones from whatever I tried in the past, it gives me discomfort whenever I hear it.

In your opinion, what are good/decent ones so far? Please share details

[-]

Equivalent-Repair488@reddit

I really just can't find any. Tried kittenTTS, Kokoro, Chatterbox (both base and turbo), they all had the robot undertones which made me internally go "ew". I just gave up altogether. Kokoro was the best imo, but still not good.

I'm not into voice cloning, I rather have like 1 male and 1 female voice that is good enough that I cannot pick out any of those artefacts, than a voice cloner which has potentially infinite number of voices, but all have a baseline of that roboty voice undertones.

What is more confusing though, is I see a lot of slop videos, like even back then the Biden, Obama and Trump minecraft memes had good voices, but idk if it is post processing or what.

[-]

UkieTechie@reddit (OP)

Voice cloning is the easy part now. you can train a model on your own voice and get voice cloning down to about 90ms in live situation. However, the TTS comments are so true. Most of them still sound kind of off when you're doing big enough phrases.

[-]

UkieTechie@reddit (OP)

So the output is highly subjective but it does both. speed to know how it works on your hardware and then it has results/report that you can replace and choose what YOU think is best.

i have my thoughts in the repo as well but this would be very subjective.

[-]

Equivalent-Repair488@reddit

There is a certain frequency range which gives that roboty static, I don't know if it is consistent throughout the TTS models and providers, but the problem frequencies are not present in natural speech, it is not a point of subjectivity, it is a point of potentially quantifiable digital audio artefacting that is created by these TTS LLMs, an unpleasant addtion that natural sounds do not have, might or might not be benchmarkable, but I think worthwhile to look into.

[-]

llama-impersonator@reddit

robot vocal fry

[-]

UkieTechie@reddit (OP)

I am adding a new column in the reports to measure this. Calling it Naturalness-Artifact Quotient (will be objectively measured) against the samples. hopefully should be able to help.

[-]

theSurgeonOfDeath_@reddit

Both are important. If you need near real-time response.

So i agree should be both tested

[-]

UkieTechie@reddit (OP)

I think omivoice so far sounds the best. kokoro is great too

[-]

Zulfiqaar@reddit

I think you have a few missing:

https://huggingface.co/models?pipeline_tag=text-to-speech

[-]

No-Implement9967@reddit

Realtime factor + memory usage + quality tradeoffs matter way more than cherry-picked demo clips. Glad someone finally centralized this stuff.

[-]

no_witty_username@reddit

I had a lot of experience testing MANY dozens tts models myself and from what i see on the list here I can attest it looks about right.. For pure speed on CPU at "acceptable" quality nothing beats piper tts. That thing is stupid fast. i have it working at above 3x RTF on a pixel 9 cpu only. very impressive for a tts. My latency that on that wimpy cpu is about 300ms ttfaa so still very impressive. For a small "good quality" tts model if I had my choice I would run supertonic 3, but unfortunately its significantly slower for my puny pixel 9 cpu at around 2000ms , can get it down to about 1000ms with optimizations in proper chunking but still to sslow, but for someone that needs a small very fast and good quality tts consider supertonic 3, very good model for its tiny size.

[-]