The world’s fastest open-source TTS: Supertonic
Posted by ANLGBOY@reddit | LocalLLaMA | View on Reddit | 31 comments
Demo https://huggingface.co/spaces/Supertone/supertonic#interactive-demo
Code https://github.com/supertone-inc/supertonic
Hello!
I want to share Supertonic, a newly open-sourced TTS engine that focuses on extreme speed, lightweight deployment, and real-world text understanding.
It’s available in 8+ programming languages: C++, C#, Java, JavaScript, Rust, Go, Swift, and Python, so you can plug it almost anywhere — from native apps to browsers to embedded/edge devices.
Technical highlights are
(1) Lightning-speed — Real-time factor:
• 0.001 on RTX4090
• 0.006 on M4 Pro
(2) Ultra lightweight — 66M parameters
(3) On-device TTS — Complete privacy and zero network latency
(4) Advanced text understanding — Handles complex, real-world inputs naturally
(5) Flexible deployment — Works in browsers, mobile apps, and small edge devices
Regarding (4), one of my favorite test sentences is:
• He spent 10,000 JPY to buy tickets for a JYP concert.
Here, “JPY” refers to Japanese yen, while “JYP” refers to a name — Supertonic handles the difference seamlessly.
Hope it's useful for you!
Chromix_@reddit
It looks like the Python version needs some love (and chunking).
GPU acceleration isn't implemented, and pushing a 128 KB text file through CPU encoding starts using a ton of RAM. It ultimately failed with:
Previous-Possible-68@reddit
It looks like this issue has just been addressed in the latest updates on GitHub. They updated chunking for text and it seems works well now. https://github.com/supertone-inc/supertonic
ANLGBOY@reddit (OP)
Thank you for pointing out the issue. Since our model is trained on audio segments up to 30 seconds long—approximately 350 characters—it may not perform well with longer text. We've addressed this by adding text-chunking algorithms in the latest commit.
SwarfDive01@reddit
Light weight is convenient, but can it handle non-verbal sounds like DIA? [Cough] or [laughing], [giggle] or emotional inflection beyond question or statement, like anger, panic, manic, excitement , or what about specific vocal model sound selection? If I like a voice, can I set it, or will I need an anchor model voice?
IrisColt@reddit
Can it moan? Asking for a friend, heh
SwarfDive01@reddit
You know, in dont remember seeing it explicitly documented...buuuut, maybe
ANLGBOY@reddit (OP)
To be honest, our model cannot yet provide human-like reactions or controllable expressions. So far, we have focused on computational efficiency, but we plan to address these capabilities in the near future.
OliDouche@reddit
Can something like this also work for transcription? Looking for something that produces transcripts on macOS as fast as possible. Currently using whisper.cpp
simracerman@reddit
Can it do 1/3 of this speed but sound like Kokoro?
ANLGBOY@reddit (OP)
Our benchmark shows that Supertonic is significantly faster than Kokoro on CPU environments. (about 10 times faster)
https://github.com/supertone-inc/supertonic?tab=readme-ov-file#characters-per-second
Dudmaster@reddit
I believe his question was, if more compute is spent on Supertonic (processed slower) does quality increase?
silenceimpaired@reddit
How accurate is it compared to Kokoro? Does it support voice cloning?
ANLGBOY@reddit (OP)
We have not conducted a thorough comparison of its pronunciation accuracy. However, it offers many advantages for processing natural text, as shown in https://huggingface.co/spaces/Supertone/supertonic#text-handling
We also plan to enable users to utilize their own voices with the open-source model in the near future.
silenceimpaired@reddit
Sounds exciting. I’ll have to dig into it after work. Hopefully you guys used Apache or MIT licensing. It seems these days you either get a full featured tool or great licensing.
Material_Abies2307@reddit
It seems to be English only… with all due respect, if you’re not gonna beat Kokoro or Piper on voice availability, there’s no use for anything lighter than it
Foreign-Beginning-49@reddit
I will respectfully disagree, piper nor kokoro are very fast on edge devices. I'll see if tgis tonic works today.
coder543@reddit
What do you mean Kokoro isn't fast on edge devices? It is absolutely tiny.
Foreign-Beginning-49@reddit
So far this is faster than kokoro or piper and co karaoke as well in quality nifty not better. We might be splitting hairs on this edge device territory though.
EndlessZone123@reddit
Kokoro is still far bigger than the standard build in TTS on any device. This is far closer to them it seems.
cleverusernametry@reddit
Speed is not the issue at all in TTS. We've already been fast for long time. The issue is quality
Foreign-Beginning-49@reddit
Still edge device frontier to explore though. I just tested on an older android and its really fast and sounds great too. This model does surprisingly well. Its impressive in my book.
LeatherRub7248@reddit
OP:
Does it handle accents when cloning a voice?
LightMaleficent5844@reddit
Numerous pronunciation errors and occasional omission of whole words in the demo. Step count doesn't matter.
r4in311@reddit
Thanks for sharing. Truly Incredible speed but sadly sounds much worse than Kokoko and kind of soulless tbh. Is finetuning code available?
Zc5Gwu@reddit
The male voice sounds better than any of the kokoro male voices IMO.
ANLGBOY@reddit (OP)
Fine-tuning code is not currently available. However, we plan to offer a pipeline that allows users to use their preferred voices with the open-source model.
1_7xr@reddit
What architecture does it use?
rageling@reddit
I clicked play on a video about a new TTS and it didn't have any sample of the TTS, what a scam
EndlessZone123@reddit
A few questions:
66M params but how much memory does it take up during inference?
Does model size scale if the resources is avaliable to train for it?
Will there be finetuning? Kokoro died for me when there was no ability to train voices.
ANLGBOY@reddit (OP)
Icy-Swordfish7784@reddit
The demo sounds good. It's also great you provided bindings in so many languages and not just python so it should be easy to implement into a variety of projects, not just web servers.