The world’s fastest open-source TTS: Supertonic

Posted by ANLGBOY@reddit | LocalLLaMA | View on Reddit | 31 comments

Demo https://huggingface.co/spaces/Supertone/supertonic#interactive-demo

Code https://github.com/supertone-inc/supertonic

Hello!

I want to share Supertonic, a newly open-sourced TTS engine that focuses on extreme speed, lightweight deployment, and real-world text understanding.

It’s available in 8+ programming languages: C++, C#, Java, JavaScript, Rust, Go, Swift, and Python, so you can plug it almost anywhere — from native apps to browsers to embedded/edge devices.

Technical highlights are

(1) Lightning-speed — Real-time factor:

• 0.001 on RTX4090

• 0.006 on M4 Pro

(2) Ultra lightweight — 66M parameters

(3) On-device TTS — Complete privacy and zero network latency

(4) Advanced text understanding — Handles complex, real-world inputs naturally

(5) Flexible deployment — Works in browsers, mobile apps, and small edge devices

Regarding (4), one of my favorite test sentences is:

• He spent 10,000 JPY to buy tickets for a JYP concert.

Here, “JPY” refers to Japanese yen, while “JYP” refers to a name — Supertonic handles the difference seamlessly.

Hope it's useful for you!

[-]

Chromix_@reddit

It looks like the Python version needs some love (and chunking).

GPU acceleration isn't implemented, and pushing a 128 KB text file through CPU encoding starts using a ton of RAM. It ultimately failed with:

Failed to allocate memory for requested buffer of size 124.992.000.256

[-]

Previous-Possible-68@reddit

It looks like this issue has just been addressed in the latest updates on GitHub. They updated chunking for text and it seems works well now. https://github.com/supertone-inc/supertonic

[-]

Thank you for pointing out the issue. Since our model is trained on audio segments up to 30 seconds long—approximately 350 characters—it may not perform well with longer text. We've addressed this by adding text-chunking algorithms in the latest commit.

[-]

SwarfDive01@reddit

Light weight is convenient, but can it handle non-verbal sounds like DIA? [Cough] or [laughing], [giggle] or emotional inflection beyond question or statement, like anger, panic, manic, excitement , or what about specific vocal model sound selection? If I like a voice, can I set it, or will I need an anchor model voice?

[-]

IrisColt@reddit

Can it moan? Asking for a friend, heh

[-]

SwarfDive01@reddit

You know, in dont remember seeing it explicitly documented...buuuut, maybe

[-]

ANLGBOY@reddit (OP)

To be honest, our model cannot yet provide human-like reactions or controllable expressions. So far, we have focused on computational efficiency, but we plan to address these capabilities in the near future.

[-]

OliDouche@reddit

Can something like this also work for transcription? Looking for something that produces transcripts on macOS as fast as possible. Currently using whisper.cpp

[-]

simracerman@reddit

Can it do 1/3 of this speed but sound like Kokoro?

[-]

ANLGBOY@reddit (OP)

Our benchmark shows that Supertonic is significantly faster than Kokoro on CPU environments. (about 10 times faster)

https://github.com/supertone-inc/supertonic?tab=readme-ov-file#characters-per-second

[-]

Dudmaster@reddit

I believe his question was, if more compute is spent on Supertonic (processed slower) does quality increase?

[-]

silenceimpaired@reddit

How accurate is it compared to Kokoro? Does it support voice cloning?

[-]

ANLGBOY@reddit (OP)

We have not conducted a thorough comparison of its pronunciation accuracy. However, it offers many advantages for processing natural text, as shown in https://huggingface.co/spaces/Supertone/supertonic#text-handling

We also plan to enable users to utilize their own voices with the open-source model in the near future.

[-]

silenceimpaired@reddit

Sounds exciting. I’ll have to dig into it after work. Hopefully you guys used Apache or MIT licensing. It seems these days you either get a full featured tool or great licensing.

[-]

Material_Abies2307@reddit

It seems to be English only… with all due respect, if you’re not gonna beat Kokoro or Piper on voice availability, there’s no use for anything lighter than it

[-]

Foreign-Beginning-49@reddit

I will respectfully disagree, piper nor kokoro are very fast on edge devices. I'll see if tgis tonic works today.

[-]

coder543@reddit

What do you mean Kokoro isn't fast on edge devices? It is absolutely tiny.

[-]

Foreign-Beginning-49@reddit

So far this is faster than kokoro or piper and co karaoke as well in quality nifty not better. We might be splitting hairs on this edge device territory though.

[-]

EndlessZone123@reddit

Kokoro is still far bigger than the standard build in TTS on any device. This is far closer to them it seems.

[-]

cleverusernametry@reddit

Speed is not the issue at all in TTS. We've already been fast for long time. The issue is quality

[-]

Foreign-Beginning-49@reddit

Still edge device frontier to explore though. I just tested on an older android and its really fast and sounds great too. This model does surprisingly well. Its impressive in my book.

[-]

LeatherRub7248@reddit

OP:

Does it handle accents when cloning a voice?

[-]

LightMaleficent5844@reddit

Numerous pronunciation errors and occasional omission of whole words in the demo. Step count doesn't matter.

[-]

r4in311@reddit

Thanks for sharing. Truly Incredible speed but sadly sounds much worse than Kokoko and kind of soulless tbh. Is finetuning code available?

[-]

Zc5Gwu@reddit

The male voice sounds better than any of the kokoro male voices IMO.

[-]

ANLGBOY@reddit (OP)

Fine-tuning code is not currently available. However, we plan to offer a pipeline that allows users to use their preferred voices with the open-source model.

[-]

66M params but how much memory does it take up during inference?
Does model size scale if the resources is avaliable to train for it?
Will there be finetuning? Kokoro died for me when there was no ability to train voices.

[-]

ANLGBOY@reddit (OP)

Using the original PyTorch model consumes only about 1000 MiB of memory, allowing us to use a batch size greater than 256 for 400-character inputs.
We internally checked that a single model at the current model size can synthesize multiple languages.
We plan to launch a service that allows users to select and use their preferred voice with the open-source model.

[-]

Icy-Swordfish7784@reddit

The demo sounds good. It's also great you provided bindings in so many languages and not just python so it should be easy to implement into a variety of projects, not just web servers.