Just dropped Kani TTS English - a 400M TTS model that's 5x faster than realtime on RTX 4080
Posted by ylankgz@reddit | LocalLLaMA | View on Reddit | 71 comments
Hey everyone!
We've been quietly grinding, and today, we're pumped to share the new release of KaniTTS English, as well as Japanese, Chinese, German, Spanish, Korean and Arabic models.
Benchmark on VastAI: RTF (Real-Time Factor) of ~0.2 on RTX4080, ~0.5 on RTX3060.
It has 400M parameters. We achieved this speed by pairing an LFM2-350M backbone with an efficient NanoCodec.
It's released under the Apache 2.0 License so you can use it for almost anything.
What Can You Build? - Real-Time Conversation. - Affordable Deployment: It's light enough to run efficiently on budget-friendly hardware, like RTX 30x, 40x, 50x - Next-Gen Screen Readers & Accessibility Tools.
Model Page: https://huggingface.co/nineninesix/kani-tts-400m-en
Pretrained Checkpoint: https://huggingface.co/nineninesix/kani-tts-400m-0.3-pt
Github Repo with Fine-tuning/Dataset Preparation pipelines: https://github.com/nineninesix-ai/kani-tts
Demo Space: https://huggingface.co/spaces/nineninesix/KaniTTS
OpenAI-Compatible API Example (Streaming): If you want to drop this right into your existing project, check out our vLLM implementation: https://github.com/nineninesix-ai/kanitts-vllm
Voice Cloning Demo (currently unstable): https://huggingface.co/spaces/nineninesix/KaniTTS_Voice_Cloning_dev
Our Discord Server: https://discord.gg/NzP3rjB4SB
konovalov-nk@reddit
Questions:
I assume 200-300 hours should be enough for a complete new language. E.g. my use case is RU/PL.
ylankgz@reddit (OP)
We made 1000 hours per language but I think 200 should be enough. It also heavily depends on the audio quality.
We have manged to train a single multilingual model speaking 6 languages https://huggingface.co/nineninesix/kani-tts-370m although I would prefer to finetunr for a single language
konovalov-nk@reddit
Gotcha, thank you!
So I read briefly what a pre-training model is, and it seems an example way how to make my dream come true and train model to speak 3 languages I want at the same time is:
Questions:
banafo@reddit
I suspect it would work, but it will be with the accent of the speaker you picked.
konovalov-nk@reddit
I love accents and find them hilarious, especially in TTS. Sure, some would disagree but I'm just building fun stuff to play around at this point (GPU poor) 🤣
We can fix accents later!
Crinkez@reddit
Are there installation instructions?
ylankgz@reddit (OP)
https://github.com/nineninesix-ai/kani-tts
OC2608@reddit
I've checked the pipeline for training. Since it supports multi-speaker, I'd like to finetune it. However...
I don't have this GPU, but I have a Kaggle account. Is it possible to make a notebook to finetune a checkpoint there?
oMGalLusrenmaestkaen@reddit
how would I go about fine-tuning this for another language (Bulgarian)? how much training data do I need? what considerations should I have?
banafo@reddit
I might give it a shot for Bulgarian :)
oMGalLusrenmaestkaen@reddit
please keep me in touch. I'm looking for a Bulgarian open-source TTS for a smart home assistant project, and there really aren't any good options, even though closed-source is heaping (ElevenLabs, Google Gemini 2.5 TTS, Google NotebookLM are all incredibly good)
banafo@reddit
Do you have 10h single speaker of clean Bulgarian we could use?
oMGalLusrenmaestkaen@reddit
I could probably generate synthetic data using Gemini, but I'm currently preoccupied
ylankgz@reddit (OP)
I would take >=200 hours of clear multispeaker audios and then finetune for single speaker 2-3 hours. You should unfreeze lm_head and embeddings when you perform full lora finetuning
banafo@reddit
Tried it, works quite well and it's really fast, i did notice that it has some disfluencies in the output sometimes. (english)
ylankgz@reddit (OP)
Thanks for feedback. Each speaker has its own features coming from the data it has been trained on
banafo@reddit
If I don’t have speaker labels, could I still finetune and use with voice cloning?
ylankgz@reddit (OP)
We have made a voice cloning without speaker labels. Not really good so far tbh
goldenjm@reddit
Congratulations on launching your model. Try my TTS torture test paragraph:
There are hard to pronounce phrases, e.g. (i) We use ArXiv and LaTeX (ii) It cost $5.6 million (iii) Json != xml; also (iv) Example vector: (x_1,...,x_2) (v) We have some RECOMMENDATIONS (i.e. suggestions) and (6) During 2010-2018. Figure 2a: It took us 16 NVIDIA gpus, and 13.7 hrs 14 mins. Consider a set A, where a equals 2 times a.
Models generally have a lot of difficulty with it. Unfortunately, your does as well. I would love an update if you're able to successfully pronounce this paragraph in the future.
ylankgz@reddit (OP)
I will take it as a benchmark!
goldenjm@reddit
Great! I'm the founder of a free TTS web and mobile app. You might enjoy our blog post where we used this torture test paragraph as part of our evaluation of many TTS systems.
Thank you for contributing an open-weight model to the community and please keep working on it!
CheatCodesOfLife@reddit
I don't suppose you could post a link to this phrase being said correctly by a TTS system?
goldenjm@reddit
Yes, I'm the founder of a free text-to-speech web and mobile app Paper2Audio and here's our audio for this difficult paragraph. We use this paragraph as a torture test when comparing TTS models. Our output isn't perfect (particularly how we read some of the roman numerals), but it is close.
coder543@reddit
If you capitalize “GPUs” correctly, Kokoro gets very close… I counted three definite errors (ArXiv, LaTeX, and a missing “a”), and one borderline error (inconsistent roman numeral pronunciation, pronouncing v as “vee”).
Correct capitalization is not optional, as it significantly changes the pronunciation of words. A native English speaker that didn’t have technical knowledge would be unable to pronounce “gpus” the way that you want it pronounced.)
banafo@reddit
Because espeak normalizes most of it before giving it to the tts in kokoro probably. Try normalizing it and then feeding it.
softwareweaver@reddit
Tried it on HF space with English - Andrew with the text below
You can call (408) 424 5214 to get the status
It spoke too fast and messed up the numbers.
ylankgz@reddit (OP)
Ya good point. Need to finetune for tel numbers
Silver_Jaguar_24@reddit
Why release if it's not ready bro? lol
banafo@reddit
It’s easy to work around. Wouldn’t call it not ready just because it doesn’t deal with digital. The dont use espeak which normally takes care of this, it’s trivial to add num2words in your inference pipeline.
banafo@reddit
It’s easy to work around. Wouldn’t call it not ready just because it doesn’t deal with digits. They dont use espeak which normally takes care of this, it’s trivial to add num2words in your inference pipeline. Look at all the things they released at once already, pretty impressive task we should be grateful for. Give them some time to iron out the small issues.
der_pelikan@reddit
Not just tel numbers, it messed up 317 in german.
When replacing the numbers with textual representation, it handles them pretty well, though.
All in all a TTS I consider for my personal assistant, well done.
ylankgz@reddit (OP)
Thanks for feedback. We’ll make it work for numbers (phone, year, roman numbers etc) as well as abbreviations on all pretrained languages
dagerdev@reddit
In Spanish the same problem. Sound like it has an aneurysm, it was hilarious. 😆 Listen :)
https://files.catbox.moe/sk6u3l.wav
skyblue_Mr@reddit
I deployed and tried CPU inference on my RK3588 dev board, and for an average 3-4 second audio clip, the inference takes about 280 seconds. Even on my PC with an R9 4790K using the same code, the average inference time is still around 6-7 seconds. Was this model not optimized for CPU inference at all? lol
silenceimpaired@reddit
Nothing in the post seems to indicate it was.
ylankgz@reddit (OP)
It should be converted to gguf to work on pi
ylankgz@reddit (OP)
We have made it on MLX for Apple Silicon, gguf is next
Mythril_Zombie@reddit
The Irish multilingual girl is pretty good.
ylankgz@reddit (OP)
Does it have real Irish accent?
mandrak4@reddit
Portuguese? 🥺
ylankgz@reddit (OP)
Next release
MrEU1@reddit
New to these. How one can add a) new language? b) new voice (voice cloning)? c) voice with emotions?
ylankgz@reddit (OP)
You can finetune it for the new language. I would train on >=200 hours of multispeaker speech and then 2-3 hours on speaker.
We are working on the separate model that supports voice cloning ootb
You mean tags? That’s also easy to finetune
AvidCyclist250@reddit
Sound quality and intonation is great but its useless because its garbles words, invents words, skips words and hallucinates.
Narrow-Belt-5030@reddit
Voice sounds nice, but it's not production ready.
Unless it was an issue with the Huggingface demo page, I gave it a long string to say and it got confused mid way through, said "umm" and bombed out (stopped speaking)
ylankgz@reddit (OP)
Yes on hf it cannot take long sentences. Roughly 15sec speaking. On dedicated gpu like rtx4090 and vllm it’s 0.2 rtf and supports streaming
Narrow-Belt-5030@reddit
Ah, ok, sorry .. I will gladly try it at home then later - I have a 5090 and on the look out for better TTS.
Can you it stream via API ? Other voices?
ylankgz@reddit (OP)
5090 works well. Try to deploy openai compatible server https://github.com/nineninesix-ai/kanitts-vllm and check the rtf on you machine
Narrow-Belt-5030@reddit
You really have my attention now.
I will test for sure - the big issue I have is sm_120 / 5090 compatibility with various libraries. If you say/know this repo works with 5090 then you cracked the issue for me. (Currently using MS Edge-TTS .. that's great, good selection of voices, but high latency compared to local)
ylankgz@reddit (OP)
It works on sm120, I have 5080 and tested on 5090 on novita and vast ai
ylankgz@reddit (OP)
Also you can easily finetune it on your custom dataset
Trysem@reddit
Again English 🥴
Double_Cause4609@reddit
I think I'm in love
ylankgz@reddit (OP)
Hope it will be useful for you!
caetydid@reddit
Decent voices for German and English! Now I just need a dynamically switching multilingual model that can deal with mixed language text.
ylankgz@reddit (OP)
Most likely German model can speak english and vise versa
Yorn2@reddit
Thank you for including an Open AI-compatible API for those of us that are trying to drop something like this into existing projects. I wish more TTS Engines did this.
ylankgz@reddit (OP)
You’re welcome 👍
Jesus_lover_99@reddit
It makes a lot of errors. I dropped a few comments from HN and it was incoherent.
> This is amazing. My entire web browser session state for every private and personal website I sign onto every day will be used for training data. It's great! I love this. This is exactly the direction humans should be going in to not self-destruct. The future is looking bright, while the light in our brains dims to eventual darkness. Slowly. Tragically. And for what purpose exactly. So cool.
It breaks at 'The future...'
ylankgz@reddit (OP)
The example on Spaces has limit around 15 sec. It should work with kanitts-vllm example since there we implemented chunking and streaming
ylankgz@reddit (OP)
This is an agent we made with Deepgram->Openai->KaniTTS using streaming https://youtu.be/wKBULlDO_3U
Devcomeups@reddit
How exactly do you connect this to a model and use it? Are there any instructions guides anywhere?
popiazaza@reddit
https://github.com/nineninesix-ai/kanitts-vllm
ubrtnk@reddit
any chance you could squeeze it down just a BIT smaller lol - I've got a jetson orin nano super with 8G sitting here with nothing to do - TTS/STT was my intention for it but havent gotten around to pieceing it together
banafo@reddit
When you try the stt, be sure to give us a try, these are our small cc-by models: https://huggingface.co/spaces/Banafo/Kroko-Streaming-ASR-Wasm ( runs locally on the browser)
ylankgz@reddit (OP)
Its bf16 and can be easily quantized to 0.5 of its size
Powerful_Evening5495@reddit
Wow, trying it in Arabic.
Not good OP , it fast but not good
combrade@reddit
XTTS is good in Arabic even with the voice clones . I cloned Al Jazeera voices with XTTS .
ylankgz@reddit (OP)
We’ll make it better! Thanks for feedback!
getgoingfast@reddit
Impressive voice quality! Thanks for sharing.
Curious how TTS model parameter count translate to VRAM usage, looks very different from LLM? This 400M model is using up-to 16GB VRAM. I could not find VRAM usage number for Kokoro-82M for contrast. 4GB?
ylankgz@reddit (OP)
We’ve got it fit to 12GB vram on rtx 3060 with 0.8 utilization. Kokoro is style tts 2 like architecture and requires much less memory. It can run efficiently on CPU with almost the same speed