Just dropped Kani TTS English - a 400M TTS model that's 5x faster than realtime on RTX 4080 | TheaterFire

Just dropped Kani TTS English - a 400M TTS model that's 5x faster than realtime on RTX 4080

Posted by ylankgz@reddit | LocalLLaMA | View on Reddit | 71 comments

Hey everyone!

We've been quietly grinding, and today, we're pumped to share the new release of KaniTTS English, as well as Japanese, Chinese, German, Spanish, Korean and Arabic models.

Benchmark on VastAI: RTF (Real-Time Factor) of ~0.2 on RTX4080, ~0.5 on RTX3060.

It has 400M parameters. We achieved this speed by pairing an LFM2-350M backbone with an efficient NanoCodec.

It's released under the Apache 2.0 License so you can use it for almost anything.

What Can You Build? - Real-Time Conversation. - Affordable Deployment: It's light enough to run efficiently on budget-friendly hardware, like RTX 30x, 40x, 50x - Next-Gen Screen Readers & Accessibility Tools.

Model Page: https://huggingface.co/nineninesix/kani-tts-400m-en

Pretrained Checkpoint: https://huggingface.co/nineninesix/kani-tts-400m-0.3-pt

Github Repo with Fine-tuning/Dataset Preparation pipelines: https://github.com/nineninesix-ai/kani-tts

Demo Space: https://huggingface.co/spaces/nineninesix/KaniTTS

OpenAI-Compatible API Example (Streaming): If you want to drop this right into your existing project, check out our vLLM implementation: https://github.com/nineninesix-ai/kanitts-vllm

Voice Cloning Demo (currently unstable): https://huggingface.co/spaces/nineninesix/KaniTTS_Voice_Cloning_dev

Our Discord Server: https://discord.gg/NzP3rjB4SB

[-]

konovalov-nk@reddit

Questions:

How many hours per language to fine-tune pre-training for a good conversational model (not talking about CSM quality but at least with pauses/acoustics)?

I assume 200-300 hours should be enough for a complete new language. E.g. my use case is RU/PL.

I'm also curious if I can have just one model speaking 3 different languages at once, or the tiny model size wouldn't allow for that?

[-]

ylankgz@reddit (OP)

We made 1000 hours per language but I think 200 should be enough. It also heavily depends on the audio quality.

We have manged to train a single multilingual model speaking 6 languages https://huggingface.co/nineninesix/kani-tts-370m although I would prefer to finetunr for a single language

[-]

konovalov-nk@reddit

Gotcha, thank you!

So I read briefly what a pre-training model is, and it seems an example way how to make my dream come true and train model to speak 3 languages I want at the same time is:

Gather 300 hours of cleaned/annotated audio per language (300*3 = 900)
Run the fine-tune/pre-train algorithm (it's all the same gradient descent, right?) over those audio/text pairs
For specifics like voices/pronunciation/emotions LoRA adapters seems the way to go

Questions:

If I want one sentence to contain 3 languages at the same time, the dataset should contain some extra/synthetic examples with proper pronunciation and language spans, something like this? `To będzie quick test, без шуток.`
Do we freeze codec during continual pre-train?
If the Kani team has tips on the best place to attach LoRA in the backbone (e.g., attention vs FFN blocks) or recommended sampling ratios for low-resource languages, I’d love to incorporate them.

[-]

banafo@reddit

I suspect it would work, but it will be with the accent of the speaker you picked.

[-]

konovalov-nk@reddit

I love accents and find them hilarious, especially in TTS. Sure, some would disagree but I'm just building fun stuff to play around at this point (GPU poor) 🤣

We can fix accents later!

[-]

Crinkez@reddit

Are there installation instructions?

[-]

ylankgz@reddit (OP)

https://github.com/nineninesix-ai/kani-tts

[-]

OC2608@reddit

I've checked the pipeline for training. Since it supports multi-speaker, I'd like to finetune it. However...

Linux with NVIDIA GPU (recommended: RTX 5090 or similar)

I don't have this GPU, but I have a Kaggle account. Is it possible to make a notebook to finetune a checkpoint there?

[-]

oMGalLusrenmaestkaen@reddit

how would I go about fine-tuning this for another language (Bulgarian)? how much training data do I need? what considerations should I have?

[-]

banafo@reddit

I might give it a shot for Bulgarian :)

[-]

oMGalLusrenmaestkaen@reddit

please keep me in touch. I'm looking for a Bulgarian open-source TTS for a smart home assistant project, and there really aren't any good options, even though closed-source is heaping (ElevenLabs, Google Gemini 2.5 TTS, Google NotebookLM are all incredibly good)

[-]

banafo@reddit

Do you have 10h single speaker of clean Bulgarian we could use?

[-]

oMGalLusrenmaestkaen@reddit

I could probably generate synthetic data using Gemini, but I'm currently preoccupied

[-]

ylankgz@reddit (OP)

I would take >=200 hours of clear multispeaker audios and then finetune for single speaker 2-3 hours. You should unfreeze lm_head and embeddings when you perform full lora finetuning

[-]

banafo@reddit

Tried it, works quite well and it's really fast, i did notice that it has some disfluencies in the output sometimes. (english)

[-]

ylankgz@reddit (OP)

Thanks for feedback. Each speaker has its own features coming from the data it has been trained on

[-]

banafo@reddit

If I don’t have speaker labels, could I still finetune and use with voice cloning?

[-]

ylankgz@reddit (OP)

We have made a voice cloning without speaker labels. Not really good so far tbh

[-]

goldenjm@reddit

Congratulations on launching your model. Try my TTS torture test paragraph:

There are hard to pronounce phrases, e.g. (i) We use ArXiv and LaTeX (ii) It cost $5.6 million (iii) Json != xml; also (iv) Example vector: (x_1,...,x_2) (v) We have some RECOMMENDATIONS (i.e. suggestions) and (6) During 2010-2018. Figure 2a: It took us 16 NVIDIA gpus, and 13.7 hrs 14 mins. Consider a set A, where a equals 2 times a.

Models generally have a lot of difficulty with it. Unfortunately, your does as well. I would love an update if you're able to successfully pronounce this paragraph in the future.

[-]

ylankgz@reddit (OP)

I will take it as a benchmark!

[-]

goldenjm@reddit

Great! I'm the founder of a free TTS web and mobile app. You might enjoy our blog post where we used this torture test paragraph as part of our evaluation of many TTS systems.

Thank you for contributing an open-weight model to the community and please keep working on it!

[-]

CheatCodesOfLife@reddit

I don't suppose you could post a link to this phrase being said correctly by a TTS system?

[-]

goldenjm@reddit

Yes, I'm the founder of a free text-to-speech web and mobile app Paper2Audio and here's our audio for this difficult paragraph. We use this paragraph as a torture test when comparing TTS models. Our output isn't perfect (particularly how we read some of the roman numerals), but it is close.

[-]

coder543@reddit

If you capitalize “GPUs” correctly, Kokoro gets very close… I counted three definite errors (ArXiv, LaTeX, and a missing “a”), and one borderline error (inconsistent roman numeral pronunciation, pronouncing v as “vee”).

Correct capitalization is not optional, as it significantly changes the pronunciation of words. A native English speaker that didn’t have technical knowledge would be unable to pronounce “gpus” the way that you want it pronounced.)

[-]

banafo@reddit

Because espeak normalizes most of it before giving it to the tts in kokoro probably. Try normalizing it and then feeding it.

[-]

softwareweaver@reddit

Tried it on HF space with English - Andrew with the text below
You can call (408) 424 5214 to get the status

It spoke too fast and messed up the numbers.

[-]

ylankgz@reddit (OP)

Ya good point. Need to finetune for tel numbers

[-]

Silver_Jaguar_24@reddit

Why release if it's not ready bro? lol

[-]

banafo@reddit

It’s easy to work around. Wouldn’t call it not ready just because it doesn’t deal with digital. The dont use espeak which normally takes care of this, it’s trivial to add num2words in your inference pipeline.

[-]

banafo@reddit

It’s easy to work around. Wouldn’t call it not ready just because it doesn’t deal with digits. They dont use espeak which normally takes care of this, it’s trivial to add num2words in your inference pipeline. Look at all the things they released at once already, pretty impressive task we should be grateful for. Give them some time to iron out the small issues.

[-]

der_pelikan@reddit

Not just tel numbers, it messed up 317 in german.
When replacing the numbers with textual representation, it handles them pretty well, though.
All in all a TTS I consider for my personal assistant, well done.

[-]

ylankgz@reddit (OP)

Thanks for feedback. We’ll make it work for numbers (phone, year, roman numbers etc) as well as abbreviations on all pretrained languages

[-]

dagerdev@reddit

In Spanish the same problem. Sound like it has an aneurysm, it was hilarious. 😆 Listen :)

https://files.catbox.moe/sk6u3l.wav

[-]

skyblue_Mr@reddit

I deployed and tried CPU inference on my RK3588 dev board, and for an average 3-4 second audio clip, the inference takes about 280 seconds. Even on my PC with an R9 4790K using the same code, the average inference time is still around 6-7 seconds. Was this model not optimized for CPU inference at all? lol

[-]

silenceimpaired@reddit

Nothing in the post seems to indicate it was.

[-]

ylankgz@reddit (OP)

It should be converted to gguf to work on pi

[-]

ylankgz@reddit (OP)

We have made it on MLX for Apple Silicon, gguf is next

[-]

Mythril_Zombie@reddit

The Irish multilingual girl is pretty good.

[-]

ylankgz@reddit (OP)

Does it have real Irish accent?

[-]

mandrak4@reddit

Portuguese? 🥺

[-]

ylankgz@reddit (OP)

Next release

[-]

MrEU1@reddit

New to these. How one can add a) new language? b) new voice (voice cloning)? c) voice with emotions?

[-]

ylankgz@reddit (OP)

You can finetune it for the new language. I would train on >=200 hours of multispeaker speech and then 2-3 hours on speaker.

We are working on the separate model that supports voice cloning ootb

You mean tags? That’s also easy to finetune

[-]

AvidCyclist250@reddit

Sound quality and intonation is great but its useless because its garbles words, invents words, skips words and hallucinates.

[-]

Narrow-Belt-5030@reddit

Voice sounds nice, but it's not production ready.

Unless it was an issue with the Huggingface demo page, I gave it a long string to say and it got confused mid way through, said "umm" and bombed out (stopped speaking)

[-]

ylankgz@reddit (OP)

Yes on hf it cannot take long sentences. Roughly 15sec speaking. On dedicated gpu like rtx4090 and vllm it’s 0.2 rtf and supports streaming

[-]

Narrow-Belt-5030@reddit

Ah, ok, sorry .. I will gladly try it at home then later - I have a 5090 and on the look out for better TTS.

Can you it stream via API ? Other voices?

[-]

ylankgz@reddit (OP)

5090 works well. Try to deploy openai compatible server https://github.com/nineninesix-ai/kanitts-vllm and check the rtf on you machine

[-]

Narrow-Belt-5030@reddit

You really have my attention now.

I will test for sure - the big issue I have is sm_120 / 5090 compatibility with various libraries. If you say/know this repo works with 5090 then you cracked the issue for me. (Currently using MS Edge-TTS .. that's great, good selection of voices, but high latency compared to local)

[-]

ylankgz@reddit (OP)

It works on sm120, I have 5080 and tested on 5090 on novita and vast ai

[-]

ylankgz@reddit (OP)

Also you can easily finetune it on your custom dataset

[-]

Trysem@reddit

Again English 🥴

[-]

Double_Cause4609@reddit

Fast on consumer hardware
Barely takes any VRAM
Finetuning code OOTB
Respectable quality

I think I'm in love

[-]

ylankgz@reddit (OP)

Hope it will be useful for you!

[-]

caetydid@reddit

Decent voices for German and English! Now I just need a dynamically switching multilingual model that can deal with mixed language text.

[-]

ylankgz@reddit (OP)

Most likely German model can speak english and vise versa

[-]

Yorn2@reddit

Thank you for including an Open AI-compatible API for those of us that are trying to drop something like this into existing projects. I wish more TTS Engines did this.

[-]

ylankgz@reddit (OP)

You’re welcome 👍

[-]

Jesus_lover_99@reddit

It makes a lot of errors. I dropped a few comments from HN and it was incoherent.

> This is amazing. My entire web browser session state for every private and personal website I sign onto every day will be used for training data. It's great! I love this. This is exactly the direction humans should be going in to not self-destruct. The future is looking bright, while the light in our brains dims to eventual darkness. Slowly. Tragically. And for what purpose exactly. So cool.

It breaks at 'The future...'

[-]

ylankgz@reddit (OP)

The example on Spaces has limit around 15 sec. It should work with kanitts-vllm example since there we implemented chunking and streaming

[-]

ylankgz@reddit (OP)

This is an agent we made with Deepgram->Openai->KaniTTS using streaming https://youtu.be/wKBULlDO_3U

[-]

Devcomeups@reddit

How exactly do you connect this to a model and use it? Are there any instructions guides anywhere?

[-]

popiazaza@reddit

https://github.com/nineninesix-ai/kanitts-vllm

[-]

ubrtnk@reddit

any chance you could squeeze it down just a BIT smaller lol - I've got a jetson orin nano super with 8G sitting here with nothing to do - TTS/STT was my intention for it but havent gotten around to pieceing it together

[-]

banafo@reddit

When you try the stt, be sure to give us a try, these are our small cc-by models: https://huggingface.co/spaces/Banafo/Kroko-Streaming-ASR-Wasm ( runs locally on the browser)

[-]

ylankgz@reddit (OP)

Its bf16 and can be easily quantized to 0.5 of its size

[-]

Powerful_Evening5495@reddit

Wow, trying it in Arabic.

Not good OP , it fast but not good

[-]

combrade@reddit

XTTS is good in Arabic even with the voice clones . I cloned Al Jazeera voices with XTTS .

[-]

ylankgz@reddit (OP)

We’ll make it better! Thanks for feedback!

[-]

getgoingfast@reddit

Impressive voice quality! Thanks for sharing.

Curious how TTS model parameter count translate to VRAM usage, looks very different from LLM? This 400M model is using up-to 16GB VRAM. I could not find VRAM usage number for Kokoro-82M for contrast. 4GB?

[-]

ylankgz@reddit (OP)

We’ve got it fit to 12GB vram on rtx 3060 with 0.8 utilization. Kokoro is style tts 2 like architecture and requires much less memory. It can run efficiently on CPU with almost the same speed