MOSS-TTS-Nano: a 0.1B open-source multilingual TTS model that runs on 4-core CPU and supports realtime speech generation
Posted by TimeEnvironmental219@reddit | LocalLLaMA | View on Reddit | 6 comments
We just open-sourced MOSS-TTS-Nano, a tiny multilingual speech generation model from MOSI.AI and the OpenMOSS team.
Some highlights:
- 0.1B parameters
- Realtime speech generation
- Runs on CPU without requiring a GPU
- Multilingual support (Chinese, English, Japanese, Korean, Arabic, and more)
- Streaming inference
- Long-text voice cloning
- Simple local deployment with
infer.py,app.py, and CLI commands
The project is aimed at practical TTS deployment: small footprint, low latency, and easy local setup for demos, lightweight services, and product integration.
GitHub:
https://github.com/OpenMOSS/MOSS-TTS-Nano
Huggingface:
https://huggingface.co/spaces/OpenMOSS-Team/MOSS-TTS-Nano
Online demo:
https://openmoss.github.io/MOSS-TTS-Nano-Demo/
Would love to hear feedback on quality, latency, and what use cases you’d want to try with a tiny open TTS model.
unculturedperl@reddit
For me, many of the multilingual samples are one repeated English sample?
unculturedperl@reddit
ran the demo on an n100 (four cores, 16gb). With defaults, it can't do real time (this is after warmup and a couple of test sentences to make sure the assets are loaded), the streaming results stutter pretty constantly, but the final result played back is ok.
Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo
Done | mode=voice_clone | prompt=en_3 | attn=eager | tts_batch=1 | codec_batch=1 | exec=cpu | cpu_threads=4 | audio=3.76s | elapsed=9.63s"Four score and seven years ago our fathers brought forth, upon this continent, a new nation, conceived in liberty, and dedicated to the proposition that all men are created equal."
Done | mode=voice_clone | prompt=en_3 | attn=eager | tts_batch=1 | codec_batch=1 | exec=cpu | cpu_threads=4 | audio=15.92s | elapsed=38.03sCoherency was good on long sentences (insert some Nabokov for your own edification here). It randomly stumbles on pronouncing some words, ie, "everywhere" came out as "airy-ware", "façade" as "fah sat", "dedicated" as "dead uhcated", in some voices, but not in others. Good output quality but noticably generated.unculturedperl@reddit
Oooh, one bad thing I've stumbled across is short sentences frequently lose coherency and enter generation hell. "No thanks":
Done | mode=voice_clone | prompt=en_3 | attn=eager | tts_batch=1 | codec_batch=1 | exec=cpu | cpu_threads=4 | audio=30.00s | elapsed=71.28s"Certainly sir":
Done | mode=voice_clone | prompt=en_3 | attn=eager | tts_batch=1 | codec_batch=1 | exec=cpu | cpu_threads=4 | audio=12.24s | elapsed=29.32sMghrghneli@reddit
Very impressive for such a small model. Would love to test on edge devices as a replacement for Kokoro.
Skystunt@reddit
This is cool!
TimeEnvironmental219@reddit (OP)
Please use https://github.com/OpenMOSS/MOSS-TTS-Nano?tab=readme-ov-file#local-web-demo-with-apppy to try the local real time speech generation on only 4 core CPU!!