MOSS-TTS-Nano: a 0.1B open-source multilingual TTS model that runs on 4-core CPU and supports realtime speech generation

Posted by TimeEnvironmental219@reddit | LocalLLaMA | View on Reddit | 6 comments

We just open-sourced MOSS-TTS-Nano, a tiny multilingual speech generation model from MOSI.AI and the OpenMOSS team.

Some highlights:

0.1B parameters
Realtime speech generation
Runs on CPU without requiring a GPU
Multilingual support (Chinese, English, Japanese, Korean, Arabic, and more)
Streaming inference
Long-text voice cloning
Simple local deployment with infer.py, app.py, and CLI commands

The project is aimed at practical TTS deployment: small footprint, low latency, and easy local setup for demos, lightweight services, and product integration.

GitHub:
https://github.com/OpenMOSS/MOSS-TTS-Nano

Huggingface:

https://huggingface.co/spaces/OpenMOSS-Team/MOSS-TTS-Nano

Online demo:
https://openmoss.github.io/MOSS-TTS-Nano-Demo/

Would love to hear feedback on quality, latency, and what use cases you’d want to try with a tiny open TTS model.

[-]

unculturedperl@reddit

For me, many of the multilingual samples are one repeated English sample?

[-]

ran the demo on an n100 (four cores, 16gb). With defaults, it can't do real time (this is after warmup and a couple of test sentences to make sure the assets are loaded), the streaming results stutter pretty constantly, but the final result played back is ok.

Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo

"Four score and seven years ago our fathers brought forth, upon this continent, a new nation, conceived in liberty, and dedicated to the proposition that all men are created equal." Done | mode=voice_clone | prompt=en_3 | attn=eager | tts_batch=1 | codec_batch=1 | exec=cpu | cpu_threads=4 | audio=15.92s | elapsed=38.03s Coherency was good on long sentences (insert some Nabokov for your own edification here). It randomly stumbles on pronouncing some words, ie, "everywhere" came out as "airy-ware", "façade" as "fah sat", "dedicated" as "dead uhcated", in some voices, but not in others. Good output quality but noticably generated.

[-]

unculturedperl@reddit

Oooh, one bad thing I've stumbled across is short sentences frequently lose coherency and enter generation hell. "No thanks":

"Certainly sir":