OpenMOSS-Team/MOSS-TTS-v1.5 · Hugging Face

Posted by pmttyji@reddit | LocalLLaMA | View on Reddit | 12 comments

MOSS-TTS-v1.5

MOSS-TTS-v1.5 is continued from MOSS-TTS 1.0. It preserves the main 1.0 capabilities, including zero-shot voice cloning, long-form speech generation, token-level duration control, Pinyin/IPA pronunciation control, multilingual synthesis, and code-switching. For the full 1.0 feature walkthrough, input schema, decoding hyperparameters, and evaluation tables, please refer to the MOSS-TTS 1.0 README.

Compared with MOSS-TTS 1.0, v1.5 focuses on the following improvements:

Stronger multilingual synthesis with language tags: when the language field is omitted, v1.5 may improve some languages and regress slightly on others compared with 1.0. When the language is specified, v1.5 is stronger than 1.0 on almost all supported languages. Set the tag when building the user message, for example processor.build_user_message(text=text_fr, language="French").
More stable voice cloning: v1.5 improves speaker similarity and reduces cloning variance, making repeated generations more consistent.
Better long-reference, short-text cloning: v1.5 handles scenarios where the reference audio is much longer than the target text more reliably than 1.0.
More stable punctuation-following prosody: v1.5 follows punctuation-driven pauses more closely, especially in long sentences.
Explicit pause control: v1.5 supports inline pause markers such as "[pause 3.2s]". For example, 我今天学习了一首中国的古诗，它的名字是[pause 3.2s]静夜思！ inserts an explicit 3.2s pause before 静夜思.

Supported Languages

MOSS-TTS-v1.5 currently supports 31 languages. It keeps the 20 languages supported by MOSS-TTS 1.0 and extends multilingual continued training to additional languages including Cantonese, Dutch, Finnish, Hindi, Macedonian, Malay, Romanian, Swahili, Tagalog, Thai, and Vietnamese.

They released additional model as well.

https://huggingface.co/OpenMOSS-Team/MOSS-SoundEffect-v2.0

News

2026.5.26: 🚀 Released MOSS-SoundEffect-v2.0, a new text-to-audio model using a DiT backbone with the Flow Matching objective, generating 48 kHz bilingual sound effects up to 30 seconds — see moss_soundeffect_v2/.

2026.5.26: 🚀 Released MOSS-TTS-v1.5, with stronger multilingual synthesis when language tags are provided, more stable voice cloning, better long-reference short-text cloning, punctuation-following prosody, and explicit pause control via [pause X.Ys].

[-]

Top_Training5738@reddit

The pause control and multilingual stuff honestly sound more interesting than the voice cloning itself. Most open TTS models can fake a voice now, but natural pacing and multilingual consistency are still where things usually fall apart.

Also supporting 31 languages locally is kind of wild if the quality is actually decent. Open source TTS is moving insanely fast right now. We went from robotic audiobook voices to “wait was that generated?” in like two years.

kevinlch@reddit

is the voice cloning better than omnivoice?

Borkato@reddit

And can it moan?

HareMayor@reddit

Have you found a way to have omnivoice output more expressions?

It outputs are very monotonous even with tags

alecKarfonta@reddit

Really love the moss models but has anyone been able to get them to run in real time? Not sure if I am doing sormthing wrong but even the streaming model cannot achieve real time speed on a 5090. Am I the only one having this problem?

ilintar@reddit

That's weird, have you tried running with my GGML based code (it can run a server) and test performance? I would be really stunned if a 5090 couldn't run the small (streaming) model at realtime speed.

Nice! Since as I understand it's the same arch, https://github.com/pwilkin/openmoss should work out of the box.

seamonn@reddit

Got a Docker?

pmttyji@reddit (OP)

I was about to tag you for this :)

jake_that_dude@reddit

for anyone trying this in Home Assistant or a voice agent, measure RTF before you wire it into the loop.

language tags matter here, and so does prompt length. if RTF is >1.0, keep it async for announcements/batch TTS. if it is <1.0 on your target GPU, then it is worth building the streaming path.

Sevealin_@reddit

I've been dying for another TTS that isn't Kokoro that I can run with Home Assistant!

https://github.com/OpenMOSS/MOSS-TTS

2026.5.26: 🚀 Released MOSS-SoundEffect-v2.0, a new text-to-audio model using a DiT backbone with the Flow Matching objective, generating 48 kHz bilingual sound effects up to 30 seconds — see moss_soundeffect_v2/.
2026.5.26: 🚀 Released MOSS-TTS-v1.5, with stronger multilingual synthesis when language tags are provided, more stable voice cloning, better long-reference short-text cloning, punctuation-following prosody, and explicit pause control via [pause X.Ys].