OpenMOSS-Team/MOSS-TTS-v1.5 · Hugging Face
Posted by pmttyji@reddit | LocalLLaMA | View on Reddit | 12 comments
MOSS-TTS-v1.5
MOSS-TTS-v1.5 is continued from MOSS-TTS 1.0. It preserves the main 1.0 capabilities, including zero-shot voice cloning, long-form speech generation, token-level duration control, Pinyin/IPA pronunciation control, multilingual synthesis, and code-switching. For the full 1.0 feature walkthrough, input schema, decoding hyperparameters, and evaluation tables, please refer to the MOSS-TTS 1.0 README.
Compared with MOSS-TTS 1.0, v1.5 focuses on the following improvements:
- Stronger multilingual synthesis with language tags: when the
languagefield is omitted, v1.5 may improve some languages and regress slightly on others compared with 1.0. When the language is specified, v1.5 is stronger than 1.0 on almost all supported languages. Set the tag when building the user message, for exampleprocessor.build_user_message(text=text_fr, language="French"). - More stable voice cloning: v1.5 improves speaker similarity and reduces cloning variance, making repeated generations more consistent.
- Better long-reference, short-text cloning: v1.5 handles scenarios where the reference audio is much longer than the target text more reliably than 1.0.
- More stable punctuation-following prosody: v1.5 follows punctuation-driven pauses more closely, especially in long sentences.
- Explicit pause control: v1.5 supports inline pause markers such as
"[pause 3.2s]". For example,我今天学习了一首中国的古诗,它的名字是[pause 3.2s]静夜思!inserts an explicit 3.2s pause before静夜思.
Supported Languages
MOSS-TTS-v1.5 currently supports 31 languages. It keeps the 20 languages supported by MOSS-TTS 1.0 and extends multilingual continued training to additional languages including Cantonese, Dutch, Finnish, Hindi, Macedonian, Malay, Romanian, Swahili, Tagalog, Thai, and Vietnamese.
They released additional model as well.
Top_Training5738@reddit
The pause control and multilingual stuff honestly sound more interesting than the voice cloning itself. Most open TTS models can fake a voice now, but natural pacing and multilingual consistency are still where things usually fall apart.
Also supporting 31 languages locally is kind of wild if the quality is actually decent. Open source TTS is moving insanely fast right now. We went from robotic audiobook voices to “wait was that generated?” in like two years.
kevinlch@reddit
is the voice cloning better than omnivoice?
Borkato@reddit
And can it moan?
HareMayor@reddit
Have you found a way to have omnivoice output more expressions?
It outputs are very monotonous even with tags
alecKarfonta@reddit
Really love the moss models but has anyone been able to get them to run in real time? Not sure if I am doing sormthing wrong but even the streaming model cannot achieve real time speed on a 5090. Am I the only one having this problem?
ilintar@reddit
That's weird, have you tried running with my GGML based code (it can run a server) and test performance? I would be really stunned if a 5090 couldn't run the small (streaming) model at realtime speed.
ilintar@reddit
Nice! Since as I understand it's the same arch, https://github.com/pwilkin/openmoss should work out of the box.
seamonn@reddit
Got a Docker?
pmttyji@reddit (OP)
I was about to tag you for this :)
jake_that_dude@reddit
for anyone trying this in Home Assistant or a voice agent, measure RTF before you wire it into the loop.
languagetags matter here, and so does prompt length. if RTF is >1.0, keep it async for announcements/batch TTS. if it is <1.0 on your target GPU, then it is worth building the streaming path.Sevealin_@reddit
I've been dying for another TTS that isn't Kokoro that I can run with Home Assistant!
pmttyji@reddit (OP)
https://github.com/OpenMOSS/MOSS-TTS
News
moss_soundeffect_v2/.[pause X.Ys].