Moss tts 1.5 8b Examples. It is the currently best voice cloning model for English as of June 2026
Posted by 9r4n4y@reddit | LocalLLaMA | View on Reddit | 49 comments
Moss tts 1.5 8b is better than fish audio s2 pro and qwen 3 tts voice clone tts.
FinBenton@reddit
I tested the online demo with bunch of my clips, for some generations it was a lot worse than OmniVoice that I normally use, some of them it was similar idk, not switching atleast yet, I think Omni generally for me sounds more natural.
LeonidasTMT@reddit
How big is omnivoice compared to moss?
VoidAlchemy@reddit
i'm on omni right now too. my local pi agent has custom SKILL for it and is great for doing research to custom mastered podcast mp3. it has a few hiccups but i appreciate the speed control knob so it doesn't talk way too fast.
pocket-tts and kokoro are nice if you need CPU inference too so i keep those old SKILLs around lol
9r4n4y@reddit (OP)
I have also used omni and moss is way better than omni. Idk whats wrong happens in ur senario
martinerous@reddit
Thanks for bringing up this one, I somehow missed it. It even speaks Latvian out of the box, which is amazing! How could they squeeze in so many languages into such a fast model? And I spent a week finetuning VoxCPM 1.5 to speak Latvian. Now my efforts are useless, haha.
HockeyDadNinja@reddit
Does anyone know if this supports realtime streaming?
llamabott@reddit
There's a number of models under the MOSS-TTS umbrella, including MOSS-TTS-Realtime.
toolman10@reddit
The quality 8B v1.5 doesn't support streaming (not that I can find anyway), only the smaller MOSS-TTS-Realtime. I tested 8B 1.5 last week and for real time voice applications it takes too long because it's got to send the entire audio file. I just tested TTS-Realtime and it does emit chunks for realtime streaming, however the quality is pretty bad. For me, it wasn't usable because it was too slow (about 2X speed) and I didn't want to try to jump through hoops to create some special environment when Qwen3 TTS is already working perfectly (and fast) for me.
9r4n4y@reddit (OP)
Sry idk. Try reading the model paper
SurpriseOk6927@reddit
voice cloning quality keeps surprising me. moss 1.5 being better than fish audio is a strong claim but the demos look legit. curious how it handles non-english languages since most tts models are still shaky there
Devatator_@reddit
I cloned Glados yesterday on their demo space and it worked well. Sadly you don't have much control over the output aside from pauses and language markers
CheatCodesOfLife@reddit
You're talking to another bloody spam bot.
9r4n4y@reddit (OP)
Huh how u know its spam??
CheatCodesOfLife@reddit
Here you go, one of it's earlier posts:
https://files.catbox.moe/1kr2xe.png
Translation from French (by Claude):
""" The French comment in "Can you sell me your SaaS?" reads:
"I'm 19, I build solo in France. My product doesn't have a name yet but it does a thing that 90% of SaaS ignore: it scrapes Reddit/X to find qualified leads and contacts them automatically with messages that DON'T look like spam. 23 early adopters, 0 ads. The real pitch? Come see the numbers live rather than a deck." """
Annoying thing is, these seem to work on a lot of people. I've been sending them to random non-tech-savvy mates recently and "is this a bot or a human?"
Now I get "Looks human to me, but I know it's a bot because you keep sending them to me" lol
CheatCodesOfLife@reddit
This all lower-case no em-dash style is what they're doing right now. But you can see the usual Claude/Kimi-slop structure. For example the engagement baiting last sentence, usually starting with "curious" or "interested":
Click it's profile and search for the current bot-spam prompted terms like "ngl" or starting with "lmao". This is how Kimi-K2.6 and the older GLM-4.6 write when you prompt it to talk like a low effort reddit post.
Here's another bot (looks like Claude with the same prompt) that was building up it's karma last week: https://old.reddit.com/user/techlatest_net?count=25&after=t1_on1l9ue
And now it's dropping garbage github repos everywhere: https://old.reddit.com/user/techlatest_net
JackStrawWitchita@reddit
I'm still struggling to find any voice-cloning TTS that can seriously compete with Chatterbox. This Moss-TTS needs a lot of horsepower (and patience) to match the quality of Chatterbox running on a potato.
ApatheticWrath@reddit
omnivoice
taking_bullet@reddit
I ditched Chatterbox. Now KugelAudio 2 (based on VibeVoice) is my new friend.
NordRanger@reddit
I just tried KugelAudio for German with the comfyui nodes but it‘s eating the end of sentences and the speech is so god damn fast that it’s barely usable.
Surely you don’t have these problems?
taking_bullet@reddit
I do. Add another, random word at the end of the whole sentence. Then edit file in Audacity - cut out last second.
NordRanger@reddit
Damn. And what about pacing?
martinerous@reddit
I went for VoxCPM, turned to be quite easy to finetune to Latvian using Mozilla Common Voice dataset. VoxCPM recently also released a new version. It has come quirks (tends to get metallic with longer sentences) but it's fast and in my tests it was more stable than Chatterbox.
JackStrawWitchita@reddit
HF says you need 19GB+ of VRAM to run KugelAudio locally? WTF? Is that true?
Chatterbox gets excellent results run *without a GPU*.
taking_bullet@reddit
Maybe in English, but not in other languages.
Indeed. Enable the 4-bit quant model if you don't have 20GB VRAM.
JackStrawWitchita@reddit
What if you don't have a GPU at all? Chatterbox runs fine without a GPU. Where can I d/l and test the version of KugelAudio that works better than Chatterbox without a GPU?
9r4n4y@reddit (OP)
Longcat dit is 3.5b model that do better voice clone than chatterbox
JackStrawWitchita@reddit
You need serious VRAM to make Longcat run whereas Chatterbox works extremely well without a GPU.
ArtfulGenie69@reddit
I like this test a lot more than the last one. All sorts of accents cloned. We are getting to the place in voice where the cloning is so good it's hard to tell what's better. I'll have to try with this one and compare to fish because it sounds pretty spot on.
ares0027@reddit
I tried for hours and couldnt run on windows :(
9r4n4y@reddit (OP)
Try to use moss tts 1.0 version comfy ui workflow but just replace the model. Or search on pinokio
ares0027@reddit
i tried the latest one and it failed, pinokio didnt work properly either but ill give 1.0 workflow a shot. thanks
Crinkez@reddit
Kinda pointless for most people. I have a 12GB gpu which I'm guessing is higher than the average person (outside of this sub) which still isn't enough to run this model.
And quants... idk if I'd trust their reliability.
9r4n4y@reddit (OP)
This was about most powerful model. But let me tell you judt brlow it is longcat dit 3.5 its also very powerful in voice cloning
Due-Hearing-5557@reddit
How does it compare to omivoice?
9r4n4y@reddit (OP)
Wayy better than omni
silenceimpaired@reddit
I need to try this… someone always wants a comparison
Wild24@reddit
How to install 8b version? Will it run on 12 gb vram with 64 gb ram?
9r4n4y@reddit (OP)
Yes just use the fp8 version. Give ai the link to the repo and ask it to find any comfy ui workflow for it. Or just ask ai give you step by step setup plan. You can also use Pinokio here.
thrownawaymane@reddit
Why is your profile hidden?
9r4n4y@reddit (OP)
None of ur business 🤗
Crinkez@reddit
It's always a bit suspicious when they're hidden. Anyway I did a bit of digging and found his full post history, but nothing stood out as bad.
9r4n4y@reddit (OP)
😭 thx for telling that
thirteen-bit@reddit
https://github.com/pwilkin/openmoss + Q4_K_M quantization of OpenMOSS-Team/MOSS-TTS-v1.5 takes ca. 12Gb (11393MiB shown by nvidia-smi on RTX3090):
./bin/moss-tts-server --host 127.0.0.1 --port 8080 --no-webui --model ./models/moss-tts-1.5-q4km.gguf
Approximately the size of the models + some space for kv cache:
3.9G moss-tts-1.5-q4km.extras.gguf
5.6G moss-tts-1.5-q4km.gguf
brahh85@reddit
i want to try this ggml implementation, but i barely have time https://github.com/pwilkin/openmoss
sanjxz54@reddit
Can it replace voice and sing ? clone first, then replace it on a music (vocal only) singing track, for example
9r4n4y@reddit (OP)
No, this model is not for singing. This will not work for what you want.
And-Bee@reddit
Very good. Will try it out
9r4n4y@reddit (OP)
:) Let me tell you, if you don't have a good GPU, there is a hugging face space which is currently running on GPU. So it means you can run as much as you want. So before it goes, try it.
https://huggingface.co/spaces/OpenMOSS-Team/MOSS-TTS-v1.5
OkAssistance7886@reddit
The examples sound pretty impressive. For voice models, I feel like the hard part is not just cloning quality, but showing clean comparisons with the same script, same noise level, and same emotion probably test the samples with Audacity, record a quick demo flow in OBS, and use Runable to mock up a simple comparison page so people can judge the outputs side by side.