Moss tts 1.5 8b Examples. It is the currently best voice cloning model for English as of June 2026

[-]

FinBenton@reddit

I tested the online demo with bunch of my clips, for some generations it was a lot worse than OmniVoice that I normally use, some of them it was similar idk, not switching atleast yet, I think Omni generally for me sounds more natural.

[-]

LeonidasTMT@reddit

How big is omnivoice compared to moss?

[-]

VoidAlchemy@reddit

i'm on omni right now too. my local pi agent has custom SKILL for it and is great for doing research to custom mastered podcast mp3. it has a few hiccups but i appreciate the speed control knob so it doesn't talk way too fast.

pocket-tts and kokoro are nice if you need CPU inference too so i keep those old SKILLs around lol

[-]

9r4n4y@reddit (OP)

I have also used omni and moss is way better than omni. Idk whats wrong happens in ur senario

[-]

martinerous@reddit

Thanks for bringing up this one, I somehow missed it. It even speaks Latvian out of the box, which is amazing! How could they squeeze in so many languages into such a fast model? And I spent a week finetuning VoxCPM 1.5 to speak Latvian. Now my efforts are useless, haha.

[-]

HockeyDadNinja@reddit

Does anyone know if this supports realtime streaming?

[-]

llamabott@reddit

There's a number of models under the MOSS-TTS umbrella, including MOSS-TTS-Realtime.

[-]

toolman10@reddit

The quality 8B v1.5 doesn't support streaming (not that I can find anyway), only the smaller MOSS-TTS-Realtime. I tested 8B 1.5 last week and for real time voice applications it takes too long because it's got to send the entire audio file. I just tested TTS-Realtime and it does emit chunks for realtime streaming, however the quality is pretty bad. For me, it wasn't usable because it was too slow (about 2X speed) and I didn't want to try to jump through hoops to create some special environment when Qwen3 TTS is already working perfectly (and fast) for me.

[-]

9r4n4y@reddit (OP)

Sry idk. Try reading the model paper

[-]

SurpriseOk6927@reddit

voice cloning quality keeps surprising me. moss 1.5 being better than fish audio is a strong claim but the demos look legit. curious how it handles non-english languages since most tts models are still shaky there

[-]

Devatator_@reddit

I cloned Glados yesterday on their demo space and it worked well. Sadly you don't have much control over the output aside from pauses and language markers

[-]

CheatCodesOfLife@reddit

You're talking to another bloody spam bot.

[-]

9r4n4y@reddit (OP)

Huh how u know its spam??

[-]

CheatCodesOfLife@reddit

Here you go, one of it's earlier posts:

https://files.catbox.moe/1kr2xe.png

Translation from French (by Claude):

""" The French comment in "Can you sell me your SaaS?" reads:

"I'm 19, I build solo in France. My product doesn't have a name yet but it does a thing that 90% of SaaS ignore: it scrapes Reddit/X to find qualified leads and contacts them automatically with messages that DON'T look like spam. 23 early adopters, 0 ads. The real pitch? Come see the numbers live rather than a deck." """

Annoying thing is, these seem to work on a lot of people. I've been sending them to random non-tech-savvy mates recently and "is this a bot or a human?"

Now I get "Looks human to me, but I know it's a bot because you keep sending them to me" lol

[-]

CheatCodesOfLife@reddit

This all lower-case no em-dash style is what they're doing right now. But you can see the usual Claude/Kimi-slop structure. For example the engagement baiting last sentence, usually starting with "curious" or "interested":

curious how it handles ...

Click it's profile and search for the current bot-spam prompted terms like "ngl" or starting with "lmao". This is how Kimi-K2.6 and the older GLM-4.6 write when you prompt it to talk like a low effort reddit post.

Here's another bot (looks like Claude with the same prompt) that was building up it's karma last week: https://old.reddit.com/user/techlatest_net?count=25&after=t1_on1l9ue

And now it's dropping garbage github repos everywhere: https://old.reddit.com/user/techlatest_net

[-]

JackStrawWitchita@reddit

I'm still struggling to find any voice-cloning TTS that can seriously compete with Chatterbox. This Moss-TTS needs a lot of horsepower (and patience) to match the quality of Chatterbox running on a potato.

[-]

ApatheticWrath@reddit

omnivoice

[-]

taking_bullet@reddit

I ditched Chatterbox. Now KugelAudio 2 (based on VibeVoice) is my new friend.

[-]

NordRanger@reddit

I just tried KugelAudio for German with the comfyui nodes but it‘s eating the end of sentences and the speech is so god damn fast that it’s barely usable.

Surely you don’t have these problems?

[-]

taking_bullet@reddit

Surely you don’t have these problems?

I do. Add another, random word at the end of the whole sentence. Then edit file in Audacity - cut out last second.

[-]

NordRanger@reddit

Damn. And what about pacing?

[-]

martinerous@reddit

I went for VoxCPM, turned to be quite easy to finetune to Latvian using Mozilla Common Voice dataset. VoxCPM recently also released a new version. It has come quirks (tends to get metallic with longer sentences) but it's fast and in my tests it was more stable than Chatterbox.

[-]

JackStrawWitchita@reddit

HF says you need 19GB+ of VRAM to run KugelAudio locally? WTF? Is that true?

Chatterbox gets excellent results run *without a GPU*.

[-]

taking_bullet@reddit

Chatterbox gets excellent results run without a GPU.

Maybe in English, but not in other languages.

HF says you need 19GB+ of VRAM to run KugelAudio locally? WTF? Is that true?

Indeed. Enable the 4-bit quant model if you don't have 20GB VRAM.

[-]

JackStrawWitchita@reddit

What if you don't have a GPU at all? Chatterbox runs fine without a GPU. Where can I d/l and test the version of KugelAudio that works better than Chatterbox without a GPU?

[-]

9r4n4y@reddit (OP)

Longcat dit is 3.5b model that do better voice clone than chatterbox

[-]

JackStrawWitchita@reddit

You need serious VRAM to make Longcat run whereas Chatterbox works extremely well without a GPU.

[-]

ArtfulGenie69@reddit

I like this test a lot more than the last one. All sorts of accents cloned. We are getting to the place in voice where the cloning is so good it's hard to tell what's better. I'll have to try with this one and compare to fish because it sounds pretty spot on.

[-]

ares0027@reddit

I tried for hours and couldnt run on windows :(

[-]

9r4n4y@reddit (OP)

Try to use moss tts 1.0 version comfy ui workflow but just replace the model. Or search on pinokio

[-]

ares0027@reddit

i tried the latest one and it failed, pinokio didnt work properly either but ill give 1.0 workflow a shot. thanks

[-]

Crinkez@reddit

Kinda pointless for most people. I have a 12GB gpu which I'm guessing is higher than the average person (outside of this sub) which still isn't enough to run this model.

And quants... idk if I'd trust their reliability.

[-]

9r4n4y@reddit (OP)

This was about most powerful model. But let me tell you judt brlow it is longcat dit 3.5 its also very powerful in voice cloning

[-]

Due-Hearing-5557@reddit

How does it compare to omivoice?

[-]

9r4n4y@reddit (OP)

Wayy better than omni

[-]

silenceimpaired@reddit

I need to try this… someone always wants a comparison

[-]

Wild24@reddit

How to install 8b version? Will it run on 12 gb vram with 64 gb ram?

[-]

9r4n4y@reddit (OP)

Yes just use the fp8 version. Give ai the link to the repo and ask it to find any comfy ui workflow for it. Or just ask ai give you step by step setup plan. You can also use Pinokio here.

[-]

thrownawaymane@reddit

Why is your profile hidden?

[-]

9r4n4y@reddit (OP)

None of ur business 🤗

[-]

Crinkez@reddit

It's always a bit suspicious when they're hidden. Anyway I did a bit of digging and found his full post history, but nothing stood out as bad.

[-]

9r4n4y@reddit (OP)

😭 thx for telling that

[-]

thirteen-bit@reddit

https://github.com/pwilkin/openmoss + Q4_K_M quantization of OpenMOSS-Team/MOSS-TTS-v1.5 takes ca. 12Gb (11393MiB shown by nvidia-smi on RTX3090):

./bin/moss-tts-server --host 127.0.0.1 --port 8080 --no-webui --model ./models/moss-tts-1.5-q4km.gguf

Approximately the size of the models + some space for kv cache:

3.9G moss-tts-1.5-q4km.extras.gguf

5.6G moss-tts-1.5-q4km.gguf

[-]

brahh85@reddit

i want to try this ggml implementation, but i barely have time https://github.com/pwilkin/openmoss

[-]

sanjxz54@reddit

Can it replace voice and sing ? clone first, then replace it on a music (vocal only) singing track, for example

[-]

9r4n4y@reddit (OP)

No, this model is not for singing. This will not work for what you want.

[-]

And-Bee@reddit

Very good. Will try it out

[-]

9r4n4y@reddit (OP)

:) Let me tell you, if you don't have a good GPU, there is a hugging face space which is currently running on GPU. So it means you can run as much as you want. So before it goes, try it.

https://huggingface.co/spaces/OpenMOSS-Team/MOSS-TTS-v1.5

[-]

OkAssistance7886@reddit

The examples sound pretty impressive. For voice models, I feel like the hard part is not just cloning quality, but showing clean comparisons with the same script, same noise level, and same emotion probably test the samples with Audacity, record a quick demo flow in OBS, and use Runable to mock up a simple comparison page so people can judge the outputs side by side.