Qwen3 TTS is seriously underrated - I got it running locally in real-time and it's one of the most expressive open TTS models I've tried

Posted by fagenorn@reddit | LocalLLaMA | View on Reddit | 74 comments

Heya guys and gals,

Around a year ago I released and posted about Persona Engine as a fun side project, trying to get the whole ASR -> LLM -> TTS pipeline going fully locally while having a realtime avatar that is lip-synced (think VTuber). I was able to achieve this and was super happy with the result, but the TTS for me was definitely lacking, since I was using Sesame at the time as reference. After that I took a long break.

A week or two ago, I thought to give the project a refresh, and also wanted to see how far we have come with local models, and boy was I pleasantly surprised with Qwen3 TTS. During my initial tests it was lacking, especially the version published by the Qwen team themselves, but after digging around and experimenting a lot I was able to:

Make streaming with the model work reliably. The architecture of the model is perfect for this, since the decoder uses a sliding window, which means if you stream the LLM response, that's completely fine and the TTS will keep coherent prosody, pitch, and intonation.
Get the model working with llama.cpp, because I am using C# and speed is important, so also quantized it.
The model was lacking word-level timings and phonemes which Kokoro (the previous, more robotic sounding TTS) had. So I had to implement CTC word-level alignment to be able to know when certain words are spoken (important for subtitles + getting phonemes to have the lips move correctly).

Once this was all done, I also decided to finetune my own Qwen3-TTS voice. The cloning capabilities are really cool, but very lacking in contextual understanding and struggles with pronouncing. Additionally, the custom trained voices provided by the Qwen team didn't have any female native speakers, and I didn't want to create a new Live2D model.

In the end, the finetune blew me away and will probably continue improving it.

GitHub is here: https://github.com/fagenorn/handcrafted-persona-engine

Check it out, have fun, and let me know whatever crazy stuff you decide to do with it.

[-]

wopenutz@reddit

That 97ms first-packet latency on a 4GB GPU is genuinely impressive and not something I expected from a local model at this stage. Most people assume you need serious hardware to get real-time TTS that actually sounds good, so this kind of result on consumer-grade specs is a big deal.

[-]

lorddumpy@reddit

voxCPM2/echoTTS blows it out of the water IMO.

[-]

DefNattyBoii@reddit

Thanks i checked echotts and voxCPM2, they are both cool! Echotts github seems to be a little old (last commit 4 months ago). How do they compare to omnivoice?

[-]

lorddumpy@reddit

bigger parameters so more VRAM but I've heard it's better quality. I personally haven't integrated omnivoice so I can't say for sure.

[-]

CheatCodesOfLife@reddit

echoTTS simply blows everything out of the water no question, see for yourself:

https://huggingface.co/spaces/jordand/echo-tts-preview

[-]

EmuMammoth6627@reddit

Echo is one I don't see talked about enough, it's really good.

[-]

Classic-Ad-5129@reddit

It is so good :'( . But it consume too much vram. Gemma already take 13go vram over my 16go.

[-]

human_bean_@reddit

How do you keep the generated speech emotionally and tonally consistent when concatenating subsequent clips?

[-]

Truth-Does-Not-Exist@reddit

WOAH

[-]

bitslizer@reddit

Nice! Is persona engine feeding those [emotion emoji] tags straight to qwen3? Are you using faster-qwen3-tts to get that speed?

[-]

fagenorn@reddit (OP)

Ty <3

No, the emotion tags are used for driving the avatar expressions. But something that would be really cool to have in the future is to be able to provide the emoji as instructions to the qwen3 tts model. It is possible to finetune it to support that, but right now my dataset isn't big enough to add this yet + required a lot of manualy work of going trough the dataset and tagging the audio clips with the emoji.

As for the speed achieved - This is a full custom solution build in c# using a combination of Onnx and llama.cpp.

[-]

MustBeSomethingThere@reddit

You could make a separate repo for your custom Qwen3 TTS engine. I would be very interested about it.

[-]

fagenorn@reddit (OP)

Check this folder in the repo, it's almost fully self-contained, well structured and quite well documented. The most important is Qwen3TtsGgufEngine.cs and LlamaTtsContext.cs

Something I am sad about is that the streaming performance could actually be much better then what is being demonstrated, but because of Ctc I have to buffer a bit of audio to get accurate word timings.

Would love for future models to output word boundary tokens so I don't have to rely on an additional model for word level timings.

[-]

DataPhreak@reddit

Which Qwen TTS model are you using here?

[-]

overand@reddit

The ability to use a motion taxes the main reason I want to try to get Fish Audio S2 Pro working well, but I haven't had a ton of luck with it. Last I knew, it was the only thing that supported tags like that?

[-]

bitslizer@reddit

Gemini 2.5 tts works amazing with tags but it's not local... The new Gemini 3 tts seems a step backward but they may still tweak it like they did with 2.5

[-]

ArtfulGenie69@reddit

You may not have the hardware for real time with fish s2 but it handles those tags for expression. Maybe at 4bit it would be real time on a 3090, it's better than when by a lot in all sorts of ways but especially voice cloning. Omnivoice is the new fast one, for an extra option.

[-]

Adventurous-Paper566@reddit

I tried Qwen3 TTS and it was slow, what is your GPU?

[-]

Atom_101@reddit

Out of the box yes because it needs 16 iterations of the subtalker RVQ transformer per frame. If you compile the RVQ transformer you get huge speedups and real time speeds.

[-]

skinnyjoints@reddit

What does compile mean in this context?

[-]

nmkd@reddit

torch.compile presumably

[-]

car_lower_x@reddit

It’s super fast honestly.

[-]

constarx@reddit

on a 5090 no doubt! it's highly unrepresentative of what the general populace is packing though.. I tried it on a 4070S and while it is fast... it still takes 2-3 seconds and that 2-3 seconds really hurts immersion.

[-]

ProtoAMP@reddit

Have you tried faster-qwen3-tts? Average speedup seems to be >5x which will make the experience less painful.

[-]

constarx@reddit

thanks I'll check it out. I've actually been using chatterbox-turbo and for voice cloning, with voicebox. But it seems like qwen3-tts is better now you think?

[-]

andy2na@reddit

I was using chatter but it had weird issues. Just switched to omnispeak and so far it's better. Tested out faster qwen and it would randomly switch up the voice after a sentence, at least in home assistant

[-]

fagenorn@reddit (OP)

I am using a 5090, but honestly qwen3 tts only takes up 2 to 3 gigs of vram at peak with the custom setup I have. Really pushing the model to its limits over here XD

[-]

TechnoSmacked@reddit

Hey I litterally made an account just to speak to you. Love the model you got, I have qwen 3.6 on a rtx 6000pro blackwell, how do I make it sound like yours and give it the emotional dynamic that you have? I see emotions are written as emojis in the output. Also using qwen 3 tts

[-]

fagenorn@reddit (OP)

I am not doing anything special for qwen3 tts, this is all emergent behaviour from just finetuning the voice on a lot of data (around 1.3h of clips). It's able to contextually understand what it is saying and adapts the voice to it.

[-]

TechnoSmacked@reddit

Fantastic, what is the prompt for the actual model? She seems sassy haha. Ok I can manage to fine tune on a hour clip thats not a problem. Just got to figure out what to use now. Also I see that you have added emotions through emojis. Whats up with that and how do we trigger those?

[-]

fagenorn@reddit (OP)

The prompt can be found on the GitHub
For the finetuning, check the repo I linked here

The emojis are being used by the engine to drive the Live2D (not Inochi2d, but similar) avatar, but are stripped for the TTS, they don't see any of it. Fully driven by the LLM via the promt.

[-]

TechnoSmacked@reddit

Fantastic, last question. What file did you fine tuned it on? Love the sound, you did a great job!

[-]

TechnoSmacked@reddit

Also is that Inochi2d I spot for persona generation? I wish you had this interface for Linux maaaaaan

[-]

no-adz@reddit

This is so cool! Does anyone know any similar open source packages which do this on mac, or provide a webpage?

[-]

jorlev@reddit

Any tweak to get this to run on Mac? Or is Mac version possible for you?

[-]

fagenorn@reddit (OP)

It would require someone to dedicate some time to this. I will say tough that I did make consious decisions to use cross-platform libs as much as possible, e.g. onnx, llama.cpp, imGui.

The biggest work (i think) would be to adapt all spout2 rendering to use Syphon on mac.

[-]

Specter_Origin@reddit

What we truly need are small but good local SST

[-]

PhilippeEiffel@reddit

If you are speaking about STT, parakeet v3 is small and really fast on CPU only.

[-]

overand@reddit

SST? Do you mean STT, (speech to text), like Whisper, etc?

[-]

Skystunt@reddit

Does it come with qwen3 tts included or do we need to manually change the tts model ?

[-]

fagenorn@reddit (OP)

It includes the finetuned, quantized, gguf qwen3 tts model. You do need to enable it tough.

[-]

selfdeprecational@reddit

how did you do the finetune? i tried it and wrote some inference code for it but didnt like the included english voices / no english female voice either. i played with the base model a bit, you can see the range of expressions a log in that

[-]

fagenorn@reddit (OP)

Check this repo: https://github.com/vspeech/Qwen3-TTS-Train

I used this, just adapted it a bit to my use-case.

[-]

MK_L@reddit

Cool

[-]

MadGenderScientist@reddit

absolutely wild conversation lol. and good work!

I still wish the conversation were more fluid, though this is better than most of what I've seen. the LLM still tends to reply in paragraphs, just short paragraphs. I think none of the models are capturing conversational dynamics and turn-taking all that well.

[-]

fagenorn@reddit (OP)

Ty <3

In the example above it uses just the default llama 3.3 model, but yeah, if you really want to get the model to be able to understand how conversations flow, the answer is to finetune it on "spoken" conversations.

I did a test of this a year ago with nemo 12B, and honestly at the time the result was super impressive. Just tried it out with persona engine, and I feel like the flow is much more "natural". https://voca.ro/1ikoYf0pMRuo

Could be a good experiment for in the future to see how the same dataset performs on modern models we have nowadays.

[-]

Icy_Restaurant_8900@reddit

That example with Nemo 12B is super impressive. Is that with Qwen3 TTS also? The audio quality is excellent

[-]

fagenorn@reddit (OP)

Yes! That's the nice thing about having a pluggable system, the LLM can just be switched on the go while keeping the same voice

[-]

CheatCodesOfLife@reddit

https://voca.ro/1ikoYf0pMRuo

I turned the volume way up because you were speaking so quietly, then at 00:34 😖 <--BOOP-- 📢

[-]

fagenorn@reddit (OP)

So sorry about that, just quickly recorded without testing ><

[-]

jimmy1460@reddit

Haven’t heard someone kinda pinpoint my same feelings about it. Its the paragraphs isn’t it

[-]

the_wreckbhai@reddit

How to run it without burning your laptop?

[-]

neuthral@reddit

creepy tbh...

[-]

IrisColt@reddit

Very interesting, there's no ultimate model winning yet.

[-]

Virtamancer@reddit

Specs?

[-]

bingeboy@reddit

Calm down

[-]

lebbi@reddit

Man, nvidia AND windows required. Bummer.

Real cool project!

[-]

sumptuous-drizzle@reddit

This is quite impressive, and I appreciate the amount of human thought that seems to be involved. (Though the AI illustrations in the README do still make me cringe)

The architecture is also quite interesting. This feels like 60-70% towards what would be necessary to make having having an AI assistant be truly painless. The loops between the different parts seem quite tight and well-thought-out. A shame that it's baked into a windows-only GUI. The same primitives as nodes e.g. in a graph-like or JSON-like structure or some other composable interface, wired together with sensible defaults but allowing good composition by means of clear, small interfaces would feel like it would really allow one to start being creative with this kind of modality. E.g. you could have a simple LLM component implementing the required interface, but you could also have an AgentOrchestrator component which implements the same interface (something like in: user message, out: streamed tokens) but internally dispatches to agents, tending to them, etc. Having it configurable but not composable always feels like such a waste to me.

Having said that, it's an amazing project, good job!

[-]

fagenorn@reddit (OP)

Spent way to much time learning Live2D and rigging the model you see using it. The engine then controls the avatar using blendshapes.

[-]

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! Come check it out!

You've also been given a special flair for your contribution. We appreciate your post!

I am a bot and this action was performed automatically.

[-]