Distilling Qwen3 TTS

Posted by Reasonable_Friend_77@reddit | LocalLLaMA | View on Reddit | 12 comments

Hi all,

I've made a few attempts to distill Qwen3 TTS without much success. I'm trying to create a model that is half the size and see what's the quality trade off... but so far I only managed to produce garbage.

Does anyone have experience with distilling TTS models?

Any tips or documentation willing to share?

[-]

Fit-Produce420@reddit

Yeah if I took out half your brain would your abilities increase, or decrease?

[-]

Reasonable_Friend_77@reddit (OP)

Lol, not sure if you're advocating for the rights of llm here, but an llm is not a brain and distillation is known process that exists and has been done extensively. My goal is to explore what's the loss in quality, how much can it retain at smaller size. For example what if it retains a single language only.

But hey, you seem to have the answer already and clearly have thought this through. :)

[-]

Fit-Produce420@reddit

Well it sounds like you're the one who doesn't understand how LLMs work.

I don't care if you spend your whole day lobotomizing language models, but the weights you're removing are shared across languages, by taking out everything that's not English all you're doing is making it stupider because those "non-English" activations are shared, not isolated.

You're basically testing how much English you can remove before you've removed too much. I guess if you're interested that's all that matters?

[-]

Reasonable_Friend_77@reddit (OP)

Did you read my question?

I'm not saying you can randomly remove layers and somehow have a smaller model. You're right that multilanguage models don't have cleanly separable English weights but I'm not saying you can remove them cleanly, that wouldn't make sense. I'm talking about distillation, creating a student model that has the same architecture but a smaller talker and measure the quality tradeoff.

These comments are not much helpful nor technical either so I'm not gonna continue this conversation.

[-]

Fit-Produce420@reddit

Yeah you're just playing around, okay neat, have fun!

[-]

r4in311@reddit

You're wasting your time, just use OmniVoice it's so much better and really small :-)

[-]

Reasonable_Friend_77@reddit (OP)

Lol. Thanks now that I know I'm wasting my time I'll stop.

[-]

Reasonable_Friend_77@reddit (OP)

With 4 upvotes I feel like I should add some explanation here. :P

Omnivoice is a diffusion model not an autoregressive one. It works great if you've a lot of cores (threads) too. But because of its nature it can't really do streaming. Diffusion models already know the exact size (length) from the very beginning and progressively iterate on the result. So if you want to minimize latency and you don't have a lot of cores (for example CPU inference) that may not be the best choice. Or so I learned in my ignorance, always happy to be proven wrong. :)

[-]

Double_Cause4609@reddit

In general, distillation's really involved.

What model are you distilling into? If you have no smaller generally pretrained model you do typically have to pre-train before distillation.

That is, distillation only works when the target policy is already near where you want to be after distillation.

You might find QAT self-distillation a bit better (where you do QAT on the weights but reference the full precision model as the teacher). If the goal is to run 2x-4x as fast it should still be fine.

[-]

Reasonable_Friend_77@reddit (OP)

I was distilling into a Qwen3 itself removing layers that were less sensitive to quantization. The rationale is that those layers are perhaps less helpful? But not sure if that's the right way to go. Perhaps it's not the right term quantization? or is it what you mean by self-distillation?

[-]

overand@reddit

Are you trying to distill it or quantize it? (And - have you already just tried it at smaller quantizations? What quantization - if any - are you using, and what sort of system are you trying to run it on?)

I'm also curious what sort of "garbage" you're getting; I find TTS garbage and nonsense to be pretty interesting!

[-]

Reasonable_Friend_77@reddit (OP)

Sorry, should've explained better. I meant half the size in number of parameters. I'm quantizing and got pretty decent results but now I want to see if it's possible to distill it. Garbage meaning garbled speech, voice cloning quite broken and even the custom voice model voices stop working. But distillation can be done in a number of ways from what I see so I wanted to check if anyone had experience.