Ngram TTS model?

Posted by Silver-Champion-4846@reddit | LocalLLaMA | View on Reddit | 4 comments

Hey there guys. Question, is it possible to make a llm-based tts model that stores some kind of patterns for specific languages as ngram lookup tables? While it might not be needed for some bulky 7b tts model, my usecase requires a model that runs with <50ms of latency on cpu while also adequately supporting a challenging language like Arabic. Would a Gema4 design be possible to adapt for tts? Maybe the ple's storing language-specific data allowing it to perform like a 500m model while being maybe 100m or less matmul-wise? Thanks.

[-]

Odd-Figure2365@reddit

from what i’ve read about gema4-inspired tts approaches, combining precomputed phoneme embeddings with a lightweight vocoder can give the same perceptual quality as a larger 500m model while staying around 100m parameters. uniconverter comes up in some communities for handling offline synthesis and caching, which could speed up experimenting with ngram tables and local mp3 generation.

[-]

Silver-Champion-4846@reddit (OP)

I wasn't talking about merely shrinking the model, but rather how to utilize static parameters that don't hog processing power while still benefitting model output, especially for a language as complex as arabic that needs a form of intelligence beyond just "learn how each letter sounds, mimick intonation patterns and that's it"

[-]

EffectiveCeilingFan@reddit

I think you mean Engram? DeepSeek’s recent paper? Its purpose is to offload the task of factual retrieval from the multilayer perceptron to allow it to focus on encoding reasoning. None of this is really applicable to a TTS model.

I don’t believe PLE is really applicable to a TTS model either.

As for your performance requirements, they’re just not possible. That would be impressive on a GPU, let alone a CPU. On a tiny TTS model, like Kokoro 82M, you could probably get sub-1000ms latency on CPU.

[-]

Silver-Champion-4846@reddit (OP)

If I had a gpu I would have tried learning how to experiment and train on my own to be the change I want to see in the world and all that, but reality has other plans. And if I could use cloud gpus, I would have.