Supertonic WebGPU: blazingly fast text-to-speech running 100% locally in your browser.

Posted by xenovatech@reddit | LocalLLaMA | View on Reddit | 10 comments

Last week, the Supertone team released Supertonic, an extremely fast and high-quality text-to-speech model. So, I created a demo for it that uses Transformers.js and ONNX Runtime Web to run the model 100% locally in the browser on WebGPU. The original authors made a web demo too, and I did my best to optimize the model as much as possible (up to \~40% faster in my tests, see below).

I was even able to generate a \~5 hour audiobook in under 3 minutes. Amazing, right?!

Link to demo (+ source code): https://huggingface.co/spaces/webml-community/Supertonic-TTS-WebGPU

* From my testing, for the same 226-character paragraph (on the same device): the newly-optimized model ran at \~1750.6 characters per second, while the original ran at \~1255.6 characters per second.