How difficult would it be to have a text-to-speech setup like Elevenlabs at home?

Posted by iaseth@reddit | LocalLLaMA | View on Reddit | 41 comments

I am using Elevenlabs to generate a lot of audio. To save costs and have greater control and customisation, I want to setup a local pipeline for this.

Have any of you guys built something like this? How was your experience? Which models did you use? What was your hardware setup?

I have an i9 13900 with 4070 (?). I can afford to spend about $4000-5000 on a new setup.