How difficult would it be to have a text-to-speech setup like Elevenlabs at home?
Posted by iaseth@reddit | LocalLLaMA | View on Reddit | 48 comments
I am using Elevenlabs to generate a lot of audio. To save costs and have greater control and customisation, I want to setup a local pipeline for this.
Have any of you guys built something like this? How was your experience? Which models did you use? What was your hardware setup?
I have an i9 13900 with 4070 (?). I can afford to spend about $4000-5000 on a new setup.
mercuryin@reddit
Would anyone know how I can clone a voice and use that voice, for example, in Home Assistant or other platforms? I might not fully understand or know what to look for. I’ve tried cloning a voice using e2/F5-TTS, and the results are perfect, but it only generates an audio file with the specific words or phrases you input. It doesn’t allow me to export the model in a way that I could use it with other software to instantly generate speech from any text I want. Everything I find online is about this process—typing words, waiting for the machine to process them, and downloading the audio—but I want that same quality available for personal assistants or TTS systems in real time
MuradTagh@reddit
Did you find any solution?
rbgo404@reddit
Hey for TTS you can check our blog we have compared TTS model latency which includes 9 different TTS models.
https://www.inferless.com/learn/comparing-different-text-to-speech---tts--models-for-different-use-cases
Also if you are interested in LLMs, we do have a performance leaderboard on HF, with matrics like TTFT, Tokens/sec, Latency :
https://huggingface.co/spaces/Inferless/LLM-Inference-Benchmark
IONaut@reddit
The newest and best TTS you can run at home that I've seen is F5 TTS. That is mainly a CLI implementation but somebody put together a very basic web UI for it that you can install through Pinokio. It captures and reproduces the voice and the emotional performance very well from just a few seconds of example audio and reproduces it fairly quickly.
rbgo404@reddit
I have recently done a quick latency benchmark and found F5 is very slow and the quality is not always great.
InterestingTea7388@reddit
F5 is the latest, but it underperforms in so many ways. Intonation, emotion, pauses are just not good. At the moment, fish speech is probably still the best choice. Far better than F5/E5 in any case.
vaksninus@reddit
It depends a lot on the audio reference. It has excellent quality, see Jarods video as an example.
IONaut@reddit
Yeah that's what I've been noticing too. Making sure there's no background music and denoising the clips help a lot. And choosing clips with the right cadence and intonation you're looking for.
vaksninus@reddit
Also the E2 Model is more focus on quality (pretty big difference) and increasing CFG to 3 instead of the default 2 (small effect), helps the quality as well a bit it seems from my anecdotal testing today.
IONaut@reddit
I thought I had some that worked better with the E2 model and others that worked better with the F5 model. Trying to pin down what it is that makes a clip work better with one than the other.
mintybadgerme@reddit
F5 performs amazingly in my tests. I guess it depends on what you're looking for.
knvn8@reddit
F5 is still a little rough around the edges, but I think it has potential if it picks up the same degree of community attention.
aniketmaurya@reddit
Parler TTS is a good open-source model for text-to-speech. You can try it here
Fun_Librarian_7699@reddit
Is there also a German model that has the same quality?
aniketmaurya@reddit
It can be finetuned on German data too.
Fun_Librarian_7699@reddit
Unfortunately, I don't have the necessary skills and especially not the necessary hardware to do that.
Blizado@reddit
Same problem here, so i will use XTTSv2 a bit longer.
lordpuddingcup@reddit
isn't XTTS and now E2 and E5 considered best opensource?
aniketmaurya@reddit
Yeah of course! there are a couple of TTS models and can be chose based on memory and speed requirements. I have an example with XTTS if someone wants to try - https://lightning.ai/lightning-ai/studios/deploy-a-voice-clone-api-coqui-xtts-v2-model?section=all&view=public&query=voice
Deep_Fried_Aura@reddit
I can't suggest SWivid/F5-TTS enough. The only tricky part is that if your output is verbose, the generation will take longer, for sub 10 sentece outputs it's incredible though and with some elbow grease and tinkering it could potentially achieve real-time generation. To do that however you would have to be creative about how you process the LLM output.
My personal "theory" is that you could begin generation on the first 5 sentences, and make microscopic adjustments to the TTS playback speed so that you're generating long outputs in blocks, or well chunks. The first generation would be slower than the last so let's say you have a 2 paragraph, 40 sentence output? You would take the first paragraph and split it into quarters, as soon as you have the first 5 sentences you ship them off to generation, and while those are generating you would prepare the next 5, by that time the LLM text output should be completed so pedal to the medal. While your second set of 5 sentences is on playback, you can run the last set of 10 (20 sentences from paragraph 1) and when those 10 sentences are on playback, you can do 10 from paragraph 2, or just send it and try to push all remaining text through.
The trick would be finding the sweet spot for the playback speed so that it's slow enough to allow your WIP generation to complete, but not slow enough to be noticeable during playback.
Something that I considered was training a custom voice using a voice with a fast delivery (fast talker) because then you could potentially slow it down considerably without the reduced playback rate affecting the generated result.
databasehead@reddit
I recently took some inspiration from a previous reddit poster that wrote amitybell/piper. I am now using his wrapper around piper in golang to serve the open source amy voice. I generate text using llama 3.1 7b and then write the wave file and play the audio in a simple html/htmx page running all on a 4090 with i9 14900k and 64gb ram.
constPxl@reddit
If i may ask, on linux or windows?
databasehead@reddit
WSL2, Ubuntu 22.04. It’s got its quirks but it’s pretty decent. Everything kinda just works!
Material1276@reddit
My software is multi-engine https://github.com/erew123/alltalk_tts/tree/alltalkbeta and I will be adding others, though please read my current support/development statement on there. Screenshots are here https://github.com/erew123/alltalk_tts/discussions/237
knvn8@reddit
Okay I was just trying out Alltalk yesterday and gave up in frustration after like two hours of trying to make it work. Half of the screens in the app are just... Text? Why isn't that stuff in the README? Or the wiki.
But I was specifically interested in trying out the voice cloning because I saw a claim about using Alltalk cloning and there was nothing I could find in the README or the app to indicate how you do that with Alltalk. Straight up raw xtts makes it easy, why is there no facility for it with Alltalk's implementation of xtts? It's weird because despite the sheer amount of verbiage in the app, I couldn't find anything about cloning. I feel like I must be missing something obvious.
Anyway, I don't want to discourage your work here, I'm grateful to every open source developer. If I had any advice it would be to pull back and focus more on completing key use cases, because in it's current state it feels more like 100 half baked use cases, so the more purpose built inference servers end up being more useful (to be at least).
Material1276@reddit
Re making it work. I have been dealing with real world life problems and put out statements about fixes/workarounds due to 3rd party things changing outside of my control while I have been away. This statement is here https://github.com/erew123/alltalk_tts/issues/377
Re cloning voices or other things with other TTS models, the instructions for each TTS engine are in the engines help pages within the Gradio interface. That should cover most of your questions.
ObnoxiouslyVivid@reddit
Thanks for all the hard work.
Really sorry to hear about your family member. Take your time and take care.
ObnoxiouslyVivid@reddit
That was my experience as well. I think the RVC wiki page was missing for like a month, not even a mention of how to enable it anywhere. I imagine they're quite busy, but still, it would have saved others countless hours of tinkering.
But I'd say it's worth it in the end, works quite well.
Nrgte@reddit
Voice training is on the settings page of all talk.
knvn8@reddit
Training and cloning are two different things
Nrgte@reddit
It's not. Under the hood it's all training. You train the AI to pick up speech pattern of the target voice.
knvn8@reddit
Training is for producing new checkpoints. At least that's the common nomenclature, if Alltalk's training section is not actually for running new training epochs then that's another issue I have with it.
Nrgte@reddit
No that's right, you do produce a new checkpoint finetuned on a targeted voice.
knvn8@reddit
Okay xtts can clone voices without requiring training a new checkpoint
Nrgte@reddit
No it can't, it can just mimick the tone of the voice, but not the speech patterns.
AutomaticDriver5882@reddit
I tried the beta the UI looks but the voices are worse which voices dud you use?
ObnoxiouslyVivid@reddit
I'm using alltalk + RVC https://github.com/erew123/alltalk_tts/wiki/RVC-(Retrieval%E2%80%90based-Voice-Conversion), it gets you 90% of they way to Elevenlabs.
I'd say be prepared to experiment with different models and voice samples. They are all very sensitive to these initial conditions.
From my experiments, XTTSv2 is still the king. You should be able to run the whole pipeline on your hardware with no issues.
bigh-aus@reddit
I’m currently using piper. It’s pretty decent and comes with a bunch of voices. My only complaint is it’s very heavy on cpu but leaves the GPU not utilized. Not a fault of piper, but the models underneath.
I’m also investigating other options. It would be nice to have a more expressive voice but ultimately the best quality ones are either much more complex to use or priopetry
synthmike@reddit
Piper runs on the GPU too. You may need to download some additional libraries and configure it properly.
bigh-aus@reddit
From what my research says it’s hella slow on gpu
MakitaNakamoto@reddit
look up F5-TTS
MRGRD56@reddit
You could try XTTSv2 - it generates pretty realistic and kind of emotional? voices in different languages. But it's not perfect, Elevenlabs is better.
If you want to call it from Python, you can use this for example https://github.com/KoljaB/RealtimeTTS
If you just need some web ui to generate audios manually, this one could help https://github.com/daswer123/xtts-webui or something like that
For me it was alright at first sight, but I didn't test it that much, actually (I have: Ryzen 9 7950X3D, 64 GB DDR5 RAM, RTX 4070 Ti Super). The only thing I can tell that it's pretty slow if you try to use it to generate speech in real time without a GPU.
Fun_Librarian_7699@reddit
Does RealtimeTTS with eleven_multilingual_v1 also produce good German results? Which model produces the best German results?
InterestingTea7388@reddit
German as a language has far too few speakers for it to be given much consideration, as many technology enthusiasts in our country also speak fluent English, there is hardly anyone who makes an effort. I'm more of a hobbyist and have been working on a German model for months - but that will take forever. ^^
Fish speech and xtts2 can do German, but the quality is far worse than when these models generate English.
bannert1337@reddit
I can recommend OpenedAI Speech (https://github.com/matatonic/openedai-speech). I host IT along with other projects from the user for Open-WebUI.
fasti-au@reddit
Very doable melotts looks like the latest updated for emotion but there’s many you can try. Whisper for the voice to texts has near real-time also
RealBiggly@reddit
Try Pinokio, they just added the latest TTS thingy and easy to get going.
Svyable@reddit
Not local but you can check out Hume.ai their new EVI2 is very lifelike