How difficult would it be to have a text-to-speech setup like Elevenlabs at home?

[-]

mercuryin@reddit

Would anyone know how I can clone a voice and use that voice, for example, in Home Assistant or other platforms? I might not fully understand or know what to look for. I’ve tried cloning a voice using e2/F5-TTS, and the results are perfect, but it only generates an audio file with the specific words or phrases you input. It doesn’t allow me to export the model in a way that I could use it with other software to instantly generate speech from any text I want. Everything I find online is about this process—typing words, waiting for the machine to process them, and downloading the audio—but I want that same quality available for personal assistants or TTS systems in real time

[-]

MuradTagh@reddit

Did you find any solution?

[-]

rbgo404@reddit

Hey for TTS you can check our blog we have compared TTS model latency which includes 9 different TTS models.

https://www.inferless.com/learn/comparing-different-text-to-speech---tts--models-for-different-use-cases

Also if you are interested in LLMs, we do have a performance leaderboard on HF, with matrics like TTFT, Tokens/sec, Latency :

https://huggingface.co/spaces/Inferless/LLM-Inference-Benchmark

[-]

IONaut@reddit

The newest and best TTS you can run at home that I've seen is F5 TTS. That is mainly a CLI implementation but somebody put together a very basic web UI for it that you can install through Pinokio. It captures and reproduces the voice and the emotional performance very well from just a few seconds of example audio and reproduces it fairly quickly.

[-]

rbgo404@reddit

I have recently done a quick latency benchmark and found F5 is very slow and the quality is not always great.

[-]

InterestingTea7388@reddit

F5 is the latest, but it underperforms in so many ways. Intonation, emotion, pauses are just not good. At the moment, fish speech is probably still the best choice. Far better than F5/E5 in any case.

[-]

vaksninus@reddit

It depends a lot on the audio reference. It has excellent quality, see Jarods video as an example.

[-]

IONaut@reddit

Yeah that's what I've been noticing too. Making sure there's no background music and denoising the clips help a lot. And choosing clips with the right cadence and intonation you're looking for.

[-]

vaksninus@reddit

Also the E2 Model is more focus on quality (pretty big difference) and increasing CFG to 3 instead of the default 2 (small effect), helps the quality as well a bit it seems from my anecdotal testing today.

[-]

IONaut@reddit

I thought I had some that worked better with the E2 model and others that worked better with the F5 model. Trying to pin down what it is that makes a clip work better with one than the other.

[-]

mintybadgerme@reddit

F5 performs amazingly in my tests. I guess it depends on what you're looking for.

[-]

knvn8@reddit

F5 is still a little rough around the edges, but I think it has potential if it picks up the same degree of community attention.

[-]

aniketmaurya@reddit

Parler TTS is a good open-source model for text-to-speech. You can try it here

[-]

Fun_Librarian_7699@reddit

Is there also a German model that has the same quality?

[-]

aniketmaurya@reddit

It can be finetuned on German data too.

[-]

Fun_Librarian_7699@reddit

Unfortunately, I don't have the necessary skills and especially not the necessary hardware to do that.

[-]

Blizado@reddit

Same problem here, so i will use XTTSv2 a bit longer.

[-]

lordpuddingcup@reddit

isn't XTTS and now E2 and E5 considered best opensource?

[-]

aniketmaurya@reddit

Yeah of course! there are a couple of TTS models and can be chose based on memory and speed requirements. I have an example with XTTS if someone wants to try - https://lightning.ai/lightning-ai/studios/deploy-a-voice-clone-api-coqui-xtts-v2-model?section=all&view=public&query=voice

[-]

Deep_Fried_Aura@reddit

I can't suggest SWivid/F5-TTS enough. The only tricky part is that if your output is verbose, the generation will take longer, for sub 10 sentece outputs it's incredible though and with some elbow grease and tinkering it could potentially achieve real-time generation. To do that however you would have to be creative about how you process the LLM output.

My personal "theory" is that you could begin generation on the first 5 sentences, and make microscopic adjustments to the TTS playback speed so that you're generating long outputs in blocks, or well chunks. The first generation would be slower than the last so let's say you have a 2 paragraph, 40 sentence output? You would take the first paragraph and split it into quarters, as soon as you have the first 5 sentences you ship them off to generation, and while those are generating you would prepare the next 5, by that time the LLM text output should be completed so pedal to the medal. While your second set of 5 sentences is on playback, you can run the last set of 10 (20 sentences from paragraph 1) and when those 10 sentences are on playback, you can do 10 from paragraph 2, or just send it and try to push all remaining text through.

The trick would be finding the sweet spot for the playback speed so that it's slow enough to allow your WIP generation to complete, but not slow enough to be noticeable during playback.

Something that I considered was training a custom voice using a voice with a fast delivery (fast talker) because then you could potentially slow it down considerably without the reduced playback rate affecting the generated result.

[-]

databasehead@reddit

I recently took some inspiration from a previous reddit poster that wrote amitybell/piper. I am now using his wrapper around piper in golang to serve the open source amy voice. I generate text using llama 3.1 7b and then write the wave file and play the audio in a simple html/htmx page running all on a 4090 with i9 14900k and 64gb ram.

[-]

constPxl@reddit

If i may ask, on linux or windows?

[-]

databasehead@reddit

WSL2, Ubuntu 22.04. It’s got its quirks but it’s pretty decent. Everything kinda just works!

[-]

Material1276@reddit

My software is multi-engine https://github.com/erew123/alltalk_tts/tree/alltalkbeta and I will be adding others, though please read my current support/development statement on there. Screenshots are here https://github.com/erew123/alltalk_tts/discussions/237

[-]

knvn8@reddit

Okay I was just trying out Alltalk yesterday and gave up in frustration after like two hours of trying to make it work. Half of the screens in the app are just... Text? Why isn't that stuff in the README? Or the wiki.

But I was specifically interested in trying out the voice cloning because I saw a claim about using Alltalk cloning and there was nothing I could find in the README or the app to indicate how you do that with Alltalk. Straight up raw xtts makes it easy, why is there no facility for it with Alltalk's implementation of xtts? It's weird because despite the sheer amount of verbiage in the app, I couldn't find anything about cloning. I feel like I must be missing something obvious.

Anyway, I don't want to discourage your work here, I'm grateful to every open source developer. If I had any advice it would be to pull back and focus more on completing key use cases, because in it's current state it feels more like 100 half baked use cases, so the more purpose built inference servers end up being more useful (to be at least).

[-]

Material1276@reddit

Re making it work. I have been dealing with real world life problems and put out statements about fixes/workarounds due to 3rd party things changing outside of my control while I have been away. This statement is here https://github.com/erew123/alltalk_tts/issues/377

Re cloning voices or other things with other TTS models, the instructions for each TTS engine are in the engines help pages within the Gradio interface. That should cover most of your questions.

[-]

ObnoxiouslyVivid@reddit

Thanks for all the hard work.

Really sorry to hear about your family member. Take your time and take care.

[-]

ObnoxiouslyVivid@reddit

That was my experience as well. I think the RVC wiki page was missing for like a month, not even a mention of how to enable it anywhere. I imagine they're quite busy, but still, it would have saved others countless hours of tinkering.

But I'd say it's worth it in the end, works quite well.

[-]

Nrgte@reddit

Voice training is on the settings page of all talk.

[-]

knvn8@reddit

Training and cloning are two different things

[-]

Nrgte@reddit

It's not. Under the hood it's all training. You train the AI to pick up speech pattern of the target voice.

[-]

knvn8@reddit

Training is for producing new checkpoints. At least that's the common nomenclature, if Alltalk's training section is not actually for running new training epochs then that's another issue I have with it.

[-]

Nrgte@reddit

No that's right, you do produce a new checkpoint finetuned on a targeted voice.

[-]

knvn8@reddit

Okay xtts can clone voices without requiring training a new checkpoint

[-]

Nrgte@reddit

No it can't, it can just mimick the tone of the voice, but not the speech patterns.

[-]

AutomaticDriver5882@reddit

I tried the beta the UI looks but the voices are worse which voices dud you use?

[-]

ObnoxiouslyVivid@reddit

I'm using alltalk + RVC https://github.com/erew123/alltalk_tts/wiki/RVC-(Retrieval%E2%80%90based-Voice-Conversion), it gets you 90% of they way to Elevenlabs.

I'd say be prepared to experiment with different models and voice samples. They are all very sensitive to these initial conditions.

From my experiments, XTTSv2 is still the king. You should be able to run the whole pipeline on your hardware with no issues.

[-]

bigh-aus@reddit

I’m currently using piper. It’s pretty decent and comes with a bunch of voices. My only complaint is it’s very heavy on cpu but leaves the GPU not utilized. Not a fault of piper, but the models underneath.

I’m also investigating other options. It would be nice to have a more expressive voice but ultimately the best quality ones are either much more complex to use or priopetry

[-]

synthmike@reddit

Piper runs on the GPU too. You may need to download some additional libraries and configure it properly.

[-]

bigh-aus@reddit

From what my research says it’s hella slow on gpu

[-]

MakitaNakamoto@reddit

look up F5-TTS

[-]

MRGRD56@reddit

You could try XTTSv2 - it generates pretty realistic and kind of emotional? voices in different languages. But it's not perfect, Elevenlabs is better.
If you want to call it from Python, you can use this for example https://github.com/KoljaB/RealtimeTTS
If you just need some web ui to generate audios manually, this one could help https://github.com/daswer123/xtts-webui or something like that

For me it was alright at first sight, but I didn't test it that much, actually (I have: Ryzen 9 7950X3D, 64 GB DDR5 RAM, RTX 4070 Ti Super). The only thing I can tell that it's pretty slow if you try to use it to generate speech in real time without a GPU.

[-]

Fun_Librarian_7699@reddit

Does RealtimeTTS with eleven_multilingual_v1 also produce good German results? Which model produces the best German results?

[-]

InterestingTea7388@reddit

German as a language has far too few speakers for it to be given much consideration, as many technology enthusiasts in our country also speak fluent English, there is hardly anyone who makes an effort. I'm more of a hobbyist and have been working on a German model for months - but that will take forever. ^^

Fish speech and xtts2 can do German, but the quality is far worse than when these models generate English.

[-]