Improved Text to Speech model: Parler TTS v1 by Hugging Face

Posted by vaibhavs10@reddit | LocalLLaMA | View on Reddit | 73 comments

Hi everyone, I'm VB, the GPU poor in residence (focus on open source audio and on-device ML) at Hugging Face! 🤗 Quite please to introduce you to Parler TTS v1 🔉 - 885M (Mini) & 2.2B (Large) - fully open-source Text-to-Speech models! 🤙 Some interesting things about it: 1. Trained on 45,000 hours of open speech (datasets released as well) 2. Upto 4x faster generation thanks to torch compile & static KV cache (compared to previous v0.1 release) 3. Mini trained on a larger text encoder, large trained on both larger text & decoder 4. Also supports SDPA & Flash Attention 2 for an added speed boost 5. In-built streaming, we provide a dedicated streaming class optimised for time to the first audio 5. Better speaker consistency, more than a dozen speakers to choose from or create a speaker description prompt and use that 6. Not convinced with a speaker? You can fine-tune the model on your dataset (only couple of hours would do) Apache 2.0 licensed codebase, weights and datasets! 🤗 Can't wait to see what y'all would build with this!🫡 Quick links: Model checkpoints: https://huggingface.co/collections/parler-tts/parler-tts-fully-open-source-high-quality-tts-66164ad285ba03e8ffde214c Space: https://huggingface.co/spaces/parler-tts/parler_tts GitHub Repo: https://github.com/huggingface/parler-tts

Reply to Post

73 Comments

[-]

SituationMan@reddit

Parler is a door nail, as in dead as a.

[-]

muchCode@reddit

In general, how does the generation speed compare to other TTS engines? I use metavoice now with fp16 and it is pretty fast, would consider this if the generation is fast enough

[-]

vaibhavs10@reddit (OP)

Don't have hard comparisons! but we also support torch compile + static KV cache which makes generations quite fast, specially when paired with streaming)

[-]

kkchangisin@reddit

Have you tried to export to ONNX? ONNX + TensorRT + Triton Inference Server is my favorite "hack" to provide performance at scale. In any case I'll try it myself because I can't resist :). Nice work!

[-]

mekarpeles@reddit

Did you have any luck with this? :)

[-]

ChuckBaggett@reddit

The Hugging Face space is giving me terrible results using the large model with the voice Mike.

[-]

mpasila@reddit

Are multilingual models planned?

[-]

assadollahi@reddit

[https://huggingface.co/parler-tts/parler-tts-mini-multilingual-v1.1](https://huggingface.co/parler-tts/parler-tts-mini-multilingual-v1.1)

[-]

mpasila@reddit

That's cool but I guess I still have to wait for someone to make a decent open TTS for Finnish..

[-]

ZealousidealAir9567@reddit

While trying with parler tts with custom dataset of an indian voice , i am still getting the american accent , how can i reduce the influence of base model

[-]

SirLazarusTheThicc@reddit

The quality is impressive with the large version, and the built in audio streaming and modifying output via prompt is very interesting. Unfortunately this is nowhere near real time, but this could be great for situations where pre-rendering is fine.

[-]

vaibhavs10@reddit (OP)

Yes! the demo is non-compile, we had some issues with Gradio hence, couldn't put up a compiled version up. But the benchmarks are solid, it works!

[-]

redfairynotblue@reddit

The length of audio for some seem short. Can you fine-tune this on longer audio? What's the recommended length and size of training data to build upon .

[-]

bihungba1101@reddit

The easy by pass could be breaking down the text into sentence chunks and stitch them together

[-]

randomfoo2@reddit

Do you have some sample code/benchmarks? I tried out the snippets from the inference doc [https://github.com/huggingface/parler-tts/blob/main/INFERENCE.md](https://github.com/huggingface/parler-tts/blob/main/INFERENCE.md) and on my 4090, the compile actually seemed to slow things down, which didn't seem right. I've been working on shaving ms off my [https://github.com/lhl/voicechat2](https://github.com/lhl/voicechat2) response time, so seeing if Parler could run with a good RTF w/ streaming would be super interesting.

[-]

bihungba1101@reddit

Any plan to have a Docker image with API support, sir?

[-]

Bound4OuterSpace@reddit

I've been using this at the HuggingFace Spaces and I'm in love! I've been looking for something just like this for a long time. Using it on Spaces seems to be very limited for my use case... Can anyone help me? Is there a how-to for a complete newb on getting this to run locally on my laptop (Win10, AMD Ryzen 7 5800H, Nvidia RTX 3050ti, 32GB RAM)? Using it on Spaces seems to be very limited for my use case...

[-]

theCapNemo@reddit

Great tool!!! Someone know a similar model for Spanish language??

[-]

jd_3d@reddit

Where can I find the full list of the 34 voice names, and do you have quick audio samples for them to get an idea of each one?

[-]

yungdrater@reddit

1. Laura 2. Gary 3. Jon 4. Lea 5. Karen 6. Rick 7. Brenda 8. David 9. Eileen 10. Jordan 11. Mike 12. Yann 13. Joy 14. James 15. Eric 16. Lauren 17. Rose 18. Will 19. Jason 20. Aaron 21. Naomie 22. Alisa 23. Patrick 24. Jerry 25. Tina 26. Jenna 27. Bill 28. Tom 29. Carol 30. Barbara 31. Rebecca 32. Anna 33. Bruce 34. Emily

[-]

Hefty_Wolverine_553@reddit

I've been getting a large amount of mispronunciations from basically all types of text, even while using sentence chunking. Approximately every 2-3 sentences will have some sort of mistake, not sure if anyone else is getting the same results.

[-]

chibop1@reddit

Is it compatible with Apple silicon?

[-]

vaibhavs10@reddit (OP)

Yes! Just pass "mps" as the device.

[-]

chibop1@reddit

Awesome, thank you! Is training possible on mps as well? The Colab notebook runs the finetune and inference by pushing the annotated dataset and the finetuned model to Huggingface. Do the scripts for training and inferencing have an option that lets you do everything locally without relying on pushing to Huggingface?

[-]

anfedoro@reddit

dont expect much.. I have tried on M1 (MBP 13).. \~50 tokens phrase with \~60 tokens description takes about 4 min to generate... useless Other consumer-grade GPU (RTX A2000) with FA2 installed also not great performing 😞 It's funny that with a plain model, it takes 45 sec for the same phase, while with a compiled one.. tadaaaa!! 240 sec.. not sure how this is possible. (obviously, I am not counting time together with compilation.. generation only) Curious if there is a possibility of quantization ? or this may kill quality ?

[-]

Wonderful-Top-5360@reddit

thank you for thinking of the gpu poors

[-]

Tough_Blueberry2837@reddit

Streaming demo?

[-]

jd_3d@reddit

Do you know if any providers will offer a paid API for this? I assume this could undercut services like ElevenLabs in price by a lot

[-]

Creepy-Muffin7181@reddit

In fact a lot of open source now better than eleven labs. I don’t know who is still buying eleven labs api with the unbelievable price

[-]

LicoriceDuckConfit@reddit

i've been loosely following this space - I know about tortoise and sovits, what other good open source models are there?

[-]

Creepy-Muffin7181@reddit

There is a tts leaderboard you can check. I had a post in the Reddit group 1 months ago and a lot of people replied their choice

[-]

Creepy-Muffin7181@reddit

https://www.reddit.com/r/LocalLLaMA/s/NnLJWXCqvZ

[-]

LicoriceDuckConfit@reddit

thanks! thats a great reference!

[-]

iloveloveloveyouu@reddit

!remindMe 1 day

[-]

RemindMeBot@reddit

I will be messaging you in 1 day on [**2024-08-10 12:05:10 UTC**](http://www.wolframalpha.com/input/?i=2024-08-10%2012:05:10%20UTC%20To%20Local%20Time) to remind you of [**this link**](https://www.reddit.com/r/LocalLLaMA/comments/1encx98/improved_text_to_speech_model_parler_tts_v1_by/lh9gwq9/?context=3) [**CLICK THIS LINK**](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5Bhttps%3A%2F%2Fwww.reddit.com%2Fr%2FLocalLLaMA%2Fcomments%2F1encx98%2Fimproved_text_to_speech_model_parler_tts_v1_by%2Flh9gwq9%2F%5D%0A%0ARemindMe%21%202024-08-10%2012%3A05%3A10%20UTC) to send a PM to also be reminded and to reduce spam. ^(Parent commenter can ) [^(delete this message to hide from others.)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Delete%20Comment&message=Delete%21%201encx98) ***** |[^(Info)](https://www.reddit.com/r/RemindMeBot/comments/e1bko7/remindmebot_info_v21/)|[^(Custom)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5BLink%20or%20message%20inside%20square%20brackets%5D%0A%0ARemindMe%21%20Time%20period%20here)|[^(Your Reminders)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=List%20Of%20Reminders&message=MyReminders%21)|[^(Feedback)](https://www.reddit.com/message/compose/?to=Watchful1&subject=RemindMeBot%20Feedback)| |-|-|-|-|

[-]

TastesLikeOwlbear@reddit

I notice that of the 34 voice names, 33 of them are pretty strongly gender-coded as male or female. The exception, Jordan, produces an extremely male voice. Are there any plans to introduce nonbinary voices into a future version? Or are there examples of prompting the existing voices to produce that type of result?

[-]

LicoriceDuckConfit@reddit

kudos!!! - Great to see this! played with it a bit last evening and it sounds great! will likely try my hand at the fine-tuning tooling over the weekend.

[-]

Ok_Maize_3709@reddit

Really cool! One thing I noticed, it sometimes mixes propositions: “where” was pronounced almost like “with”. I’m not sure why this can happen unless some mapping in training data is off

[-]

Evening_Ad6637@reddit

Hey I have tested the hf space examples and the output sounds really great. Even with the small model very realistic. Amazing work guys! So far this is English only, right?

[-]

kI3RO@reddit

> English yes, I've tried it with Spanish and it spew the most lovely gibberish I've ever heard

[-]

Evening_Ad6637@reddit

[-]

shibe5@reddit

Instead some of words, it says other, unrelated words that don't sound similar to original words at all.

[-]

Pvt_Twinkietoes@reddit

How does it compare to Whisper?

[-]

laterral@reddit

hmmm... I tried this with a paragraph from Benjamin's autobiography. The voice loses its mind after the first sentence and starts speaking like the characters in Magika.

[-]

Darkboy5000@reddit

It is really impressive. 2 questions: Is it compatible with the Hailo8L AI accelerator? Do you plan on adding other languaged besides english?

[-]

Rivarr@reddit

Great stuff. Is there any chance of being able to fine-tune this locally? How much vram is required?

[-]

Xanjis@reddit

How does it stack up against tortoise and xttsv2?

[-]

Enough-Meringue4745@reddit

We really need some high fidelity voice cloning

[-]

privacyparachute@reddit

I love that it's actually open source, well done! If I make one suggestion: there are a lot of English TTS options, but very few for other languages. Perhaps that could be something for a future version?

[-]

coder543@reddit

I took a snippet from the HuggingFace README: > Parler-TTS Large v1 is a 2.2B-parameters text-to-speech (TTS) model, trained on 45K hours of audio data, that can generate high-quality, natural sounding speech with features that can be controlled using a simple text prompt (e.g. gender, background noise, speaking rate, pitch and reverberation). > With Parler-TTS Mini v1, this is the second set of models published as part of the Parler-TTS project, which aims to provide the community with TTS training resources and dataset pre-processing code. And even using the large model with the default voice description, it only speaks part of the words from the beginning and the end, skipping the middle, and losing coherence. Am I doing something wrong by trying to have it speak a few sentences?

[-]

msbeaute00000001@reddit

Yes, I can confirm this. Both models seem to have this problem. Don't know if the authors can share how they trained this model so we could find the problem.

[-]

ShengrenR@reddit

Think LLM with small context window - the best bet with some of these is to use sentence chunking and batch the gen - then stick them back together. (noteworthy that their streaming gen doesn't batch, so for fastest turnaround you'd likely stream the first sentence (or n tokens) and pray your batch has finished by the end of that sentence.. or get creative and do pairs or the like)

[-]

coder543@reddit

It’s just hard to imagine how you could get results that don’t sound awkwardly stitched together that way.

[-]

ShengrenR@reddit

Agreed, that's a part of the challenge - you don't get the natural pauses that a human speaker would create in between. In my local setup I generally add \~0.3sec of just 0s in the audio array before stitching it all back together.. works reasonably well to my ear, though not dynamic.

[-]

Severin_Suveren@reddit

That guy who casually innovated with text2speech when he made that HAL-repo made it work fine. I didn't study his code or anything, but looking through it he seems to have programmatically added multiple different tones of speech to make the final output seem more natural

[-]

NickUnrelatedToPost@reddit

Nice! Thank you! That's just what I currently need!

[-]

ShengrenR@reddit

Hurray! I've been watching that space enough, hoping for the v1, that my browser started saving the link to its suggestions lol. Great work HF folks! One Q: In your docs you have "To ensure speaker consistency across generations, this checkpoint was also trained on 34 speakers, characterized by name (e.g. Jon, Lea, Gary, Jenna, Mike, Laura)", but I imagine folks don't want 'e.g.' but a dictionary :) or is it a game for us to guess and check haha Does not look to be documented in the repo at least: [https://github.com/search?q=repo%3Ahuggingface%2Fparler-tts%20Gary&type=code](https://github.com/search?q=repo%3Ahuggingface%2Fparler-tts%20Gary&type=code)

[-]

LMLocalizer@reddit

Here is the full list of speakers, extracted from [https://huggingface.co/datasets/ylacombe/parler-tts-mini-v1-a\_speaker\_similarity](https://huggingface.co/datasets/ylacombe/parler-tts-mini-v1-a_speaker_similarity): Laura Gary Jon Lea Karen Rick Brenda David Eileen Jordan Mike Yann Joy James Eric Lauren Rose Will Jason Aaron Naomie Alisa Patrick Jerry Tina Jenna Bill Tom Carol Barbara Rebecca Anna Bruce Emily

[-]

ShengrenR@reddit

Thanks!

[-]

ShengrenR@reddit

Also.. I see it uses RoPE for embedding.. have you tried the usual LLM context extend tricks to see how it behaves for long passages?

[-]

bigattichouse@reddit

"Systems online" seems to produce some weird pronunciations for "online" no matter what I enter.

[-]

coder543@reddit

That’s an… [interesting](https://en.wikipedia.org/wiki/Parler).. choice of name.

[-]

RenoHadreas@reddit

That’s funny but “Parler” in French means to speak so it’s quite fitting for a TTS model

[-]

vaibhavs10@reddit (OP)

Exactly, the name originates from our french roots 🇫🇷

[-]

Evening_Ad6637@reddit

You Frenche people are really nailing the DL field :D

[-]

hyperdynesystems@reddit

This is great!

[-]

artificial_genius@reddit

This looks really cool, I haven't seen a mention of how it deals with driving emotions of the voice but in the GitHub it shows an example of how you can prompt for the style of speech. Haven't tried it yet but it looks very promising. You couldn't prompt the style in xtts. Can we make the voices angry, yell, use disrespectful tones? It looks possible from the GitHub.

[-]

vaibhavs10@reddit (OP)

Just tried it, it works for me 🤗

[-]

Few_Painter_5588@reddit

Yeah, I tried it now and waited for a bit, it works pretty well!

[-]

coder543@reddit

The quality does seem very impressive, just from first impressions.