DramaBox - Most Expressive Voice model ever based on LTX 2.3

Posted by manmaynakhashi@reddit | LocalLLaMA | View on Reddit | 37 comments

The Most Expressive Voice Model.

Github: https://github.com/resemble-ai/DramaBox

HF Model: https://huggingface.co/ResembleAI/Dramabox

HF Space: https://huggingface.co/spaces/ResembleAI/Dramabox

[-]

Jeidoz@reddit

I am dumb dumb and GitHub's readme is not enough for me to run project. Can someone share more detailed instructions? I suppose I may need install some python dependencies, download and put somewhere models and toggle CUDA 13 usage?

[-]

toothpastespiders@reddit

I haven't tried it yet, but I'm always excited for this kind of thing just on a practical level for people with cancer or similar issues. People really don't get how horrible it is to have something so personal stolen by the thing killing you. It's not just about being able to say something out loud. It's about the personal nature of it being "your" voice, another thing that makes you who you are, being taken. Being able to clone your voice before its lost, or even reclaim it from old recordings, can be such a huge win just in terms of quality of life.

[-]

markeus101@reddit

Always happy to see new open source TTS. Would be nice if they could run on edge devices but i think if something like that existed it wont be open source

[-]

EndlessZone123@reddit

It feels like we hit 95% likeness but still robotic and low quality audio.

[-]

Euchale@reddit

Yeah, I feel like I am taking crazy pills when I hear people say this sounds great, its still far to echo-y

[-]

manmaynakhashi@reddit (OP)

use better reference audio and you'll get better sound, it can do voice cloning.

[-]

HelpfulHand3@reddit

It's based on LTX so it's going to sound bad even if it is expressive
nothing to do with reference voice - this is how the audio in the videos sound too

[-]

ghulamalchik@reddit

People talk about the fidelity, the expressiveness, you're talking about the quality. Both are true.

[-]

RAZA_2666R@reddit

Finally an open model that actually sounds like a real person emotes

[-]

silenceimpaired@reddit

Sounds like it was trained on the cartoon Joker from the Batman series.

[-]

Sanity_N0t_Included@reddit

Luke Skywalker?

[-]

manmaynakhashi@reddit (OP)

Nope it's just a reference for voice cloning , it's based on ltx2.3 so i think base model might have been trained on that, i have just repurposed it for audio only.

[-]

ShawnnSmuts90@reddit

close.. not there though

[-]

TheGoddessInari@reddit

Huh. Random Conan.

[-]

Guinness@reddit

/r/gonewildaudio (NSFW) would fucking love this.

[-]

dyeusyt@reddit

sounds perfect for indie game Devs to use this in their games.

[-]

Salt-Powered@reddit

Why would people who famously put their soul into their art, use a souless machine in their creation? AAA studios for sure though

[-]

wntersnw@reddit

Maybe finding and hiring voice actors, negotiating rates, licensing, budgeting, etc. feels more soulless than just creating what they want on their computer?

[-]

o5mfiHTNsH748KVq@reddit

Those types of people are vocal, but mot the majority of developers. You find more of that mentality around the indie dev scene and people that hold their products on a pedestal.

Anyway, the future of gaming is dynamic content on demand. Not all games, but this is an emerging genre.

[-]

Sixhaunt@reddit

lots of indie game devs have more specialized skillsets or enjoy certain aspects of game development more than others and so they would prefer to automate away the annoying parts and focus on the creative parts that they enjoy doing. Many people have a vision for something but not every single skillset required to pull it off.

[-]

iMakeSense@reddit

They say typing on the soulless machine they're on on this soulless website

[-]

polawiaczperel@reddit

Costs

[-]

manmaynakhashi@reddit (OP)

to save studio money ?

[-]

o5mfiHTNsH748KVq@reddit

Don’t let gamers see this comment, they’ll fucking panic.

[-]

Disposable110@reddit

No one has 24GB of free VRAM though, especially not when running a game on the side that already wants at least half of that.

[-]

manmaynakhashi@reddit (OP)

you can run it on 8 gb of vram , for indie game you can generate audios and use it in game , not literally running model inside the game lmao

[-]

Xp_12@reddit

they were probably thinking this was the other project. fwiw I got scenema running on a 5060ti 16gb, but it didn't sound great at int8 and was slow with CPU offload. I'll give your project a go.

[-]

manmaynakhashi@reddit (OP)

lot of usecases, more on the creative side then agentic side.

[-]

polawiaczperel@reddit

I remember your first post a while ago. Thanks for the code.

[-]

rm-rf-rm@reddit

yeah grateful as well as I wanted to find this again and given how many new projects i come across everyday I had no idea how I was going to find it, even in my github stars

[-]

manmaynakhashi@reddit (OP)

thank you for supporting.

[-]

ghulamalchik@reddit

Impressive fidelity, bad quality. I wish it didn't sound like they're speaking through a pipe.

[-]

Genebra_Checklist@reddit

it's comunnity only or can we use for monetized projects?

[-]

manmaynakhashi@reddit (OP)

i don't know it's based on ltx2.3 so i have to add the same license according to what they have mentioned , i think you will be fine untll you hit 10M , you can refer to the license , not a legal advice.

[-]

addictiveboi@reddit

This is AWESOME. I thought when I used LTX a couple of months ago "this has way better voice acting than TTS engines". You guys are awesome for actually creating this, and the fact that you have voice cloning aswell is just mind blowing to me. Gonna download this and try it in a little bit!!!

[-]

EveningIncrease7579@reddit

What about scenema audio, this is more lighter?

[-]

manmaynakhashi@reddit (OP)

yes much lighter if you offload gemma model you can do inference under 8 gb vram.