Insert pauses into text file for kokoro

Posted by dts-five@reddit | LocalLLaMA | View on Reddit | 24 comments

Before discussions were taken offline, I had looked at this page:
Adding defined period pauses to the input text file

There was someone who made a GitHub that incorporated pauses:
https://github.com/vijay120/kokoro-tts

But the original thread included another way similar to inserting ...,...,... or something like that. Someone responded by saying that it makes a sighing noise, and someone posted a workaround that took away the sighing noise, but the pause was still there.

Does anyone know how to do that, or has anyone seen that old discussion thread and remember the method?

[-]

Expert-Package-3324@reddit

Late to the game but you could use [pause:2.5s] for instance.

[-]

Francis_Canada@reddit

OMG guys this does work! thank you so much!!

[-]

Kopultana@reddit

It think this kinda works for a small pause. I couldn't find a way to make the pause longer yet.

;-

Try these and see the difference:

Hello. My name is John.
Hello. ;- my ;- name ;- is ;- John.
Hello. ;-
my ;- 
name ;-
is ;-
John.

[-]

Beautiful-Ad-7568@reddit

This actually works amazingly well. THANK YOU!
Now just to go back and edit 20 chapters/3hrs of my book I'm working on...
(I think a find and replace operation is about to be heavily abused... lol)

[-]

toolsavvy@reddit

These didn't do anything for me and adding more than two ";-" results in unwanted noises. I've seen many work-arounds like this and none worked for me. I wonder if it has to do with me using CPU instead of GPU?

[-]

Kopultana@reddit

It's just a small pause. I don't know what you get with using CPU but here's what I generated: https://vocaroo.com/1fBHNlvzH8AY

[-]

toolsavvy@reddit

Thanks for the audio example. I just got some time to test things out again myself and I found that your method does seem to work for American voices but not for British ones, which is what I mainly use because to me the American ones sound too much like voices in brainwashing videos lol. (I have been preferring bf_isabella with a low weigh on af_nicole to soften isabella's voice a bit). But unfortunately the British ones are harder to control. They "sigh' a lot when you introduce random characters to try to introduce a pause. They also sigh in other cases where the American voices don't.

Anyhow, TIL effective methods for controlling voices in Kokoro can depend on the voice and likely the language chosen.

[-]

dts-five@reddit (OP)

That is the best variant I've tried. Thanks. Probably good enough for the paragraph breaks. Commercial variants are usually 3 second gaps, but I haven't been able to achieve that yet.

This works for a bit longer: ;-,;-,;-,;-,;-,;-

[-]

Suitable-Analysis321@reddit

Try period mark with quotes. Works very well in my attempts at using kokoro "."

[-]

KindRazzmatazz8490@reddit

Try my method below. Split the text into sentences wrapped within double quotes:
```
"Hello!"
"How are you today?"
"I am your AI assistant."
```

[-]

ramius124@reddit

I'm using the FastKoKo docker implementation and I've had ZERO luck getting it to acknowledge punctuation or pauses regardless of the method used.

[-]

Electronic_Bee_7485@reddit

I know it's not the answer but I've used the silence tool on this web audio editor for the parts that needed it. https://audiomass.co/

[-]

toolsavvy@reddit

What I don't understand is. since Kokoro FastAPI and other variants use espeak-ng and since espeak-ng allows the use of SSML tags, like , then there must be a way to get Kokoro FastAPI to be able to use these tags.

But maybe I'm wrong. Admittedly I have little to no understanding of these TTS models.

For espeak-ng, the way SSML is used is by adding the "-m" flag in the command line, this tells espeak to ignore all "<" and ">" and to honor any SSML commands inside them (well, whichever ones espeak recognizes).

Example linux command for espeak to insert pauses:

 espeak-ng -m "Hello world.<break time='5000ms'>Nice to be here."

There must be a script somewhere in Kokoro FastAPI where you can enter a few lines to force it to recognize the espeak "-m" flag for all output, thereby enabling the use of SSML tags in your input text.

I've tried quite a few things to make these SSML tags work in the various Kokoro variants that use espeak-ng, but none have worked. But I don't know enough to bother trying to modify code for this purpose.

Without the ability to add pauses, creating audiobooks that are worth listening to is a chore and a half because it would require a lot of editing of the audio output in audacity or similar. It would take at least as much time, if not more, than just recording myself reading it when you consider all the other editing needed for Kokoro to make words and sentences to be pronounced and flow somewhat properly. I editied literally half of a 30+ chapter book so that it sounds somewhat natural with Kokoro and that took me days. But then I realized I have this pause issue I can't seem to lick lol, so I had to give up for now.

[-]

dts-five@reddit (OP)

Thanks for the background info and research into this. I hate that the discussions were disabled on HF. It seems like a bunch of useful info and discussions are now just gone

[-]

toolsavvy@reddit

Well IIRC there is also a thread on github about inserting pauses with Kokoro and it went nowhere. No one really knows how to do it properly with Kokoro, assuming there is even a way. Also, IIRC hexgrad said that he'd look into adding SSML support but that it's not going to be priority and that it will take a great deal of effort.

My experience in the modern TTS world thus far is that even when SSML is incorporated, it's usually limited and the tag is usually left out. I don't know why but for some reason people who make TTS scripts and software have this weird idea that is not useful lol.

[-]

WorkingStart6808@reddit

Is there any updates here? Could you find a way to insert pauses?

[-]

toolsavvy@reddit

Nope

[-]

dts-five@reddit (OP)

Seems like an obvious need to me. Maybe most people aren’t down in the weeds and are fine with “good enough.”

[-]

toolsavvy@reddit

That and others don't mind paying for a somewhat proper TTS.

[-]

toolsavvy@reddit

BTW: I did try using the vijay120 variant. Although the PAUSE_x tags do work, the resulting audio is severely degraded when using any pause tags and using MP3 output, but much better when using WAV output. That's doable since I can easily convert all my WAV fils to MP3 failry quickly.

But the big problem is that if I use any pause tags then the "--split output" function doesn't work, which is important for me because without it you can't do a even full chapter of a book, let alone a full book. So this variant is pretty useless to me.

[-]

defiantnd@reddit

Are you using the webui for Kokoro? I am, and I'm wondering how to add pauses in it as well.

[-]

defiantnd@reddit

Just as a followup, I had chatgpt write a paragraph adding extra periods at the end of each sentence, and added hard returns to make each sentence on a separate line. It seems to be much better that way, so it does seem to recognize some punctuation as pauses.

[-]

dts-five@reddit (OP)

I've been using this variant during testing:
https://github.com/remsky/Kokoro-FastAPI

[-]

Effective_Degree2225@reddit

cant access the text file you shared