Insert pauses into text file for kokoro
Posted by dts-five@reddit | LocalLLaMA | View on Reddit | 24 comments
Before discussions were taken offline, I had looked at this page:
Adding defined period pauses to the input text file
There was someone who made a GitHub that incorporated pauses:
https://github.com/vijay120/kokoro-tts
But the original thread included another way similar to inserting ...,...,... or something like that. Someone responded by saying that it makes a sighing noise, and someone posted a workaround that took away the sighing noise, but the pause was still there.
Does anyone know how to do that, or has anyone seen that old discussion thread and remember the method?
Expert-Package-3324@reddit
Late to the game but you could use [pause:2.5s] for instance.
Francis_Canada@reddit
OMG guys this does work! thank you so much!!
Kopultana@reddit
It think this kinda works for a small pause. I couldn't find a way to make the pause longer yet.
;-Try these and see the difference:
Beautiful-Ad-7568@reddit
This actually works amazingly well. THANK YOU!
Now just to go back and edit 20 chapters/3hrs of my book I'm working on...
(I think a find and replace operation is about to be heavily abused... lol)
toolsavvy@reddit
These didn't do anything for me and adding more than two ";-" results in unwanted noises. I've seen many work-arounds like this and none worked for me. I wonder if it has to do with me using CPU instead of GPU?
Kopultana@reddit
It's just a small pause. I don't know what you get with using CPU but here's what I generated: https://vocaroo.com/1fBHNlvzH8AY
toolsavvy@reddit
Thanks for the audio example. I just got some time to test things out again myself and I found that your method does seem to work for American voices but not for British ones, which is what I mainly use because to me the American ones sound too much like voices in brainwashing videos lol. (I have been preferring bf_isabella with a low weigh on af_nicole to soften isabella's voice a bit). But unfortunately the British ones are harder to control. They "sigh' a lot when you introduce random characters to try to introduce a pause. They also sigh in other cases where the American voices don't.
Anyhow, TIL effective methods for controlling voices in Kokoro can depend on the voice and likely the language chosen.
dts-five@reddit (OP)
That is the best variant I've tried. Thanks. Probably good enough for the paragraph breaks. Commercial variants are usually 3 second gaps, but I haven't been able to achieve that yet.
This works for a bit longer:
;-,;-,;-,;-,;-,;-Suitable-Analysis321@reddit
Try period mark with quotes. Works very well in my attempts at using kokoro "."
KindRazzmatazz8490@reddit
Try my method below. Split the text into sentences wrapped within double quotes:
```
"Hello!"
"How are you today?"
"I am your AI assistant."
```
ramius124@reddit
I'm using the FastKoKo docker implementation and I've had ZERO luck getting it to acknowledge punctuation or pauses regardless of the method used.
Electronic_Bee_7485@reddit
I know it's not the answer but I've used the silence tool on this web audio editor for the parts that needed it. https://audiomass.co/
toolsavvy@reddit
What I don't understand is. since Kokoro FastAPI and other variants use espeak-ng and since espeak-ng allows the use of SSML tags, like, then there must be a way to get Kokoro FastAPI to be able to use these tags.
But maybe I'm wrong. Admittedly I have little to no understanding of these TTS models.
For espeak-ng, the way SSML is used is by adding the "-m" flag in the command line, this tells espeak to ignore all "<" and ">" and to honor any SSML commands inside them (well, whichever ones espeak recognizes).
Example linux command for espeak to insert pauses:
There must be a script somewhere in Kokoro FastAPI where you can enter a few lines to force it to recognize the espeak "-m" flag for all output, thereby enabling the use of SSML tags in your input text.
I've tried quite a few things to make these SSML tags work in the various Kokoro variants that use espeak-ng, but none have worked. But I don't know enough to bother trying to modify code for this purpose.
Without the ability to add pauses, creating audiobooks that are worth listening to is a chore and a half because it would require a lot of editing of the audio output in audacity or similar. It would take at least as much time, if not more, than just recording myself reading it when you consider all the other editing needed for Kokoro to make words and sentences to be pronounced and flow somewhat properly. I editied literally half of a 30+ chapter book so that it sounds somewhat natural with Kokoro and that took me days. But then I realized I have this pause issue I can't seem to lick lol, so I had to give up for now.
dts-five@reddit (OP)
Thanks for the background info and research into this. I hate that the discussions were disabled on HF. It seems like a bunch of useful info and discussions are now just gone
toolsavvy@reddit
Well IIRC there is also a thread on github about inserting pauses with Kokoro and it went nowhere. No one really knows how to do it properly with Kokoro, assuming there is even a way. Also, IIRC hexgrad said that he'd look into adding SSML support but that it's not going to be priority and that it will take a great deal of effort.
My experience in the modern TTS world thus far is that even when SSML is incorporated, it's usually limited and the tag is usually left out. I don't know why but for some reason people who make TTS scripts and software have this weird idea that is not useful lol.
WorkingStart6808@reddit
Is there any updates here? Could you find a way to insert pauses?
toolsavvy@reddit
Nope
dts-five@reddit (OP)
Seems like an obvious need to me. Maybe most people aren’t down in the weeds and are fine with “good enough.”
toolsavvy@reddit
That and others don't mind paying for a somewhat proper TTS.
toolsavvy@reddit
BTW: I did try using the vijay120 variant. Although the PAUSE_x tags do work, the resulting audio is severely degraded when using any pause tags and using MP3 output, but much better when using WAV output. That's doable since I can easily convert all my WAV fils to MP3 failry quickly.
But the big problem is that if I use any pause tags then the "--split output" function doesn't work, which is important for me because without it you can't do a even full chapter of a book, let alone a full book. So this variant is pretty useless to me.
defiantnd@reddit
Are you using the webui for Kokoro? I am, and I'm wondering how to add pauses in it as well.
defiantnd@reddit
Just as a followup, I had chatgpt write a paragraph adding extra periods at the end of each sentence, and added hard returns to make each sentence on a separate line. It seems to be much better that way, so it does seem to recognize some punctuation as pauses.
dts-five@reddit (OP)
I've been using this variant during testing:
https://github.com/remsky/Kokoro-FastAPI
Effective_Degree2225@reddit
cant access the text file you shared