Okay, so why are you showing us this stuff then? Remember, this is localllama..
And to be clear, it is totally fine to make a SaaS and make money with it, but why not giving others the opportunity to see the sourcecode and/or to host it themselves for personal use?
I wanted to show it because I thought it's cool and new. I hadn't fully made up my mind about open-sourcing this so far. Last two years I did so many open-source contributions and published countless scripts. I even shared how the entire algorithm works here.
But when I don’t immediately also hand over the full code, when I try to somehow pay the bills for me and my dog, I get downvotes. After all those projects I already offered for free that feels unfair and is disappointing.
I see, I see, I am sorry for the downvotes and the financial burden or disappointments you have experienced. I hope that you can secure a good income with this project. But don't get me wrong, I still don't quite understand why one would exclude the other. I mean the question seriously, because I've never been that far myself and maybe I'm too naive about the concept of open source and still making money at the same time. But personally, for example, I can tell you that I am very happy to pay for open source software and a large part of my monthly expenditure goes on software that is open source.
- I start by feeding in a few audio files of known speakers. These form the "voiceprints" that the algo uses for comparison. This is how I start the script:
- As audio comes in, VAD detects speech. The algo takes two types of voice embeddings with different providers (pyannote/embeddings, ecapa and resemblyzer):
Embedding of all audio from the start of a sentence.
Embedding of a rolling window (\~3 seconds) of the most recent audio, updated every few hundred milliseconds.
Both embeddings are compared with the embeddings of the known speakers. 1) Identifies the current speaker fast and 2) catches speaker changes. If 2) finds a match for a different speaker than the one currently identified, it flags a speaker turn (or interruption). Cosine similarity used to compare embeddings. If the EMA-3 of the added similarity score of all embeddings providers passes a threshold, the speaker is considered a match.
It's a solution for situations with a known fixed group of speakers with no need for heavy clustering algorithms.
Hmm 🤔 I worked on a court reporting AI and solved the diarization issue in much the same way.
The major difference is each timestamped 500ms slice is put in an “unknown” group and each member of the unknown group is constantly compared against known speakers. If it matches a known speaker it is assigned to that speaker.
The problem is how to get to the point we have known speakers. Fortunately in a courtroom situation each speaker must first announce themselves or be announced.
“Your honor I am Bill S. Preston, esquire, attorney for the plaintiff.”
“Your honor I am Ted Theodore Logan, representing the defendant”
In the end we found it easier to also have a visual analysis tool watch the proceedings and generate a description of who is speaking.
This was then supplied in a timestamped “subtext” or secondary text like descriptive video for the visually impaired.
As it turns out that solution completely simplified the design such that the voice pattern matching was no longer necessary.
So simplified design… Descriptive video AI watches video while whisper listens. Both are piped to a normal transformer that produces a transcript in near real time.
Now we just need to train whisper on legalese because judges hate reading transcripts about “motions in lemony”. 🤦♂️
This is cool. We have an internal demo that does this with translation to 100+ languages with real time voice cloning, and rag integration. Like an Alexa on steroids. Do you use whisper or something like moonshine? Ive played around with https://huggingface.co/pyannote/speaker-diarization for diarization a bit but my coworker put all the other stuff togeather into a working product.
Last I looked at pyannote the models had restrictions that wanted they could pull them any time from HF. I'm glad to see they MITed it.
What's really remarkable is you got it to process segments in real time. Did you have to overlap segments at all to retain speaker consistency or did it work straight out of the box with chunked audio? When I tried this I failed pretty miserably 🤪
I wrote a test script that used speaker diarization for ad removal from podcasts, it seemed to have a lot of potential. My super simple approach was to guess a number of ad-seconds per hour, then determine which speakers were nearest to being under that threshold and cut them out of the audio. The cool thing was that even if the podcast host is doing the ad, they often record them at a different time and under different audio conditions as the rest of the podcast, so it was considered a different speaker (at least in my limited testing).
I didn't go too far with it because diarization is really slow and it would get crashy on longer clips. I still think this approach could work though, especially if you could spot check the removals by transcribing the shortest segments and asking a small and fast local LLM if the transcript sounded like an ad before giving it the axe.
Buffer ~5 mins of audio. Examine the transcript for signs of advertising using literally any LLM, mark beginning and end. Fade volume out at beginning and fade it back in near the end. Fast forward or skip through the middle of it.
Not the OP here but MLX is Apple only. Unless your target audience is using an Apple exclusively or you have a compelling reason for MLX you’re just tying yourself to the Apple ecosystem without any significant improvement in inference.
Here’s an example I just ran on my MacBook using an audiobook version of Mary Shelly’s Frankenstein from Gutenberg.
whisper-large-gguf = 120 tokens per second
whisper-large-mlx = 145 tokens per second
Most shocking is that when compared to the actual raw text, the gguf version had less transcription errors than the mlx version.
Theoretically you could run this on a Pi 5. Once you get it functional you need to look closely at the models you’re using, how and why. Quantization will make a huge difference here.
If this was multilingual and the output text was rendered in real time as an overlay text on the screen, it could be used to translate anything playing on the machine. I often encounter videos in languages I don't understand without subtitles. This would be such a neat solution.
Nice work. This is a standard diarization embedding approach with chunking to make it run in real time. This is a cool demo, but will be unfortunately very inaccurate for real world stuff.
Whose embeddings did you take to make this? Or did you train your own? If you trained your own, what data did you train from? I don't see any credits to pyannote or anyone else for your voiceprint embeddings.
I imagine you mean for realtime in-person diarization? That's because this type of solution would completely fall apart the moment you have cross talk, background noise, similar sounding voices, etc. Plus, if you're doing it in person, you likely don't have substantial GPU power to power it in real time with low latency unless you're using high powered cloud GPUs.
AutoModerator@reddit
Your submission has been automatically removed due to receiving many reports. If you believe that this was an error, please send a message to modmail.
I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.
MKU64@reddit
Love it! are you open-sourcing it?
Lonligrin@reddit (OP)
Prob making a SaaS only
Captain_Pumpkinhead@reddit
https://i.redd.it/lhispe68brde1.gif
Evening_Ad6637@reddit
Okay, so why are you showing us this stuff then? Remember, this is localllama..
And to be clear, it is totally fine to make a SaaS and make money with it, but why not giving others the opportunity to see the sourcecode and/or to host it themselves for personal use?
Lonligrin@reddit (OP)
I wanted to show it because I thought it's cool and new. I hadn't fully made up my mind about open-sourcing this so far. Last two years I did so many open-source contributions and published countless scripts. I even shared how the entire algorithm works here.
But when I don’t immediately also hand over the full code, when I try to somehow pay the bills for me and my dog, I get downvotes. After all those projects I already offered for free that feels unfair and is disappointing.
TheRealMasonMac@reddit
Understandable. It would be nice, however, if one could pay to get it local instead of a saas.
az226@reddit
You’re posting in the wrong community.
0xTech@reddit
It supports Ollama
Evening_Ad6637@reddit
I see, I see, I am sorry for the downvotes and the financial burden or disappointments you have experienced. I hope that you can secure a good income with this project. But don't get me wrong, I still don't quite understand why one would exclude the other. I mean the question seriously, because I've never been that far myself and maybe I'm too naive about the concept of open source and still making money at the same time. But personally, for example, I can tell you that I am very happy to pay for open source software and a large part of my monthly expenditure goes on software that is open source.
Enough-Meringue4745@reddit
Other guy is right, doesn’t belong here.
Lonligrin@reddit (OP)
Ok some details how this works:
- I start by feeding in a few audio files of known speakers. These form the "voiceprints" that the algo uses for comparison. This is how I start the script:
```bash
python realtime_text.py --speaker_bases kolja.wav lasinya.wav male.wav female.wav winkens.wav kinski.wav bully.wav arthur.wav
```
- As audio comes in, VAD detects speech. The algo takes two types of voice embeddings with different providers (pyannote/embeddings, ecapa and resemblyzer):
Embedding of all audio from the start of a sentence.
Embedding of a rolling window (\~3 seconds) of the most recent audio, updated every few hundred milliseconds.
Both embeddings are compared with the embeddings of the known speakers. 1) Identifies the current speaker fast and 2) catches speaker changes. If 2) finds a match for a different speaker than the one currently identified, it flags a speaker turn (or interruption). Cosine similarity used to compare embeddings. If the EMA-3 of the added similarity score of all embeddings providers passes a threshold, the speaker is considered a match.
It's a solution for situations with a known fixed group of speakers with no need for heavy clustering algorithms.
ServeAlone7622@reddit
Hmm 🤔 I worked on a court reporting AI and solved the diarization issue in much the same way.
The major difference is each timestamped 500ms slice is put in an “unknown” group and each member of the unknown group is constantly compared against known speakers. If it matches a known speaker it is assigned to that speaker.
The problem is how to get to the point we have known speakers. Fortunately in a courtroom situation each speaker must first announce themselves or be announced.
“Your honor I am Bill S. Preston, esquire, attorney for the plaintiff.”
“Your honor I am Ted Theodore Logan, representing the defendant”
In the end we found it easier to also have a visual analysis tool watch the proceedings and generate a description of who is speaking.
This was then supplied in a timestamped “subtext” or secondary text like descriptive video for the visually impaired.
As it turns out that solution completely simplified the design such that the voice pattern matching was no longer necessary.
So simplified design… Descriptive video AI watches video while whisper listens. Both are piped to a normal transformer that produces a transcript in near real time.
Now we just need to train whisper on legalese because judges hate reading transcripts about “motions in lemony”. 🤦♂️
Awwtifishal@reddit
whisper has a text context feature where you could just put jargon-heavy sentences, you could try that
Pro-editor-1105@reddit
github? or hf?
Many_SuchCases@reddit
He said it's closed source in another comment. He's just here to advertise it basically.
Pro-editor-1105@reddit
then you better get out lol, this is basic technology lol, this ain't anything proprietary.
DataPhreak@reddit
This is definitely a custom job. Probably using an open model, but the CLI is definitely homebrew.
jklre@reddit
This is cool. We have an internal demo that does this with translation to 100+ languages with real time voice cloning, and rag integration. Like an Alexa on steroids. Do you use whisper or something like moonshine? Ive played around with https://huggingface.co/pyannote/speaker-diarization for diarization a bit but my coworker put all the other stuff togeather into a working product.
amejin@reddit
Last I looked at pyannote the models had restrictions that wanted they could pull them any time from HF. I'm glad to see they MITed it.
What's really remarkable is you got it to process segments in real time. Did you have to overlap segments at all to retain speaker consistency or did it work straight out of the box with chunked audio? When I tried this I failed pretty miserably 🤪
TwistedBrother@reddit
I mean whisperx is already pretty good at this.
zerd@reddit
It doesn't do realtime though https://github.com/m-bain/whisperX/issues/476
The_frozen_one@reddit
I wrote a test script that used speaker diarization for ad removal from podcasts, it seemed to have a lot of potential. My super simple approach was to guess a number of ad-seconds per hour, then determine which speakers were nearest to being under that threshold and cut them out of the audio. The cool thing was that even if the podcast host is doing the ad, they often record them at a different time and under different audio conditions as the rest of the podcast, so it was considered a different speaker (at least in my limited testing).
I didn't go too far with it because diarization is really slow and it would get crashy on longer clips. I still think this approach could work though, especially if you could spot check the removals by transcribing the shortest segments and asking a small and fast local LLM if the transcript sounded like an ad before giving it the axe.
ServeAlone7622@reddit
Alternative design… A buffer and skip approach.
Buffer ~5 mins of audio. Examine the transcript for signs of advertising using literally any LLM, mark beginning and end. Fade volume out at beginning and fade it back in near the end. Fast forward or skip through the middle of it.
Bakedsoda@reddit
whatspecs does it need to run ?
Lonligrin@reddit (OP)
Needs strong hw, demo is on 4090, might run on lower systems but not much lower
Bakedsoda@reddit
not bad. have you tried on mlx on m chip set if so please report on results.
ServeAlone7622@reddit
Not the OP here but MLX is Apple only. Unless your target audience is using an Apple exclusively or you have a compelling reason for MLX you’re just tying yourself to the Apple ecosystem without any significant improvement in inference.
Here’s an example I just ran on my MacBook using an audiobook version of Mary Shelly’s Frankenstein from Gutenberg.
whisper-large-gguf = 120 tokens per second
whisper-large-mlx = 145 tokens per second
Most shocking is that when compared to the actual raw text, the gguf version had less transcription errors than the mlx version.
ServeAlone7622@reddit
Theoretically you could run this on a Pi 5. Once you get it functional you need to look closely at the models you’re using, how and why. Quantization will make a huge difference here.
tronathan@reddit
Yaaaay, this may be the missing link! Now where's my 72B Any-to-Any model (inclduing streaming json time series data)
AnhedoniaJack@reddit
Diarization cha cha cha
pigeon57434@reddit
bro what the fuck is it transcribing
TotalRuler1@reddit
right? Isn't this what occurs to create all accessible video transcripts for screen readers?
Time-Accountant1992@reddit
I always wondered how something like this would be done. Very very cool.
Smithiegoods@reddit
Thats amazing, but what is this video lol.
pmp22@reddit
If this was multilingual and the output text was rendered in real time as an overlay text on the screen, it could be used to translate anything playing on the machine. I often encounter videos in languages I don't understand without subtitles. This would be such a neat solution.
hackeristi@reddit
You could do that with realtimeSTT (subtitles) If you are handy with Python. You should be able to do what you are asking in very few steps.
leeharris100@reddit
Nice work. This is a standard diarization embedding approach with chunking to make it run in real time. This is a cool demo, but will be unfortunately very inaccurate for real world stuff.
Whose embeddings did you take to make this? Or did you train your own? If you trained your own, what data did you train from? I don't see any credits to pyannote or anyone else for your voiceprint embeddings.
indicava@reddit
Upvoted for Cunk
Also, any details would be nice!
Chris_in_Lijiang@reddit
Philomena?
Livid_Victory_979@reddit
its from https://github.com/KoljaB/RealtimeSTT . the repo is not updated though.
Lonligrin@reddit (OP)
Yes, that's the basis, realtimestt_speechendpoint_binary_classified.py to be precise. Also I'm still updating RealtimeSTT.
SnooPaintings8639@reddit
Very impressive. I know of an "AI" company, that just gave up and uses multiple physical mics, one per person.
Can it detect your voice vs "unknown"? That would be enough for many use cases.
leeharris100@reddit
I imagine you mean for realtime in-person diarization? That's because this type of solution would completely fall apart the moment you have cross talk, background noise, similar sounding voices, etc. Plus, if you're doing it in person, you likely don't have substantial GPU power to power it in real time with low latency unless you're using high powered cloud GPUs.
Lonligrin@reddit (OP)
Detecting vs unknown yes, with a 100% accuracy like for voice based access unlocking I don't think so.
--Tintin@reddit
Remindme! 2days
RemindMeBot@reddit
I will be messaging you in 2 days on 2025-01-19 23:00:12 UTC to remind you of this link
CLICK THIS LINK to send a PM to also be reminded and to reduce spam.
^(Parent commenter can ) ^(delete this message to hide from others.)
Su1tz@reddit
Fucking how