I used Kokoro-82M, Llama 3.2, and Whisper Small to build a real-time speech-to-speech chatbot that runs locally on my MacBook!
Posted by tycho_brahes_nose_@reddit | LocalLLaMA | View on Reddit | 77 comments
Present-Permission46@reddit
I made similar with whisper+ ollama + kokoro , and used conversation libraries of ollama which makes it more natural when it talks
Xodnil@reddit
Im actually curious, can you clone your own/use your voice using Kokoro?
EmotionLogicAI@reddit
How about adding real emotion detection to it?
drplan@reddit
I really like that your project is a compact python script :) Finally an implementation that is easy to follow. Wonderful achievement!
Separate_Cup_5095@reddit
RuntimeError: espeak not installed on your system
but espeak is already install in my system. i tried espeak-ng, same issue. any ideas. i am usng macbook pro m3.
CommunicationUsed270@reddit
vamsammy@reddit
totally awesome!
jahflyx@reddit
this is fire.. kuddos
tycho_brahes_nose_@reddit (OP)
Weebo is a real-time speech-to-speech chatbot that utilizes Whisper Small for speech-to-text, Llama 3.2 for text generation, and Kokoro-82M for text-to-speech.
You can learn more about it here: https://amanvir.com/weebo
It's open-source, and you can find the code on GitHub: https://github.com/amanvirparhar/weebo
Mukun00@reddit
Bro I literally had this idea to run on a mobile device. Implemented whisper.cpp in flutter but lost interest while implementing llm in flutter :(. Ig I need to work on that project again.
tycho_brahes_nose_@reddit (OP)
That sounds dope! Would love an app where I can interact with various types of local models on my iPhone!
Affectionate-Hat-536@reddit
There are apps PocketPal and PocketGPT that work on mobiles that let you interact with models. Not sure if you can access them programmatically though!
Recoil42@reddit
Dope. It looks like it doesn't really support interruption?
tycho_brahes_nose_@reddit (OP)
Thanks! And yeah, there's currently no mechanism to interrupt the TTS with your own voice. I was considering adding it, but I just wanted to ship the project and get it out there 😆
Please feel free to open up a PR for this feature if you'd like, and I'd love to get it approved!
DanInVirtualReality@reddit
I've been experimenting with Pipecat to facilitate interrupting in this kind of model chain - depending on your choice of transport it can open up remote access more easily, too. I was hoping to get to this... nearly every week of the last 6 months 😆 maybe you might have more luck.
https://github.com/pipecat-ai/pipecat
Seems like they are motivated to facilitate Daily.co for the transport later in particular, but Livekit is in there too, which is what the Open Interpreter O1 app uses. (The main point regarding the transport layer seems to be that managing voice-to-voice realtime conversations over the internet is a hard but ready-solved problem, just expect issues if you naively use web sockets)
talk_nerdy_to_m3@reddit
Unbearably slow but very cool. I wonder how it performs on a 4090?
Journeyj012@reddit
I made something similar (with an old-fashioned TTS and a more faster-whisper medium) and it was near real-time on my RTX 4060 Ti 16GB. I could even use llama 3.1q4.
https://github.com/JourneyJ012/ollama-chatbot This doesn't separate at any punctuation, and my code is completely awful.
BuildAQuad@reddit
As far as I know it should be possible to get it running near realtime on a 4090.
Lorddon1234@reddit
Awesome work!
tycho_brahes_nose_@reddit (OP)
Thank you!
vamsammy@reddit
posted issue on github.
Dear-Nail-5039@reddit
I just tried it and it worked almost instantly! Switched from tiny to small Whisper - a little slower but my German English is transcribed much better now.
tycho_brahes_nose_@reddit (OP)
Oops, thanks for catching that! I was experimenting with different Whisper models before pushing to GitHub, and I forgot to change it back to "small" in the code. I'm glad it worked well for you though!
pateandcognac@reddit
Nice, dude! To cut down on the model taking mistranscriptions literally, when I do STT input, I add something to my LLM prompt like:
tycho_brahes_nose_@reddit (OP)
This is great, thank you! I definitely think this would be especially beneficial when working with smaller STT models, as the mistranscriptions are much more frequent and prominent.
Altruistic_Poem6087@reddit
How do you optimize voiceover delay? Do you do tts sentence by sentence or there are some other way?
tycho_brahes_nose_@reddit (OP)
Exactly, TTS is done sentence-by-sentence.
CtrlAltDelve@reddit
So damn cool. Well done!
tycho_brahes_nose_@reddit (OP)
Thanks!
micamecava@reddit
Great job! The latency is nice.
This was my next weekend todo project, I’m now a bit jealous that you got to it before me haha.
tycho_brahes_nose_@reddit (OP)
Haha, thank you! 😆
MixtureOfAmateurs@reddit
I did this the other day, but whisper absolutely sucks balls when it comes to my Aussie accent. Did you come across any alternatives when making this?
tycho_brahes_nose_@reddit (OP)
Funnily enough, I didn't really consider any alternatives to Whisper. Kind of took it for granted that it was still the best choice for STT.
I haven't been keeping up with the benchmarks, but maybe there's a better model out there that's small enough to run locally?
joninco@reddit
https://huggingface.co/hexgrad/Kokoro-82M is pretty legit.
MixtureOfAmateurs@reddit
Yeah it seems pretty dominant. I think I need to fine-tune it or look for a fine-tune, I know I'm not the only Aussie with this issue lol
corvidpal@reddit
Have you tried out large-v3? I use it every day and don't even bother checking the transcription. It's so good..
I am from Melbourne though so maybe my Aussie accent is less prominent. Where are you from?
MixtureOfAmateurs@reddit
Yeah turbo, which I think is better than large but I'll check out large V3 specifically tomorrow. I'm from Brisbane and it was completely useless. I have a proper recording mic in a quiet room, with a pretty loud PC in the background. It got "Tell me a joke" so incredibly wrong 3 times in a row I gave up
Rozwik@reddit
woah dude. I was looking to build the same thing yesterday (with exactly the same 3 tools). I wonder how many of us are having the same ideas these days. This will definitely save me some time. Thanks for sharing.
Not_your_guy_buddy42@reddit
I often wonder what could be achieved if all the similar opensource projects were able to bundle their efforts somehow instead of inventing 3 million variations of the same apps ... it's not how it works though is it
dxcore_35@reddit
This!
Rozwik@reddit
Yeah, that would be a dream.
In future we'd better have a check every time we initiate a new project, it would query github, and alert us like
"a similar project is already in development over here (link).
Are you sure you still want to continue with this one or contribute to the ongoing project?"
or something like that.
johncarpen1@reddit
I think it depends on the use case. I have wanted a markdown editor for a long time. Obsidian is great for my purpose. but it's not open source. Logseq, I just don't like the look of it. there are numerous others that I have tried, but there is always something that doesn't work out, and if I want to add something to it, then I have to dive into a source-code, which takes a lot of time to figure out how something is implemented.
So I have started to build my own markdown editor. Now I know, where and how something is implemented. If I want to add a feature, now I can just open my vscode and start coding according to my needs. It might not be super optimized and there might be shit ton of errors, but it works for me.
I think the main aspect would be the backend/engine portion of any project. If it is very simple and easy to work with, creating a frontend for it is just a matter of choice, and we can start using something like bolt.diy to quickly create a frontend.
Rozwik@reddit
Yup. Frontend is done.
Not_your_guy_buddy42@reddit
I also want to code my own notes app, ha
cptbeard@reddit
easily thousands, including me. did few projects before and was again inspired by kokoro.
tycho_brahes_nose_@reddit (OP)
Haha, I’m glad you had the same idea! Yeah, once I saw how impressive Kokoro was, I knew this was the first project I had to build with it!
rorowhat@reddit
Can this work on a PC?
tycho_brahes_nose_@reddit (OP)
Yes, although you'd have to swap out the
lightning_whisper_mlx
model with another Whisper implementation, and you might want to look into changing the ONNX execution provider if you have a GPU.bdiler1@reddit
is this faster than faster whisper?
ramzeez88@reddit
Check out lema-ai on github. It works on native windows.
tylercoder@reddit
Which mb model?
Expensive-Apricot-25@reddit
Nice, looks super cool.
How do you split the generation stream into chunks to send to the TTS engine? do you just use standard punctuation (ie ".", "!", "?", etc.), if so, what do u do if the language model generates code where these no longer serve as valid punctuation?
tycho_brahes_nose_@reddit (OP)
Yes, it splits at ".", "!", and "?". You're right about that not working well with code, but I'm not sure if there's even a use-case where you'd want to read out lines of code with TTS? If you're concerned about the LLM accidentally generating code, I'd say that the best option would just be to create a strongly worded system prompt with instructions to stop the model from generating code.
Expensive-Apricot-25@reddit
I doubt there is a use case for having it read out code, but there is a use case for saying “write python code for a function that does xyz” then it will respond to you as normal, but with out reading the code out loud, and u can just copy/paste the code from a pop up or something.
I’m sure u can just make a simple regex expression to pattern match for markdown plain text/code, and just crop it out b4 TTS
tycho_brahes_nose_@reddit (OP)
Ooh, yes, in that case, you could either pattern match or use structured outputs to get the LLM to respond with a JSON object where the text and code parts are separate key-value pairs.
Expensive-Apricot-25@reddit
yeah. imo, pattern matching is better. The more you restrict a model the worse its performance becomes, especially with local models.
Mother_Soraka@reddit
the current demo you just showed, is TTSing in chunks?
tycho_brahes_nose_@reddit (OP)
Yes.
ServeAlone7622@reddit
I think he means foreign languages that don’t use the same punctuation marks. The answer by the way is to use Sentence Piece
Flaky_Pay_2367@reddit
Nice work!
Could you run `nvtop` and `htop` in the terminal too, so we see the gpu & cpu usage?
Besides, could you post the specs of your Macbook here?
tycho_brahes_nose_@reddit (OP)
Thanks! About to head off to bed right now, so I'll have to get back to you with the CPU/GPU usage stats another time. But as for my MacBook specs: I'm running this on an M2 Pro with 16GB RAM.
madaradess007@reddit
Wow, dude! This is exactly what i wanted to do today!
Please add more instructions on how to setup
kokoro-v0_19.onnx
(TTS model)tycho_brahes_nose_@reddit (OP)
Thanks, I'm glad you like it!
Just added the download link for
kokoro-v0_19.onnx
to GitHub repo. All you need to do is download the model, and put it in the same folder as the Python script.Murky-Use-949@reddit
what should one do to create real time audio translation from English , say to a language like Malayalam ? I am a teacher and many of the instruction materials are purely in english and i try to teach illiterate/low literacy adults in the evenings ... i believe this could be of huge help here ... A general outline/plan would be helpful enough i will code up the rest ...
tycho_brahes_nose_@reddit (OP)
With respect to Weebo, you'd essentially have to replace the LLM with a translation model and then find a TTS model that's compatible with Malayalam (or whatever language you're trying to synthesize speech for).
sagardavara_codes@reddit
Amazing, even you used multi model process here, still latency seems good for production use cases
tycho_brahes_nose_@reddit (OP)
I know right?! Crazy what you can build using fully local inference.
SquashFront1303@reddit
I really need a windows app like this to practice my speaking skills but most of this are hard to setup installing libraries which gives me headache I hope somebody compiles it to a standalone windows app with good interface.
TruckUseful4423@reddit
My project: https://github.com/feckom/samantha/
onemorefreak@reddit
What hardware are you using?
tycho_brahes_nose_@reddit (OP)
I'm running this on a MacBook M2 Pro with 16 GB RAM.
Short-Sandwich-905@reddit
i imaigne realtime translation
tycho_brahes_nose_@reddit (OP)
Ooh, could definitely be a really cool use-case!
dsartori@reddit
A valuable contribution thank you!
tycho_brahes_nose_@reddit (OP)
Thank you, I appreciate it!
gamblingapocalypse@reddit
Finally, I can have a local ai girlfriend!
Murky_Mountain_97@reddit
Nicely done! Did you consider an solo integration?