I used Kokoro-82M, Llama 3.2, and Whisper Small to build a real-time speech-to-speech chatbot that runs locally on my MacBook! | TheaterFire

I used Kokoro-82M, Llama 3.2, and Whisper Small to build a real-time speech-to-speech chatbot that runs locally on my MacBook!

Posted by tycho_brahes_nose_@reddit | LocalLLaMA | View on Reddit | 82 comments

[-]

Murky-Use-949@reddit

what should one do to create real time audio translation from English , say to a language like Malayalam ? I am a teacher and many of the instruction materials are purely in english and i try to teach illiterate/low literacy adults in the evenings ... i believe this could be of huge help here ... A general outline/plan would be helpful enough i will code up the rest ...

[-]

raghav0610@reddit

have created this and opensourced. Kept it simple for easy installation. Its uses OpenAPI Key. You can see in the video that I tried with English to Malayalam and it was decent enough.

https://www.linkedin.com/posts/raghavpatnecha_while-watching-google-io-2025-they-announced-activity-7342680036270886912-PWyS

[-]

tycho_brahes_nose_@reddit (OP)

With respect to Weebo, you'd essentially have to replace the LLM with a translation model and then find a TTS model that's compatible with Malayalam (or whatever language you're trying to synthesize speech for).

[-]

AirOdd3926@reddit

You can try ultravox.ai and their opensource models (fixie-ai/ultravox-v0_5-llama-3_3-70b · Hugging Face) but in their service they provide voice to voice but in open source models it only Voice to Text and you need Kokoro to translate it to voice. Nice to have you feedback about this usecase.

[-]

ExtremeSliceofPie@reddit

Hey Thank you for inspiring me to write my own local agent! After a few hours I was able to create a (simple) local agent using Phi3- Ollama - Kokoro, and tkinter to make an interface. :) I didn't realize how cool this could be. Thanks for the inspiration.

[-]

Mission-Network-2814@reddit

Nice but are you considering building this on livekit. I am trying something similar and im failing very bad

[-]

tycho_brahes_nose_@reddit (OP)

Weebo is a real-time speech-to-speech chatbot that utilizes Whisper Small for speech-to-text, Llama 3.2 for text generation, and Kokoro-82M for text-to-speech.

You can learn more about it here: https://amanvir.com/weebo

It's open-source, and you can find the code on GitHub: https://github.com/amanvirparhar/weebo

[-]

Donovanth1@reddit

Vedal's gonna have competition with this

[-]

Mukun00@reddit

Bro I literally had this idea to run on a mobile device. Implemented whisper.cpp in flutter but lost interest while implementing llm in flutter :(. Ig I need to work on that project again.

[-]

tycho_brahes_nose_@reddit (OP)

That sounds dope! Would love an app where I can interact with various types of local models on my iPhone!

[-]

Affectionate-Hat-536@reddit

There are apps PocketPal and PocketGPT that work on mobiles that let you interact with models. Not sure if you can access them programmatically though!

[-]

Recoil42@reddit

Dope. It looks like it doesn't really support interruption?

[-]

tycho_brahes_nose_@reddit (OP)

Thanks! And yeah, there's currently no mechanism to interrupt the TTS with your own voice. I was considering adding it, but I just wanted to ship the project and get it out there 😆

Please feel free to open up a PR for this feature if you'd like, and I'd love to get it approved!

[-]

DanInVirtualReality@reddit

I've been experimenting with Pipecat to facilitate interrupting in this kind of model chain - depending on your choice of transport it can open up remote access more easily, too. I was hoping to get to this... nearly every week of the last 6 months 😆 maybe you might have more luck.

https://github.com/pipecat-ai/pipecat

Seems like they are motivated to facilitate Daily.co for the transport later in particular, but Livekit is in there too, which is what the Open Interpreter O1 app uses. (The main point regarding the transport layer seems to be that managing voice-to-voice realtime conversations over the internet is a hard but ready-solved problem, just expect issues if you naively use web sockets)

[-]

Present-Permission46@reddit

I made similar with whisper+ ollama + kokoro , and used conversation libraries of ollama which makes it more natural when it talks

[-]

Xodnil@reddit

Im actually curious, can you clone your own/use your voice using Kokoro?

[-]

EmotionLogicAI@reddit

How about adding real emotion detection to it?

[-]

drplan@reddit

I really like that your project is a compact python script :) Finally an implementation that is easy to follow. Wonderful achievement!

[-]

Separate_Cup_5095@reddit

RuntimeError: espeak not installed on your system

but espeak is already install in my system. i tried espeak-ng, same issue. any ideas. i am usng macbook pro m3.

[-]

CommunicationUsed270@reddit

[-]

vamsammy@reddit

totally awesome!

[-]

jahflyx@reddit

this is fire.. kuddos

[-]

talk_nerdy_to_m3@reddit

Unbearably slow but very cool. I wonder how it performs on a 4090?

[-]

Journeyj012@reddit

I made something similar (with an old-fashioned TTS and a more faster-whisper medium) and it was near real-time on my RTX 4060 Ti 16GB. I could even use llama 3.1q4.

https://github.com/JourneyJ012/ollama-chatbot This doesn't separate at any punctuation, and my code is completely awful.

[-]

BuildAQuad@reddit

As far as I know it should be possible to get it running near realtime on a 4090.

[-]

Lorddon1234@reddit

Awesome work!

[-]

tycho_brahes_nose_@reddit (OP)

Thank you!

[-]

vamsammy@reddit

posted issue on github.

[-]

Dear-Nail-5039@reddit

I just tried it and it worked almost instantly! Switched from tiny to small Whisper - a little slower but my German English is transcribed much better now.

[-]

tycho_brahes_nose_@reddit (OP)

Oops, thanks for catching that! I was experimenting with different Whisper models before pushing to GitHub, and I forgot to change it back to "small" in the code. I'm glad it worked well for you though!

[-]

pateandcognac@reddit

Nice, dude! To cut down on the model taking mistranscriptions literally, when I do STT input, I add something to my LLM prompt like:

This text was transcribed using a speech to text system and may be imperfect. If something seems unusual, assume it was mistranscribed. Do your best to infer the words actually spoken. If you are unable to infer what the user actually said, tell them you misheard and ask for clarification.

[-]

tycho_brahes_nose_@reddit (OP)

This is great, thank you! I definitely think this would be especially beneficial when working with smaller STT models, as the mistranscriptions are much more frequent and prominent.

[-]

Altruistic_Poem6087@reddit

How do you optimize voiceover delay? Do you do tts sentence by sentence or there are some other way?

[-]

tycho_brahes_nose_@reddit (OP)

Exactly, TTS is done sentence-by-sentence.

[-]

CtrlAltDelve@reddit

So damn cool. Well done!

[-]

tycho_brahes_nose_@reddit (OP)

Thanks!

[-]

micamecava@reddit

Great job! The latency is nice.

This was my next weekend todo project, I’m now a bit jealous that you got to it before me haha.

[-]

tycho_brahes_nose_@reddit (OP)

Haha, thank you! 😆

[-]

MixtureOfAmateurs@reddit

I did this the other day, but whisper absolutely sucks balls when it comes to my Aussie accent. Did you come across any alternatives when making this?

[-]

tycho_brahes_nose_@reddit (OP)

Funnily enough, I didn't really consider any alternatives to Whisper. Kind of took it for granted that it was still the best choice for STT.

I haven't been keeping up with the benchmarks, but maybe there's a better model out there that's small enough to run locally?

[-]

joninco@reddit

https://huggingface.co/hexgrad/Kokoro-82M is pretty legit.

[-]

MixtureOfAmateurs@reddit

Yeah it seems pretty dominant. I think I need to fine-tune it or look for a fine-tune, I know I'm not the only Aussie with this issue lol

[-]

corvidpal@reddit

Have you tried out large-v3? I use it every day and don't even bother checking the transcription. It's so good..

I am from Melbourne though so maybe my Aussie accent is less prominent. Where are you from?

[-]

MixtureOfAmateurs@reddit

Yeah turbo, which I think is better than large but I'll check out large V3 specifically tomorrow. I'm from Brisbane and it was completely useless. I have a proper recording mic in a quiet room, with a pretty loud PC in the background. It got "Tell me a joke" so incredibly wrong 3 times in a row I gave up

[-]

Rozwik@reddit

woah dude. I was looking to build the same thing yesterday (with exactly the same 3 tools). I wonder how many of us are having the same ideas these days. This will definitely save me some time. Thanks for sharing.

[-]

Not_your_guy_buddy42@reddit

I often wonder what could be achieved if all the similar opensource projects were able to bundle their efforts somehow instead of inventing 3 million variations of the same apps ... it's not how it works though is it

[-]

dxcore_35@reddit

This!

[-]

Rozwik@reddit

Yeah, that would be a dream.

In future we'd better have a check every time we initiate a new project, it would query github, and alert us like
"a similar project is already in development over here (link).
Are you sure you still want to continue with this one or contribute to the ongoing project?"
or something like that.

[-]

johncarpen1@reddit

I think it depends on the use case. I have wanted a markdown editor for a long time. Obsidian is great for my purpose. but it's not open source. Logseq, I just don't like the look of it. there are numerous others that I have tried, but there is always something that doesn't work out, and if I want to add something to it, then I have to dive into a source-code, which takes a lot of time to figure out how something is implemented.

So I have started to build my own markdown editor. Now I know, where and how something is implemented. If I want to add a feature, now I can just open my vscode and start coding according to my needs. It might not be super optimized and there might be shit ton of errors, but it works for me.

I think the main aspect would be the backend/engine portion of any project. If it is very simple and easy to work with, creating a frontend for it is just a matter of choice, and we can start using something like bolt.diy to quickly create a frontend.

[-]

Rozwik@reddit

Yup. Frontend is done.

[-]

Not_your_guy_buddy42@reddit

I also want to code my own notes app, ha

[-]

cptbeard@reddit

easily thousands, including me. did few projects before and was again inspired by kokoro.

[-]

tycho_brahes_nose_@reddit (OP)

Haha, I’m glad you had the same idea! Yeah, once I saw how impressive Kokoro was, I knew this was the first project I had to build with it!

[-]

rorowhat@reddit

Can this work on a PC?

[-]

tycho_brahes_nose_@reddit (OP)

Yes, although you'd have to swap out the lightning_whisper_mlx model with another Whisper implementation, and you might want to look into changing the ONNX execution provider if you have a GPU.

[-]

bdiler1@reddit

is this faster than faster whisper?

[-]

ramzeez88@reddit

Check out lema-ai on github. It works on native windows.

[-]

tylercoder@reddit

Which mb model?

[-]

Expensive-Apricot-25@reddit

Nice, looks super cool.

How do you split the generation stream into chunks to send to the TTS engine? do you just use standard punctuation (ie ".", "!", "?", etc.), if so, what do u do if the language model generates code where these no longer serve as valid punctuation?

[-]

tycho_brahes_nose_@reddit (OP)

Yes, it splits at ".", "!", and "?". You're right about that not working well with code, but I'm not sure if there's even a use-case where you'd want to read out lines of code with TTS? If you're concerned about the LLM accidentally generating code, I'd say that the best option would just be to create a strongly worded system prompt with instructions to stop the model from generating code.

[-]

Expensive-Apricot-25@reddit

I doubt there is a use case for having it read out code, but there is a use case for saying “write python code for a function that does xyz” then it will respond to you as normal, but with out reading the code out loud, and u can just copy/paste the code from a pop up or something.

I’m sure u can just make a simple regex expression to pattern match for markdown plain text/code, and just crop it out b4 TTS

[-]

tycho_brahes_nose_@reddit (OP)

Ooh, yes, in that case, you could either pattern match or use structured outputs to get the LLM to respond with a JSON object where the text and code parts are separate key-value pairs.

[-]

Expensive-Apricot-25@reddit

yeah. imo, pattern matching is better. The more you restrict a model the worse its performance becomes, especially with local models.

[-]

Mother_Soraka@reddit

the current demo you just showed, is TTSing in chunks?

[-]

tycho_brahes_nose_@reddit (OP)

Yes.

[-]

ServeAlone7622@reddit

I think he means foreign languages that don’t use the same punctuation marks. The answer by the way is to use Sentence Piece

[-]

Flaky_Pay_2367@reddit

Nice work!

Could you run `nvtop` and `htop` in the terminal too, so we see the gpu & cpu usage?
Besides, could you post the specs of your Macbook here?

[-]

tycho_brahes_nose_@reddit (OP)

Thanks! About to head off to bed right now, so I'll have to get back to you with the CPU/GPU usage stats another time. But as for my MacBook specs: I'm running this on an M2 Pro with 16GB RAM.

[-]

madaradess007@reddit

Wow, dude! This is exactly what i wanted to do today!
Please add more instructions on how to setup kokoro-v0_19.onnx (TTS model)

[-]

tycho_brahes_nose_@reddit (OP)

Thanks, I'm glad you like it!

Just added the download link for kokoro-v0_19.onnx to GitHub repo. All you need to do is download the model, and put it in the same folder as the Python script.

[-]

sagardavara_codes@reddit

Amazing, even you used multi model process here, still latency seems good for production use cases

[-]

tycho_brahes_nose_@reddit (OP)

I know right?! Crazy what you can build using fully local inference.

[-]

SquashFront1303@reddit

I really need a windows app like this to practice my speaking skills but most of this are hard to setup installing libraries which gives me headache I hope somebody compiles it to a standalone windows app with good interface.

[-]

TruckUseful4423@reddit

My project: https://github.com/feckom/samantha/

[-]

onemorefreak@reddit

What hardware are you using?

[-]

tycho_brahes_nose_@reddit (OP)

I'm running this on a MacBook M2 Pro with 16 GB RAM.

[-]

Short-Sandwich-905@reddit

i imaigne realtime translation

[-]

tycho_brahes_nose_@reddit (OP)

Ooh, could definitely be a really cool use-case!

[-]

dsartori@reddit

A valuable contribution thank you!

[-]

tycho_brahes_nose_@reddit (OP)

Thank you, I appreciate it!

[-]

gamblingapocalypse@reddit

Finally, I can have a local ai girlfriend!

[-]

Murky_Mountain_97@reddit

Nicely done! Did you consider an solo integration?