Built a Voice Agents from Scratch GitHub tutorial: mic > Whisper > local LLM (GGUF) > Kokoro > speaker, fully local, no API keys
Posted by purellmagents@reddit | LocalLLaMA | View on Reddit | 11 comments
Been building this for a while and finally cleaned it up enough to share.
voice-agents-from-scratch is a numbered, chapter-by-chapter repo that walks the full real-time pipeline:
- Microphone capture
- Whisper for STT
- Local GGUF LLM (via llama.cpp)
- Kokoro for TTS
- Speaker output
Everything streams - you don't wait for the full LLM response before TTS starts speaking. That's the part that makes it feel like a real conversation instead of a chatbot with a voice skin.
Each chapter is a runnable script + a short CODE.md walkthrough. There's also a small shared library so you can see how the pieces compose into a real system, not just isolated calls.
Why fully local matters here: you can actually see where latency lives. Warm-up, first-audio time, streaming chunk size - these aren't abstractions when you're running it on your own machine.
Repo: https://github.com/pguso/voice-agents-from-scratch
Happy to answer questions about the architecture or tradeoffs I ran into.
Mantikos804@reddit
Such a pain the ass to make I just ended up using open webui
purellmagents@reddit (OP)
You will use this setup in production. If you build systems for customers it’s a good idea what happens under the hood. So you can solve issues efficiently. That’s the main goal of this repo. So people (devs) understand the concept
Mantikos804@reddit
I applaud you for it. That was just my experience with this idea.
youcloudsofdoom@reddit
If this is always-on, why aren't you using a wakeword? Or have you gone PTT? I have been trying to build a similar pipeline but always on/with a wakeword and running on a Pi 5, but found that the computational overhead is too much for such a tiny device, and the lag feels too heavy.
purellmagents@reddit (OP)
It's a tutorial so it doesn't really run in the background. But that's my plan to extend the lib I implemented further and make it run on a raspi in always on mode.
youcloudsofdoom@reddit
Ah okay. Will be interested to see the outcomes of your Pi tests then, I do think there are lots of performance optimisations to be had with it, with a little time.. .
ReferenceOwn287@reddit
Thanks for sharing this. I have been working on a linux desktop assistant for a while and want to add a voice to it next, starred your repo to learn from it.
This is my project - https://github.com/achinivar/meera (posted it in a couple of forums but haven’t got much traction unfortunately)
I see your project has tool calls as well, how are you handling them?
Routing prompts directly to the LLM and asking it to figure out the right tool was getting very unreliable for me, especially with an increased number of tools, so I added a small embedding model that uses exemplars for tool calls and finds the closest match before calling the LLM (shared about it on the repo wiki)
purellmagents@reddit (OP)
Yeah, I Kept it pretty textbook here: the LLM gets a strict “tool router” system prompt, the full JSON Schema list from a small registry, and a few explicit routing hints (e.g. default general knowledge to search). It has to emit a single JSON object with name and arguments - there’s no embedding step or exemplar retrieval to narrow tools first. In the script I added a comment that for production you’d want native tool APIs or grammar-constrained generation on top of this. Your exemplar + embedding shortlist is exactly the right next step!
ReferenceOwn287@reddit
Oh yes, the embedding model was a world of difference in tool calls reliability. I was going in loops trying to optimize the prompt and json schema before I went back to the drawing board and learnt about them.
ekaj@reddit
Just skimmed it, but this looks actually helpful/handy/wish I had it a couple months ago. Like, really, really wish I did.
purellmagents@reddit (OP)
Sorry that I am too late 😉 hope it can now help others on this journey. Maybe you have ideas what to add to it next?