Real-Time Speech-to-Speech Chatbot: Whisper, Llama 3.1, Kokoro, and Silero VAD
Posted by martian7r@reddit | Python | View on Reddit | 20 comments
Hi everyone, Please have a look at the Vocal-Agent, a real-time speech-to-speech chatbot that integrates Whisper for speech recognition, Silero VAD for voice activity detection, Llama 3.1 for reasoning, and Kokoro ONNX for natural voice synthesis.
🔗 GitHub Repo: https://github.com/tarun7r/Vocal-Agent
🚀 What My Project Does
Vocal-Agent enables seamless real-time spoken conversations with an AI assistant. It processes speech input with low latency, understands queries using LLMs, and generates human-like speech in response. The system also supports web integration (Google Search, Wikipedia, Arxiv) and is extensible through an agent framework.
🎯 Target Audience
- AI researchers & developers: Experiment with real-time S2S AI interactions.
- Voice-based AI enthusiasts: Build and extend a natural voice-based chatbot.
- Accessibility-focused applications: Enhance spoken communication tools.
- Open-source contributors: Collaborate on an evolving project.
🔍 How It Differs from Existing Alternatives
Unlike existing voice assistants, Vocal-Agent offers:
✅ Fully open-source implementation with an extensible framework.
✅ LLM-powered reasoning (Llama 3.1 8B) via Agno instead of rule-based responses.
✅ ONNX-optimized TTS for efficient voice synthesis.
✅ Low-latency pipeline for real-time interactivity.
✅ Web search capabilities integrated into the agent system.
✨ Key Features
- 🎙 Speech Recognition: Whisper (large-v1) + Silero VAD
- 🤖 Multimodal Reasoning: Llama 3.1 8B via Ollama & Agno Agent
- 🌐 Web Integration: Google Search, Wikipedia, Arxiv
- 🗣 Natural Voice Synthesis: Kokoro-82M ONNX
- ⚡ Low-Latency Processing: Optimized audio pipeline
- 🔧 Extensible Tooling: Expand agent capabilities easily
Would love to hear your feedback, suggestions, and contributions! 🚀
fenghuangshan@reddit
chub79@reddit
Brilliant project. I only knew of paid products but it's awesome to see that OSS competes with them :)
martian7r@reddit (OP)
Actually it is still the cascading s2s, to build the proper s2s we would require a lot of data and resource like A100 GPUs to train
Recent-Ad869@reddit
Would it be possible to run this smoothly on a M1 Mac?
BepNhaVan@reddit
Can this be injected with translation for real time translation?
martian7r@reddit (OP)
Depends on the llm used, you can change the llm run on the ollama which has a support of various langue for translation, look out for the kokoro languages supported as well
BepNhaVan@reddit
Can you wrap this in docker container?
martian7r@reddit (OP)
Planning to do it soon
Amazing_Upstairs@reddit
What version of python are you on? Because on wsl I could not resolve the dependencies in requirements.txt
martian7r@reddit (OP)
requires-python = ">=3.9"
Amazing_Upstairs@reddit
3.12 didn't work on wsl
Amazing_Upstairs@reddit
Thanks it works. Seems a bit arbitrary as to whether it goes to arxiv, google, ollama or wikipedia even when I specifically say "google weather Cape Town"
martian7r@reddit (OP)
Make the prompt better, it's open, It is how better you can give prompt
Amazing_Upstairs@reddit
Also not sure if there's a way to skip a long incorrect response
Amazing_Upstairs@reddit
Also it often starts producing results while I'm still talking even with the very slightest of pauses.
Amazing_Upstairs@reddit
Windows support please
Amazing_Upstairs@reddit
Also does not install on Windows Subsystem for Linux
martian7r@reddit (OP)
Actually it supports for windows as well, ensure you have GPU and llm model running on the local machine using ollama, place the kokoro onnx models manually on the directory
install the espeak-ng:
https://github.com/espeak-ng/espeak-ng/blob/master/docs/guide.md
Amazing_Upstairs@reddit
You'll have to provide way better instructions than that
martian7r@reddit (OP)
modified the readme file, pls check now