Real-Time Speech-to-Speech Chatbot: Whisper, Llama 3.1, Kokoro, and Silero VAD

Posted by martian7r@reddit | Python | View on Reddit | 20 comments

Hi everyone, Please have a look at the Vocal-Agent, a real-time speech-to-speech chatbot that integrates Whisper for speech recognition, Silero VAD for voice activity detection, Llama 3.1 for reasoning, and Kokoro ONNX for natural voice synthesis.

🔗 GitHub Repo: https://github.com/tarun7r/Vocal-Agent

🚀 What My Project Does

Vocal-Agent enables seamless real-time spoken conversations with an AI assistant. It processes speech input with low latency, understands queries using LLMs, and generates human-like speech in response. The system also supports web integration (Google Search, Wikipedia, Arxiv) and is extensible through an agent framework.

🎯 Target Audience

🔍 How It Differs from Existing Alternatives

Unlike existing voice assistants, Vocal-Agent offers:
✅ Fully open-source implementation with an extensible framework.
✅ LLM-powered reasoning (Llama 3.1 8B) via Agno instead of rule-based responses.
✅ ONNX-optimized TTS for efficient voice synthesis.
✅ Low-latency pipeline for real-time interactivity.
✅ Web search capabilities integrated into the agent system.

✨ Key Features

Would love to hear your feedback, suggestions, and contributions! 🚀