Vocalis: Local Conversational AI Assistant (Speech ↔️ Speech in Real Time with Vision Capabilities)
Posted by townofsalemfangay@reddit | LocalLLaMA | View on Reddit | 39 comments
Hey r/LocalLLaMA 👋
Been a long project, but I have Just released **Vocalis**, a real-time local assistant that goes full speech-to-speech—Custom VAD, Faster Whisper ASR, LLM in the middle, TTS out. Built for speed, fluidity, and actual usability in voice-first workflows. Latency will depend on your setup, ASR preference and LLM/TTS model size (all configurable via the .env in backend).
💬 **Talk to it like a person**.
🎧 **Interrupt mid-response** (barge-in).
🧠 **Silence detection for follow-ups** (the assistant will speak without you following up based on the context of the conversation).
🖼️ **Image analysis support to provide multi-modal context to non-vision capable endpoints** ([SmolVLM-256M](https://huggingface.co/HuggingFaceTB/SmolVLM-256M-Instruct)).
🧾 **Session save/load support** with full context.
It uses your local LLM via OpenAI-style endpoint (LM Studio, llama.cpp, GPUStack, etc), and any TTS server (like my [Orpheus-FastAPI](https://github.com/Lex-au/Orpheus-FastAPI) or for super low latency, [Kokoro-FastAPI](https://github.com/remsky/Kokoro-FastAPI)). Frontend is React, backend is FastAPI—WebSocket-native with real-time audio streaming and UI states like *Listening*, *Processing*, and *Speaking*.
**Speech Recognition Performance (using Vocalis-Q4\_K\_M + Koroko-FASTAPI TTS)**
The system uses Faster-Whisper with the `base.en` model and a beam size of 2, striking an optimal balance between accuracy and speed. This configuration achieves:
* **ASR Processing**: \~0.43 seconds for typical utterances
* **Response Generation**: \~0.18 seconds
* **Total Round-Trip Latency**: \~0.61 seconds
Real-world example from system logs:
INFO:faster_whisper:Processing audio with duration 00:02.229
INFO:backend.services.transcription:Transcription completed in 0.51s: Hi, how are you doing today?...
INFO:backend.services.tts:Sending TTS request with 147 characters of text
INFO:backend.services.tts:Received TTS response after 0.16s, size: 390102 bytes
There's a full breakdown of the architecture and latency information on my readme.
GitHub: [https://github.com/Lex-au/VocalisConversational](https://github.com/Lex-au/VocalisConversational)
model (optional): [https://huggingface.co/lex-au/Vocalis-Q4\_K\_M.gguf](https://huggingface.co/lex-au/Vocalis-Q4_K_M.gguf)
Some demo videos during project progress here: [https://www.youtube.com/@AJ-sj5ik](https://www.youtube.com/@AJ-sj5ik)
License: Apache 2.0
Let me know what you think or if you have questions!
39 Comments
Omarashraf2823@reddit
rbgo404@reddit
kzoltan@reddit
townofsalemfangay@reddit (OP)
kzoltan@reddit
Predatedtomcat@reddit
townofsalemfangay@reddit (OP)
SeriousGrab6233@reddit
townofsalemfangay@reddit (OP)
indian_geek@reddit
HelpfulHand3@reddit
Traditional_Tap1708@reddit
HelpfulHand3@reddit
Chromix_@reddit
HelpfulHand3@reddit
poli-cya@reddit
HelpfulHand3@reddit
Chromix_@reddit
townofsalemfangay@reddit (OP)
HelpfulHand3@reddit
townofsalemfangay@reddit (OP)
poli-cya@reddit
townofsalemfangay@reddit (OP)
poli-cya@reddit
townofsalemfangay@reddit (OP)
poli-cya@reddit
townofsalemfangay@reddit (OP)
ElTejanoLoco@reddit
Carchofa@reddit
townofsalemfangay@reddit (OP)
Carchofa@reddit
Background_Put_4978@reddit
townofsalemfangay@reddit (OP)
Radiant_Dog1937@reddit
townofsalemfangay@reddit (OP)
Maximus-CZ@reddit
townofsalemfangay@reddit (OP)
Chromix_@reddit
townofsalemfangay@reddit (OP)