Vocalis: Local Conversational AI Assistant (Speech ↔️ Speech in Real Time with Vision Capabilities)

Posted by townofsalemfangay@reddit | LocalLLaMA | View on Reddit | 39 comments

Hey r/LocalLLaMA 👋 Been a long project, but I have Just released **Vocalis**, a real-time local assistant that goes full speech-to-speech—Custom VAD, Faster Whisper ASR, LLM in the middle, TTS out. Built for speed, fluidity, and actual usability in voice-first workflows. Latency will depend on your setup, ASR preference and LLM/TTS model size (all configurable via the .env in backend). 💬 **Talk to it like a person**. 🎧 **Interrupt mid-response** (barge-in). 🧠 **Silence detection for follow-ups** (the assistant will speak without you following up based on the context of the conversation). 🖼️ **Image analysis support to provide multi-modal context to non-vision capable endpoints** ([SmolVLM-256M](https://huggingface.co/HuggingFaceTB/SmolVLM-256M-Instruct)). 🧾 **Session save/load support** with full context. It uses your local LLM via OpenAI-style endpoint (LM Studio, llama.cpp, GPUStack, etc), and any TTS server (like my [Orpheus-FastAPI](https://github.com/Lex-au/Orpheus-FastAPI) or for super low latency, [Kokoro-FastAPI](https://github.com/remsky/Kokoro-FastAPI)). Frontend is React, backend is FastAPI—WebSocket-native with real-time audio streaming and UI states like *Listening*, *Processing*, and *Speaking*. **Speech Recognition Performance (using Vocalis-Q4\_K\_M + Koroko-FASTAPI TTS)** The system uses Faster-Whisper with the `base.en` model and a beam size of 2, striking an optimal balance between accuracy and speed. This configuration achieves: * **ASR Processing**: \~0.43 seconds for typical utterances * **Response Generation**: \~0.18 seconds * **Total Round-Trip Latency**: \~0.61 seconds Real-world example from system logs: INFO:faster_whisper:Processing audio with duration 00:02.229 INFO:backend.services.transcription:Transcription completed in 0.51s: Hi, how are you doing today?... INFO:backend.services.tts:Sending TTS request with 147 characters of text INFO:backend.services.tts:Received TTS response after 0.16s, size: 390102 bytes There's a full breakdown of the architecture and latency information on my readme. GitHub: [https://github.com/Lex-au/VocalisConversational](https://github.com/Lex-au/VocalisConversational) model (optional): [https://huggingface.co/lex-au/Vocalis-Q4\_K\_M.gguf](https://huggingface.co/lex-au/Vocalis-Q4_K_M.gguf) Some demo videos during project progress here: [https://www.youtube.com/@AJ-sj5ik](https://www.youtube.com/@AJ-sj5ik) License: Apache 2.0 Let me know what you think or if you have questions!