Vocalis: Local Conversational AI Assistant (Speech ↔️ Speech in Real Time with Vision Capabilities)

Posted by townofsalemfangay@reddit | LocalLLaMA | View on Reddit | 39 comments

Hey r/LocalLLaMA 👋 Been a long project, but I have Just released **Vocalis**, a real-time local assistant that goes full speech-to-speech—Custom VAD, Faster Whisper ASR, LLM in the middle, TTS out. Built for speed, fluidity, and actual usability in voice-first workflows. Latency will depend on your setup, ASR preference and LLM/TTS model size (all configurable via the .env in backend). 💬 **Talk to it like a person**. 🎧 **Interrupt mid-response** (barge-in). 🧠 **Silence detection for follow-ups** (the assistant will speak without you following up based on the context of the conversation). 🖼️ **Image analysis support to provide multi-modal context to non-vision capable endpoints** ([SmolVLM-256M](https://huggingface.co/HuggingFaceTB/SmolVLM-256M-Instruct)). 🧾 **Session save/load support** with full context. It uses your local LLM via OpenAI-style endpoint (LM Studio, llama.cpp, GPUStack, etc), and any TTS server (like my [Orpheus-FastAPI](https://github.com/Lex-au/Orpheus-FastAPI) or for super low latency, [Kokoro-FastAPI](https://github.com/remsky/Kokoro-FastAPI)). Frontend is React, backend is FastAPI—WebSocket-native with real-time audio streaming and UI states like *Listening*, *Processing*, and *Speaking*. **Speech Recognition Performance (using Vocalis-Q4\_K\_M + Koroko-FASTAPI TTS)** The system uses Faster-Whisper with the `base.en` model and a beam size of 2, striking an optimal balance between accuracy and speed. This configuration achieves: * **ASR Processing**: \~0.43 seconds for typical utterances * **Response Generation**: \~0.18 seconds * **Total Round-Trip Latency**: \~0.61 seconds Real-world example from system logs: INFO:faster_whisper:Processing audio with duration 00:02.229 INFO:backend.services.transcription:Transcription completed in 0.51s: Hi, how are you doing today?... INFO:backend.services.tts:Sending TTS request with 147 characters of text INFO:backend.services.tts:Received TTS response after 0.16s, size: 390102 bytes There's a full breakdown of the architecture and latency information on my readme. GitHub: [https://github.com/Lex-au/VocalisConversational](https://github.com/Lex-au/VocalisConversational) model (optional): [https://huggingface.co/lex-au/Vocalis-Q4\_K\_M.gguf](https://huggingface.co/lex-au/Vocalis-Q4_K_M.gguf) Some demo videos during project progress here: [https://www.youtube.com/@AJ-sj5ik](https://www.youtube.com/@AJ-sj5ik) License: Apache 2.0 Let me know what you think or if you have questions!

Reply to Post

39 Comments

[-]

Omarashraf2823@reddit

[-]

rbgo404@reddit

You can check out our cookbook as well! https://docs.inferless.com/cookbook/serverless-customer-service-bot

[-]

kzoltan@reddit

I'm running Ollama to host both the LLM (qwen2.5 14b q8) and the TTS model, on a single 4090. The speed seems a bit off for some reason, even though the memory use stays around 20GB (the OS does not use the card at all, so almost 24GB is available). Is the below normal from Orpheus-FastAPI using the **Q2\_K** model? ... <|audio|>tara: Got it! The response time can vary slightly depending on various factors, but generally, my responses are designed to be quick and efficient. If you'd like to test further or have any other questions in the future, feel free to let me know!<|eot\_id|> ... Progress: 159.5 tokens/sec, est. 13.3s audio generated, **2246 tokens**, 156 chunks in **14.1s**

[-]

townofsalemfangay@reddit (OP)

Hi! Orpheus still isn't quite there yet in terms of latency. But your result of 13 seconds of audio in 14 seconds is below 0 real time. But that could be due to the fact you're running a decently sized LLM and TTS on a single GPU. From my experiencing building Orpheus-FASTAPI, the depedency on SNAC is a real bottleneck. I'm looking at reworking some stuff soon to provide streaming directly via API to play chunks as is, instead of a whole compiled audio file. I would recommend trying something like Koroko-FASTAPI in the interim. You can see latency results in my [demo video](https://www.youtube.com/watch?v=2slWwsHTNIA&) here.

[-]

kzoltan@reddit

Ah, so the full sound file gets transferred before playing it. I did not see that in the code yet. That explains the delay I'm experiencing. With smaller models, it is a bit better, but it still gets slower after some time for some reason. Anyway, let's see how much the chunking improves it. I will familiarize myself with your code a bit in the meantime...

[-]

Predatedtomcat@reddit

How does it compare to RealtimeSTT and RealtimeTTS from koljab

[-]

townofsalemfangay@reddit (OP)

I couldn't say for certain! I have not tried either of those. Today I'm going to upload a demo video and append to the front of my git repo as requested by others. If you have already tried, let me know your experience.

[-]

SeriousGrab6233@reddit

Koljab's realtimeSTT,TTS, and stream2Sentence probably are faster than what you have currently but he hasnt made any polished applications like you have

[-]

townofsalemfangay@reddit (OP)

**Hey everyone!** Thanks to some great feedback from this community, I’ve added a **Windows installation guide and demonstration video** right at the top of my repo for easier access. u/indian_geek u/Maximus-CZ (requesters) 📹 **Video:** [https://www.youtube.com/watch?v=WmG7fNNFiRo](https://www.youtube.com/watch?v=WmG7fNNFiRo) 📁 **Repo:** [https://github.com/Lex-au/Vocalis?tab=readme-ov-file#video-demonstration-of-setup-and-usage](https://github.com/Lex-au/Vocalis?tab=readme-ov-file#video-demonstration-of-setup-and-usage) Appreciate all the insight so far—feel free to check it out and let me know what you think!

[-]

indian_geek@reddit

Although you mention 610ms as latency, the youtube video demos I see seem to have higher latency times. Can you please clarify?

[-]

HelpfulHand3@reddit

What you're seeing is the delay after the silence threshold is reached. If it would send the TTS/LLM request instantly after you stop speaking then you wouldn't get to complete a thought any time you paused for a moment. Turn detection is still the hardest piece of the puzzle, probably requiring custom trained model that can detect end of turn reliably from being sent audio. I wonder if that's what Maya has going on, it always seemed to so quickly know when you finished speaking but without many false positives.

[-]

Traditional_Tap1708@reddit

Livekit has a transformer based model that does something similar. I’m still experimenting with. You can check it out.

[-]

HelpfulHand3@reddit

They do but they have restrictive licensing on their components

[-]

Chromix_@reddit

There is a simpler solution as long as the overall end-to-end reaction time is that high: *Reduce* the silence threshold. If the user continues speaking while the pipeline runs then abort it earliest at token generation, not before. That way the KV cache is already warmed up and prompt processing will be faster the next time. Also, with a bit of work you can potentially let the STT continue from the previous snippet of transcribed audio, reducing the reaction time even further.

[-]

HelpfulHand3@reddit

Yes, I'm doing this in my own chat interface, to a degree. It runs live transcription (unlike Vocalis which seems to do it all after silence threshold is satisfied) and on every 500ms pause, it caches an LLM result with the current transcript. If after a longer silence threshold is met and the transcript hasn't changed (normalized for punctuation etc) it uses the cached response. This can be extended but I never got around to it. You can start buffering the TTS for instant playback as well, but all you're going to save is that 600ms not the 1-2s of silence threshold. [https://imgur.com/a/lnPBDrk](https://imgur.com/a/lnPBDrk)

[-]

poli-cya@reddit

Wow, that's insanely impressive. Is it something you think a tinkerer could implement in a few hours? I've got a 4090 laptop I'd love to try it out on.

[-]

HelpfulHand3@reddit

Probably not, I'd recommend just using this or OP's FastAPI git.

[-]

Chromix_@reddit

Exactly, that's the way to go if you want to reduce latency for the user - which should be one of the main goals, aside from avoid verbose LLM responses.

[-]

townofsalemfangay@reddit (OP)

Thanks for the wonderful insights! Vocalis uses a silence-threshold approach for a few reasons: * **Reliability**: Complete utterance transcription tends to be more accurate than partial fragments * **Usability**: For most conversational use cases, the natural pauses work well with the flow * **Development Timeline**: I had to make some trade-offs to ship a stable v1.0 u/HelpfulHand3 \- Your approach with live transcription is pretty damn good. I actually prototyped something similar early on but ran into issues with: 1. False positives in transcription that would later be corrected 2. Higher resource usage on lower-end systems 3. Difficulty in determining when a thought was truly complete I do like your idea about trimming initial silence from TTS responses (with regards to Orpheus) - that's something I could definitely optimise further. u/Chromix_ \- Keeping the KV cache warm and reducing the silence threshold is definitely a good direction. The challenge was balancing this with a good UX across different speaking styles. Your ideas are definitely in line with where I want to take Vocalis: * Implementing a true seamless speculative execution system where it starts processing before the user finishes speaking * Smarter turn detection that adapts to the user's speaking style Both of these will require external models, at bare minimum something like SentencePiece and either: 1. A direct change to the LLM endpoint with an additional speculative decoder, or 2. Another specialised model in the middle specifically for turn detection (similar to what commercial assistants use) At that point though, I begin to wonder if it's beyond just me alone. As a solo developer, there's a constant balance between optimisation and keeping the project accessible for consumer hardware. If anyone wants to contribute or experiment with these approaches, I'd welcome collaboration on these more advanced features.

[-]

HelpfulHand3@reddit

\`\`\`False positives in transcription that would later be corrected\`\`\` Yeah, this was coming up for me as well, but I set the delay before the LLM call to be just outside the window for transcript adjustment. I believe for my particular settings was around 500ms. \`\`\`Higher resource usage on lower-end systems\`\`\` True, but you're unlikely to be running the STT while the TTS is inferencing. VAD can stop the TTS as an interrupt (or barge-in as you've termed it.) \`\`\`Difficulty in determining when a thought was truly complete\`\`\` True. A model like the smart turn or LiveKit's [Turn Detector](https://huggingface.co/livekit/turn-detector/) (I think their licensing is restrictive) if low enough latency would be a good preliminary check before running the LLM. But for my purposes, 500ms debounced after the last transcription change was enough for an improvement in latency with minimal issues. Speculative decoding would be cool!

[-]

townofsalemfangay@reddit (OP)

Hi! Good observation—the handful of YouTube videos floating around were recorded during the earlier stages of the project when I was using the \`large\` model with a beam size of 5 for Whisper. Sadly, my accent (Australian) leads to transcription woes with with both tiny/base. That setup prioritised transcription quality for demo clarity but added a fair bit of latency. The default configuration that ships with Vocalis now uses the \`base\` model with a beam size of 2—much faster. The 610ms latency figure mentioned in the README reflects that default setup on tests with my 4090. Appreciate you checking it out!

[-]

poli-cya@reddit

Alright, took me a couple of hours but I finally got it working. Almost immediately, I think I've got a suggestion- it needs a button to erase memory/start a new conversation. I've not even installed the vision portion but it keeps wanting to talk about analyzing an image it is convinced I uploaded. Speech recognition is hit or miss, but I am working with a laptop mic from a few feet away- gonna try to access through my phone or pull out the old blue yeti mic to take poor hardware out of the equation. You also make the orpheus-fastapi, right? If so, any ideas why it wouldn't detect my GPU on a 4090 laptop? IT says "🖥️ Hardware: CPU only (No CUDA GPU detected)", even though I'm running the q8 orpheus and your Llama 8B on it from the same PC. As for general suggestions, going for a IQ4/imatrix and/or offering larger sizes on the assistant might be nice. Also maybe offering a middle-road Q6-something of Orpheus, as the Q8 won't run real-time on a 4090 laptop but Q4 may be a big dip. I guess overall, I'm saying bringing the LLM up and the TTS down a notch might be the best balance. Super cool project, the voice of the assistant is damn close to OpenAI advanced mode. I'm gonna try to wipe all the memory manually so I can start fresh once I get the better mic in play. I'll try to give you another update once I've had some more time with it.

[-]

townofsalemfangay@reddit (OP)

Hi! Really appreciate the insight—glad you got it running! You don’t actually need to use the fine-tuned model I uploaded to my [Hugging Face](https://huggingface.co/collections/lex-au/vocalis-67fb2b4481692bf8e2aea0db) for Vocalis; any model with OpenAI-compatible endpoints will work just fine. As for the memory reset, there is a way to do that: in the top-left corner, click the three-line menu to open the sidebar. From there, you can erase the conversation memory. Under “Session Management” you can also save, rename, load, or delete conversations. If you want a clean slate, I’d suggest hitting the hangup button, clearing memory via the sidebar, and then clicking call again to restart the session fresh. Yep—I also built Orpheus-FASTAPI! Not sure why it wouldn’t detect your 4090 laptop GPU, though. I’d double-check you’ve got the latest [GeForce drivers installed](https://www.nvidia.com/en-us/geforce-now/download/), and also make sure you’ve got the [CUDA Toolkit](https://developer.nvidia.com/cuda-toolkit) set up properly. Running `nvidia-smi` in CMD should confirm whether CUDA is being picked up. That usually resolves it. Also totally agree re: quantisation balance. I will look at adding more quants for my fine-tune specifically for vocalis, or alternatively, drop the safetensors so everyone can have it themselves. Glad the assistant voice landed well! Looking forward to your next update—keen to hear how it runs with the better mic.

[-]

poli-cya@reddit

I may be crazy, but it looks like your github link is 404'ing. Is it just a bad link or did you take it down somehow?

[-]

townofsalemfangay@reddit (OP)

Hi! You can find it here https://github.com/Lex-au/Vocalis

[-]

poli-cya@reddit

That's funny, I ended up finding it many hours ago and got it all implemented throughout the day. I just left you a big message with some of my takes/suggestions. Thanks so much for all the fantastic work, even with the hiccups it feels super futuristic and insane it can run as fast as it does all on my laptop.

[-]

townofsalemfangay@reddit (OP)

I'm so happy to hear that 💖

[-]

ElTejanoLoco@reddit

The link is incorrect, the correct link is [https://github.com/Lex-au/Vocalis](https://github.com/Lex-au/Vocalis)

[-]

Carchofa@reddit

I've reviewed the GitHub documentation and couldn't find any information about tool calling capabilities. I'm very interested in seeing this implemented in the future, potentially with a flow like this: user input -> assistant response (possibly indicating a tool use) -> tool execution -> the assistant then either provides the final answer based on the results or reiterates by calling more tools. I've been trying to prototype this by having the model output JSON containing both an "answer" field and a "tool_call" field. If the "tool_call" field is empty, it stops; otherwise, it loops back with the tool output. This is to avoid the problem of not being able to generate a response and a tool call with the same api call A key challenge I'm facing is efficiently streaming the response because the initial part of the JSON is always the "answer" tag, which the LLM has to generate in full before any potential tool call. Filtering out this consistent initial part of the JSON to enable smoother streaming is proving difficult. Great project overall, and thank you for your efforts!

[-]

townofsalemfangay@reddit (OP)

Thanks for the interest in Vocalis! Tool calling is an interesting direction but wasn't part of the initial vision for the project. Vocalis was designed primarily as a speech-to-speech conversational assistant rather than a task automation system. The focus has been on creating natural, fluid voice interactions with minimal latency. That said, I can see why you'd want this capability! A few thoughts on your approach: 1. **Streaming challenges**: You've hit on one of the key issues - streaming responses becomes tricky with structured output like JSON. When the model has to complete the "answer" field before starting the "tool\_call" field, it breaks the immediate nature of conversation. 2. **Voice UI considerations**: Tool calling would need a different UI/UX in a voice interface. Unlike text interfaces where seeing JSON or function calls is normal, voice conversations need natural transitions between direct answers and tool invocations. 3. **Alternative approach**: Rather than JSON output parsing, you might consider: * Function calling via the OpenAI-compatible API (if your model supports it) * Adding a post-processing layer that detects tool call intents in natural language * Using semantic routing where certain phrases trigger specific tools If you're really keen on adding this to Vocalis, the integration point would be in the LLM client service. You'd need to modify the response handling in `backend/services/llm.py` and potentially add a new tool execution service. The Model Context Protocol (MCP) is an interesting approach, but integrating it into a speech-to-speech flow would require significant UI/UX work to make the experience feel natural. I'd be curious to see what you build if you decide to fork the project! It's always interesting to see different directions people take with the codebase.

[-]

Carchofa@reddit

Thanks for the suggestions. I discarded the regular way to do function calling (Openai SDK tool use) to allow the LLM to generate a response and a tool call in the same response. But now that I think about it, I could just do another API call afterwards in which I only look at the tool call and ignore any response, as it would mean there's no need to call any tools. I'm looking at the code right now and I'm having some trouble setting up the LLM and TTS to use Groq for testing (I'm GPU poor and I haven't studied coding), but I'm trying to keep it all OpenAI compatible so that it's easy to go back to a local option. Once I have that, I'll try to implement tool calling only in the backend and then I'll try to use Cursor to update the frontend (it will probably need some fixing by someone experienced after that). Thanks for keeping the project super organized. It has been easy to identify where everything is so far.

[-]

Background_Put_4978@reddit

Well, this is amazing. I’m so glad someone else understood the concept of silence as an important input! Congratulations man, this is a milestone.

[-]

townofsalemfangay@reddit (OP)

Thank you for the kind words ❤️

[-]

Radiant_Dog1937@reddit

The github link returns a 404.

[-]

townofsalemfangay@reddit (OP)

Hi! Try: [https://github.com/Lex-au/Vocalis](https://github.com/Lex-au/Vocalis)

[-]

Maximus-CZ@reddit

you should add video demo near the top

[-]

townofsalemfangay@reddit (OP)

Will get one out tomorrow ❤️

[-]

Chromix_@reddit

Your custom Q4\_K\_M quant was created without imatrix. You're losing [quite some quality there](https://www.reddit.com/r/LocalLLaMA/comments/1993iro/ggufs_quants_can_punch_above_their_weights_now/). Better recreate that one. Very nice that this gives users the option to select different options for parts of the pipeline and thus customize speed vs. quality. The total latency of 610ms for the system to respond is slightly above [what's expected](https://www.reddit.com/r/LocalLLaMA/comments/1jjqsa0/comment/mjz66fe/) in conversations with humans, but not too high to feel unnatural yet. Do you stream the LLM response while it's being generated into the TTS and stream-play the resulting audio to reduce latency?

[-]

townofsalemfangay@reddit (OP)

Hi! Thanks for the kind words and the heads-up about `imatrix`—I'll definitely take a squiz at that. I might even drop the safetensors entirely so folks can roll their own quant with whatever settings suit their setup. Latency-wise, yeah, you're spot on. The biggest bottleneck for me (running on a 4090) is actually ASR. With an Aussie accent, I can’t really use `tiny.en` or super low beam sizes without sacrificing transcription accuracy—unless I speak *very* slowly and loudly into the mic, lol. So I tend to default to `base` with beam size 2, which adds a bit of overhead but gives me solid results. That said, users can absolutely squeeze more responsiveness by dialing down model sizes or switching to faster LLM/TTS endpoints—it’s all trade-offs between speed, stability, and clarity. And yeah—unlike Sesame, we don’t have a team of inference engineers and racks of infra smoothing out the edge cases. But for a fully local, open-source stack? I think we’re getting *really* close to that magic threshold of natural responsiveness 👌 The full architecture is up on my GitHub. Everything runs async via websockets—voice detection thresholds kick off ASR, which then hands off payloads to the LLM before it bounces back to TTS then the browser. It’s basically an orchestrator between these services with minimal handoff latency.