Do you use LLM's with TTS and speech recognition?
Posted by film_man_84@reddit | LocalLLaMA | View on Reddit | 8 comments
As the title says, do you talk to your LLM using speech recognition and listen back its answers with TTS models?
Last night I didn't slept much so I sit on computer and installed Fast-Kokoro for TTS and configured Koboldcpp using Whisper model and so far it seems to be great experience with SillyTavern and Gemma 4 small E4B model.
I have RTX 4060 Ti with 16 GB VRAM and 32 GB of RAM and with this setup (SillyTavern + Koboldcpp + Whisper + Gemma 4-E4B + Fast Kokoro) it is almost real time, so it is relistic to use for talking with voice.
Since this is quite new to me (previously only used TTS long time ago for testing), I was wondering how others here are doing. Do you talk to your LLM's or is it more rare use case?
Echo9Zulu-@reddit
Yes, I added support to OpenArc specifically for this usecase. I haven't made more than one meh application with these yet but OpenArc now supports qwen asr, qwen tts (all tasks) whisper and kokoro. You can run any of these at same time as llm or vlm, from the same server. Very nice and near real time on b70.
Back on topic though, I have found ASR utility somewhat limited. Maybe I'm not used to speaking outloud as much... sort of weird to share thoughts outloud this way, and even weirder listening to tts with anyone around lol.
Lately I've been experimenting with a speak mcp tool to see how llms handle addressing the user. My interest started as a toy alignment problem to see what llms given a choice of what to present to the user choose to present. Maybe reasoning traces would show inner concealment. Still working on the tool description.
film_man_84@reddit (OP)
Yeah, it feels somehow weird to share thoughts with speaking instead of writing (I have grown in the IRC era, not on Discord-talking chat era), but I assume that it will get easier the more I do.
Of course I don't have any plans to talk if somebody is around. Also I prefer not to use too much technology when others are around unless we do something with somebody. It is better to stay in real life with people in real life and hve time for each others than just sit on computers :)
_supert_@reddit
I prefer typing and reading. I have a good tts-stt loop but in the end, typing is faster, less cringy and you can't read out things like
$ ls ~/.config/*/*.tomlsensibly, for example.film_man_84@reddit (OP)
Yeah. I noticed that some things I just write, some speak because it is easier to get right immediately by typing, but more fun to just talk on others.
awitod@reddit
I've been having a lot of fun with microsoft/VibeVoice-1.5B the last few days for TTS and I am using Qwen3-ASR-0.6B for ASR and transcription.
rkoy1234@reddit
I have something similar, but I recommend putting in a VAD (with like 1 sec threshold), otherwise too many inputs that say (silence) or (no audio) from whisper.
my setup is whisper+omnivoice/qwentts(omnivoice if I want quality, qwen if I want streaming) +llamacpp with a vibecoded speech to speech framework.
I moved away from ST since TTS support is kinda janky/basic on it, especially when you start adding the 'stream by paragraphs' and 'ignore text out of quotes' options. (no shade to the ST devs, ik they hate receiving requests for hundred different ttses every day, lol)
this is one of those experiences where an extra gpu or more vram can tangibly increase quality of experience, since the bigger tts models are much more expressive, and the latency cuts down a lot if you can separate the llm pipeline with the tts pipeline (so that tts can start streaming sentence by sentence while llm is still outputting, without hogging gpu utilization)
Kahvana@reddit
I had it for a while and seemed fun, but in the end typing is far more accurate and doesn't take that much more time.
FinBenton@reddit
I'w done a fare share of whisper-llama.cpp-tts wrappers but in the end I always end up just typing my stuff in, I guess I dont like just talking by myself.