Setup for dictation / voice control via local LLM on Linux/AMD?

Posted by fallenguru@reddit | LocalLLaMA | View on Reddit | 4 comments

Disclaimer: I'm a bloody newb, sorry in advance.

For health reasons, I'd really like to reduce the amount of typing I have to do, but conventional dictation / speech recognition software never worked for me. And even if it did, it doesn't exist for Linux. Imagine my surprise when I tried voice input on Gemini the other day, it was near perfect. Makes sense, really. Anyway, all the SOTA cloud-based offerings can do it just fine. Not even homophone shenanigans faze them much, they seem to be able to follow what I'm on about. Punctuation and stuff like paragraph breaks aside, it's what I imagine dictating to a human secretary would be like. And that with voice recognition meant to facilitate spoken prompts ...
Only, running my work stuff through a cloud service wouldn't even be legal and my personal stuff is private. Which reduces the usefulness of this to Reddit posts. \^\^ Also, I like running my own stuff.

Q1: Can local models freely available and practical to run at home even do this?

I have a Radeon VII. 16 GB of HBM2, but no CUDA, obviously. Ideally I'd like to get to a proof-of-concept stage with this; if it then turns out it needs more hardware to be faster/smarter, so be it.

I've been playing around with LMStudio a bit, got some impressive text→text results, but image+text→text was completely useless on all models I tried (on stuff SOTA chatbots do great on); and I don't think LMStudio does audio at all?

Q2: Assuming the raw functionality is there, what can I use to tie it together?

Like, dictating into LMStudio, or something like it, then copying out the text is doable, but something like an input method with push-to-talk, or even an open mic with speech detection would obviously be nicer. Bonus points if it can execute basic voice commands, too.

To cut this short, I got as far as finding Whisper.cpp, but AFAICS that only does transcription of pre-recorded audio files without any punctuation or any "understanding" of what it's transcribing, and it doesn't seem to work in LMStudio for me, so I couldn't test it yet.

And frankly, I haven't done any serious coding in decades, cobbling together something that records a continuous audio stream, segments it, feeds the segments to Whisper, then feeds the result to a text-to-text model to make sense of it all and pretty it up—as a student, I'd have been all over that, but I don't have that kind of time any more. :-(

I couldn't find anything approaching a turnkey-solution, only one or two promising abandoned projects. Some kind of high-level API, then?

tl;dr: I think this should be doable, but I've no idea where to start.

[-]

Professional-Crab234@reddit

I think I have good news, and I warmly suggest that you give this one a go:

VOXD - a voice-typing / dictation app for linux

"Out of the box" sets you up with LOCAL voice transcription, and even LOCAL ai-rewriting according to your custom pre-made prompts.

Works on CPU. No GPU required. Decent speed on my ryzen 5800.

ytain_1@reddit

You could take a look at this open source app called Handy (https://github.com/cjpais/Handy), it's cross platform. Unfortunately no real time transcription like the cloud services.

Also only for Windows, you can take a look at this app called SpeechPulse (https://speechpulse.com), it supports the real time transcription similar to cloud offerings, but it's based on Whisper model and as well the whisper.cpp.

TokenRingAI@reddit

It is 100% possible, the workflow is that you record with PortAudio, save to a WAV file, transcribe that with the model of your choosing, and feed the result to the LLM.

jamaalwakamaal@reddit

I dont know of any complete stack for your purpose , but check this, might be helpful, works good even on CPU. I have used this one, press a hotkey and speak, it transcribes and pastes the text at the active cursor position https://github.com/savbell/whisper-writer