Setup for dictation / voice control via local LLM on Linux/AMD?

Posted by fallenguru@reddit | LocalLLaMA | View on Reddit | 4 comments

Disclaimer: I'm a bloody newb, sorry in advance.

For health reasons, I'd really like to reduce the amount of typing I have to do, but conventional dictation / speech recognition software never worked for me. And even if it did, it doesn't exist for Linux. Imagine my surprise when I tried voice input on Gemini the other day, it was near perfect. Makes sense, really. Anyway, all the SOTA cloud-based offerings can do it just fine. Not even homophone shenanigans faze them much, they seem to be able to follow what I'm on about. Punctuation and stuff like paragraph breaks aside, it's what I imagine dictating to a human secretary would be like. And that with voice recognition meant to facilitate spoken prompts ...
Only, running my work stuff through a cloud service wouldn't even be legal and my personal stuff is private. Which reduces the usefulness of this to Reddit posts. \^\^ Also, I like running my own stuff.

Q1: Can local models freely available and practical to run at home even do this?

I have a Radeon VII. 16 GB of HBM2, but no CUDA, obviously. Ideally I'd like to get to a proof-of-concept stage with this; if it then turns out it needs more hardware to be faster/smarter, so be it.

I've been playing around with LMStudio a bit, got some impressive text→text results, but image+text→text was completely useless on all models I tried (on stuff SOTA chatbots do great on); and I don't think LMStudio does audio at all?

Q2: Assuming the raw functionality is there, what can I use to tie it together?

Like, dictating into LMStudio, or something like it, then copying out the text is doable, but something like an input method with push-to-talk, or even an open mic with speech detection would obviously be nicer. Bonus points if it can execute basic voice commands, too.

To cut this short, I got as far as finding Whisper.cpp, but AFAICS that only does transcription of pre-recorded audio files without any punctuation or any "understanding" of what it's transcribing, and it doesn't seem to work in LMStudio for me, so I couldn't test it yet.

And frankly, I haven't done any serious coding in decades, cobbling together something that records a continuous audio stream, segments it, feeds the segments to Whisper, then feeds the result to a text-to-text model to make sense of it all and pretty it up—as a student, I'd have been all over that, but I don't have that kind of time any more. :-(

I couldn't find anything approaching a turnkey-solution, only one or two promising abandoned projects. Some kind of high-level API, then?

tl;dr: I think this should be doable, but I've no idea where to start.