Best way to do live transcriptions?
Posted by Daniel_H212@reddit | LocalLLaMA | View on Reddit | 17 comments
Currently taking a class from a professor that talks super slow. Never had this problem before but my ADHD makes it hard for me to focus on his lecture. My thought was that live transcription would help with this enormously. His syllabus also does explicitly allow recording of his lectures without needing permission, which I take to mean transcriptions would be allowed too.
Windows live caption is great and actually recognizes his speech almost perfectly, but it is live only, there's no full transcript created or saved anywhere and text is gone the moment he moves onto the next sentence.
I tried Buzz, but so far it seems to not work very well. I can't seem to use Qwen3-ASR-0.6B or granite-4-1b-speech with it, and whisper models seem incapable of recognizing his speech since he's too far from the microphone (and yes I tried lowering the volume threshold to 0).
What's the best way to do what I'm trying to do? I want a model that is small enough to run on my laptop's i5-1235U, a front end that lets me see the transcribed text live and keeps the full transcript, and the ability to recognize quiet speech similar to windows live caption.
BuildAndShipp@reddit
Try recording directly in the browser — Voice to Text — Free Voice Recorder & Instant Transcription | VideoText has a live transcription mode where you just hit record, it listens in real time, live transcription and once you're done it automatically generates the full transcript, summary, and translation. No local model setup, no hardware requirements. Might be exactly what you need for lectures.
tradmalcong@reddit
For live‑transcription‑style setups, a lot of people either lean hard into local Whisper‑based stacks (WhisperLive, whisper_streaming, etc.) or offload the heavy ASR/transcription to a hosted API so they can focus on the pipeline around it. If you ever want to keep the data flow simpler instead of self‑hosting the whole speech‑to‑text layer, tools like Talo and Palabra already bake in real‑time ASR, translation, and streaming into a single flow, so you can treat them more like infrastructure than something you rebuild yourself
get-whisperr@reddit
I have not seen a local model perform well or at least to an acceptable standards.. there are several ways to achieve this using whisperr. I'll leave the guide here for your reference: https://guides.whisperr.co/docs/use-cases/external-microphone
JotMe-Translation@reddit
there is nothing like best, because we do not get everything when use AI tools, but yes i tried a number of tools and one of them is jotme. i can not say it perfect but yes its real time transcription was really good and it provided context based transcription which is very useful feature
google translate was also good but i can not tap the mic button again and again
ArtfulGenie69@reddit
There are all sorts of small speech to text models now, like parakeet or qwen3 asr. They should run on cpu just fine, maybe even your phone, still the phone would be complicated without an app.
What I would do is listen to the lecture and record it, then run it through a cleaning stage with something like pynoise to take out garbage and something to even out audio like an equalization, then have that run through the text model. Deepseek should be able to write you a quick script that does all this automatically.
The benefit of your laptop is that you could try to get a cheap really good USB mic to help audio capture if it feels like that is needed.
Deepseek can write the program to chunk incoming data at all quiet places as well and then convert the speech that was in the chunk. Making it semi real time.
There are a bunch of simpler options than this still that are just premade apps like you tried but because they are all monitized they may find ways to make you pay, like that dumb 30m or less issue you had with the phone app.
colom1@reddit
mir geht es genau so, dass ich nichts mehr verstehe, wenn zu langsam gespochen wird - i feel you XD
Für den Alltag sind diese KI Spracherkennungssoftwares schon ganz ok, aber bei einer Uni Vorlesung mit vielen Fachwörtern und so weiter stößt sie wahrscheinlich an ihre Grenzen.
Ich hatte mal einen Uni Kommilitonen, der AVWS hatte bzw. schwerhörig. Er konnte zwar im 1:1 Gespräch gut verstehen, aber in großen Räumen mit schlechter Akustik hat er sich schwer getan. Er hat dann Schriftdolmetscher bekommen, die ihm alles live mitgeschrieben haben, was gesagt wurde. Sozusagen wie Live-Untertitel im Fernsehen aber nur für die Uni. Und danach bekam er die Mitschrift zugeschickt zum Lernen. Den Service hätte ich auch gerne gehabt! :D
Schau dir mal das an: sprachpilot.at/schriftdolmetschen.html Vielleicht kommt sowas ja auch für dich in Frage bei deiner "Einschränkung"
Ahja das ganze war kostenlos für meinen Kommilitonen. Darum muss sich nämlich die Uni kümmern.
Daniel_H212@reddit (OP)
Hmm... I just tried otter AI on my phone and it actually works pretty well. I'd rather have it be on my laptop but for now this seems like not the worst solution.
Ok_Read_2524@reddit
for laptop you could try https://fast-transcriber.com (I'm making this) but happy to give you lifetime pro use (need beta testers) :) lmk!
hdnh2006@reddit
mmm... maybe not the exact solution you are looking for but it could work for you. I have a demo of Open WebUI completely free and if you have a .mp3 recorded, you can transcribe it there if you want: chat.privategpt.es
Check the image
1) upload the mp3 file
2) click on the transcribed file
3) copy/past or even ask some free models I have deployed there for free
Daniel_H212@reddit (OP)
mp3 doesn't work for me, I need it to be a live transcription.
hdnh2006@reddit
Sorry, I missunderstood.
WhisperianCookie@reddit
you could fork one of the opensource stt tools (e.g. epicenter), and vibe code this live preview feature on top
Terminator857@reddit
Try the different openwhisper models on your laptop to see if they keep up and don't drain your battery. Qwen has a 2.5B model for this also. Leaderboard at: https://huggingface.co/spaces/hf-audio/open_asr_leaderboard I might decide to test IBM granite 1b.
Daniel_H212@reddit (OP)
What do I run them with? I have a decent idea which models are good but I need an inference solution that can run them and a front end that lets me use them.
Terminator857@reddit
You can ask opus or any ai cli agent to create a python script that will do it.
ionlycreate42@reddit
Parakeet doesn’t work? 0.6b
archieve_@reddit
https://github.com/SakiRinn/LiveCaptions-Translator
It has CPU live captions with history, and you don’t need a GPU to run it. This will save your laptop battery and reduce heat.
If you want to try ASR-LLM, use Parakeet v2 472M