A fully offline, multi-speaker transcription pipeline for macOS (no cloud, no API keys, runs on M1/M2/M3 with Metal acceleration)
Posted by No_Weight6617@reddit | LocalLLaMA | View on Reddit | 7 comments
Hey,
I developed VaultASR,a native C++ pipeline that does the entire speech-to-text + speaker diarization stack locally. My major goal has been to effectively utilize the hardware and run end-to-end on the machine locally avoiding any sensitive recordings/data go to cloud
What it does:
- Transcribes audio/video files with OpenAI's Whisper (via whsiper.cpp)
- Detects speech segments using Silero VAD v5 over ONNX Runtime
- Identifies who said what using WeSpeaker speaker embeddings + agglomerative clustering
- Outputs to Text, JSON, SRT, XLSX, Markdown, Docx, or SQLite
Performance on M1:
- Decoded 2 hours of audio in \~10 seconds
- Full transcription + diarization of that same 2h file in minutes
- Runs entirely on Metal GPU with no CPU bottleneck
Stack:
- C++17, CMake
- whisper.cpp (Whisper inference, Metal backend)
- ONNX Runtime (Silero VAD, WeSpeaker) with CoreML acceleration
- FFmpeg for decoding, libxlsxwriter for XLSX, RNNoise for denoising
Roadmap: Goal is to support other execution providers (CUDA (NVIDIA), DirectML (Windows), ROCm (AMD))
GitHub: https://github.com/vamshinr/vaultASR
would love the help extending this project to support other execution providers.
coder543@reddit
Did you try using VibeVoice ASR, which natively supports speaker diarization?
No_Weight6617@reddit (OP)
vibevoice i assume uses a LLM correct? And which means we would need to hold large amounts of data in memory. Haven't explored this option though.
No_Weight6617@reddit (OP)
Also curious, if the audio file is huge, do you know if it supports streaming as loading whole audio chunk might not be fit on low end pcs with limited memory
TheActualStudy@reddit
Pyannote for diarization?
No_Weight6617@reddit (OP)
indirectly yes, was using wespeaker resnet34 model that was derived from pyannote only. Just that i didn't use the full pyannote pipeline to avoid any overheads
TheActualStudy@reddit
How did you find the diarization accuracy? I wasn't really happy with Pyannote or Sortformer, but have been having good success with DiariZen, which I have an ONNX implementation of here. The code should work of your system, but I don't have the hardware to test CoreML, so that part won't be there yet.
No_Weight6617@reddit (OP)
I agree, atleast on few overlapping samples o observed that current approach is not perfect (and yeah diarizen would be efficient especially in sich scenarios) But while browsing about it, onnx converted model doesn’t seem to run on coreml. Definitely a good addition for the next steps to research on. Thanks for that info though