A fully offline, multi-speaker transcription pipeline for macOS (no cloud, no API keys, runs on M1/M2/M3 with Metal acceleration)

Posted by No_Weight6617@reddit | LocalLLaMA | View on Reddit | 7 comments

Hey,

I developed VaultASR,a native C++ pipeline that does the entire speech-to-text + speaker diarization stack locally. My major goal has been to effectively utilize the hardware and run end-to-end on the machine locally avoiding any sensitive recordings/data go to cloud

What it does:

Transcribes audio/video files with OpenAI's Whisper (via whsiper.cpp)
Detects speech segments using Silero VAD v5 over ONNX Runtime
Identifies who said what using WeSpeaker speaker embeddings + agglomerative clustering
Outputs to Text, JSON, SRT, XLSX, Markdown, Docx, or SQLite

Performance on M1:

Decoded 2 hours of audio in \~10 seconds
Full transcription + diarization of that same 2h file in minutes
Runs entirely on Metal GPU with no CPU bottleneck

Stack:

C++17, CMake
whisper.cpp (Whisper inference, Metal backend)
ONNX Runtime (Silero VAD, WeSpeaker) with CoreML acceleration
FFmpeg for decoding, libxlsxwriter for XLSX, RNNoise for denoising

Roadmap: Goal is to support other execution providers (CUDA (NVIDIA), DirectML (Windows), ROCm (AMD))

GitHub: https://github.com/vamshinr/vaultASR

would love the help extending this project to support other execution providers.

[-]

coder543@reddit

Did you try using VibeVoice ASR, which natively supports speaker diarization?

[-]

No_Weight6617@reddit (OP)

vibevoice i assume uses a LLM correct? And which means we would need to hold large amounts of data in memory. Haven't explored this option though.

[-]

No_Weight6617@reddit (OP)

Also curious, if the audio file is huge, do you know if it supports streaming as loading whole audio chunk might not be fit on low end pcs with limited memory

[-]

TheActualStudy@reddit

Pyannote for diarization?

[-]

No_Weight6617@reddit (OP)

indirectly yes, was using wespeaker resnet34 model that was derived from pyannote only. Just that i didn't use the full pyannote pipeline to avoid any overheads

[-]

TheActualStudy@reddit

How did you find the diarization accuracy? I wasn't really happy with Pyannote or Sortformer, but have been having good success with DiariZen, which I have an ONNX implementation of here. The code should work of your system, but I don't have the hardware to test CoreML, so that part won't be there yet.

[-]

No_Weight6617@reddit (OP)

I agree, atleast on few overlapping samples o observed that current approach is not perfect (and yeah diarizen would be efficient especially in sich scenarios) But while browsing about it, onnx converted model doesn’t seem to run on coreml. Definitely a good addition for the next steps to research on. Thanks for that info though