vibevoice.cpp: Microsoft VibeVoice (TTS + long-form ASR with diarization) ported to ggml/C++, runs on CPU/CUDA/Metal/Vulkan, no Python at inference

Posted by mudler_it@reddit | LocalLLaMA | View on Reddit | 10 comments

A few weeks ago I shipped vibevoice.cpp, a pure-C++ ggml port of Microsoft
VibeVoice (the speech-to-speech model with voice cloning, https://github.com/
microsoft/VibeVoice). Wanted to post a follow-up here because we're at a point where the engine has grown well past "first-pass port" and into something other people might actually want to run.

This work was brought to you with <3 from the LocalAI team!

What it does:

TTS with pre-converted voice prompts (any of upstream's .pt voices, ours or yours converted via scripts/convert_voice_to_gguf.py): give it a 30s reference clip, generate 24kHz speech in the cloned voice. Ships pre-converted GGUFs (0.5B realtime model) on https://huggingface.co/mudler/vibevoice.cpp-models
Long-form ASR with speaker diarization : 7B-parameter model, returns
JSON segments {start, end, speaker, content}. Tested up to 17 minutes
audio in one shot.

Backends: CPU (CPU-only baseline), CUDA, Metal, Vulkan, hipBLAS via ggml's
backend dispatch. Single binary or libvibevoice.so + flat C ABI for embedding (purego/cgo/dlopen-friendly).

Numbers:

                               Inference   RTF    Peak RSS
68s sample, CUDA Q4_K (GB10):  28 s       0.41   ~6 GB
68s sample, CPU  Q4_K (R9):    150 s      2.20   ~8 GB
17min audio, CPU Q8_0:         1929 s     1.94   ~26 GB

Compared to upstream Microsoft Python + Transformers + vLLM plugin:

Same Qwen2.5 7B/0.5B backbone, same DPM-Solver diffusion head, same windowed prefill (5 text tokens / 6 speech frames per the mlx-audio pattern).
Closed-loop TTS→ASR test asserts 100% source-word recall on a fixed seed; runs in CI.
No Python at inference, no vLLM, no torch.

Limitations / honest:

17min audio peak is still 26 GB on CPU because of the encoder activation pool + 14 GB Q8_0 weights. Q4_K cuts the model side (\~10 GB on disk), but the encoder pool needs its own work.
The diffusion head builds 20 small graphs per latent frame; graph reuse there is the next obvious win.
No streaming output yet. emits a complete WAV / full transcript.
ASR transcript quality is what upstream gives you; on a 17min Italian audio the recovered transcript is faithful through natural sentence boundaries.

Repo: https://github.com/mudler/vibevoice.cpp (MIT)

Models: https://huggingface.co/mudler/vibevoice.cpp-models

LocalAI integration: This work was done with <3 from the LocalAI team. vibevoice.cpp is already a backend which can be used ready-to-go in LocalAI !

Happy to answer questions and feedback!

[-]

vibevoice.cpp: Microsoft VibeVoice (TTS + long-form ASR with diarization) ported to ggml/C++, runs on CPU/CUDA/Metal/Vulkan, no Python at inference

taking_bullet@reddit

ToInfinityAndAbove@reddit

taking_bullet@reddit

lukaszpi@reddit

buddroyce@reddit

pmttyji@reddit

JackStrawWitchita@reddit

Skystunt@reddit

foldl-li@reddit

Huge-Safety-1061@reddit