vibevoice.cpp: Microsoft VibeVoice (TTS + long-form ASR with diarization) ported to ggml/C++, runs on CPU/CUDA/Metal/Vulkan, no Python at inference
Posted by mudler_it@reddit | LocalLLaMA | View on Reddit | 10 comments
A few weeks ago I shipped vibevoice.cpp, a pure-C++ ggml port of Microsoft
VibeVoice (the speech-to-speech model with voice cloning, https://github.com/
microsoft/VibeVoice). Wanted to post a follow-up here because we're at a point where the engine has grown well past "first-pass port" and into something other people might actually want to run.
This work was brought to you with <3 from the LocalAI team!
What it does:
- TTS with pre-converted voice prompts (any of upstream's .pt voices, ours or yours converted via scripts/convert_voice_to_gguf.py): give it a 30s reference clip, generate 24kHz speech in the cloned voice. Ships pre-converted GGUFs (0.5B realtime model) on https://huggingface.co/mudler/vibevoice.cpp-models
- Long-form ASR with speaker diarization : 7B-parameter model, returns
- JSON segments {start, end, speaker, content}. Tested up to 17 minutes
- audio in one shot.
Backends: CPU (CPU-only baseline), CUDA, Metal, Vulkan, hipBLAS via ggml's
backend dispatch. Single binary or libvibevoice.so + flat C ABI for embedding (purego/cgo/dlopen-friendly).
Numbers:
Inference RTF Peak RSS
68s sample, CUDA Q4_K (GB10): 28 s 0.41 ~6 GB
68s sample, CPU Q4_K (R9): 150 s 2.20 ~8 GB
17min audio, CPU Q8_0: 1929 s 1.94 ~26 GB
Compared to upstream Microsoft Python + Transformers + vLLM plugin:
- Same Qwen2.5 7B/0.5B backbone, same DPM-Solver diffusion head, same windowed prefill (5 text tokens / 6 speech frames per the mlx-audio pattern).
- Closed-loop TTS→ASR test asserts 100% source-word recall on a fixed seed; runs in CI.
- No Python at inference, no vLLM, no torch.
Limitations / honest:
- 17min audio peak is still 26 GB on CPU because of the encoder activation pool + 14 GB Q8_0 weights. Q4_K cuts the model side (\~10 GB on disk), but the encoder pool needs its own work.
- The diffusion head builds 20 small graphs per latent frame; graph reuse there is the next obvious win.
- No streaming output yet. emits a complete WAV / full transcript.
- ASR transcript quality is what upstream gives you; on a 17min Italian audio the recovered transcript is faithful through natural sentence boundaries.
Repo: https://github.com/mudler/vibevoice.cpp (MIT)
Models: https://huggingface.co/mudler/vibevoice.cpp-models
LocalAI integration: This work was done with <3 from the LocalAI team. vibevoice.cpp is already a backend which can be used ready-to-go in LocalAI !
Happy to answer questions and feedback!
taking_bullet@reddit
It's always nice to see another TTS project 👌 Are you going to add support for KugelAudio models? That's basically classic VibeVoice, but trained for European languages.
ToInfinityAndAbove@reddit
also interested
taking_bullet@reddit
FYU: If you are not scared by ComfyUI you can use KugelAudio right now 😏
lukaszpi@reddit
Thank you! Awesome work
buddroyce@reddit
Cool stuff man!
pmttyji@reddit
Nice. Glad to continuous stuff from you!
JackStrawWitchita@reddit
Does your version run faster than normal VibeVoice on CPU-only machines?
Skystunt@reddit
This looks cool !
foldl-li@reddit
Cool.
Huge-Safety-1061@reddit
Awesome work!