vibevoice.cpp: Microsoft VibeVoice (TTS + long-form ASR with diarization) ported to ggml/C++, runs on CPU/CUDA/Metal/Vulkan, no Python at inference

Posted by mudler_it@reddit | LocalLLaMA | View on Reddit | 10 comments

A few weeks ago I shipped vibevoice.cpp, a pure-C++ ggml port of Microsoft
VibeVoice (the speech-to-speech model with voice cloning, https://github.com/
microsoft/VibeVoice). Wanted to post a follow-up here because we're at a point where the engine has grown well past "first-pass port" and into something other people might actually want to run.

This work was brought to you with <3 from the LocalAI team!

What it does:

Backends: CPU (CPU-only baseline), CUDA, Metal, Vulkan, hipBLAS via ggml's
backend dispatch. Single binary or libvibevoice.so + flat C ABI for embedding (purego/cgo/dlopen-friendly).

Numbers:

                               Inference   RTF    Peak RSS
68s sample, CUDA Q4_K (GB10):  28 s       0.41   ~6 GB
68s sample, CPU  Q4_K (R9):    150 s      2.20   ~8 GB
17min audio, CPU Q8_0:         1929 s     1.94   ~26 GB

Compared to upstream Microsoft Python + Transformers + vLLM plugin:

Limitations / honest:

Repo: https://github.com/mudler/vibevoice.cpp (MIT)

Models: https://huggingface.co/mudler/vibevoice.cpp-models

LocalAI integration: This work was done with <3 from the LocalAI team. vibevoice.cpp is already a backend which can be used ready-to-go in LocalAI !

Happy to answer questions and feedback!