I ported NVIDIA Parakeet (speech-to-text) to ggml: same output as NeMo, faster, GGUF-quantized, no Python
Posted by mudler_it@reddit | LocalLLaMA | View on Reddit | 40 comments
I ported NVIDIA's Parakeet speech-to-text models to pure C++/ggml (the engine behind llama.cpp and whisper.cpp). It runs the FastConformer TDT / CTC / RNNT / hybrid models with no Python and no PyTorch, on CPU and GPU (CUDA, HIP, Vulkan, Metal).
The goal was to match NeMo exactly, then make it deployable anywhere. Where it landed:
- Output is byte-for-byte identical to NeMo (WER 0 on the f32/f16 path).
- Faster than NeMo's own PyTorch runtime: up to \~5x on the larger TDT/hybrid models on GPU, up to \~1.86x on CPU when quantized, and about 2x less memory.
- Around 600x realtime on GPU on a 23s clip (one hour of audio in roughly 6 seconds).
- Quantized GGUF for every variant: f16, q8_0, q6_k, q5_k, q4_k.

It also does cache-aware streaming with real-time end-of-utterance, word-level timestamps with confidence, and exposes a small flat C-API so you can embed it pretty much everywhere. The GGUF is self-contained: the tokenizer/vocab is baked into the model file, no external files needed.
It ships as a backend in LocalAI too, so you get an OpenAI-compatible /v1/audio/transcriptions endpoint fully local. (Disclosure: I work on LocalAI.)
https://reddit.com/link/1tt6oja/video/nxngb7x1aj4h1/player
Links:
- Code (MIT): https://github.com/mudler/parakeet.cpp
- Models (GGUF): https://huggingface.co/mudler/parakeet-cpp-gguf
All credit to NVIDIA for the Parakeet models and to ggml for the runtime. Benchmarks, methodology, and per-model plots are in the repo. Happy to answer questions about the port, the decoders, or the numbers.
harrro@reddit
Looks really fast thanks.
Is it possible to get the Openai-compatible transcription API without using LocalAI?
thirteen-bit@reddit
I'd be interested in this too, currently running llama.cpp llama-server, ik_llama.cpp llama-server, whisper-server, sd-server, kokoro FastAPI, parakeet FastAPI and Orpheus TTS as podman containers - all from a single llama-swap config.
And planning to add moss-tts-server from https://github.com/pwilkin/openmoss
cleverusernametry@reddit
Woah how do you so this with llama-swap?
thirteen-bit@reddit
llama-swap README for basic config, this covers:
Config example for docker-based backends
So in podman with GPU access or docker it's possible to run something like:
Config example (with podman, docker should be similar):
mudler_it@reddit (OP)
I have no plans to bring an openai-compatible server, parakeet.cpp is made as such it's easier to write you own on top.
LocalAI is really small - you basically select the backends that you want to be installed ( in this case, parakeet.cpp ).
cleverusernametry@reddit
Didn't someone do this a while back or an I hallucinating?
sfifs@reddit
Is this something we can use for voice to text for OpenClaw?
mudler_it@reddit (OP)
you can use for voice to text for anything
nuclearbananana@reddit
fyi https://github.com/CrispStrobe/CrispASR has basically tried to do this for all asr models, kinda like llama.cpp.
That said I found that project much slower than onnx on cpu, so hopefully yours is better
mudler_it@reddit (OP)
I discovered it just few days ago, interesting approach. however my take on this is a bit different - I rather prefer having separate projects that are completely optimized against a model, and have e.g. LocalAI that consumes these individually. It's best of all trades because you get optimizations that single implementations can carry without the burden of supporting multiple model architectures
MaruluVR@reddit
Would love to see a standalone version of the open ai endpoint or maybe a llama cpp integration, the "LocalAI" software is too bloated for me, I like keeping things separate in their own containers.
mudler_it@reddit (OP)
LocalAI is really installing only the backends that your model uses. It's not bloated by per-se unless you start to install all the backends
Full_Dimension_3495@reddit
This is golden. Was literally just setting up a custom voice assisted agent pipeline and was finding Whisper to be a bit too slow for my liking. Ported this into my project and its much quicker, and is very accurate. Thank you for your hard work and contributions to the community!
andreyis29@reddit
I compiled your code under Windows 10, there were errors in two files. Need to #define _USE_MATH_DEFINES there:
Far_Suit575@reddit
Love seeing tools become more accessible without adding extra complexity.
microbass@reddit
That's really cool, thanks!
_Whiskas_@reddit
Very nice!
Are there plans to do the same for the Nvidia's Canary family of models?
mudler_it@reddit (OP)
I'm looking at it already, but I'm very keen on keeping it very optimized to the model. If there aren't performance degradations I'll add it.
ToHallowMySleep@reddit
Great work so far, just go add one more voice that I'd be interested in this applied to Canary too.
KokaOP@reddit
how does it compare with, nano-parakeet
caetydid@reddit
does your port support CPU-only and multi-lang model?
no_witty_username@reddit
This is very serendipitous. I am currently implementing exactly that for my voice agent so this is perfect. If it's the V3 model is it the V3 model? anyways good job!
SkyFeistyLlama8@reddit
Is there support for less common inference hardware like SYCL, Adreno OpenCL or aarch64-specific CPU instructions?
annodomini@reddit
There are open PRs for this in whisper.cpp and llama.cpp, I've been tracking this since trying Nemotron 3 Nano Omni and then realizing it didn't support audio:
brahh85@reddit
thank you so much for the project, its awesome because of the universal support for almost all hardware
do you plan to include the generation of subtitles?
cibernox@reddit
This nice, I very recently published a port/recipe to run parakeet in intel NPUs, specifically through the wyoming protocol (the use used by home assistant for voice interaction): https://github.com/cibernox/wyoming-parakeet-on-intel-npu
If we were able to run gguf on the NPU, which in theory is possible, we could have a greater convenience while maintaining the low power consuption of using NPUs.
Honest-Kangaroo-1830@reddit
Did you find this faster than the typical whisper pipeline? Funny enough, I started with local LLMs with the goal of using it in HA. Now I have 3k worth of compute lol. I've been looking for models that are coherent enough to run on my mini PC (8845h with 780m iGPU, 64GB DDR5) while still being fast. I have bounced between Qwen 3.6 35B at Q2 and Gemma. Just trying to get a whole pipeline locally that is fast enough to be a reasonable Alexa replacement.
cibernox@reddit
I started the same way.
Parakeet is better than whisper in every dimmension. It is faster (and it's not close, it is A LOT faster), it is more accurate, it has a wider vocabulary (It recognizes some well-known words that are not "words", like Shakira), it allows you to mix languages in a single sentence (e.g. "Play the song Despacito").
As for having a fast HA assist pipeline, a MoE is ideal. In my experience, you need a model that you can run at \~70tk/s and that allows a non-thinking mode. If you can run qwen 35B at that speed (you might) you should get sub-3s end-2-end responses in HA assist, which is where this feels truly fast.
Honest-Kangaroo-1830@reddit
Yeah I'm totally with you on the MoEs, 35B with MTP gets about 40ish TG on the mini PC, but LFM 8B A1B reaches 60 in TG (and it's also coherent and pretty excellent at tool calling). The only issue with A1B is that the containers don't generate correctly in llama.cpp, so a response would return with a big block of thinking. As a fallback, I have my main inference setup running 35B at blazing speeds, about 170 TG but I want to keep it clear because I genuinely use that pipeline already. Separation of powers sort of.
I would wait for someone to fix the thinking tags, then limit it to 300 tokens worth of thinking, then implement LFM for sure.
Thanks for that info for parakeet! I completely set down tts and stt once I got the LLM bug so I will absolutely implement it. That info is genuinely gold.
mudler_it@reddit (OP)
it's much faster and accurate than whisper. here are the videos from the benchmarks:
https://github.com/mudler/parakeet.cpp/blob/master/benchmarks/media/gpu_whisper_duel.mp4
https://github.com/mudler/parakeet.cpp/blob/master/benchmarks/media/cpu_duel.mp4
mudler_it@reddit (OP)
Interesting! I don't have a NPU to test against so this would be out of my reach for now - curious, what HW do you have?
cibernox@reddit
I recently got an intel core ultra 265K to upgrade my 12th gen i3 home server. It has an NPU that in theory is perfect for some ML tasks, and STT and TTS are perfect examples. Image recognition in video feeds too.
If I can run parakeet at 30x realtime using 12w, that means that in streaming mode (in which realtime is the upper limit of usage), it could use <1w (as it would be nearly idle).
Kokoro is the next on my list, so I can free my GPUs from the burden of all ML tasks except heavy LLM inference.
WhisperianCookie@reddit
pretty cool
dangerous_inference@reddit
I just switched from Parakeet to Qwen3 1.7B ASR because it's so much better running the model on the server than the client. Also I hate onnx libraries. I'll have to try this when I get my new server up and running.
Danmoreng@reddit
How does it compare against Sherpa-onnx in terms of speed? https://github.com/k2-fsa/sherpa-onnx
mudler_it@reddit (OP)
I didn't benchmarked against sherpa-onnx yet, I did took as a reference implementation Nemo from Nvidia. But nevertheless good point, will take a look at it and try to run some bench against
mudler_it@reddit (OP)
Update, Mario says it's faster than his onnx implementation:
https://x.com/badlogicgames/status/2061201400059531729?s=20
eramax@reddit
great work
koloved@reddit
Could it helped handy app to increase speed of local stt?
badlogicgames@reddit
wow, this is awesome! i just finished a "shitty voice robot" project for our little one last weekend, using an ONNX based parakeet inference pipeline. while cross-platform, i'd have preferred something based on GGLM. and here it is. Thanks!