I ported NVIDIA Parakeet (speech-to-text) to ggml: same output as NeMo, faster, GGUF-quantized, no Python

Posted by mudler_it@reddit | LocalLLaMA | View on Reddit | 40 comments

I ported NVIDIA's Parakeet speech-to-text models to pure C++/ggml (the engine behind llama.cpp and whisper.cpp). It runs the FastConformer TDT / CTC / RNNT / hybrid models with no Python and no PyTorch, on CPU and GPU (CUDA, HIP, Vulkan, Metal).

The goal was to match NeMo exactly, then make it deployable anywhere. Where it landed:

Output is byte-for-byte identical to NeMo (WER 0 on the f32/f16 path).
Faster than NeMo's own PyTorch runtime: up to \~5x on the larger TDT/hybrid models on GPU, up to \~1.86x on CPU when quantized, and about 2x less memory.
Around 600x realtime on GPU on a 23s clip (one hour of audio in roughly 6 seconds).
Quantized GGUF for every variant: f16, q8_0, q6_k, q5_k, q4_k.

It also does cache-aware streaming with real-time end-of-utterance, word-level timestamps with confidence, and exposes a small flat C-API so you can embed it pretty much everywhere. The GGUF is self-contained: the tokenizer/vocab is baked into the model file, no external files needed.

It ships as a backend in LocalAI too, so you get an OpenAI-compatible /v1/audio/transcriptions endpoint fully local. (Disclosure: I work on LocalAI.)

https://reddit.com/link/1tt6oja/video/nxngb7x1aj4h1/player

Links:

Code (MIT): https://github.com/mudler/parakeet.cpp
Models (GGUF): https://huggingface.co/mudler/parakeet-cpp-gguf

All credit to NVIDIA for the Parakeet models and to ggml for the runtime. Benchmarks, methodology, and per-model plots are in the repo. Happy to answer questions about the port, the decoders, or the numbers.

[-]

harrro@reddit

Looks really fast thanks.

Is it possible to get the Openai-compatible transcription API without using LocalAI?

[-]

thirteen-bit@reddit

I'd be interested in this too, currently running llama.cpp llama-server, ik_llama.cpp llama-server, whisper-server, sd-server, kokoro FastAPI, parakeet FastAPI and Orpheus TTS as podman containers - all from a single llama-swap config.

And planning to add moss-tts-server from https://github.com/pwilkin/openmoss

[-]

cleverusernametry@reddit

Woah how do you so this with llama-swap?

[-]

thirteen-bit@reddit

llama-swap README for basic config, this covers:

Config example for docker-based backends

So in podman with GPU access or docker it's possible to run something like:

https://github.com/remsky/Kokoro-FastAPI
https://github.com/Shadowfita/parakeet-tdt-0.6b-v2-fastapi
https://github.com/Lex-au/Orpheus-FastAPI

Config example (with podman, docker should be similar):

macros:
  "modeldir": ./models
  "bindir": ./bin
  "comfymodeldir": ${modeldir}/comfyui

  "llama-server": >
    ${bindir}/llama-server
    --port ${PORT}
    --jinja

  "moss-tts-server": ${bindir}/moss-tts-server

  "ikllama-server": >
    ${bindir}/ik_llama-server
    --port ${PORT}
    --jinja

  "sd-server": >
    ${bindir}/sd-server
    --listen-port ${PORT}
    --diffusion-fa
    --verbose

  "whisper-server": ${bindir}/whisper-server

models:
  "whisper":
    name: "Whisper Audio Transcription"
    proxy: "http://127.0.0.1:${PORT}"
    checkEndpoint: /v1/audio/transcriptions/
    cmd: |
      ${whisper-server}
      --host 127.0.0.1
      --port ${PORT}
      --model ${modeldir}/ggml-large-v3-turbo-q8_0.bin
      --request-path /v1/audio/transcriptions
      --inference-path ""

  "z-image-turbo":
    name: "Z-Image Turbo text-to-image"
    checkEndpoint: /
    cmd: |
      ${sd-server}
      --diffusion-model ${modeldir}/z-image-turbo-Q8_0.gguf
      --vae ${comfymodeldir}/vae/ae.safetensors
      --llm ${modeldir}/Qwen3-4B-Instruct-2507-UD-Q8_K_XL.gguf
      --lora-model-dir ${comfymodeldir}/loras/
      --steps 8
      --cfg-scale 1.0
      --vae-conv-direct
      --sampling-method euler
      --scheduler sgm_uniform
      --seed -1

  "openmoss-1.5-q8":
    name: "MOSS-TTS-v1.5"
    proxy: "http://127.0.0.1:${PORT}"
    #env:
    #  - "LD_LIBRARY_PATH=${env.HOME}/src/llama.cpp/bin"
    cmd: |
      ${moss-tts-server}
      --host 127.0.0.1
      --port ${PORT}
      --no-webui
      --model ${modeldir}/moss-tts-1.5-q8_0.gguf

  "kokoro-tts":
    name: "kokoro TTS"
    proxy: "http://127.0.0.1:${PORT}"
    useModelName: "tts-1"
    checkEndpoint: /health
    cmd: |
      podman run
      --rm
      --replace
      --name ${MODEL_ID}
      -p ${PORT}:8880
      --device nvidia.com/gpu=all
      --env 'API_LOG_LEVEL=INFO'
      ghcr.io/remsky/kokoro-fastapi-gpu:latest
    cmdStop: podman stop ${MODEL_ID}

  "parakeet-tdt":
    name: "parakeet-tdt-0.6b-v3 ASR"
    proxy: "http://127.0.0.1:${PORT}"
    checkEndpoint: /health
    cmd: |
      podman run
      --rm
      --replace
      --name ${MODEL_ID}
      -p ${PORT}:5092
      --device nvidia.com/gpu=all
      localhost/parakeet-tdt:gpu
    cmdStop: podman stop ${MODEL_ID}

  "qwen3.6-27b-q4-novision":
    name: "Qwen 3.6 27B Q4 text only"
    filters:
      stripParams: "temperature, top_k, top_p, min_p, repeat_penalty, presence_penalty"
      setParams:
        # Default: reasoning-coding (precise coding with thinking enabled)
        chat_template_kwargs:
          enable_thinking: true
        temperature: 0.6
        presence_penalty: 0.0
      setParamsByID:
        "${MODEL_ID}:reasoning-coding":
          chat_template_kwargs:
            enable_thinking: true
          temperature: 0.6
          presence_penalty: 0.0
        "${MODEL_ID}:reasoning-general":
          chat_template_kwargs:
            enable_thinking: true
          temperature: 1.0
          presence_penalty: 1.5
        "${MODEL_ID}:instruct-general":
          chat_template_kwargs:
            enable_thinking: false
          temperature: 0.7
          top_p: 0.8
          presence_penalty: 1.5
        "${MODEL_ID}:instruct-reasoning":
          chat_template_kwargs:
            enable_thinking: false
          temperature: 1.0
          top_p: 1.0
          presence_penalty: 1.5
    cmd: |
      ${llama-server}
      --cache-type-k q8_0
      --cache-type-v q8_0
      --flash-attn on
      --n-gpu-layers 999
      --spec-type ngram-mod
      --parallel 1
      --top-p 0.95
      --top-k 20
      --min-p 0.0
      --repeat-penalty 1.0
      --reasoning on
      --chat-template-kwargs '{"preserve_thinking": true}'
      --ctx-size 131072
      --fit off
      #--no-mmproj-offload
      #--image-min-tokens 256 --image-max-tokens 768
      #--mmproj ${modeldir}/mmproj-Qwen_Qwen3.6-27B-bf16.gguf
      --model ${modeldir}/Qwen_Qwen3.6-27B-Q4_K_M.gguf

[-]

mudler_it@reddit (OP)

I have no plans to bring an openai-compatible server, parakeet.cpp is made as such it's easier to write you own on top.

LocalAI is really small - you basically select the backends that you want to be installed ( in this case, parakeet.cpp ).

[-]

cleverusernametry@reddit

Didn't someone do this a while back or an I hallucinating?

[-]

sfifs@reddit

Is this something we can use for voice to text for OpenClaw?

[-]

mudler_it@reddit (OP)

you can use for voice to text for anything

[-]

nuclearbananana@reddit

fyi https://github.com/CrispStrobe/CrispASR has basically tried to do this for all asr models, kinda like llama.cpp.

That said I found that project much slower than onnx on cpu, so hopefully yours is better

[-]

mudler_it@reddit (OP)

I discovered it just few days ago, interesting approach. however my take on this is a bit different - I rather prefer having separate projects that are completely optimized against a model, and have e.g. LocalAI that consumes these individually. It's best of all trades because you get optimizations that single implementations can carry without the burden of supporting multiple model architectures

[-]

MaruluVR@reddit

Would love to see a standalone version of the open ai endpoint or maybe a llama cpp integration, the "LocalAI" software is too bloated for me, I like keeping things separate in their own containers.

[-]

mudler_it@reddit (OP)

LocalAI is really installing only the backends that your model uses. It's not bloated by per-se unless you start to install all the backends

[-]

Full_Dimension_3495@reddit

This is golden. Was literally just setting up a custom voice assisted agent pipeline and was finding Whisper to be a bit too slow for my liking. Ported this into my project and its much quicker, and is very accurate. Thank you for your hard work and contributions to the community!

[-]

andreyis29@reddit

I compiled your code under Windows 10, there were errors in two files. Need to #define _USE_MATH_DEFINES there:

\parakeet-build\src\mel_gpu.cpp

\parakeet-build\src\fft.cpp

[-]

Far_Suit575@reddit

Love seeing tools become more accessible without adding extra complexity.

[-]

microbass@reddit

That's really cool, thanks!

[-]

_Whiskas_@reddit

Very nice!

Are there plans to do the same for the Nvidia's Canary family of models?

[-]

mudler_it@reddit (OP)

I'm looking at it already, but I'm very keen on keeping it very optimized to the model. If there aren't performance degradations I'll add it.

[-]

ToHallowMySleep@reddit

Great work so far, just go add one more voice that I'd be interested in this applied to Canary too.

[-]

KokaOP@reddit

how does it compare with, nano-parakeet

[-]

caetydid@reddit

does your port support CPU-only and multi-lang model?

[-]

no_witty_username@reddit

This is very serendipitous. I am currently implementing exactly that for my voice agent so this is perfect. If it's the V3 model is it the V3 model? anyways good job!

[-]

SkyFeistyLlama8@reddit

Is there support for less common inference hardware like SYCL, Adreno OpenCL or aarch64-specific CPU instructions?

[-]

annodomini@reddit

There are open PRs for this in whisper.cpp and llama.cpp, I've been tracking this since trying Nemotron 3 Nano Omni and then realizing it didn't support audio:

https://github.com/ggml-org/llama.cpp/pull/22520
https://github.com/ggml-org/whisper.cpp/pull/3735

[-]

brahh85@reddit

thank you so much for the project, its awesome because of the universal support for almost all hardware

do you plan to include the generation of subtitles?

[-]

cibernox@reddit

This nice, I very recently published a port/recipe to run parakeet in intel NPUs, specifically through the wyoming protocol (the use used by home assistant for voice interaction): https://github.com/cibernox/wyoming-parakeet-on-intel-npu

If we were able to run gguf on the NPU, which in theory is possible, we could have a greater convenience while maintaining the low power consuption of using NPUs.

[-]

Honest-Kangaroo-1830@reddit

Did you find this faster than the typical whisper pipeline? Funny enough, I started with local LLMs with the goal of using it in HA. Now I have 3k worth of compute lol. I've been looking for models that are coherent enough to run on my mini PC (8845h with 780m iGPU, 64GB DDR5) while still being fast. I have bounced between Qwen 3.6 35B at Q2 and Gemma. Just trying to get a whole pipeline locally that is fast enough to be a reasonable Alexa replacement.

[-]

cibernox@reddit

I started the same way.
Parakeet is better than whisper in every dimmension. It is faster (and it's not close, it is A LOT faster), it is more accurate, it has a wider vocabulary (It recognizes some well-known words that are not "words", like Shakira), it allows you to mix languages in a single sentence (e.g. "Play the song Despacito").

As for having a fast HA assist pipeline, a MoE is ideal. In my experience, you need a model that you can run at \~70tk/s and that allows a non-thinking mode. If you can run qwen 35B at that speed (you might) you should get sub-3s end-2-end responses in HA assist, which is where this feels truly fast.

[-]

Honest-Kangaroo-1830@reddit

Yeah I'm totally with you on the MoEs, 35B with MTP gets about 40ish TG on the mini PC, but LFM 8B A1B reaches 60 in TG (and it's also coherent and pretty excellent at tool calling). The only issue with A1B is that the containers don't generate correctly in llama.cpp, so a response would return with a big block of thinking. As a fallback, I have my main inference setup running 35B at blazing speeds, about 170 TG but I want to keep it clear because I genuinely use that pipeline already. Separation of powers sort of.

I would wait for someone to fix the thinking tags, then limit it to 300 tokens worth of thinking, then implement LFM for sure.

Thanks for that info for parakeet! I completely set down tts and stt once I got the LLM bug so I will absolutely implement it. That info is genuinely gold.

[-]

mudler_it@reddit (OP)

it's much faster and accurate than whisper. here are the videos from the benchmarks:

https://github.com/mudler/parakeet.cpp/blob/master/benchmarks/media/gpu_whisper_duel.mp4

https://github.com/mudler/parakeet.cpp/blob/master/benchmarks/media/cpu_duel.mp4

[-]

mudler_it@reddit (OP)

Interesting! I don't have a NPU to test against so this would be out of my reach for now - curious, what HW do you have?

[-]

cibernox@reddit

I recently got an intel core ultra 265K to upgrade my 12th gen i3 home server. It has an NPU that in theory is perfect for some ML tasks, and STT and TTS are perfect examples. Image recognition in video feeds too.

If I can run parakeet at 30x realtime using 12w, that means that in streaming mode (in which realtime is the upper limit of usage), it could use <1w (as it would be nearly idle).

Kokoro is the next on my list, so I can free my GPUs from the burden of all ML tasks except heavy LLM inference.

[-]

WhisperianCookie@reddit

pretty cool

[-]

dangerous_inference@reddit

I just switched from Parakeet to Qwen3 1.7B ASR because it's so much better running the model on the server than the client. Also I hate onnx libraries. I'll have to try this when I get my new server up and running.

[-]

Danmoreng@reddit

How does it compare against Sherpa-onnx in terms of speed? https://github.com/k2-fsa/sherpa-onnx

[-]

mudler_it@reddit (OP)

I didn't benchmarked against sherpa-onnx yet, I did took as a reference implementation Nemo from Nvidia. But nevertheless good point, will take a look at it and try to run some bench against

[-]

mudler_it@reddit (OP)

Update, Mario says it's faster than his onnx implementation:

https://x.com/badlogicgames/status/2061201400059531729?s=20

[-]