I benchmarked 42 STT models on medical audio with a new Medical WER metric — the leaderboard completely reshuffled

Posted by MajesticAd2862@reddit | LocalLLaMA | View on Reddit | 18 comments

I benchmarked 42 STT models on medical audio with a new Medical WER metric — the leaderboard completely reshuffled

TL;DR: I updated my medical speech-to-text benchmark to 42 models (up from 31 in v3) and added a new metric: Medical WER (M-WER).

Standard WER treats every word equally. In medical audio, that makes little sense — “yeah” and “amoxicillin” do not carry the same importance.

So for v4 I re-scored the benchmark using only clinically relevant words: drugs, conditions, symptoms, anatomy, and clinical procedures. I also broke out Drug M-WER separately, since medication names are where patient-safety risk gets real.

That change reshuffled the leaderboard hard.

A few notable results:

All code, transcripts, per-file metrics, and the full leaderboard are open-source on GitHub.

Previous posts: v1 · v2 · v3

What changed since v3

1. New headline metric: Medical WER (M-WER)

Standard WER is still useful, but in a doctor-patient conversation it overweights the wrong things. A missed filler word and a missed medication name both count as one error, even though only one is likely to matter clinically.

So for v4 I added:

The current vocabulary covers 179 terms across 5 categories:

The reshuffle is real. Parakeet TDT 0.6B v3 looked great on normal WER in v3, but on M-WER it falls to #31, with 22% Drug M-WER. Great at conversational glue, much weaker on the words that actually carry clinical meaning.

2. 11 new models added (31 → 42)

This round added a bunch of new serious contenders:

Also added a separate multi-speaker track with Multitalker Parakeet 0.6B using cpWER, since joint ASR + diarization is a different evaluation problem.

Top 20 by Medical WER

Dataset: PriMock57 — 55 doctor-patient consultations, \~80K words of British English medical dialogue.

# Model WER M-WER Drug M-WER Speed Host
1 Google Gemini 3 Pro Preview 8.35% 2.65% 3.1% 64.5s API
2 Google Gemini 2.5 Pro 8.15% 2.97% 4.1% 56.4s API
3 VibeVoice-ASR 9B (Microsoft, open-source) 8.34% 3.16% 5.6% 96.7s H100
4 Soniox stt-async-v4 9.18% 3.32% 7.1% 46.2s API
5 Google Gemini 3 Flash Preview 11.33% 3.64% 5.2% 51.5s API
6 ElevenLabs Scribe v2 9.72% 3.86% 4.3% 43.5s API
7 AssemblyAI Universal-3 Pro (medical-v1) 9.55% 4.02% 6.5% 37.3s API
8 Qwen3 ASR 1.7B (open-source) 9.00% 4.40% 8.6% 6.8s A10
9 Deepgram Nova-3 Medical 9.05% 4.53% 9.7% 12.9s API
10 OpenAI GPT-4o Mini Transcribe (Dec '25) 11.18% 4.85% 10.6% 40.4s API
11 Microsoft MAI-Transcribe-1 11.52% 4.85% 11.2% 21.8s API
12 ElevenLabs Scribe v1 10.87% 4.88% 7.5% 36.3s API
13 Google Gemini 2.5 Flash 9.45% 5.01% 10.3% 20.2s API
14 Voxtral Mini Transcribe V1 11.85% 5.17% 11.0% 22.4s API
15 Parakeet TDT 1.1B 9.03% 5.20% 15.5% 12.3s T4
16 Voxtral Mini Transcribe V2 11.64% 5.36% 12.1% 18.4s API
17 Voxtral Mini 4B Realtime 11.89% 5.39% 11.8% 270.9s A10
18 Cohere Transcribe (Mar 2026) 11.81% 5.59% 16.6% 3.9s A10
19 OpenAI Whisper-1 13.20% 5.62% 10.3% 104.3s API
20 Groq Whisper Large v3 Turbo 12.14% 5.75% 14.4% 8.0s API

Full 42-model leaderboard on GitHub.

The funny part: Microsoft vs Microsoft

Microsoft now has two visible STT offerings in this benchmark:

And on the metric that actually matters for medical voice, the open model wins clearly:

So Microsoft’s own open-source release beats Microsoft’s flagship closed STT product by:

VibeVoice is very good, but it is also heavy: 9B params, long inference, and we ran it on H100 96GB. So it wins on contextual medical accuracy, but not on deployability.

Best small open-source model: Qwen3-ASR 1.7B

This is probably the most practically interesting open-source result in the whole board.

Qwen3-ASR 1.7B lands at:

That is a strong accuracy-to-cost tradeoff.

It is much faster than VibeVoice, much smaller, and still good enough on medical terms that I think a lot of people building local or semi-local clinical voice stacks will care more about this result than the #1 spot.

One important deployment caveat: Qwen3-ASR does not play nicely with T4. The model path wants newer attention support and ships in bf16, so A10 or better is the realistic target.

There was also a nasty long-audio bug in the default vLLM setup: Qwen3 would silently hang on longer files. The practical fix was:

max_num_batched_tokens=16384

That one-line change fixed it for us. Full notes are in the repo’s AGENTS.md.

Cloud APIs got serious this round

v3 was still mostly a Google / ElevenLabs / OpenAI / Mistral story.

v4 broadened that a lot:

Google still dominates the very top, but the broader takeaway is different:

the gap between strong cloud APIs and strong open-source models is now small enough that deployment constraints matter more than ever.

How M-WER is computed

The implementation is simple on purpose:

  1. Tag medically relevant words in the reference transcript
  2. Run normal WER alignment between reference and hypothesis
  3. Count substitutions / deletions / insertions only on those tagged medical tokens
  4. Compute:
  5. M-WER over all medical tokens
  6. Drug M-WER over the drug subset only

Current vocab:

The vocabulary file is in evaluate/medical_terms_list.py and is easy to extend.

Links

Happy to take questions, criticism on the metric design, or suggestions for v5.