I benchmarked 42 STT models on medical audio with a new Medical WER metric — the leaderboard completely reshuffled
Posted by MajesticAd2862@reddit | LocalLLaMA | View on Reddit | 18 comments
TL;DR: I updated my medical speech-to-text benchmark to 42 models (up from 31 in v3) and added a new metric: Medical WER (M-WER).
Standard WER treats every word equally. In medical audio, that makes little sense — “yeah” and “amoxicillin” do not carry the same importance.
So for v4 I re-scored the benchmark using only clinically relevant words: drugs, conditions, symptoms, anatomy, and clinical procedures. I also broke out Drug M-WER separately, since medication names are where patient-safety risk gets real.
That change reshuffled the leaderboard hard.
A few notable results:
- VibeVoice-ASR 9B ranks #3 on M-WER and beats Microsoft’s own new closed MAI-Transcribe-1, which lands at #11
- Parakeet TDT 0.6B v3 drops from a strong overall-WER position to #31 on M-WER because of weak drug-name performance
- Qwen3-ASR 1.7B is the most interesting small local model this round: 4.40% M-WER and about 7s/file on A10
- Cloud APIs were stronger than I expected: Soniox, AssemblyAI Universal-3 Pro, and Deepgram Nova-3 Medical all ended up genuinely competitive
All code, transcripts, per-file metrics, and the full leaderboard are open-source on GitHub.
What changed since v3
1. New headline metric: Medical WER (M-WER)
Standard WER is still useful, but in a doctor-patient conversation it overweights the wrong things. A missed filler word and a missed medication name both count as one error, even though only one is likely to matter clinically.
So for v4 I added:
- M-WER = WER computed only over medically relevant reference tokens
- Drug M-WER = same idea, but restricted to drug names only
The current vocabulary covers 179 terms across 5 categories:
- drugs
- conditions
- symptoms
- anatomy
- clinical procedures
The reshuffle is real. Parakeet TDT 0.6B v3 looked great on normal WER in v3, but on M-WER it falls to #31, with 22% Drug M-WER. Great at conversational glue, much weaker on the words that actually carry clinical meaning.
2. 11 new models added (31 → 42)
This round added a bunch of new serious contenders:
- Soniox stt-async-v4 → #4 on M-WER
- AssemblyAI Universal-3 Pro (
domain: medical-v1) → #7 - Deepgram Nova-3 Medical → #9
- Microsoft MAI-Transcribe-1 → #11
- Qwen3-ASR 1.7B → #8, best small open-source model this round
- Cohere Transcribe (Mar 2026) → #18, extremely fast
- Parakeet TDT 1.1B → #15
- Facebook MMS-1B-all → #42 dead last on this dataset
Also added a separate multi-speaker track with Multitalker Parakeet 0.6B using cpWER, since joint ASR + diarization is a different evaluation problem.
Top 20 by Medical WER
Dataset: PriMock57 — 55 doctor-patient consultations, \~80K words of British English medical dialogue.
| # | Model | WER | M-WER | Drug M-WER | Speed | Host |
|---|---|---|---|---|---|---|
| 1 | Google Gemini 3 Pro Preview | 8.35% | 2.65% | 3.1% | 64.5s | API |
| 2 | Google Gemini 2.5 Pro | 8.15% | 2.97% | 4.1% | 56.4s | API |
| 3 | VibeVoice-ASR 9B (Microsoft, open-source) | 8.34% | 3.16% | 5.6% | 96.7s | H100 |
| 4 | Soniox stt-async-v4 | 9.18% | 3.32% | 7.1% | 46.2s | API |
| 5 | Google Gemini 3 Flash Preview | 11.33% | 3.64% | 5.2% | 51.5s | API |
| 6 | ElevenLabs Scribe v2 | 9.72% | 3.86% | 4.3% | 43.5s | API |
| 7 | AssemblyAI Universal-3 Pro (medical-v1) | 9.55% | 4.02% | 6.5% | 37.3s | API |
| 8 | Qwen3 ASR 1.7B (open-source) | 9.00% | 4.40% | 8.6% | 6.8s | A10 |
| 9 | Deepgram Nova-3 Medical | 9.05% | 4.53% | 9.7% | 12.9s | API |
| 10 | OpenAI GPT-4o Mini Transcribe (Dec '25) | 11.18% | 4.85% | 10.6% | 40.4s | API |
| 11 | Microsoft MAI-Transcribe-1 | 11.52% | 4.85% | 11.2% | 21.8s | API |
| 12 | ElevenLabs Scribe v1 | 10.87% | 4.88% | 7.5% | 36.3s | API |
| 13 | Google Gemini 2.5 Flash | 9.45% | 5.01% | 10.3% | 20.2s | API |
| 14 | Voxtral Mini Transcribe V1 | 11.85% | 5.17% | 11.0% | 22.4s | API |
| 15 | Parakeet TDT 1.1B | 9.03% | 5.20% | 15.5% | 12.3s | T4 |
| 16 | Voxtral Mini Transcribe V2 | 11.64% | 5.36% | 12.1% | 18.4s | API |
| 17 | Voxtral Mini 4B Realtime | 11.89% | 5.39% | 11.8% | 270.9s | A10 |
| 18 | Cohere Transcribe (Mar 2026) | 11.81% | 5.59% | 16.6% | 3.9s | A10 |
| 19 | OpenAI Whisper-1 | 13.20% | 5.62% | 10.3% | 104.3s | API |
| 20 | Groq Whisper Large v3 Turbo | 12.14% | 5.75% | 14.4% | 8.0s | API |
Full 42-model leaderboard on GitHub.
The funny part: Microsoft vs Microsoft
Microsoft now has two visible STT offerings in this benchmark:
- VibeVoice-ASR 9B — open-source, from Microsoft Research
- MAI-Transcribe-1 — closed, newly shipped by Microsoft's new SuperIntelligence team available through Azure Foundry.
And on the metric that actually matters for medical voice, the open model wins clearly:
- VibeVoice-ASR 9B → #3, 3.16% M-WER
- MAI-Transcribe-1 → #11, 4.85% M-WER
So Microsoft’s own open-source release beats Microsoft’s flagship closed STT product by:
- 1.7 absolute points of M-WER
- 5.6 absolute points of Drug M-WER
VibeVoice is very good, but it is also heavy: 9B params, long inference, and we ran it on H100 96GB. So it wins on contextual medical accuracy, but not on deployability.
Best small open-source model: Qwen3-ASR 1.7B
This is probably the most practically interesting open-source result in the whole board.
Qwen3-ASR 1.7B lands at:
- 9.00% WER
- 4.40% M-WER
- 8.6% Drug M-WER
- about 6.8s/file on A10
That is a strong accuracy-to-cost tradeoff.
It is much faster than VibeVoice, much smaller, and still good enough on medical terms that I think a lot of people building local or semi-local clinical voice stacks will care more about this result than the #1 spot.
One important deployment caveat: Qwen3-ASR does not play nicely with T4. The model path wants newer attention support and ships in bf16, so A10 or better is the realistic target.
There was also a nasty long-audio bug in the default vLLM setup: Qwen3 would silently hang on longer files. The practical fix was:
max_num_batched_tokens=16384
That one-line change fixed it for us. Full notes are in the repo’s AGENTS.md.
Cloud APIs got serious this round
v3 was still mostly a Google / ElevenLabs / OpenAI / Mistral story.
v4 broadened that a lot:
- Soniox (#4) — impressive for a universal model without explicit medical specialization
- AssemblyAI Universal-3 Pro (#7) — very solid, especially with
medical-v1 - Deepgram Nova-3 Medical (#9) — fastest serious cloud API in the top group
- Microsoft MAI-Transcribe-1 (#11) — weaker than I expected, but still competitive
Google still dominates the very top, but the broader takeaway is different:
the gap between strong cloud APIs and strong open-source models is now small enough that deployment constraints matter more than ever.
How M-WER is computed
The implementation is simple on purpose:
- Tag medically relevant words in the reference transcript
- Run normal WER alignment between reference and hypothesis
- Count substitutions / deletions / insertions only on those tagged medical tokens
- Compute:
- M-WER over all medical tokens
- Drug M-WER over the drug subset only
Current vocab:
- 179 medical terms
- 5 categories
- 464 drug-term occurrences in PriMock57
The vocabulary file is in evaluate/medical_terms_list.py and is easy to extend.
Links
- GitHub: https://github.com/Omi-Health/medical-STT-eval
- Full 42-model leaderboard, evaluation code, per-file transcripts, and per-file metrics are all open-source
- Qwen3 long-audio debugging notes are documented in
AGENTS.md
Happy to take questions, criticism on the metric design, or suggestions for v5.
fullouterjoin@reddit
wtf does this even mean
bambamlol@reddit
Thank you for the update!
Would you mind sharing what the actual costs for the API transcriptions were? Or did you already publish that somewhere and I simply can't find it?
nuclearbananana@reddit
For models that support it, do you provide a prompt/list of technical terms?
MajesticAd2862@reddit (OP)
Yes i have a list of medical words categorized by type, which are used to calculate the M-WER. If that was your question.
nuclearbananana@reddit
I got that you use them for scoring, but I'm asking if you provide them as vocabulary to the ASR models to improve accuraccy
MajesticAd2862@reddit (OP)
I tried this, for some LLM based models this will work like Vibevoice this will work. But with some others, like Parakeet, i saw it result into new hallucinations. Have to do more research into this.
bambamlol@reddit
Yes, please experiment with actual prompting in your next benchmark. That would be very interesting. I would expect that AssemblyAI's results, for example, could be improved significantly:
https://www.assemblyai.com/docs/pre-recorded-audio/keyterms-prompting
https://www.assemblyai.com/docs/pre-recorded-audio/universal-3-pro/prompting
Especially with keyterms prompting where you can "provide up to 1,000 words or phrases (maximum 6 words per phrase)". The output should be significantly better, otherwise, why would they even offer this feature?
Anyway, looking forward to any future updates, and thanks for sharing! :)
WhisperianCookie@reddit
nice work, gonna try to add qwen 1.7b to our android STT app
coder543@reddit
I think your implementation of MedASR must be broken. A 65% WER means that the harness is broken, not the model.
MajesticAd2862@reddit (OP)
I tried it multiple ways including Vertex AI API endpoint. Thing is MedASR is trained on dictation, while eval set is dialogue. Because all code is open source published, please prove me wrong!
coder543@reddit
I spitballed a few ideas with
codex, and saw measurable drop in all three WER metrics: https://github.com/Omi-Health/medical-STT-eval/pull/1It still seems to be a shockingly terrible model for this application
MajesticAd2862@reddit (OP)
Thanks for this! I tried improved chunk and merge with the Gemma 4, but didn’t practice on older models. Will have a look. But you’re right, it will improve but stays terrible as a model
EffectiveCeilingFan@reddit
Google MedASR?
MajesticAd2862@reddit (OP)
It has been evaluated, but proved very bad in this eval set with WER > 65%. Partially because it’s mostly trained on dictation and not dialogue, and because it’s not optimized for long-form audio.
No_Fee_2726@reddit
faah, parakeet dropping from top tier to #31 just because of medical terms is a reality check haha. it really goes to show that general benchmarks are basically useless for niche industries. the drug m-wer metric is a genius move tbh. if a model misses a dosage or a medication name, the whole transcript is basically trash or worse, dangerous. great work on this.
MajesticAd2862@reddit (OP)
Exactly, it was kind of revealing for me as well. Thanks! Hope to be able to continue in multi-lingual and more diverse eval set.
BasaltLabs@reddit
Most benchmarks are gameable, as they are trained on a specific dataset anyway.
I'm trying to build a Community based benchmarking for AIs that are non-LLM judged.
https://github.com/Basaltlabs-app/Gauntlet it's small at the moment but I am working on it actively.
gfernandf@reddit
intresting!