I benchmarked 42 STT models on medical audio with a new Medical WER metric — the leaderboard completely reshuffled

Posted by MajesticAd2862@reddit | LocalLLaMA | View on Reddit | 18 comments

TL;DR: I updated my medical speech-to-text benchmark to 42 models (up from 31 in v3) and added a new metric: Medical WER (M-WER).

Standard WER treats every word equally. In medical audio, that makes little sense — “yeah” and “amoxicillin” do not carry the same importance.

So for v4 I re-scored the benchmark using only clinically relevant words: drugs, conditions, symptoms, anatomy, and clinical procedures. I also broke out Drug M-WER separately, since medication names are where patient-safety risk gets real.

That change reshuffled the leaderboard hard.

A few notable results:

VibeVoice-ASR 9B ranks #3 on M-WER and beats Microsoft’s own new closed MAI-Transcribe-1, which lands at #11
Parakeet TDT 0.6B v3 drops from a strong overall-WER position to #31 on M-WER because of weak drug-name performance
Qwen3-ASR 1.7B is the most interesting small local model this round: 4.40% M-WER and about 7s/file on A10
Cloud APIs were stronger than I expected: Soniox, AssemblyAI Universal-3 Pro, and Deepgram Nova-3 Medical all ended up genuinely competitive

All code, transcripts, per-file metrics, and the full leaderboard are open-source on GitHub.

Previous posts: v1 · v2 · v3

What changed since v3

1. New headline metric: Medical WER (M-WER)

Standard WER is still useful, but in a doctor-patient conversation it overweights the wrong things. A missed filler word and a missed medication name both count as one error, even though only one is likely to matter clinically.

So for v4 I added:

M-WER = WER computed only over medically relevant reference tokens
Drug M-WER = same idea, but restricted to drug names only

The current vocabulary covers 179 terms across 5 categories:

drugs
conditions
symptoms
anatomy
clinical procedures

The reshuffle is real. Parakeet TDT 0.6B v3 looked great on normal WER in v3, but on M-WER it falls to #31, with 22% Drug M-WER. Great at conversational glue, much weaker on the words that actually carry clinical meaning.

2. 11 new models added (31 → 42)

This round added a bunch of new serious contenders:

Soniox stt-async-v4 → #4 on M-WER
AssemblyAI Universal-3 Pro (domain: medical-v1) → #7
Deepgram Nova-3 Medical → #9
Microsoft MAI-Transcribe-1 → #11
Qwen3-ASR 1.7B → #8, best small open-source model this round
Cohere Transcribe (Mar 2026) → #18, extremely fast
Parakeet TDT 1.1B → #15
Facebook MMS-1B-all → #42 dead last on this dataset

Also added a separate multi-speaker track with Multitalker Parakeet 0.6B using cpWER, since joint ASR + diarization is a different evaluation problem.

Top 20 by Medical WER

Dataset: PriMock57 — 55 doctor-patient consultations, \~80K words of British English medical dialogue.

#	Model	WER	M-WER	Drug M-WER	Speed	Host
1	Google Gemini 3 Pro Preview	8.35%	2.65%	3.1%	64.5s	API
2	Google Gemini 2.5 Pro	8.15%	2.97%	4.1%	56.4s	API
3	VibeVoice-ASR 9B (Microsoft, open-source)	8.34%	3.16%	5.6%	96.7s	H100
4	Soniox stt-async-v4	9.18%	3.32%	7.1%	46.2s	API
5	Google Gemini 3 Flash Preview	11.33%	3.64%	5.2%	51.5s	API
6	ElevenLabs Scribe v2	9.72%	3.86%	4.3%	43.5s	API
7	AssemblyAI Universal-3 Pro (medical-v1)	9.55%	4.02%	6.5%	37.3s	API
8	Qwen3 ASR 1.7B (open-source)	9.00%	4.40%	8.6%	6.8s	A10
9	Deepgram Nova-3 Medical	9.05%	4.53%	9.7%	12.9s	API
10	OpenAI GPT-4o Mini Transcribe (Dec '25)	11.18%	4.85%	10.6%	40.4s	API
11	Microsoft MAI-Transcribe-1	11.52%	4.85%	11.2%	21.8s	API
12	ElevenLabs Scribe v1	10.87%	4.88%	7.5%	36.3s	API
13	Google Gemini 2.5 Flash	9.45%	5.01%	10.3%	20.2s	API
14	Voxtral Mini Transcribe V1	11.85%	5.17%	11.0%	22.4s	API
15	Parakeet TDT 1.1B	9.03%	5.20%	15.5%	12.3s	T4
16	Voxtral Mini Transcribe V2	11.64%	5.36%	12.1%	18.4s	API
17	Voxtral Mini 4B Realtime	11.89%	5.39%	11.8%	270.9s	A10
18	Cohere Transcribe (Mar 2026)	11.81%	5.59%	16.6%	3.9s	A10
19	OpenAI Whisper-1	13.20%	5.62%	10.3%	104.3s	API
20	Groq Whisper Large v3 Turbo	12.14%	5.75%	14.4%	8.0s	API

Full 42-model leaderboard on GitHub.

The funny part: Microsoft vs Microsoft

Microsoft now has two visible STT offerings in this benchmark:

VibeVoice-ASR 9B — open-source, from Microsoft Research
MAI-Transcribe-1 — closed, newly shipped by Microsoft's new SuperIntelligence team available through Azure Foundry.

And on the metric that actually matters for medical voice, the open model wins clearly:

VibeVoice-ASR 9B → #3, 3.16% M-WER
MAI-Transcribe-1 → #11, 4.85% M-WER

So Microsoft’s own open-source release beats Microsoft’s flagship closed STT product by:

1.7 absolute points of M-WER
5.6 absolute points of Drug M-WER

VibeVoice is very good, but it is also heavy: 9B params, long inference, and we ran it on H100 96GB. So it wins on contextual medical accuracy, but not on deployability.

Best small open-source model: Qwen3-ASR 1.7B

This is probably the most practically interesting open-source result in the whole board.

Qwen3-ASR 1.7B lands at:

9.00% WER
4.40% M-WER
8.6% Drug M-WER
about 6.8s/file on A10

That is a strong accuracy-to-cost tradeoff.

It is much faster than VibeVoice, much smaller, and still good enough on medical terms that I think a lot of people building local or semi-local clinical voice stacks will care more about this result than the #1 spot.

One important deployment caveat: Qwen3-ASR does not play nicely with T4. The model path wants newer attention support and ships in bf16, so A10 or better is the realistic target.

There was also a nasty long-audio bug in the default vLLM setup: Qwen3 would silently hang on longer files. The practical fix was:

max_num_batched_tokens=16384

That one-line change fixed it for us. Full notes are in the repo’s AGENTS.md.

Cloud APIs got serious this round

v3 was still mostly a Google / ElevenLabs / OpenAI / Mistral story.

v4 broadened that a lot:

Soniox (#4) — impressive for a universal model without explicit medical specialization
AssemblyAI Universal-3 Pro (#7) — very solid, especially with medical-v1
Deepgram Nova-3 Medical (#9) — fastest serious cloud API in the top group
Microsoft MAI-Transcribe-1 (#11) — weaker than I expected, but still competitive

Google still dominates the very top, but the broader takeaway is different:

the gap between strong cloud APIs and strong open-source models is now small enough that deployment constraints matter more than ever.

How M-WER is computed

The implementation is simple on purpose:

Tag medically relevant words in the reference transcript
Run normal WER alignment between reference and hypothesis
Count substitutions / deletions / insertions only on those tagged medical tokens
Compute:
M-WER over all medical tokens
Drug M-WER over the drug subset only

Current vocab:

179 medical terms
5 categories
464 drug-term occurrences in PriMock57

The vocabulary file is in evaluate/medical_terms_list.py and is easy to extend.

Links

GitHub: https://github.com/Omi-Health/medical-STT-eval
Full 42-model leaderboard, evaluation code, per-file transcripts, and per-file metrics are all open-source
Qwen3 long-audio debugging notes are documented in AGENTS.md

Happy to take questions, criticism on the metric design, or suggestions for v5.

[-]

fullouterjoin@reddit

The reshuffle is real.

wtf does this even mean

bambamlol@reddit

Thank you for the update!

Would you mind sharing what the actual costs for the API transcriptions were? Or did you already publish that somewhere and I simply can't find it?

nuclearbananana@reddit

For models that support it, do you provide a prompt/list of technical terms?

MajesticAd2862@reddit (OP)

Yes i have a list of medical words categorized by type, which are used to calculate the M-WER. If that was your question.

I got that you use them for scoring, but I'm asking if you provide them as vocabulary to the ASR models to improve accuraccy

I tried this, for some LLM based models this will work like Vibevoice this will work. But with some others, like Parakeet, i saw it result into new hallucinations. Have to do more research into this.

Yes, please experiment with actual prompting in your next benchmark. That would be very interesting. I would expect that AssemblyAI's results, for example, could be improved significantly:

https://www.assemblyai.com/docs/pre-recorded-audio/keyterms-prompting

https://www.assemblyai.com/docs/pre-recorded-audio/universal-3-pro/prompting

Especially with keyterms prompting where you can "provide up to 1,000 words or phrases (maximum 6 words per phrase)". The output should be significantly better, otherwise, why would they even offer this feature?

Anyway, looking forward to any future updates, and thanks for sharing! :)

WhisperianCookie@reddit

nice work, gonna try to add qwen 1.7b to our android STT app

coder543@reddit

I think your implementation of MedASR must be broken. A 65% WER means that the harness is broken, not the model.

I tried it multiple ways including Vertex AI API endpoint. Thing is MedASR is trained on dictation, while eval set is dialogue. Because all code is open source published, please prove me wrong!

I spitballed a few ideas with codex, and saw measurable drop in all three WER metrics: https://github.com/Omi-Health/medical-STT-eval/pull/1

It still seems to be a shockingly terrible model for this application

Thanks for this! I tried improved chunk and merge with the Gemma 4, but didn’t practice on older models. Will have a look. But you’re right, it will improve but stays terrible as a model

EffectiveCeilingFan@reddit

Google MedASR?

It has been evaluated, but proved very bad in this eval set with WER > 65%. Partially because it’s mostly trained on dictation and not dialogue, and because it’s not optimized for long-form audio.

No_Fee_2726@reddit

faah, parakeet dropping from top tier to #31 just because of medical terms is a reality check haha. it really goes to show that general benchmarks are basically useless for niche industries. the drug m-wer metric is a genius move tbh. if a model misses a dosage or a medication name, the whole transcript is basically trash or worse, dangerous. great work on this.

Exactly, it was kind of revealing for me as well. Thanks! Hope to be able to continue in multi-lingual and more diverse eval set.

BasaltLabs@reddit

Most benchmarks are gameable, as they are trained on a specific dataset anyway.

I'm trying to build a Community based benchmarking for AIs that are non-LLM judged.

https://github.com/Basaltlabs-app/Gauntlet it's small at the moment but I am working on it actively.

gfernandf@reddit

intresting!