I fine-tuned Cohere Transcribe to support diarization and timestamps

Posted by iamMess@reddit | LocalLLaMA | View on Reddit | 25 comments

Hi I'll keep it short: [Cohere-transcribe](https://huggingface.co/CohereLabs/cohere-transcribe-03-2026) is currently the best open source speech to text model (and possibly even better than other proprietary models). BUT it doesn't support diarization (speaker identification) and timestamps, even though there are tokens for it in the tokenizer. SO I trained the model to support it. It follows the standard timestamp standard. The output now looks like this: <|spltoken0|><|t:0.0|> Welcome back. <|t:1.5|><|spltoken1|><|t:1.5|> Thanks. <|t:2.4|> Which is an easily parsable format. The timestamps are accurate within 0.097 seconds on average, and 90% are within 0.006 seconds. The model supports up to 4 speakers per 30 seconds, and using the diarize\_long.py script, it could accurately identify up to 32 people. It's [available for free on huggingface](https://huggingface.co/syvai/cohere-transcribe-diarize). Enjoy!

Reply to Post

25 Comments

[-]

nick_frosst@reddit

Hey this is awesome! Thank you for doing this and sharing it!

[-]

Zealousideal-Land356@reddit

That’s amazing! Have been looking for a good solution for this

[-]

baap_42@reddit

Super interesting work. Did you happen to run any diarization-specific eval metrics on the finetuned model, like cpWER or tcpWER? Also curious how you approached the finetuning process for diarization support if you can share a brief overview.

[-]

therapy-cat@reddit

Oh wow. I might use this in something I'm building.

[-]

Embarrassed_Soup_279@reddit

have you looked into microsofts vibevoice asr? as far as i know it was one of the best models that supported speaker diarization

[-]

iMakeSense@reddit

It eats hella VRAM. Not consumer machine friendly. I tried running a quantized model on my mac and it slowed to a crawl

[-]

Schlick7@reddit

How does this compare to parakeet? i see its about 3 times the size, so I assume better quality but also worse performance.

[-]

iamMess@reddit (OP)

Parakeet is the fastest model out there right now. It has a RTF of 3300. This model has a RTF of 524.

[-]

KokaOP@reddit

its very sensitive to noise, to use effectively you will need noise remover voice activity detection if you are going to use parakeet try it using nano-parakeet its a pytorch only implem with some effective speeduos

[-]

oxygen_addiction@reddit

So does this shine anywhere compared to Parakeet? Also thank you for sharing your work. I don't mean to sound ungrateful.

[-]

canadaduane@reddit

Check out https://huggingface.co/ibm-granite/granite-speech-4.1-2b-plus, currently best in class on the Open ASR Leaderboard (the "plus" variant adds speaker attribution, but I don't see a direct ranking). https://huggingface.co/spaces/hf-audio/open_asr_leaderboard

[-]

DeepWisdomGuy@reddit

Thank you for this! It will make creating speaker voice datasets a lot easier.

[-]

zxyzyxz@reddit

Why train it over using something like Pyannote or Nvidia NeMo?

[-]

iamMess@reddit (OP)

Single model for everything.

[-]

zxyzyxz@reddit

But it's more prone to hallucination than having a more deterministic diarizer

[-]

silenceimpaired@reddit

That’s exciting!

[-]

Accomplished_Ad9530@reddit

Nice. I’ve been looking into doing the same for ~16 speakers, though most diarization models top out at 4 and I only know of one that handles 8. Do you know if people are hitting a theoretical limit, or if it’s a training/data scaling issue?

[-]

iamMess@reddit (OP)

It's hard to find real training data that is well labelled for that many speakers. You can generate it synthetically, but it's not the same quality.

[-]

iamMess@reddit (OP)

Its a slight degredation but shouldnt be noticavle in real world use.

[-]

I fine-tuned Cohere Transcribe to support diarization and timestamps

Reply to Post

25 Comments

nick_frosst@reddit

Zealousideal-Land356@reddit

baap_42@reddit

therapy-cat@reddit

Embarrassed_Soup_279@reddit

iMakeSense@reddit

Schlick7@reddit

iamMess@reddit (OP)

KokaOP@reddit

oxygen_addiction@reddit

canadaduane@reddit

DeepWisdomGuy@reddit

zxyzyxz@reddit

iamMess@reddit (OP)

zxyzyxz@reddit

silenceimpaired@reddit

Accomplished_Ad9530@reddit

iamMess@reddit (OP)

Accomplished_Ad9530@reddit

1beb@reddit

nuclearbananana@reddit

iamMess@reddit (OP)

No_Algae1753@reddit

brahh85@reddit

waruby@reddit