I fine-tuned Cohere Transcribe to support diarization and timestamps
Posted by iamMess@reddit | LocalLLaMA | View on Reddit | 25 comments
Hi
I'll keep it short:
[Cohere-transcribe](https://huggingface.co/CohereLabs/cohere-transcribe-03-2026) is currently the best open source speech to text model (and possibly even better than other proprietary models).
BUT it doesn't support diarization (speaker identification) and timestamps, even though there are tokens for it in the tokenizer.
SO I trained the model to support it. It follows the standard timestamp standard.
The output now looks like this:
<|spltoken0|><|t:0.0|> Welcome back. <|t:1.5|><|spltoken1|><|t:1.5|> Thanks. <|t:2.4|>
Which is an easily parsable format.
The timestamps are accurate within 0.097 seconds on average, and 90% are within 0.006 seconds.
The model supports up to 4 speakers per 30 seconds, and using the diarize\_long.py script, it could accurately identify up to 32 people.
It's [available for free on huggingface](https://huggingface.co/syvai/cohere-transcribe-diarize).
Enjoy!
25 Comments
nick_frosst@reddit
Zealousideal-Land356@reddit
baap_42@reddit
therapy-cat@reddit
Embarrassed_Soup_279@reddit
iMakeSense@reddit
Schlick7@reddit
iamMess@reddit (OP)
KokaOP@reddit
oxygen_addiction@reddit
canadaduane@reddit
DeepWisdomGuy@reddit
zxyzyxz@reddit
iamMess@reddit (OP)
zxyzyxz@reddit
silenceimpaired@reddit
Accomplished_Ad9530@reddit
iamMess@reddit (OP)
Accomplished_Ad9530@reddit
1beb@reddit
nuclearbananana@reddit
iamMess@reddit (OP)
No_Algae1753@reddit
brahh85@reddit
waruby@reddit