What Speaker Diarization tools should I look into?

Posted by Chemical_Gas3710@reddit | LocalLLaMA | View on Reddit | 7 comments

Hi,

I am making a tool that needs to analyze a conversation (non-English) between two people. The conversation is provided to me in audio format. I am currently using OpenAI Whisper to transcribe and feed the transcription to ChatGPT-4o model through the API for analysis.

So far, it's doing a fair job. Sometimes, though, reading the transcription, I find it hard to figure out which speaker is speaking what. I have to listen to the audio to figure it out. I am wondering if ChatGPT-4o would also sometimes find it hard to follow the conversation from the transcription. I think that adding a speaker diarization step might make the transcription easier to understand and analyze.

I am looking for Speaker Diarization tools that I can use. I have tried using pyannote speaker-diarization-3.1, but I find it does not work very well. What are some other options that I can look at?

[-]

LlamaDelRey10@reddit

I’ve actually seen a few issues with diarization from Whisper transcripts, so Whisper might actually be the issue here. As a first step, I’d suggest testing out some other transcription providers like ElevenLabs, Deepgram, or Speechmatics. I’ve found them to be more accurate for non-English languages.

Brunex666@reddit

Check lamantin.io it's very familiar with me

alexeir@reddit

You can use Lingvanex Speech to text tool, it has quality diarization feature.

shammahllamma@reddit

Checkout whisper-diarizarion by MahmoudAshraf https://github.com/MahmoudAshraf97/whisper-diarization

https://github.com/transcriptionstream/transcriptionstream for a turnkey solution based on whisper-diarization

SupportiveBot2_25@reddit

I’ve tested a few options recently for diarization in real-time or streaming setups. Whisper can work, but diarization support is patchy and often needs external tooling (like PyAnnote).

If you’re looking for something that works out of the box and holds up in noisy conditions or multi-speaker overlap, I’d suggest trying Speechmatics. I’ve used it in a couple of projects and found the speaker labels to be consistently more reliable than what I got from Assembly or Azure. It also integrates cleanly with other voice agent stacks. Just make sure to tune the latency settings depending on your use case.

NotAReallyNormalName@reddit

Why not just let 4o handle that? it supports audio input so you could just do that. Gemini 2.5-Pro is much much better though

Chemical_Gas3710@reddit (OP)

I could let 4o do that yes, but the pricing of using audio as input v/s text is quite on the higher side and I was looking to optimize on that by using text as input.