What Speaker Diarization tools should I look into?
Posted by Chemical_Gas3710@reddit | LocalLLaMA | View on Reddit | 7 comments
Hi,
I am making a tool that needs to analyze a conversation (non-English) between two people. The conversation is provided to me in audio format. I am currently using OpenAI Whisper to transcribe and feed the transcription to ChatGPT-4o model through the API for analysis.
So far, it's doing a fair job. Sometimes, though, reading the transcription, I find it hard to figure out which speaker is speaking what. I have to listen to the audio to figure it out. I am wondering if ChatGPT-4o would also sometimes find it hard to follow the conversation from the transcription. I think that adding a speaker diarization step might make the transcription easier to understand and analyze.
I am looking for Speaker Diarization tools that I can use. I have tried using pyannote speaker-diarization-3.1, but I find it does not work very well. What are some other options that I can look at?
LlamaDelRey10@reddit
I’ve actually seen a few issues with diarization from Whisper transcripts, so Whisper might actually be the issue here. As a first step, I’d suggest testing out some other transcription providers like ElevenLabs, Deepgram, or Speechmatics. I’ve found them to be more accurate for non-English languages.
Brunex666@reddit
Check lamantin.io it's very familiar with me
alexeir@reddit
You can use Lingvanex Speech to text tool, it has quality diarization feature.
shammahllamma@reddit
Checkout whisper-diarizarion by MahmoudAshraf https://github.com/MahmoudAshraf97/whisper-diarization
https://github.com/transcriptionstream/transcriptionstream for a turnkey solution based on whisper-diarization
SupportiveBot2_25@reddit
I’ve tested a few options recently for diarization in real-time or streaming setups. Whisper can work, but diarization support is patchy and often needs external tooling (like PyAnnote).
If you’re looking for something that works out of the box and holds up in noisy conditions or multi-speaker overlap, I’d suggest trying Speechmatics. I’ve used it in a couple of projects and found the speaker labels to be consistently more reliable than what I got from Assembly or Azure. It also integrates cleanly with other voice agent stacks. Just make sure to tune the latency settings depending on your use case.
NotAReallyNormalName@reddit
Why not just let 4o handle that? it supports audio input so you could just do that. Gemini 2.5-Pro is much much better though
Chemical_Gas3710@reddit (OP)
Hi,
I could let 4o do that yes, but the pricing of using audio as input v/s text is quite on the higher side and I was looking to optimize on that by using text as input.