HACKER Q&A
📣 authorfly

Is there a whisper-like speech-to-text that detects the speaker?


I found some commercial (expensive) offerings doing this but there doesn't seem to be an open source way to categorise the output of whisper into different speakers/sources?

Thinking of this for podcast analysis purposes.


  👤 networked Accepted Answer ✓
whisper.cpp supports a model with "speaker segmentation" or "local diarization". It is called "local" because that it doesn't name the distinct speakers; it only tells you when the speaker changes. See https://github.com/ggerganov/whisper.cpp/issues/1715#issueco.... Once you compile whisper.cpp and download the model, run `main` with that model and the option `-tdrz true`.

👤 sfmz
"diarization" is your search term. eg. https://github.com/MahmoudAshraf97/whisper-diarization