Is there a whisper-like speech-to-text that detects the speaker?

Question

I found some commercial (expensive) offerings doing this but there doesn't seem to be an open source way to categorise the output of whisper into different speakers/sources?Thinking of this for podcast analysis purposes.

networked · Accepted Answer

whisper.cpp supports a model with "speaker segmentation" or "local diarization". It is called "local" because that it doesn't name the distinct speakers; it only tells you when the speaker changes. See https://github.com/ggerganov/whisper.cpp/issues/1715#issueco.... Once you compile whisper.cpp and download the model, run `main` with that model and the option `-tdrz true`.

sfmz · Answer

"diarization" is your search term. eg. https://github.com/MahmoudAshraf97/whisper-diarization