The process of partitioning an audio recording into segments based on speaker identity, answering the question 'who spoke when.' Speaker diarization is essential for indexing conversations, meetings, and interviews in multimodal content processing.
Speaker diarization systems segment audio into speaker-homogeneous regions through a pipeline of voice activity detection, speaker embedding extraction, and clustering. First, non-speech regions are filtered out. Then, short audio segments are encoded into speaker embeddings that capture voice characteristics. Finally, clustering groups segments from the same speaker together.
Modern systems use neural speaker embeddings (x-vectors, ECAPA-TDNN) with spectral clustering or agglomerative clustering. End-to-end neural diarization models (EEND) jointly perform segmentation and clustering. pyannote.audio is a popular framework providing pretrained diarization pipelines. Performance is measured using Diarization Error Rate (DER), which combines missed speech, false alarm, and speaker confusion errors.