Why Transcription Alone Isn't Enough
An AI agent that can transcribe audio knows WHAT was said. But for most real-world applications -- meetings, interviews, podcasts, courtroom recordings, customer service calls -- knowing what was said is only half the problem. The other half is knowing WHO said it.
A raw transcript of a meeting looks like this:
We should move the launch to Q3. I disagree, the market window
closes in June. What if we do a soft launch in May? That could work
if engineering can hit the deadline.
A diarized transcript looks like this:
[Speaker A, 00:14] We should move the launch to Q3.
[Speaker B, 00:18] I disagree, the market window closes in June.
[Speaker C, 00:22] What if we do a soft launch in May?
[Speaker A, 00:25] That could work if engineering can hit the deadline.
The second version is searchable in fundamentally different ways. You can answer "what did the VP of Marketing say about the timeline?" or "find all meetings where the CTO and the designer discussed the same feature." Without diarization, these queries are impossible because the transcript has no concept of speaker identity.
Speaker diarization is the process of partitioning an audio stream into segments labeled by speaker identity. It answers the question: "who spoke when?" The output is a set of time-stamped speaker labels that can be aligned with a transcript to produce speaker-attributed text.
The Diarization Pipeline
Modern diarization systems follow a four-stage pipeline. Each stage can be implemented with different algorithms, but the overall structure is consistent across production systems.
Stage 1: Voice Activity Detection (VAD)
Before identifying speakers, the system needs to know WHEN anyone is speaking at all. Voice Activity Detection separates speech regions from silence, music, background noise, and other non-speech audio.
The simplest VAD approaches use energy thresholds -- if the signal power exceeds a threshold, mark it as speech. This works in clean audio but fails in noisy environments where background noise has comparable energy to speech.
Modern VAD uses small neural networks trained to classify each audio frame as speech or non-speech. Silero VAD (used by pyannote and many production systems) is a 2MB ONNX model that processes audio at 500x real-time on CPU. It classifies 30ms frames as speech/non-speech with about 95% accuracy on noisy audio.
VAD output is a set of speech segments with start and end timestamps:
speech: [(0.5, 3.2), (3.8, 7.1), (8.0, 12.5), (13.1, 15.8)]
Stage 2: Speaker Embedding Extraction
Once speech regions are identified, the system extracts a fixed-dimensional vector (embedding) for each segment that captures the speaker's vocal characteristics -- pitch, timbre, speaking rate, vocal tract resonances.
The dominant architecture for speaker embeddings is the ECAPA-TDNN (Emphasized Channel Attention, Propagation and Aggregation in Time Delay Neural Network). It processes mel-spectrogram features through 1D convolutional layers with multi-scale channel attention, producing a 192 or 256-dimensional embedding per segment.
Key properties of speaker embeddings:
The embedding extraction typically operates on short segments (1.5-3 seconds) with a sliding window. Shorter segments are more likely to contain a single speaker but provide less acoustic evidence. Longer segments capture more speaker information but risk spanning speaker turns.
Stage 3: Clustering
With a set of speaker embeddings (one per short segment), the system needs to determine which segments belong to the same speaker. This is a clustering problem: group the embeddings into K clusters, where K is the number of distinct speakers.
Spectral clustering is the standard approach for diarization. The algorithm:
1. Compute a similarity matrix S where S[i,j] = cosine_similarity(embedding_i, embedding_j) 2. Build a graph Laplacian from the similarity matrix 3. Compute the eigenvalues of the Laplacian 4. Estimate the number of speakers from the eigenvalue gap (the largest jump between consecutive eigenvalues) 5. Run K-means on the top-K eigenvectors to assign each segment to a speaker
The eigenvalue gap heuristic for estimating speaker count works because the Laplacian's eigenvalue spectrum reflects the block structure of the similarity matrix. If there are 3 distinct speakers, the similarity matrix has 3 dense blocks (within-speaker similarities are high) separated by sparse regions (between-speaker similarities are low). This produces 3 small eigenvalues followed by a gap.
Agglomerative Hierarchical Clustering (AHC) is the main alternative. It starts with each segment as its own cluster and iteratively merges the two closest clusters until a stopping threshold is reached. AHC is simpler to implement and works well when the number of speakers is small, but spectral clustering handles larger speaker counts more robustly.
Stage 4: Overlap Detection and Refinement
In real conversations, speakers frequently overlap -- interruptions, backchannels ("uh-huh", "right"), and simultaneous speech. A pipeline that assigns each time frame to exactly one speaker will misattribute overlapping regions.
Overlap-aware diarization adds a separate model that detects time regions where multiple speakers are active simultaneously. Pyannote's overlap detection uses a PyanNet architecture (a modification of SincNet + LSTM) that outputs per-frame probabilities of 0, 1, 2, or 3+ simultaneous speakers.
When overlap is detected, the system assigns the frame to all speakers whose embedding similarity to the segment exceeds a threshold. The result is that a single time region can have multiple speaker labels.
End-to-End Neural Diarization
The pipeline approach (VAD -> embeddings -> clustering -> overlap) works well but has a fundamental limitation: errors compound across stages. A VAD miss causes the embedding extractor to skip a speech region entirely. A clustering error propagates to all segments in that cluster.
End-to-end neural diarization (EEND) replaces the multi-stage pipeline with a single neural network that directly outputs per-frame speaker labels. The model takes a mel-spectrogram as input and outputs a matrix of speaker activity probabilities: one row per time frame, one column per speaker.
EEND architectures use self-attention (Transformer) to model the entire audio sequence, learning to detect speaker changes, attribute speech to the correct speaker, and handle overlaps -- all in a single forward pass.
The trade-offs:
| Aspect | Pipeline (VAD + Embed + Cluster) | EEND |
| Speaker count | Handles unknown N via clustering | Often requires fixed max N |
| Long audio | Scales well (process in windows) | Quadratic attention limits length |
| Overlap handling | Requires separate overlap model | Handles naturally |
| Interpretability | Each stage debuggable | Black box |
| Accuracy | SOTA with pyannote 3.x | Competitive but less mature |
Aligning Diarization with Transcription
Diarization outputs speaker-labeled time segments. Transcription (ASR) outputs timestamped words. Aligning the two produces speaker-attributed transcripts.
The alignment process:
1. Run ASR (e.g., Whisper, Canary-Qwen) to get word-level timestamps 2. Run diarization (e.g., pyannote) to get speaker-labeled segments 3. For each word, find which speaker segment it falls into by matching timestamps 4. If a word spans a speaker boundary, assign it to the speaker with the larger overlap
# Simplified alignment
for word in transcript_words:
word_mid = (word.start + word.end) / 2
for segment in diarization_segments:
if segment.start <= word_mid <= segment.end:
word.speaker = segment.speaker
break
The quality of alignment depends on timestamp accuracy from both systems. Whisper's word-level timestamps have approximately 200ms precision. Pyannote's speaker boundaries have approximately 100ms precision. Combined, this means speaker attribution is reliable for utterances longer than about 500ms but may be incorrect for very short interjections or rapid turn-taking.
Speaker Identification vs. Diarization
Diarization assigns abstract labels (Speaker A, Speaker B) -- it tells you that the same person spoke at 0:14 and 0:25, but not who that person is. Speaker identification goes further: it matches the abstract labels to known identities.
The identification step compares each cluster's centroid embedding against a database of enrolled speaker embeddings. If Speaker A's centroid is closest to "Sarah Chen" in the enrollment database, Speaker A is labeled as Sarah Chen.
This requires a pre-enrollment step where known speakers provide reference audio. The reference audio is embedded using the same speaker encoder used for diarization, and the resulting embedding is stored in a database.
For production systems, the enrollment process typically needs 10-30 seconds of clean speech per person. More reference audio produces more robust enrollment embeddings because short segments can be affected by emotional state, speaking context, or recording conditions.
What Affects Diarization Quality
Several factors determine how well diarization works in practice:
Number of speakers. Diarization works best with 2-4 speakers. Performance degrades as speaker count increases because the clustering problem becomes harder and speaker embeddings from brief segments carry less discriminative information. Conference calls with 10+ speakers are significantly harder than two-person interviews.
Audio quality. Far-field microphones (conference room ceiling arrays) produce lower-quality speaker embeddings than close-talk microphones (headsets, lapel mics). Reverberation smears the spectral features that speaker encoders rely on.
Speaker similarity. Speakers with similar vocal characteristics (same gender, similar age, similar accent) are harder to separate. The embedding space has less distance between them, making clustering boundaries less clear.
Overlap ratio. Conversations with frequent overlapping speech (debates, group discussions) are harder because the system must disentangle multiple simultaneous voices. Diarization error rate (DER) typically doubles when overlap ratio goes from 5% to 20%.
Segment length. Short utterances (under 1 second) provide minimal acoustic evidence for speaker identification. Backchannels like "yeah" or "mm-hmm" are often too brief for reliable attribution.
Metrics: Diarization Error Rate
The standard metric for diarization quality is Diarization Error Rate (DER), defined as:
DER = (False Alarm + Missed Speech + Speaker Confusion) / Total Speech Duration
State-of-the-art DER on standard benchmarks:
| Dataset | DER | System | Note |
| AMI (meeting) | ~18% | pyannote 3.1 | 4-person meetings, far-field mic |
| CALLHOME | ~11% | pyannote 3.1 | 2-6 person phone calls |
| DIHARD III | ~15% | EEND-vector | Diverse challenging audio |
| VoxConverse | ~5% | pyannote 3.1 | Broadcast/podcast audio |