Speaker Diarization: How AI Agents Know Who Said What in Audio and Video

Why Transcription Alone Isn't Enough

An AI agent that can transcribe audio knows WHAT was said. But for most real-world applications -- meetings, interviews, podcasts, courtroom recordings, customer service calls -- knowing what was said is only half the problem. The other half is knowing WHO said it.

A raw transcript of a meeting looks like this:

We should move the launch to Q3. I disagree, the market window
closes in June. What if we do a soft launch in May? That could work
if engineering can hit the deadline.

A diarized transcript looks like this:

[Speaker A, 00:14] We should move the launch to Q3.
[Speaker B, 00:18] I disagree, the market window closes in June.
[Speaker C, 00:22] What if we do a soft launch in May?
[Speaker A, 00:25] That could work if engineering can hit the deadline.

The second version is searchable in fundamentally different ways. You can answer "what did the VP of Marketing say about the timeline?" or "find all meetings where the CTO and the designer discussed the same feature." Without diarization, these queries are impossible because the transcript has no concept of speaker identity.

Speaker diarization is the process of partitioning an audio stream into segments labeled by speaker identity. It answers the question: "who spoke when?" The output is a set of time-stamped speaker labels that can be aligned with a transcript to produce speaker-attributed text.

Speaker diarization answers who spoke and when, the layer transcription alone cannot provide. The same sixteen seconds of a meeting are shown at three levels: the raw waveform, voice activity detection marking speech regions, and clustered speaker turns, producing a transcript whose every line carries a speaker label and a timestamp. The standard pipeline is four stages: voice activity detection (Silero VAD, a 2MB model running 500x real-time on CPU at about 95% accuracy), speaker embedding extraction (ECAPA-TDNN, 192 or 256 dimensions over 1.5 to 3 second sliding windows, capturing pitch and timbre regardless of words), spectral clustering that reads the speaker count from the eigenvalue gap, and overlap detection that allows multiple labels on one region. Diarization is then joined with ASR word timestamps to attribute each word, and the field scores itself with diarization error rate (false alarm plus missed speech plus speaker confusion over total speech), where under 10% is production quality.

See the full diagram →

The Diarization Pipeline

Modern diarization systems follow a four-stage pipeline. Each stage can be implemented with different algorithms, but the overall structure is consistent across production systems.

Stage 1: Voice Activity Detection (VAD)

Before identifying speakers, the system needs to know WHEN anyone is speaking at all. Voice Activity Detection separates speech regions from silence, music, background noise, and other non-speech audio.

The simplest VAD approaches use energy thresholds -- if the signal power exceeds a threshold, mark it as speech. This works in clean audio but fails in noisy environments where background noise has comparable energy to speech.

Modern VAD uses small neural networks trained to classify each audio frame as speech or non-speech. Silero VAD (used by pyannote and many production systems) is a 2MB ONNX model that processes audio at 500x real-time on CPU. It classifies 30ms frames as speech/non-speech with about 95% accuracy on noisy audio.

VAD output is a set of speech segments with start and end timestamps:

speech: [(0.5, 3.2), (3.8, 7.1), (8.0, 12.5), (13.1, 15.8)]

Stage 2: Speaker Embedding Extraction

Once speech regions are identified, the system extracts a fixed-dimensional vector (embedding) for each segment that captures the speaker's vocal characteristics -- pitch, timbre, speaking rate, vocal tract resonances.

The dominant architecture for speaker embeddings is the ECAPA-TDNN (Emphasized Channel Attention, Propagation and Aggregation in Time Delay Neural Network). It processes mel-spectrogram features through 1D convolutional layers with multi-scale channel attention, producing a 192 or 256-dimensional embedding per segment.

Key properties of speaker embeddings:

Speaker discriminative: Embeddings from the same speaker cluster together regardless of what they're saying. "Hello, how are you" and "The quarterly results exceeded expectations" from the same person produce similar vectors.

Text independent: The embedding captures WHO is speaking, not WHAT they're saying. This is the opposite of text embeddings.

Robust to channel variation: A good speaker encoder produces similar embeddings whether the person is on a phone call, in a conference room, or wearing a headset.

The embedding extraction typically operates on short segments (1.5-3 seconds) with a sliding window. Shorter segments are more likely to contain a single speaker but provide less acoustic evidence. Longer segments capture more speaker information but risk spanning speaker turns.

Stage 3: Clustering

With a set of speaker embeddings (one per short segment), the system needs to determine which segments belong to the same speaker. This is a clustering problem: group the embeddings into K clusters, where K is the number of distinct speakers.

Spectral clustering is the standard approach for diarization. The algorithm:

1. Compute a similarity matrix S where S[i,j] = cosine_similarity(embedding_i, embedding_j) 2. Build a graph Laplacian from the similarity matrix 3. Compute the eigenvalues of the Laplacian 4. Estimate the number of speakers from the eigenvalue gap (the largest jump between consecutive eigenvalues) 5. Run K-means on the top-K eigenvectors to assign each segment to a speaker

The eigenvalue gap heuristic for estimating speaker count works because the Laplacian's eigenvalue spectrum reflects the block structure of the similarity matrix. If there are 3 distinct speakers, the similarity matrix has 3 dense blocks (within-speaker similarities are high) separated by sparse regions (between-speaker similarities are low). This produces 3 small eigenvalues followed by a gap.

Agglomerative Hierarchical Clustering (AHC) is the main alternative. It starts with each segment as its own cluster and iteratively merges the two closest clusters until a stopping threshold is reached. AHC is simpler to implement and works well when the number of speakers is small, but spectral clustering handles larger speaker counts more robustly.

Stage 4: Overlap Detection and Refinement

In real conversations, speakers frequently overlap -- interruptions, backchannels ("uh-huh", "right"), and simultaneous speech. A pipeline that assigns each time frame to exactly one speaker will misattribute overlapping regions.

Overlap-aware diarization adds a separate model that detects time regions where multiple speakers are active simultaneously. Pyannote's overlap detection uses a PyanNet architecture (a modification of SincNet + LSTM) that outputs per-frame probabilities of 0, 1, 2, or 3+ simultaneous speakers.

When overlap is detected, the system assigns the frame to all speakers whose embedding similarity to the segment exceeds a threshold. The result is that a single time region can have multiple speaker labels.

End-to-End Neural Diarization

The pipeline approach (VAD -> embeddings -> clustering -> overlap) works well but has a fundamental limitation: errors compound across stages. A VAD miss causes the embedding extractor to skip a speech region entirely. A clustering error propagates to all segments in that cluster.

End-to-end neural diarization (EEND) replaces the multi-stage pipeline with a single neural network that directly outputs per-frame speaker labels. The model takes a mel-spectrogram as input and outputs a matrix of speaker activity probabilities: one row per time frame, one column per speaker.

EEND architectures use self-attention (Transformer) to model the entire audio sequence, learning to detect speaker changes, attribute speech to the correct speaker, and handle overlaps -- all in a single forward pass.

The trade-offs:

Aspect

Pipeline (VAD + Embed + Cluster)

EEND

Speaker count	Handles unknown N via clustering	Often requires fixed max N
Long audio	Scales well (process in windows)	Quadratic attention limits length
Overlap handling	Requires separate overlap model	Handles naturally
Interpretability	Each stage debuggable	Black box
Accuracy	SOTA with pyannote 3.x	Competitive but less mature

In practice, most production systems use the pipeline approach with pyannote 3.x because it handles variable speaker counts, scales to long audio, and provides interpretable intermediate outputs for debugging.

Aligning Diarization with Transcription

Diarization outputs speaker-labeled time segments. Transcription (ASR) outputs timestamped words. Aligning the two produces speaker-attributed transcripts.

The alignment process:

1. Run ASR (e.g., Whisper, Canary-Qwen) to get word-level timestamps 2. Run diarization (e.g., pyannote) to get speaker-labeled segments 3. For each word, find which speaker segment it falls into by matching timestamps 4. If a word spans a speaker boundary, assign it to the speaker with the larger overlap

# Simplified alignment
for word in transcript_words:
    word_mid = (word.start + word.end) / 2
    for segment in diarization_segments:
        if segment.start <= word_mid <= segment.end:
            word.speaker = segment.speaker
            break

The quality of alignment depends on timestamp accuracy from both systems. Whisper's word-level timestamps have approximately 200ms precision. Pyannote's speaker boundaries have approximately 100ms precision. Combined, this means speaker attribution is reliable for utterances longer than about 500ms but may be incorrect for very short interjections or rapid turn-taking.

Speaker Identification vs. Diarization

Diarization assigns abstract labels (Speaker A, Speaker B) -- it tells you that the same person spoke at 0:14 and 0:25, but not who that person is. Speaker identification goes further: it matches the abstract labels to known identities.

The identification step compares each cluster's centroid embedding against a database of enrolled speaker embeddings. If Speaker A's centroid is closest to "Sarah Chen" in the enrollment database, Speaker A is labeled as Sarah Chen.

This requires a pre-enrollment step where known speakers provide reference audio. The reference audio is embedded using the same speaker encoder used for diarization, and the resulting embedding is stored in a database.

For production systems, the enrollment process typically needs 10-30 seconds of clean speech per person. More reference audio produces more robust enrollment embeddings because short segments can be affected by emotional state, speaking context, or recording conditions.

What Affects Diarization Quality

Several factors determine how well diarization works in practice:

Number of speakers. Diarization works best with 2-4 speakers. Performance degrades as speaker count increases because the clustering problem becomes harder and speaker embeddings from brief segments carry less discriminative information. Conference calls with 10+ speakers are significantly harder than two-person interviews.

Audio quality. Far-field microphones (conference room ceiling arrays) produce lower-quality speaker embeddings than close-talk microphones (headsets, lapel mics). Reverberation smears the spectral features that speaker encoders rely on.

Speaker similarity. Speakers with similar vocal characteristics (same gender, similar age, similar accent) are harder to separate. The embedding space has less distance between them, making clustering boundaries less clear.

Overlap ratio. Conversations with frequent overlapping speech (debates, group discussions) are harder because the system must disentangle multiple simultaneous voices. Diarization error rate (DER) typically doubles when overlap ratio goes from 5% to 20%.

Segment length. Short utterances (under 1 second) provide minimal acoustic evidence for speaker identification. Backchannels like "yeah" or "mm-hmm" are often too brief for reliable attribution.

Metrics: Diarization Error Rate

The standard metric for diarization quality is Diarization Error Rate (DER), defined as:

DER = (False Alarm + Missed Speech + Speaker Confusion) / Total Speech Duration

False Alarm: Non-speech regions incorrectly labeled as speech (VAD errors)

Missed Speech: Speech regions not detected (VAD misses)

Speaker Confusion: Speech correctly detected but attributed to the wrong speaker (clustering errors)

State-of-the-art DER on standard benchmarks:

Dataset

DER

System

Note

AMI (meeting)	~18%	pyannote 3.1	4-person meetings, far-field mic
CALLHOME	~11%	pyannote 3.1	2-6 person phone calls
DIHARD III	~15%	EEND-vector	Diverse challenging audio
VoxConverse	~5%	pyannote 3.1	Broadcast/podcast audio

DER below 10% is considered production-quality for most applications. The AMI meeting corpus remains challenging because of far-field recording conditions and frequent overlapping speech.

Why Transcription Alone Isn't Enough

The Diarization Pipeline

Stage 1: Voice Activity Detection (VAD)

Stage 2: Speaker Embedding Extraction

Stage 3: Clustering

Stage 4: Overlap Detection and Refinement

End-to-End Neural Diarization

Aligning Diarization with Transcription

Speaker Identification vs. Diarization

What Affects Diarization Quality

Metrics: Diarization Error Rate

Further Reading

Put multimodal search to work

Already have vectors?

Run this on your own data

Related guides

Streaming Video Understanding: How Agents Watch an Unbounded Live Feed in Real Time

Audio Fingerprinting: How Agents Recognize a Specific Recording in Noise

Creative Ad Analysis for AI Agents: JEPA, Multi-Vector Retrieval, and Signal Fusion