The Deaf Agent Problem
Most AI agents are blind and deaf. The previous guide in this series tackled vision -- giving agents eyes through object detection and scene understanding. This guide tackles the other half: hearing.
An agent that can process text but not audio is missing a massive slice of the world's information. Meeting recordings, customer support calls, podcast archives, surveillance footage with dialogue, factory floor sounds, music libraries -- all of these contain signal that a text-only agent cannot access.
The gap is not just about transcription. Transcribing speech to text is step one, but it discards speaker identity, emotional tone, background sounds, music, and temporal structure. A complete audio perception pipeline extracts multiple layers of features from the same audio stream, just as a visual pipeline extracts objects, scenes, and embeddings from the same image.
This guide covers the full stack: how raw audio becomes features that agents can search, filter, and reason over.
From Air Pressure to Numbers
Sound is a pressure wave. A microphone converts that wave into a continuous electrical signal. An analog-to-digital converter (ADC) samples that signal at regular intervals -- typically 16,000 times per second (16 kHz) for speech, 44,100 Hz for music, or 48,000 Hz for professional audio. Each sample is a number representing the amplitude of the wave at that instant.
A one-minute audio clip at 16 kHz produces 960,000 numbers. This raw waveform is the starting point for all audio processing, but it is a poor representation for machine learning. Two utterances of the same word by the same speaker will produce different raw waveforms because of tiny timing differences, background noise, and microphone variation.
The Spectrogram Transform
The critical preprocessing step is the Short-Time Fourier Transform (STFT). The STFT slides a window (typically 25ms wide, shifted by 10ms each step) across the waveform and computes the frequency decomposition of each window. The result is a 2D matrix: time on one axis, frequency on the other, with intensity values representing how much energy is present at each frequency at each moment.
This is a spectrogram -- a visual representation of sound that converts the 1D time-domain signal into a 2D time-frequency representation. Spectrograms are to audio what images are to vision: the natural input format for neural networks.
Mel-Frequency Scaling
Human hearing is not linear. We are much better at distinguishing low-frequency differences (100 Hz vs 200 Hz) than high-frequency ones (8000 Hz vs 8100 Hz). The mel scale compresses high frequencies to match human perception. A mel spectrogram typically uses 80 or 128 mel-frequency bins, reducing a 513-bin linear spectrogram to a compact representation that preserves the information humans actually use.
Most modern speech models (Whisper, Wav2Vec2, Conformer) take mel spectrograms as input. The typical pipeline:
raw waveform (16kHz, 960K samples/min)
→ STFT (25ms window, 10ms hop)
→ mel filterbank (80 bins)
→ log compression
→ 80 × T mel spectrogram
Where T is the number of time frames (about 100 per second of audio).
Automatic Speech Recognition Architectures
Transcribing speech to text is the most mature audio AI task, but the architectures have evolved dramatically. Understanding the three main approaches helps you choose the right model for your use case.
CTC: Connectionist Temporal Classification
CTC (2006) solved a fundamental problem: how do you train a model when you have audio paired with text but no alignment between them? You know the speaker said "hello world" but you do not know which milliseconds correspond to which letters.
CTC works by having the model output a probability distribution over characters (plus a special "blank" token) at every time frame. The key insight is the many-to-one mapping: multiple output sequences can collapse to the same text. For example, these all decode to "hello":
h-h-e-e-l-l-l-l-o → hello
--h-e-l--l-o----- → hello
h--e--l-l--o-- → hello
CTC sums the probabilities of all possible alignments and optimizes the total probability of the correct text. At inference time, a simple greedy or beam search decoder collapses repeated characters and removes blanks.
Strengths: Fast inference (single forward pass), streaming-capable, simple architecture. Weaknesses: Cannot model output dependencies -- each character prediction is independent. Struggles with rare words because it has no language model built in.
Models using CTC: Wav2Vec2, HuBERT, NVIDIA Parakeet-CTC.
Attention-Based Encoder-Decoder
The attention-based approach (used by Whisper) treats ASR as a sequence-to-sequence translation problem: translate audio into text, one token at a time.
The encoder processes the mel spectrogram through transformer layers to produce a sequence of hidden states. The decoder is an autoregressive transformer that generates text tokens one by one, attending to the encoder output at each step.
This architecture can model output dependencies (each token prediction sees all previous tokens), which gives it advantages on rare words, punctuation, and formatting. Whisper exploits this by training the decoder to produce fully formatted text with punctuation, capitalization, and even language tags.
Strengths: Highest accuracy on offline transcription, handles punctuation and formatting, multilingual. Weaknesses: Not streaming (must wait for full audio), slower inference due to autoregressive decoding, can hallucinate (generate text not in the audio) when the audio is silent or noisy.
Models using attention: OpenAI Whisper, Google USM.
Transducer (RNN-T / TDT)
The transducer architecture combines the best of both worlds. It has an encoder (processes audio), a prediction network (models text history, like a decoder), and a joint network that combines them.
Unlike attention models, the transducer does not attend over the full encoder output -- it processes audio frame by frame. Unlike CTC, it does model output dependencies through the prediction network. This makes it both streaming-capable and accurate.
The Token-and-Duration Transducer (TDT) variant (used by NVIDIA Parakeet TDT) adds duration prediction, which improves timestamp accuracy and reduces computational cost by skipping blank frames more efficiently.
Strengths: Streaming + accurate, good timestamp alignment, efficient inference. Weaknesses: More complex training, newer and less widely supported.
Models using transducers: NVIDIA Parakeet TDT, Google Chirp.
Choosing the Right ASR Architecture
| Requirement | Best Architecture | Why |
| Offline batch transcription | Attention (Whisper) | Highest accuracy, best formatting |
| Real-time streaming | CTC or Transducer | Frame-by-frame processing |
| Accurate timestamps | Transducer (TDT) | Duration prediction built in |
| Low-resource language | Attention (Whisper) | Pre-trained on 680K hours of multilingual data |
| Edge deployment | CTC (Moonshine) | Smallest models, simplest decoder |
Speaker Diarization: Who Said What
Transcription tells you what was said. Speaker diarization tells you who said it, and when they started and stopped speaking. This is essential for meeting recordings, interviews, call center audio, and any multi-speaker scenario.
The Diarization Pipeline
Classical speaker diarization is a four-stage pipeline:
1. Voice Activity Detection (VAD): Identify which segments of audio contain speech versus silence, music, or noise. This is a binary classification at the frame level. PyAnnote uses a small neural network trained to output speech probability for each 16ms frame.
2. Speaker Embedding Extraction: For each speech segment, extract a fixed-length vector (speaker embedding) that captures the speaker's voice characteristics. The dominant approach is x-vectors: a time-delay neural network (TDNN) trained on speaker verification tasks, producing a 512-dimensional embedding per segment. Two segments from the same speaker should have embeddings with high cosine similarity; segments from different speakers should have low similarity.
3. Clustering: Group segments by speaker. Agglomerative hierarchical clustering (AHC) is the standard approach: start with each segment as its own cluster, then iteratively merge the two most similar clusters until a stopping criterion is met. The stopping criterion (usually a cosine similarity threshold) determines how many speakers are detected.
4. Re-segmentation: Refine speaker boundaries. The initial segmentation is coarse (based on VAD chunks). Re-segmentation uses a Viterbi algorithm or neural model to find the exact frame where the speaker changes.
End-to-End Neural Diarization (EEND)
Recent work replaces the multi-stage pipeline with a single neural network. EEND models take the mel spectrogram as input and output a speaker activity matrix: for each time frame, which of N speakers is active. This naturally handles overlapping speech (two speakers talking simultaneously), which is the Achilles heel of clustering-based methods.
PyAnnote 3.1 uses a hybrid approach: neural segmentation (detecting speaker turns and overlaps) combined with embedding-based clustering for speaker identity.
Diarization for Agent Pipelines
In a multimodal pipeline, diarization enriches the transcript with speaker labels:
Without diarization:
"Let's move forward with the merger."
"I disagree, the valuation is too high."
With diarization:
[Speaker A, 00:14.2 - 00:17.8] "Let's move forward with the merger."
[Speaker B, 00:18.1 - 00:21.4] "I disagree, the valuation is too high."
An agent can now filter by speaker, search for what a specific person said, or analyze speaker dynamics (talk time ratios, interruption frequency, turn-taking patterns).
Audio Embeddings and Contrastive Learning
Transcription and diarization extract structured features from speech. But audio contains far more than speech: music, environmental sounds, mechanical noises, alarms, animal calls. To make all of this searchable, you need audio embeddings -- dense vector representations that capture semantic similarity.
CLAP: Contrastive Language-Audio Pretraining
CLAP is the audio counterpart to CLIP. Just as CLIP learns a shared embedding space for images and text, CLAP learns a shared space for audio and text.
The architecture:
Audio Input → Audio Encoder (HTSAT) → audio embedding (512-dim)
Text Input → Text Encoder (RoBERTa) → text embedding (512-dim)
Contrastive loss: pull (audio, matching text) pairs together,
push (audio, non-matching text) pairs apart
HTSAT (Hierarchical Token-Semantic Audio Transformer) is a Swin Transformer adapted for spectrograms. It processes the mel spectrogram with hierarchical windowed self-attention, producing a single 512-dimensional vector for the entire audio clip.
After training on hundreds of thousands of (audio, caption) pairs, CLAP enables:
The Modality Gap in Audio
Just as CLIP embeddings for images and text occupy different sub-regions of the shared space (the "modality gap"), CLAP audio and text embeddings show the same phenomenon. An audio embedding of a dog barking and the text embedding of "dog barking" will be close relative to unrelated pairs, but there is still a systematic offset between the audio and text embedding distributions.
For search pipelines, this means:
1. Cross-modal search works (text query → audio results) but is not as precise as within-modal search. 2. Score calibration matters. A cosine similarity of 0.3 in CLAP cross-modal search may be a strong match, while the same score in text-to-text search would be weak. 3. Fusion helps. Combining audio embeddings with transcript text embeddings (reciprocal rank fusion or learned fusion) consistently outperforms either modality alone.
Building an Audio Perception Pipeline
A complete audio perception pipeline extracts multiple feature types from the same audio stream, stores them separately, and makes each searchable through appropriate retrieval stages.
Architecture
Audio file (.mp3, .wav, .m4a)
│
├─→ Mel Spectrogram → ASR Model → transcript (text)
│ │
│ ├─→ Text Embeddings → vector store
│ └─→ Speaker Diarization → speaker segments
│
├─→ CLAP Encoder → audio embedding → vector store
│
└─→ VAD → speech/silence segments → metadata store
Each feature type feeds a different retrieval stage:
| Feature | Storage | Retrieval Stage |
| Transcript text | Text index | Keyword / semantic search |
| Text embeddings | Vector store | Semantic similarity (text-to-text) |
| Audio embeddings (CLAP) | Vector store | Cross-modal search (text-to-audio) |
| Speaker segments | Structured metadata | Filter by speaker ID |
| VAD segments | Structured metadata | Filter speech vs. non-speech |
Extraction Strategy: Chunk, Extract, Index
Audio files can be arbitrarily long. A one-hour meeting recording needs to be chunked before extraction:
1. Fixed-length chunks (30 seconds) for audio embeddings. CLAP is trained on short clips; longer audio should be split into overlapping 30-second windows. 2. VAD-based chunks for transcription. Split on silence boundaries to avoid cutting words in half. Most ASR models handle 30-second segments well; Whisper processes 30-second windows internally. 3. Speaker turn chunks for diarization-aware transcription. After diarization, re-segment by speaker turns so each chunk contains speech from a single speaker.
Multi-Stage Retrieval Example
A retrieval pipeline for a podcast archive might look like:
Stage 1: Audio embedding search (CLAP)
→ "sounds like a heated argument" → top 50 clips by audio similarity
Stage 2: Transcript semantic search
→ "disagreement about pricing strategy" → re-rank by text relevance
Stage 3: Speaker filter
→ speaker_id = "CEO" → filter to clips where the CEO is speaking
Stage 4: LLM re-rank
→ Re-rank final candidates using an LLM that reads both the transcript
and the query, producing a relevance score
Each stage narrows the candidate set using a different modality, combining the strengths of audio understanding, text semantics, and structured metadata.
Practical Patterns for Agent Audio Processing
Pattern 1: Transcribe-and-Forget
The simplest pattern: transcribe audio to text, then treat it as a text document. The agent never touches the audio again.
# Transcribe with Whisper
transcript = whisper.transcribe(audio_path, model="large-v3")
# Index the transcript text
index.add_document(
text=transcript["text"],
metadata={"source": audio_path, "duration": transcript["duration"]}
)
When to use: When speech content is all that matters and you do not need speaker identity, audio search, or non-speech sounds.
Limitation: Discards everything except words. Music, tone, background sounds, speaker identity -- all lost.
Pattern 2: Parallel Feature Extraction
Extract multiple feature types simultaneously and index each one:
# Run in parallel
transcript = asr_model.transcribe(audio)
speakers = diarization_model.diarize(audio)
audio_emb = clap_model.encode_audio(audio_chunks)
text_emb = text_model.encode(transcript["text"])
# Index each feature type
for chunk, emb in zip(audio_chunks, audio_emb):
vector_store.upsert(id=chunk.id, vector=emb, type="audio")
vector_store.upsert(id=doc_id, vector=text_emb, type="text")
metadata_store.insert(
id=doc_id,
speakers=speakers,
transcript=transcript
)
When to use: When the agent needs to search by content, by sound, and by speaker. This is the standard production pattern.
Pattern 3: Streaming Perception
For real-time applications (live meetings, surveillance, call centers), the agent processes audio as it arrives:
# Streaming ASR + real-time diarization
stream = audio_source.open_stream(sample_rate=16000)
for chunk in stream.chunks(duration_ms=500):
# Streaming CTC/Transducer model
partial_transcript = asr_model.process_chunk(chunk)
# Sliding-window speaker embedding
if len(buffer) >= 2.0: # 2 seconds of audio
speaker_id = diarizer.identify_speaker(buffer)
# Emit event if new speech detected
if partial_transcript.is_final:
agent.on_speech(
text=partial_transcript.text,
speaker=speaker_id,
timestamp=chunk.timestamp
)
When to use: Live transcription, real-time monitoring, interactive agents that respond during conversation.
Evaluation Metrics for Audio Pipelines
Word Error Rate (WER) for ASR
WER is the standard ASR metric: the edit distance between the predicted and reference transcripts, normalized by reference length.
WER = (Substitutions + Insertions + Deletions) / Reference Words
Whisper Large v3 achieves approximately 3-5% WER on clean English speech. On noisy or accented speech, WER can exceed 15-20%. Always evaluate on audio that matches your production distribution.
Diarization Error Rate (DER) for Speaker Diarization
DER measures the fraction of time that is incorrectly attributed to a speaker. It has three components:
DER = (False Alarm + Missed + Confusion) / Total Speech Time
PyAnnote 3.1 achieves approximately 10-15% DER on standard benchmarks. The main error source is overlapping speech, where two speakers talk simultaneously.
Retrieval Metrics for Audio Search
For audio embedding search, use the same metrics as visual search: Recall@K (fraction of relevant results in the top K), Mean Average Precision (mAP), and Normalized Discounted Cumulative Gain (nDCG).
The key difference: always evaluate cross-modal retrieval (text query → audio results) separately from within-modal retrieval (audio query → audio results). Cross-modal performance is typically 10-20% lower due to the modality gap.
Common Pitfalls
Assuming transcription is lossless. Transcripts lose speaker identity, timing precision, tone, and non-speech sounds. If your use case requires any of these, extract them as separate features.
Using a single embedding for long audio. CLAP is designed for clips under 30 seconds. Encoding a 60-minute recording as a single embedding averages out all the detail. Chunk first, embed each chunk, and index them as separate searchable units.
Ignoring sample rate mismatches. Most speech models expect 16 kHz audio. Feeding 44.1 kHz audio without resampling produces garbage. Always resample to the model's expected rate before processing.
Skipping VAD before ASR. Running Whisper on long silent stretches or music causes hallucination -- the model generates plausible-sounding but entirely fabricated transcripts. Always run VAD first and only transcribe segments that contain speech.
Not calibrating diarization for your domain. The default clustering threshold in PyAnnote assumes conversational audio with clear turn-taking. In a noisy call center or multi-party meeting, you need to tune the threshold on a labeled sample from your domain.