Forced Alignment for AI Agents: Word Timestamps, Diarization, and Audio Evidence Search

Why Timestamps Are a Retrieval Primitive

Audio and video RAG often starts with automatic speech recognition. That is necessary, but it is not enough.

A transcript without accurate timing is just a long text file. An agent can quote it, but it cannot reliably answer the operational question: where in the source did this happen?

For spoken media, useful evidence has four parts:

Field

What it answers

Text	What was said
Time span	When it was said
Speaker or channel	Who said it, when available
Source handle	Which audio or video object contains it

The time span is what turns a transcript result into inspectable evidence. It lets the agent deep-link to the moment in a call, meeting, podcast, surveillance clip, lecture, deposition, or broadcast.

This is the difference between:

"The customer asked for a refund."

"The customer asked for a refund in call_481 at 00:13:42.180 to 00:13:46.920."

The second answer can be verified.

What Forced Alignment Does

Forced alignment takes two inputs:

1. An audio signal. 2. A known transcript or token sequence.

It returns timing boundaries for the transcript units. Those units may be phonemes, characters, words, phrases, or subtitle segments.

ASR answers: "What text is in this audio?"

Forced alignment answers: "Given this text, where does each part occur in the audio?"

That distinction matters. A transcription model may produce segment-level timestamps, but those timestamps are often too coarse for retrieval citations. Segment boundaries can drift, especially with long audio, music, overlap, silence, accents, code-switching, or noisy recordings. Forced alignment is a second pass that snaps transcript units to acoustic evidence.

The Basic Pipeline

A production alignment pipeline usually looks like this:

audio or video object
  -> voice activity detection
  -> ASR transcript
  -> optional normalization
  -> forced alignment
  -> speaker diarization join
  -> span chunking
  -> embedding and sparse indexing
  -> agent retrieval tool

Each stage changes the retrieval contract.

Voice activity detection removes silence and non-speech so the aligner does not waste work. ASR gives the text. Normalization makes transcript tokens match what the aligner expects. Forced alignment gives boundaries. Diarization attaches speaker turns. Chunking decides what span size should be searchable. Indexing makes the spans retrievable.

The output should not be a plain transcript. It should be a span table:

{
  "source_id": "call_481",
  "source_uri": "s3://support-calls/2026/06/09/call_481.wav",
  "span_id": "call_481:000822180-000826920",
  "text": "I would like a refund because the outage affected our launch",
  "start_ms": 822180,
  "end_ms": 826920,
  "speaker": "customer",
  "asr_model": "Qwen/Qwen3-ASR-1.7B",
  "aligner_model": "Qwen/Qwen3-ForcedAligner-0.6B",
  "alignment_confidence": 0.91
}

An agent can search this record, cite it, and request the source clip for inspection.

Alignment Algorithms

Forced alignment is an old speech problem with newer model families. The main approaches differ in how they score possible alignments between audio frames and transcript units.

1. HMM and Kaldi-Style Alignment

Traditional forced aligners, including the Montreal Forced Aligner family, rely on acoustic models, pronunciation dictionaries, and Hidden Markov Models. The transcript is expanded into a likely phoneme sequence. The acoustic model scores which phoneme is likely at each frame. Dynamic programming finds the most likely path through the sequence.

This works well when:

You have a pronunciation dictionary.

The language is supported.

The transcript is clean.

You need phoneme-level or word-level linguistic alignment.

It is less convenient for arbitrary multilingual production media because dictionaries, acoustic models, and text normalization rules become operational dependencies.

2. CTC Forced Alignment

Connectionist Temporal Classification models output frame-level probabilities over tokens plus a blank symbol. The aligner searches for the path through those frame probabilities that collapses to the target transcript.

The common dynamic programming formulation is similar to Viterbi decoding:

1. Convert transcript tokens into an expanded sequence with blanks. 2. Score every audio frame against every possible token state. 3. Find the highest probability monotonic path. 4. Collapse blanks and repeated tokens into word or character boundaries.

CTC alignment is useful because it does not need frame-level labels at training time. It is common in wav2vec2-style alignment systems and WhisperX-style pipelines.

The main failure mode is tokenization mismatch. If the transcript text and the aligner vocabulary disagree, the model may force a bad path.

3. Dynamic Time Warping

Dynamic Time Warping aligns two sequences by stretching or compressing time while preserving order. It is often used when you have two comparable streams, such as model token timing estimates and transcript tokens.

DTW is attractive because it is simple and robust to local speed changes. It is not a speech recognizer by itself. It needs features or token probabilities that already represent the audio and text.

Use DTW when you need to reconcile two imperfect timing sequences, not when you need a full acoustic alignment system from scratch.

4. Non-Autoregressive Timestamp Prediction

Newer aligners can predict timestamps directly for transcript units. Qwen3-ForcedAligner-0.6B is an example: the model card says it aligns text-speech pairs and returns word or character-level timestamps, with support for timestamp prediction in speech windows up to five minutes.

This is useful for retrieval systems because the output is closer to the data shape agents need:

[
  {"word": "refund", "start_ms": 823940, "end_ms": 824310},
  {"word": "outage", "start_ms": 824980, "end_ms": 825430}
]

The tradeoff is that model-specific runtime and windowing rules matter. You still need validation, confidence handling, and fallbacks.

Why Word Timestamps Are Not Always the Right Unit

Word-level timing sounds ideal, but retrieval rarely needs every word as a separate document. Very small chunks create noisy search results and brittle citations.

Use word timestamps as raw material. Build higher-level searchable spans from them.

Good span units include:

Unit

Best for

Risk

Word	Exact subtitle sync, redaction, clip trimming	Too small for semantic search
Phrase	Spoken claim detection, quote retrieval	Needs punctuation or pause heuristics
Speaker turn	Meetings, calls, interviews	Can be too long when speakers monologue
Semantic chunk	RAG and agent answers	Requires chunker that preserves time
Scene or chapter	Podcasts, lectures, long videos	Too coarse for citations

A strong retrieval system stores multiple units:

Word timestamps for playback and exact citation.

Phrase spans for quote-level retrieval.

Semantic chunks for answer generation.

Speaker turns for conversation structure.

The agent should retrieve chunks, then cite the narrower word or phrase span inside the chunk.

Joining Diarization and Alignment

Speaker diarization answers "who spoke when." Forced alignment answers "where the words occur." They should be joined by interval overlap.

aligned word:     refund     822.180s - 822.510s
speaker segment:  customer   819.000s - 828.400s
result:           refund spoken by customer

The join is simple when speakers do not overlap. Overlap makes it harder:

Two speakers talk at the same time.

The diarizer changes speaker too late.

ASR merges speech from two speakers into one text span.

A speaker label covers silence or background audio.

Do not hide this uncertainty. Store overlap ratios and confidence fields.

{
  "word": "refund",
  "start_ms": 822180,
  "end_ms": 822510,
  "speaker": "customer",
  "speaker_overlap": 0.96,
  "speaker_confidence": 0.88
}

Agents should receive this uncertainty when the answer depends on attribution.

Indexing Aligned Speech

Do not embed timestamps inside the text string. Text like "[00:13:42] I would like a refund" adds token noise to the embedding. Store timestamps as metadata.

Use two parallel representations:

{
  "embedding_text": "I would like a refund because the outage affected our launch",
  "metadata": {
    "source_id": "call_481",
    "start_ms": 822180,
    "end_ms": 826920,
    "speaker": "customer"
  }
}

For retrieval, combine:

1. Dense embedding search for semantic meaning. 2. BM25 or sparse search for exact terms, names, and product codes. 3. Metadata filters for speaker, date, source type, language, or customer. 4. Reranking for the top candidates.

This matters for agent search because spoken evidence contains both semantic and exact-match questions:

Semantic: "Where did the customer sound frustrated about reliability?"

Exact: "Where did someone say SKU-447B?"

Filtered: "Find refund mentions by the customer, not the support agent."

Temporal: "What did the engineer say after the outage was mentioned?"

Query Planning for Agents

Expose aligned speech as a bounded tool. The agent should not receive the entire transcript unless it asks for expansion.

A useful tool schema:

{
  "name": "search_spoken_evidence",
  "description": "Search timestamped audio and video transcripts. Returns cited spans with source URI, speaker, start_ms, end_ms, and nearby context.",
  "input_schema": {
    "type": "object",
    "properties": {
      "query": {"type": "string"},
      "speaker": {"type": "string"},
      "source_type": {"type": "string", "enum": ["audio", "video", "any"]},
      "date_range": {"type": "string"},
      "top_k": {"type": "integer", "minimum": 1, "maximum": 20}
    },
    "required": ["query"]
  }
}

The result should include inspectable handles:

{
  "results": [
    {
      "text": "I would like a refund because the outage affected our launch",
      "speaker": "customer",
      "source_uri": "s3://support-calls/2026/06/09/call_481.wav",
      "start_ms": 822180,
      "end_ms": 826920,
      "score": 0.84,
      "clip_uri": "mixpeek://clips/call_481/822180-826920"
    }
  ]
}

The agent can answer with a citation and ask for a follow-up clip only if needed.

Evaluation

Evaluate alignment separately from retrieval. A retrieval system can find the right transcript chunk while still citing the wrong second.

Measure at least five things:

Metric

What it catches

Boundary error	Average absolute error for word or phrase start/end times
Coverage	Percentage of transcript tokens that received usable timestamps
Speaker attribution accuracy	Whether words are assigned to the correct speaker
Citation hit rate	Whether the cited span actually contains the answer
Expansion quality	Whether adjacent context helps without pulling in unrelated speech

For agents, add task-level evals:

1. Ask the agent a question that requires spoken evidence. 2. Require an answer with source ID and timestamp. 3. Check whether the cited time span contains the supporting speech. 4. Penalize answers that cite only the file without a span. 5. Penalize answers that cite the right transcript but wrong speaker.

The important metric is not only word error rate. It is whether a human can click the citation and verify the answer.

Failure Modes

Transcript mismatch. If ASR hallucinated, omitted, or normalized text too aggressively, the aligner may force timestamps onto words that were never spoken.

Punctuation drift. Punctuation is not acoustic. Use it to build spans, but do not expect punctuation boundaries to align cleanly with audio.

Overlapping speech. One transcript line may contain speech from two speakers. Store uncertainty and consider overlap-aware diarization.

Music and background audio. Song lyrics, crowd noise, and background speech can confuse ASR and alignment. Track non-speech intervals.

Long-window drift. Alignment windows that are too long can drift. Windows that are too short lose context. Use voice activity and sentence boundaries to choose windows.

Embedding pollution. Inline timestamps, speaker labels, and JSON blobs can hurt semantic embeddings. Keep clean text for embedding and structured metadata for filters.

Mixpeek Example

In Mixpeek, model this as a speech evidence pipeline rather than a transcript upload.

from mixpeek import Mixpeek

mx = Mixpeek(api_key="YOUR_API_KEY")

mx.ingest.audio(
    collection_id="support-calls",
    source="s3://support-calls/2026/06/",
    feature_extractors=[
        {
            "name": "audio_transcription",
            "model_id": "Qwen/Qwen3-ASR-1.7B",
            "params": {
                "return_text": True,
                "return_word_timestamps": True,
                "forced_aligner": "Qwen/Qwen3-ForcedAligner-0.6B"
            }
        },
        {
            "name": "speaker_diarization",
            "model_id": "pyannote/speaker-diarization-community-1"
        },
        {
            "name": "text_embedding",
            "model_id": "BAAI/bge-m3",
            "params": {
                "source_field": "aligned_speech_spans.text"
            }
        }
    ]
)

Then expose a retrieval tool:

results = mx.retrievers.execute(
    retriever_id="your-retriever-id",
    query="customer requested refund after outage",
)

If you already generated aligned spans outside Mixpeek, store them in MVS with clean text vectors and timestamp metadata:

mx.mvs.upsert(
    namespace="support-call-memory",
    vectors=[
        {
            "id": "call_481:822180:826920:bge_m3",
            "values": span_embedding,
            "metadata": {
                "source_uri": "s3://support-calls/2026/06/09/call_481.wav",
                "text": "I would like a refund because the outage affected our launch",
                "speaker": "customer",
                "start_ms": 822180,
                "end_ms": 826920,
                "asr_model": "Qwen/Qwen3-ASR-1.7B",
                "aligner_model": "Qwen/Qwen3-ForcedAligner-0.6B"
            }
        }
    ]
)

Design Checklist

Store clean transcript text separately from timestamp metadata.

Preserve source URI, source ID, language, model IDs, extractor versions, and timing units.

Keep word-level timestamps even if retrieval chunks are phrase or semantic spans.

Join diarization and alignment by interval overlap, not by transcript line number.

Store confidence and overlap fields when speaker attribution is uncertain.

Use dense and sparse retrieval together for spoken evidence.

Return source URI, start time, end time, and nearby context to agents.

Evaluate citation hit rate, not only ASR word error rate.

Backfill alignment when ASR models, diarizers, or normalization rules change.

Key Takeaways

1. ASR makes audio searchable. Forced alignment makes it citeable.

2. Word timestamps are raw evidence, not always the retrieval chunk.

3. Keep timestamps as metadata. Do not pollute embedding text with timing strings.

4. Diarization and alignment need interval joins with uncertainty, especially during overlap.

5. Agent tools should return compact cited spans with source handles and expansion paths.

6. The eval that matters is whether a human can click the cited span and verify the answer.

Why Timestamps Are a Retrieval Primitive

What Forced Alignment Does

The Basic Pipeline

Alignment Algorithms

1. HMM and Kaldi-Style Alignment

2. CTC Forced Alignment

3. Dynamic Time Warping

4. Non-Autoregressive Timestamp Prediction

Why Word Timestamps Are Not Always the Right Unit

Joining Diarization and Alignment

Indexing Aligned Speech

Query Planning for Agents

Evaluation

Failure Modes

Mixpeek Example

Design Checklist

Key Takeaways

Further Reading

Put multimodal search to work

Already have vectors?

Run this on your own data

Related guides

Audio Fingerprinting: How Agents Recognize a Specific Recording in Noise

Streaming Video Understanding: How Agents Watch an Unbounded Live Feed in Real Time

Perceptual Image Hashing: How Agents Recognize the Same Picture After It Has Been Re-Encoded, Cropped, and Recolored