Why Timestamps Are a Retrieval Primitive
Audio and video RAG often starts with automatic speech recognition. That is necessary, but it is not enough.
A transcript without accurate timing is just a long text file. An agent can quote it, but it cannot reliably answer the operational question: where in the source did this happen?
For spoken media, useful evidence has four parts:
| Field | What it answers |
| Text | What was said |
| Time span | When it was said |
| Speaker or channel | Who said it, when available |
| Source handle | Which audio or video object contains it |
This is the difference between:
The second answer can be verified.
What Forced Alignment Does
Forced alignment takes two inputs:
1. An audio signal. 2. A known transcript or token sequence.
It returns timing boundaries for the transcript units. Those units may be phonemes, characters, words, phrases, or subtitle segments.
ASR answers: "What text is in this audio?"
Forced alignment answers: "Given this text, where does each part occur in the audio?"
That distinction matters. A transcription model may produce segment-level timestamps, but those timestamps are often too coarse for retrieval citations. Segment boundaries can drift, especially with long audio, music, overlap, silence, accents, code-switching, or noisy recordings. Forced alignment is a second pass that snaps transcript units to acoustic evidence.
The Basic Pipeline
A production alignment pipeline usually looks like this:
audio or video object
-> voice activity detection
-> ASR transcript
-> optional normalization
-> forced alignment
-> speaker diarization join
-> span chunking
-> embedding and sparse indexing
-> agent retrieval tool
Each stage changes the retrieval contract.
Voice activity detection removes silence and non-speech so the aligner does not waste work. ASR gives the text. Normalization makes transcript tokens match what the aligner expects. Forced alignment gives boundaries. Diarization attaches speaker turns. Chunking decides what span size should be searchable. Indexing makes the spans retrievable.
The output should not be a plain transcript. It should be a span table:
{
"source_id": "call_481",
"source_uri": "s3://support-calls/2026/06/09/call_481.wav",
"span_id": "call_481:000822180-000826920",
"text": "I would like a refund because the outage affected our launch",
"start_ms": 822180,
"end_ms": 826920,
"speaker": "customer",
"asr_model": "Qwen/Qwen3-ASR-1.7B",
"aligner_model": "Qwen/Qwen3-ForcedAligner-0.6B",
"alignment_confidence": 0.91
}
An agent can search this record, cite it, and request the source clip for inspection.
Alignment Algorithms
Forced alignment is an old speech problem with newer model families. The main approaches differ in how they score possible alignments between audio frames and transcript units.
1. HMM and Kaldi-Style Alignment
Traditional forced aligners, including the Montreal Forced Aligner family, rely on acoustic models, pronunciation dictionaries, and Hidden Markov Models. The transcript is expanded into a likely phoneme sequence. The acoustic model scores which phoneme is likely at each frame. Dynamic programming finds the most likely path through the sequence.
This works well when:
It is less convenient for arbitrary multilingual production media because dictionaries, acoustic models, and text normalization rules become operational dependencies.
2. CTC Forced Alignment
Connectionist Temporal Classification models output frame-level probabilities over tokens plus a blank symbol. The aligner searches for the path through those frame probabilities that collapses to the target transcript.
The common dynamic programming formulation is similar to Viterbi decoding:
1. Convert transcript tokens into an expanded sequence with blanks. 2. Score every audio frame against every possible token state. 3. Find the highest probability monotonic path. 4. Collapse blanks and repeated tokens into word or character boundaries.
CTC alignment is useful because it does not need frame-level labels at training time. It is common in wav2vec2-style alignment systems and WhisperX-style pipelines.
The main failure mode is tokenization mismatch. If the transcript text and the aligner vocabulary disagree, the model may force a bad path.
3. Dynamic Time Warping
Dynamic Time Warping aligns two sequences by stretching or compressing time while preserving order. It is often used when you have two comparable streams, such as model token timing estimates and transcript tokens.
DTW is attractive because it is simple and robust to local speed changes. It is not a speech recognizer by itself. It needs features or token probabilities that already represent the audio and text.
Use DTW when you need to reconcile two imperfect timing sequences, not when you need a full acoustic alignment system from scratch.
4. Non-Autoregressive Timestamp Prediction
Newer aligners can predict timestamps directly for transcript units. Qwen3-ForcedAligner-0.6B is an example: the model card says it aligns text-speech pairs and returns word or character-level timestamps, with support for timestamp prediction in speech windows up to five minutes.
This is useful for retrieval systems because the output is closer to the data shape agents need:
[
{"word": "refund", "start_ms": 823940, "end_ms": 824310},
{"word": "outage", "start_ms": 824980, "end_ms": 825430}
]
The tradeoff is that model-specific runtime and windowing rules matter. You still need validation, confidence handling, and fallbacks.
Why Word Timestamps Are Not Always the Right Unit
Word-level timing sounds ideal, but retrieval rarely needs every word as a separate document. Very small chunks create noisy search results and brittle citations.
Use word timestamps as raw material. Build higher-level searchable spans from them.
Good span units include:
| Unit | Best for | Risk |
| Word | Exact subtitle sync, redaction, clip trimming | Too small for semantic search |
| Phrase | Spoken claim detection, quote retrieval | Needs punctuation or pause heuristics |
| Speaker turn | Meetings, calls, interviews | Can be too long when speakers monologue |
| Semantic chunk | RAG and agent answers | Requires chunker that preserves time |
| Scene or chapter | Podcasts, lectures, long videos | Too coarse for citations |
The agent should retrieve chunks, then cite the narrower word or phrase span inside the chunk.
Joining Diarization and Alignment
Speaker diarization answers "who spoke when." Forced alignment answers "where the words occur." They should be joined by interval overlap.
aligned word: refund 822.180s - 822.510s
speaker segment: customer 819.000s - 828.400s
result: refund spoken by customer
The join is simple when speakers do not overlap. Overlap makes it harder:
Do not hide this uncertainty. Store overlap ratios and confidence fields.
{
"word": "refund",
"start_ms": 822180,
"end_ms": 822510,
"speaker": "customer",
"speaker_overlap": 0.96,
"speaker_confidence": 0.88
}
Agents should receive this uncertainty when the answer depends on attribution.
Indexing Aligned Speech
Do not embed timestamps inside the text string. Text like "[00:13:42] I would like a refund" adds token noise to the embedding. Store timestamps as metadata.
Use two parallel representations:
{
"embedding_text": "I would like a refund because the outage affected our launch",
"metadata": {
"source_id": "call_481",
"start_ms": 822180,
"end_ms": 826920,
"speaker": "customer"
}
}
For retrieval, combine:
1. Dense embedding search for semantic meaning. 2. BM25 or sparse search for exact terms, names, and product codes. 3. Metadata filters for speaker, date, source type, language, or customer. 4. Reranking for the top candidates.
This matters for agent search because spoken evidence contains both semantic and exact-match questions:
Query Planning for Agents
Expose aligned speech as a bounded tool. The agent should not receive the entire transcript unless it asks for expansion.
A useful tool schema:
{
"name": "search_spoken_evidence",
"description": "Search timestamped audio and video transcripts. Returns cited spans with source URI, speaker, start_ms, end_ms, and nearby context.",
"input_schema": {
"type": "object",
"properties": {
"query": {"type": "string"},
"speaker": {"type": "string"},
"source_type": {"type": "string", "enum": ["audio", "video", "any"]},
"date_range": {"type": "string"},
"top_k": {"type": "integer", "minimum": 1, "maximum": 20}
},
"required": ["query"]
}
}
The result should include inspectable handles:
{
"results": [
{
"text": "I would like a refund because the outage affected our launch",
"speaker": "customer",
"source_uri": "s3://support-calls/2026/06/09/call_481.wav",
"start_ms": 822180,
"end_ms": 826920,
"score": 0.84,
"clip_uri": "mixpeek://clips/call_481/822180-826920"
}
]
}
The agent can answer with a citation and ask for a follow-up clip only if needed.
Evaluation
Evaluate alignment separately from retrieval. A retrieval system can find the right transcript chunk while still citing the wrong second.
Measure at least five things:
| Metric | What it catches |
| Boundary error | Average absolute error for word or phrase start/end times |
| Coverage | Percentage of transcript tokens that received usable timestamps |
| Speaker attribution accuracy | Whether words are assigned to the correct speaker |
| Citation hit rate | Whether the cited span actually contains the answer |
| Expansion quality | Whether adjacent context helps without pulling in unrelated speech |
1. Ask the agent a question that requires spoken evidence. 2. Require an answer with source ID and timestamp. 3. Check whether the cited time span contains the supporting speech. 4. Penalize answers that cite only the file without a span. 5. Penalize answers that cite the right transcript but wrong speaker.
The important metric is not only word error rate. It is whether a human can click the citation and verify the answer.
Failure Modes
Transcript mismatch. If ASR hallucinated, omitted, or normalized text too aggressively, the aligner may force timestamps onto words that were never spoken.
Punctuation drift. Punctuation is not acoustic. Use it to build spans, but do not expect punctuation boundaries to align cleanly with audio.
Overlapping speech. One transcript line may contain speech from two speakers. Store uncertainty and consider overlap-aware diarization.
Music and background audio. Song lyrics, crowd noise, and background speech can confuse ASR and alignment. Track non-speech intervals.
Long-window drift. Alignment windows that are too long can drift. Windows that are too short lose context. Use voice activity and sentence boundaries to choose windows.
Embedding pollution. Inline timestamps, speaker labels, and JSON blobs can hurt semantic embeddings. Keep clean text for embedding and structured metadata for filters.
Mixpeek Example
In Mixpeek, model this as a speech evidence pipeline rather than a transcript upload.
from mixpeek import Mixpeek
mx = Mixpeek(api_key="YOUR_API_KEY")
mx.ingest.audio(
collection_id="support-calls",
source="s3://support-calls/2026/06/",
feature_extractors=[
{
"name": "audio_transcription",
"model_id": "Qwen/Qwen3-ASR-1.7B",
"params": {
"return_text": True,
"return_word_timestamps": True,
"forced_aligner": "Qwen/Qwen3-ForcedAligner-0.6B"
}
},
{
"name": "speaker_diarization",
"model_id": "pyannote/speaker-diarization-community-1"
},
{
"name": "text_embedding",
"model_id": "BAAI/bge-m3",
"params": {
"source_field": "aligned_speech_spans.text"
}
}
]
)
Then expose a retrieval tool:
results = mx.retrievers.retrieve(
collection_ids=["support-calls"],
stages=[
{
"type": "hybrid_search",
"feature": "aligned_speech_spans",
"query": "customer requested refund after outage",
"top_k": 50
},
{
"type": "rerank",
"model_id": "Qwen/Qwen3-Reranker-4B",
"top_k": 10
}
],
return_fields=[
"source_uri",
"text",
"speaker",
"start_ms",
"end_ms",
"alignment_confidence"
]
)
If you already generated aligned spans outside Mixpeek, store them in MVS with clean text vectors and timestamp metadata:
mx.mvs.upsert(
namespace="support-call-memory",
vectors=[
{
"id": "call_481:822180:826920:bge_m3",
"values": span_embedding,
"metadata": {
"source_uri": "s3://support-calls/2026/06/09/call_481.wav",
"text": "I would like a refund because the outage affected our launch",
"speaker": "customer",
"start_ms": 822180,
"end_ms": 826920,
"asr_model": "Qwen/Qwen3-ASR-1.7B",
"aligner_model": "Qwen/Qwen3-ForcedAligner-0.6B"
}
}
]
)
Design Checklist
Key Takeaways
1. ASR makes audio searchable. Forced alignment makes it citeable.
2. Word timestamps are raw evidence, not always the retrieval chunk.
3. Keep timestamps as metadata. Do not pollute embedding text with timing strings.
4. Diarization and alignment need interval joins with uncertainty, especially during overlap.
5. Agent tools should return compact cited spans with source handles and expansion paths.
6. The eval that matters is whether a human can click the cited span and verify the answer.