NEWVectors or files. Pick a path.Start →
    Agent Perception
    18 min read
    Updated 2026-06-09

    Forced Alignment for AI Agents: Word Timestamps, Diarization, and Audio Evidence Search

    Learn how forced alignment turns transcripts into timestamped evidence that agents can search, cite, and inspect across audio and video archives.

    Agent Perception
    Audio Search
    Video Search
    Forced Alignment
    Timestamps
    Diarization

    Why Timestamps Are a Retrieval Primitive



    Audio and video RAG often starts with automatic speech recognition. That is necessary, but it is not enough.

    A transcript without accurate timing is just a long text file. An agent can quote it, but it cannot reliably answer the operational question: where in the source did this happen?

    For spoken media, useful evidence has four parts:

    FieldWhat it answers
    TextWhat was said
    Time spanWhen it was said
    Speaker or channelWho said it, when available
    Source handleWhich audio or video object contains it
    The time span is what turns a transcript result into inspectable evidence. It lets the agent deep-link to the moment in a call, meeting, podcast, surveillance clip, lecture, deposition, or broadcast.

    This is the difference between:

  1. "The customer asked for a refund."
  2. "The customer asked for a refund in call_481 at 00:13:42.180 to 00:13:46.920."


  3. The second answer can be verified.

    What Forced Alignment Does



    Forced alignment takes two inputs:

    1. An audio signal. 2. A known transcript or token sequence.

    It returns timing boundaries for the transcript units. Those units may be phonemes, characters, words, phrases, or subtitle segments.

    ASR answers: "What text is in this audio?"

    Forced alignment answers: "Given this text, where does each part occur in the audio?"

    That distinction matters. A transcription model may produce segment-level timestamps, but those timestamps are often too coarse for retrieval citations. Segment boundaries can drift, especially with long audio, music, overlap, silence, accents, code-switching, or noisy recordings. Forced alignment is a second pass that snaps transcript units to acoustic evidence.

    The Basic Pipeline



    A production alignment pipeline usually looks like this:

    audio or video object
      -> voice activity detection
      -> ASR transcript
      -> optional normalization
      -> forced alignment
      -> speaker diarization join
      -> span chunking
      -> embedding and sparse indexing
      -> agent retrieval tool
    


    Each stage changes the retrieval contract.

    Voice activity detection removes silence and non-speech so the aligner does not waste work. ASR gives the text. Normalization makes transcript tokens match what the aligner expects. Forced alignment gives boundaries. Diarization attaches speaker turns. Chunking decides what span size should be searchable. Indexing makes the spans retrievable.

    The output should not be a plain transcript. It should be a span table:

    {
      "source_id": "call_481",
      "source_uri": "s3://support-calls/2026/06/09/call_481.wav",
      "span_id": "call_481:000822180-000826920",
      "text": "I would like a refund because the outage affected our launch",
      "start_ms": 822180,
      "end_ms": 826920,
      "speaker": "customer",
      "asr_model": "Qwen/Qwen3-ASR-1.7B",
      "aligner_model": "Qwen/Qwen3-ForcedAligner-0.6B",
      "alignment_confidence": 0.91
    }
    


    An agent can search this record, cite it, and request the source clip for inspection.

    Alignment Algorithms



    Forced alignment is an old speech problem with newer model families. The main approaches differ in how they score possible alignments between audio frames and transcript units.

    1. HMM and Kaldi-Style Alignment



    Traditional forced aligners, including the Montreal Forced Aligner family, rely on acoustic models, pronunciation dictionaries, and Hidden Markov Models. The transcript is expanded into a likely phoneme sequence. The acoustic model scores which phoneme is likely at each frame. Dynamic programming finds the most likely path through the sequence.

    This works well when:

  4. You have a pronunciation dictionary.
  5. The language is supported.
  6. The transcript is clean.
  7. You need phoneme-level or word-level linguistic alignment.


  8. It is less convenient for arbitrary multilingual production media because dictionaries, acoustic models, and text normalization rules become operational dependencies.

    2. CTC Forced Alignment



    Connectionist Temporal Classification models output frame-level probabilities over tokens plus a blank symbol. The aligner searches for the path through those frame probabilities that collapses to the target transcript.

    The common dynamic programming formulation is similar to Viterbi decoding:

    1. Convert transcript tokens into an expanded sequence with blanks. 2. Score every audio frame against every possible token state. 3. Find the highest probability monotonic path. 4. Collapse blanks and repeated tokens into word or character boundaries.

    CTC alignment is useful because it does not need frame-level labels at training time. It is common in wav2vec2-style alignment systems and WhisperX-style pipelines.

    The main failure mode is tokenization mismatch. If the transcript text and the aligner vocabulary disagree, the model may force a bad path.

    3. Dynamic Time Warping



    Dynamic Time Warping aligns two sequences by stretching or compressing time while preserving order. It is often used when you have two comparable streams, such as model token timing estimates and transcript tokens.

    DTW is attractive because it is simple and robust to local speed changes. It is not a speech recognizer by itself. It needs features or token probabilities that already represent the audio and text.

    Use DTW when you need to reconcile two imperfect timing sequences, not when you need a full acoustic alignment system from scratch.

    4. Non-Autoregressive Timestamp Prediction



    Newer aligners can predict timestamps directly for transcript units. Qwen3-ForcedAligner-0.6B is an example: the model card says it aligns text-speech pairs and returns word or character-level timestamps, with support for timestamp prediction in speech windows up to five minutes.

    This is useful for retrieval systems because the output is closer to the data shape agents need:

    [
      {"word": "refund", "start_ms": 823940, "end_ms": 824310},
      {"word": "outage", "start_ms": 824980, "end_ms": 825430}
    ]
    


    The tradeoff is that model-specific runtime and windowing rules matter. You still need validation, confidence handling, and fallbacks.

    Why Word Timestamps Are Not Always the Right Unit



    Word-level timing sounds ideal, but retrieval rarely needs every word as a separate document. Very small chunks create noisy search results and brittle citations.

    Use word timestamps as raw material. Build higher-level searchable spans from them.

    Good span units include:

    UnitBest forRisk
    WordExact subtitle sync, redaction, clip trimmingToo small for semantic search
    PhraseSpoken claim detection, quote retrievalNeeds punctuation or pause heuristics
    Speaker turnMeetings, calls, interviewsCan be too long when speakers monologue
    Semantic chunkRAG and agent answersRequires chunker that preserves time
    Scene or chapterPodcasts, lectures, long videosToo coarse for citations
    A strong retrieval system stores multiple units:

  9. Word timestamps for playback and exact citation.
  10. Phrase spans for quote-level retrieval.
  11. Semantic chunks for answer generation.
  12. Speaker turns for conversation structure.


  13. The agent should retrieve chunks, then cite the narrower word or phrase span inside the chunk.

    Joining Diarization and Alignment



    Speaker diarization answers "who spoke when." Forced alignment answers "where the words occur." They should be joined by interval overlap.

    aligned word:     refund     822.180s - 822.510s
    speaker segment:  customer   819.000s - 828.400s
    result:           refund spoken by customer
    


    The join is simple when speakers do not overlap. Overlap makes it harder:

  14. Two speakers talk at the same time.
  15. The diarizer changes speaker too late.
  16. ASR merges speech from two speakers into one text span.
  17. A speaker label covers silence or background audio.


  18. Do not hide this uncertainty. Store overlap ratios and confidence fields.

    {
      "word": "refund",
      "start_ms": 822180,
      "end_ms": 822510,
      "speaker": "customer",
      "speaker_overlap": 0.96,
      "speaker_confidence": 0.88
    }
    


    Agents should receive this uncertainty when the answer depends on attribution.

    Indexing Aligned Speech



    Do not embed timestamps inside the text string. Text like "[00:13:42] I would like a refund" adds token noise to the embedding. Store timestamps as metadata.

    Use two parallel representations:

    {
      "embedding_text": "I would like a refund because the outage affected our launch",
      "metadata": {
        "source_id": "call_481",
        "start_ms": 822180,
        "end_ms": 826920,
        "speaker": "customer"
      }
    }
    


    For retrieval, combine:

    1. Dense embedding search for semantic meaning. 2. BM25 or sparse search for exact terms, names, and product codes. 3. Metadata filters for speaker, date, source type, language, or customer. 4. Reranking for the top candidates.

    This matters for agent search because spoken evidence contains both semantic and exact-match questions:

  19. Semantic: "Where did the customer sound frustrated about reliability?"
  20. Exact: "Where did someone say SKU-447B?"
  21. Filtered: "Find refund mentions by the customer, not the support agent."
  22. Temporal: "What did the engineer say after the outage was mentioned?"


  23. Query Planning for Agents



    Expose aligned speech as a bounded tool. The agent should not receive the entire transcript unless it asks for expansion.

    A useful tool schema:

    {
      "name": "search_spoken_evidence",
      "description": "Search timestamped audio and video transcripts. Returns cited spans with source URI, speaker, start_ms, end_ms, and nearby context.",
      "input_schema": {
        "type": "object",
        "properties": {
          "query": {"type": "string"},
          "speaker": {"type": "string"},
          "source_type": {"type": "string", "enum": ["audio", "video", "any"]},
          "date_range": {"type": "string"},
          "top_k": {"type": "integer", "minimum": 1, "maximum": 20}
        },
        "required": ["query"]
      }
    }
    


    The result should include inspectable handles:

    {
      "results": [
        {
          "text": "I would like a refund because the outage affected our launch",
          "speaker": "customer",
          "source_uri": "s3://support-calls/2026/06/09/call_481.wav",
          "start_ms": 822180,
          "end_ms": 826920,
          "score": 0.84,
          "clip_uri": "mixpeek://clips/call_481/822180-826920"
        }
      ]
    }
    


    The agent can answer with a citation and ask for a follow-up clip only if needed.

    Evaluation



    Evaluate alignment separately from retrieval. A retrieval system can find the right transcript chunk while still citing the wrong second.

    Measure at least five things:

    MetricWhat it catches
    Boundary errorAverage absolute error for word or phrase start/end times
    CoveragePercentage of transcript tokens that received usable timestamps
    Speaker attribution accuracyWhether words are assigned to the correct speaker
    Citation hit rateWhether the cited span actually contains the answer
    Expansion qualityWhether adjacent context helps without pulling in unrelated speech
    For agents, add task-level evals:

    1. Ask the agent a question that requires spoken evidence. 2. Require an answer with source ID and timestamp. 3. Check whether the cited time span contains the supporting speech. 4. Penalize answers that cite only the file without a span. 5. Penalize answers that cite the right transcript but wrong speaker.

    The important metric is not only word error rate. It is whether a human can click the citation and verify the answer.

    Failure Modes



    Transcript mismatch. If ASR hallucinated, omitted, or normalized text too aggressively, the aligner may force timestamps onto words that were never spoken.

    Punctuation drift. Punctuation is not acoustic. Use it to build spans, but do not expect punctuation boundaries to align cleanly with audio.

    Overlapping speech. One transcript line may contain speech from two speakers. Store uncertainty and consider overlap-aware diarization.

    Music and background audio. Song lyrics, crowd noise, and background speech can confuse ASR and alignment. Track non-speech intervals.

    Long-window drift. Alignment windows that are too long can drift. Windows that are too short lose context. Use voice activity and sentence boundaries to choose windows.

    Embedding pollution. Inline timestamps, speaker labels, and JSON blobs can hurt semantic embeddings. Keep clean text for embedding and structured metadata for filters.

    Mixpeek Example



    In Mixpeek, model this as a speech evidence pipeline rather than a transcript upload.

    from mixpeek import Mixpeek

    mx = Mixpeek(api_key="YOUR_API_KEY")

    mx.ingest.audio( collection_id="support-calls", source="s3://support-calls/2026/06/", feature_extractors=[ { "name": "audio_transcription", "model_id": "Qwen/Qwen3-ASR-1.7B", "params": { "return_text": True, "return_word_timestamps": True, "forced_aligner": "Qwen/Qwen3-ForcedAligner-0.6B" } }, { "name": "speaker_diarization", "model_id": "pyannote/speaker-diarization-community-1" }, { "name": "text_embedding", "model_id": "BAAI/bge-m3", "params": { "source_field": "aligned_speech_spans.text" } } ] )


    Then expose a retrieval tool:

    results = mx.retrievers.retrieve(
        collection_ids=["support-calls"],
        stages=[
            {
                "type": "hybrid_search",
                "feature": "aligned_speech_spans",
                "query": "customer requested refund after outage",
                "top_k": 50
            },
            {
                "type": "rerank",
                "model_id": "Qwen/Qwen3-Reranker-4B",
                "top_k": 10
            }
        ],
        return_fields=[
            "source_uri",
            "text",
            "speaker",
            "start_ms",
            "end_ms",
            "alignment_confidence"
        ]
    )
    


    If you already generated aligned spans outside Mixpeek, store them in MVS with clean text vectors and timestamp metadata:

    mx.mvs.upsert(
        namespace="support-call-memory",
        vectors=[
            {
                "id": "call_481:822180:826920:bge_m3",
                "values": span_embedding,
                "metadata": {
                    "source_uri": "s3://support-calls/2026/06/09/call_481.wav",
                    "text": "I would like a refund because the outage affected our launch",
                    "speaker": "customer",
                    "start_ms": 822180,
                    "end_ms": 826920,
                    "asr_model": "Qwen/Qwen3-ASR-1.7B",
                    "aligner_model": "Qwen/Qwen3-ForcedAligner-0.6B"
                }
            }
        ]
    )
    


    Design Checklist



  24. Store clean transcript text separately from timestamp metadata.
  25. Preserve source URI, source ID, language, model IDs, extractor versions, and timing units.
  26. Keep word-level timestamps even if retrieval chunks are phrase or semantic spans.
  27. Join diarization and alignment by interval overlap, not by transcript line number.
  28. Store confidence and overlap fields when speaker attribution is uncertain.
  29. Use dense and sparse retrieval together for spoken evidence.
  30. Return source URI, start time, end time, and nearby context to agents.
  31. Evaluate citation hit rate, not only ASR word error rate.
  32. Backfill alignment when ASR models, diarizers, or normalization rules change.


  33. Key Takeaways



    1. ASR makes audio searchable. Forced alignment makes it citeable.

    2. Word timestamps are raw evidence, not always the retrieval chunk.

    3. Keep timestamps as metadata. Do not pollute embedding text with timing strings.

    4. Diarization and alignment need interval joins with uncertainty, especially during overlap.

    5. Agent tools should return compact cited spans with source handles and expansion paths.

    6. The eval that matters is whether a human can click the cited span and verify the answer.

    Further Reading



  34. Audio Feature Extraction: How AI Agents Learn to Hear
  35. Speaker Diarization: How AI Agents Know Who Said What in Audio and Video
  36. Audio-Visual Retrieval for AI Agents
  37. Agent Perception Evals
  38. Qwen3-ForcedAligner-0.6B on Hugging Face
  39. WhisperX on GitHub
  40. Montreal Forced Aligner documentation
  41. Already have embeddings?

    Skip extraction — bring your own vectors to MVS. Dense + sparse + BM25 hybrid search. First 1M vectors free.

    Build a Multimodal Search Pipeline

    Give agents searchable access to video, image, audio, and document evidence with Mixpeek.

    Start BuildingRead Docs