NEWVectors or files. Pick a path.Start →
    Agent Perception
    21 min read
    Updated 2026-06-19

    How ASR Decoding Actually Works: Beam Search, LM Fusion, and Confidence

    Speech models do not output transcripts. They output frame-level probability lattices that a decoder turns into text, and the decoder is where most transcript quality is won or lost. This guide opens that box: greedy versus beam search, CTC versus transducer versus attention decoding, language-model fusion, contextual biasing, repetition and hallucination control, and how to produce the word confidences an agent needs to trust what it hears.

    Audio AI
    Speech Recognition
    Beam Search
    Decoding
    Agent Perception
    Confidence

    The Transcript Is a Decision, Not a Model Output



    An AI agent that searches spoken media almost always sits downstream of automatic speech recognition. Calls, meetings, podcasts, lectures, support tickets with voicemail attachments, video with dialogue: all of it becomes searchable only after speech becomes text. If the text is wrong, every layer above it is wrong too, and the failure is silent. The agent retrieves a chunk, the chunk contains plausible words, and the words are subtly wrong in a way no similarity score can detect.

    Here is the part most pipelines miss: a speech model does not emit a transcript. It emits, for each short slice of audio, a probability distribution over possible tokens. A one-minute clip produces thousands of these distributions stacked into a lattice. Turning that lattice into a single string of words is a separate algorithm called the decoder, and the decoder is where a large fraction of transcript quality is decided. Two systems can use the exact same acoustic model and produce noticeably different transcripts purely because of how they decode.

    This guide is about that step. Not how the model is trained, but how its raw outputs become the text an agent will search, cite, and reason over.

    What Comes Out of the Acoustic Model



    After the audio is turned into a spectrogram and pushed through the encoder, you have a sequence of frames, typically one every 20 to 40 milliseconds. For each frame the model produces a vector of scores, one score per token in the vocabulary. The vocabulary might be characters, subword pieces, or whole words, plus special symbols.

    The single most important special symbol is the blank. Blank does not mean silence. It means "no new token emitted at this frame." Blank exists because audio frames are much more frequent than spoken tokens. A single word like "refund" might span fifteen frames, and the model needs a way to say "still the same word, nothing new yet."

    So the raw output is a grid: frames down one axis, vocabulary across the other, a probability in every cell. The decoder's job is to walk this grid and choose one sequence of tokens that best explains the audio. Different model families constrain that walk differently, which is why decoding looks different for CTC, transducer, and attention models.

    Greedy Decoding: The Cheap Baseline



    The simplest decoder is greedy decoding, also called argmax decoding. At every frame, pick the single highest-probability token. Then clean up the result.

    For a CTC model the cleanup has two rules, applied in this order:

    1. Collapse consecutive duplicate tokens into one. 2. Remove all blank symbols.

    A worked example. Suppose frame-by-frame argmax over the audio for the word "hello" produces this:

    frames:  h h _ e l l _ l o o _
    step 1:  h   _ e l   _ l o   _      (collapse runs)
    step 2:  h     e l     l o            (drop blanks)
    result:  h e l l o
    


    Notice the two distinct "l" sounds survive because a blank sat between them. That blank is what tells the collapse rule "these are two letters, not one held letter." Remove blanks from the model and you can no longer spell "hello" or "letter" or "balloon." This is the whole reason the blank symbol exists.

    Greedy decoding is fast and needs nothing but the model. Its weakness is that it commits to the best token at each frame independently. It never reconsiders. If the locally best choice at frame 40 makes the overall sentence less likely, greedy decoding cannot back out. For clean audio this is often fine. For accents, noise, rare names, and domain jargon it leaves accuracy on the table.

    Beam Search: Keeping Several Hypotheses Alive



    Beam search fixes greedy's myopia by tracking the best `B` partial transcripts at once, where `B` is the beam width. At each step every surviving hypothesis is extended by the plausible next tokens, all the new candidates are scored, and only the top `B` are kept. At the end you return the highest-scoring complete hypothesis.

    The score of a hypothesis is the sum of log probabilities of its tokens. Log space matters: probabilities of long sequences underflow to zero if you multiply them directly, and addition in log space is both numerically stable and cheaper.

    score(hypothesis) = sum over tokens of log P(token | audio, previous tokens)
    


    A simplified beam step looks like this:

    def beam_step(beams, frame_logprobs, beam_width):
        candidates = []
        for text, score in beams:
            for token, token_logprob in frame_logprobs:
                candidates.append((text + [token], score + token_logprob))
        # keep only the most promising B hypotheses
        candidates.sort(key=lambda c: c[1], reverse=True)
        return candidates[:beam_width]
    


    Real CTC beam search is more careful than this sketch because many token sequences collapse to the same text after the blank-and-duplicate rules. A correct implementation merges hypotheses that map to the same string and sums their probabilities, so a transcript that can be reached by several paths is correctly judged more likely. That merging is the difference between a toy beam search and one that actually beats greedy.

    How Wide Should the Beam Be



    Beam width trades accuracy for cost and latency.

    Beam widthBehaviorTypical use
    1Identical to greedy decodingReal-time, edge devices, cost-sensitive batch
    4 to 8Most of the accuracy gain, modest costCommon production default
    16 to 32Diminishing returns, useful with LM fusionHigh-stakes offline transcription
    100+Rarely worth it, research or rescoringLattice generation for downstream rescoring
    The accuracy curve is sharply concave. Going from beam 1 to beam 8 usually helps a lot. Going from beam 8 to beam 64 usually helps a little while multiplying compute. Pick the smallest beam that hits your accuracy target, because every extra unit of width is paid on every frame of every file you ever transcribe.

    Decoding Differs by Model Family



    The three dominant ASR architectures expose different decoding problems. Knowing which one you run tells you which knobs exist.

    CTC Decoding



    CTC models score every frame independently given the audio and have no explicit model of token-to-token dependencies. That makes CTC decoding fast and naturally parallel, and it makes external language models especially valuable because the acoustic model alone has a weak sense of which word sequences are plausible. CTC beam search with an n-gram or neural LM is a classic, strong, low-latency combination.

    Transducer Decoding (RNN-T and Variants)



    Transducer models add a prediction network that conditions on the tokens emitted so far, so the model itself has a built-in sense of language. Decoding alternates between two moves at each step: emit a token, or advance to the next audio frame. This is what makes transducers the workhorse of streaming ASR. They can emit words as audio arrives without waiting for the end of the utterance. The decoding bookkeeping is more intricate because the number of tokens emitted per frame is variable, but the payoff is true low-latency streaming.

    Attention Encoder-Decoder Decoding



    Attention models such as Whisper generate text autoregressively, one token at a time, each token attending over the whole encoded audio. Decoding here is ordinary sequence generation: greedy or beam search over the decoder, often with a temperature and sampling fallback. This family produces the most fluent, best-punctuated, most multilingual output, but it is not naturally streaming because the decoder wants to see the full audio, and its fluency is exactly what makes it prone to confabulation on silence or noise, covered below.

    FamilyStreamingBuilt-in language modelDecoding costTypical strength
    CTCYesNoLowestSpeed, easy external LM fusion
    TransducerYesYes (prediction net)MediumLow-latency streaming
    AttentionAwkwardYes (decoder)HighestFluency, punctuation, multilingual

    Language Model Fusion: Teaching the Decoder What Is Plausible



    An acoustic model knows what the audio sounds like. It does not necessarily know that "wreck a nice beach" should be "recognize speech," or that your product is spelled "Mixpeek" and not "mix peak." A language model supplies that prior over word sequences, and fusing it into the decoder is one of the highest-leverage accuracy moves available.

    The most common method is shallow fusion. During beam search you add the language model's log probability, scaled by a weight, to each hypothesis score:

    combined = acoustic_logprob + lm_weight * lm_logprob + insertion_bonus * num_tokens
    


    Three terms, three jobs. The acoustic term keeps the transcript faithful to the audio. The LM term pulls toward fluent, plausible word sequences. The insertion bonus counteracts a subtle bias: adding the LM term tends to penalize longer hypotheses (more tokens means more negative log probabilities summed), so without a length correction the decoder quietly prefers dropping words. The insertion bonus, sometimes called a word or length reward, compensates.

    Tuning `lm_weight` is a balance. Too low and the LM does nothing. Too high and the decoder starts ignoring the audio and writing fluent sentences that were never spoken. You tune it on a held-out set against word error rate, not by feel.

    Other fusion methods exist (deep fusion and cold fusion mix the LM inside the network during training rather than only at decode time), but shallow fusion stays popular because it needs no retraining and lets you swap the LM per domain. That swappability matters operationally: a medical transcription LM and a call-center LM can ride on top of the same acoustic model.

    Contextual Biasing: Getting the Hard Words Right



    Generic language models help with generic fluency. They do not know your specific account names, SKUs, drug names, place names, or the proper nouns in today's meeting. Those rare-but-critical words are exactly the ones an agent most needs to search on, and they are exactly the ones a vanilla decoder gets wrong, because the model has barely seen them.

    Contextual biasing injects a list of expected terms into decoding so the decoder is more willing to choose them when the audio is close. Approaches include boosting the scores of paths that match a bias phrase, building the bias list into a small on-the-fly weighted graph, and attention-based biasing where the model attends over a provided list of phrases.

    The practical pattern for an agent system: assemble the bias list dynamically from context you already have. Meeting attendee names, the customer's product entitlements, the catalog of SKUs, the speakers expected on the call. A small, well-chosen bias list often moves accuracy on the exact terms that retrieval depends on, while a giant indiscriminate list mostly adds false positives. Curate it.

    Repetition, Hallucination, and Other Decoder Pathologies



    Decoders fail in characteristic ways. Recognizing the symptom tells you which knob to reach for.

    Hallucinated speech on non-speech. Attention models such as Whisper, asked to transcribe silence, music, or noise, will sometimes emit fluent, confident, entirely fabricated text, often a phrase from their training data like a sign-off or a subtitle credit. The model is a language generator at heart and abhors emitting nothing. The durable fix is upstream: run voice activity detection first and only decode segments that actually contain speech. Decoder-side mitigations include a no-speech probability threshold and temperature fallback.

    Repetition loops. Autoregressive decoders can fall into a loop, repeating a word or phrase many times. This is a known failure of greedy and low-temperature decoding when the model gets stuck in a high-probability cycle. Mitigations include a repetition penalty, a compression-ratio check that rejects output whose text is suspiciously repetitive, and temperature fallback that re-decodes a failed segment with increasing randomness.

    Timestamp drift on long audio. Long files decoded in one pass accumulate timing error. Chunked decoding with overlap, plus a forced-alignment pass afterward, keeps citations honest. Decoding gives you the words; alignment snaps them to the right second.

    Over-aggressive LM fusion. If the transcript reads more fluently than the audio could justify and starts inventing reasonable-sounding words, the language-model weight is too high. Lower it and re-measure.

    A defensive decoding configuration for offline transcription usually combines: VAD gating, a moderate beam, a no-speech threshold, a repetition penalty, a compression-ratio sanity check, and temperature fallback for segments that trip any check.

    Confidence: The Signal an Agent Needs to Trust the Transcript



    A transcript without confidence is a claim with no error bars. For an agent that has to decide whether to trust, double-check, or escalate, per-token and per-segment confidence is not a luxury, it is the input to that decision.

    Confidence falls out of decoding naturally because the decoder already has probabilities. The common signals:

  1. Token posterior. The probability the decoder assigned to each emitted token, often the softmax probability after decoding.
  2. Hypothesis margin. The score gap between the top beam and the runner-up. A large gap means the decoder was sure. A razor-thin gap means it nearly chose different words.
  3. Acoustic-LM agreement. Whether the acoustic term and the LM term agreed on the chosen path, or whether the LM had to overrule the audio.


  4. Two cautions. First, raw model probabilities are usually overconfident and need calibration before a threshold means anything. A model that says 0.99 should be wrong about one time in a hundred at that level, and untuned models rarely are. Second, store confidence as structured metadata, not inside the transcript text. Inlining "[0.62]" into the words pollutes any embedding you build from that text.

    A useful per-word record:

    {
      "word": "Mixpeek",
      "start_ms": 14820,
      "end_ms": 15240,
      "confidence": 0.71,
      "decoded_by": "ctc_beam_lm",
      "was_biased": true
    }
    


    An agent can use this. Low-confidence proper nouns can trigger a clarification, a second-pass decode, or a flag on the citation. High-confidence spans can be cited without hedging.

    Why This Determines What an Agent Can Hear



    Every retrieval question over spoken media inherits the decoder's choices. A semantic search for "the customer asked about reliability" depends on the decoder having produced the right words near that moment. An exact-match search for "SKU-447B" depends on contextual biasing having gotten that token right against a weak acoustic prior. A confidence-aware agent that knows when not to trust a transcript depends on the decoder having surfaced calibrated probabilities. The decoder is not plumbing beneath the interesting part. For audio, it is a large share of the interesting part.

    Mixpeek Example



    In Mixpeek, treat decoding configuration as part of the perception contract for an audio collection, not as a hidden default. The goal is a transcript with word timestamps and confidence that downstream retrieval can both search and trust.

    from mixpeek import Mixpeek

    mx = Mixpeek(api_key="YOUR_API_KEY")

    mx.ingest.audio( collection_id="support-calls", source="s3://support-calls/2026/06/", feature_extractors=[ { "name": "audio_transcription", "model_id": "Qwen/Qwen3-ASR-1.7B", "params": { "return_text": True, "return_word_timestamps": True, "return_word_confidence": True, "beam_size": 8, "no_speech_threshold": 0.6, "compression_ratio_threshold": 2.4, "bias_phrases": ["Mixpeek", "SKU-447B", "MVS", "Qdrant"] } }, { "name": "text_embedding", "model_id": "BAAI/bge-m3", "params": {"source_field": "transcript_spans.text"} } ] )


    The agent-facing retriever then searches the clean transcript text while keeping confidence and timing as filterable metadata, so an agent can prefer high-confidence evidence and still deep-link to the exact moment:

    results = mx.retrievers.retrieve(
        collection_ids=["support-calls"],
        stages=[
            {
                "type": "hybrid_search",
                "feature": "transcript_spans",
                "query": "customer asked about reliability after the outage",
                "top_k": 50
            },
            {
                "type": "filter",
                "field": "word_confidence_min",
                "operator": ">=",
                "value": 0.5
            },
            {
                "type": "rerank",
                "model_id": "Qwen/Qwen3-Reranker-4B",
                "top_k": 10
            }
        ],
        return_fields=["source_uri", "text", "start_ms", "end_ms", "word_confidence_min"]
    )
    


    The decoding parameters are now explicit and auditable. If you later raise the beam, change the bias list, or swap the acoustic model, that is a recorded change to how the collection hears, and you know to backfill.

    Design Checklist



  5. Decide greedy versus beam per workload, and use the smallest beam that hits your accuracy target.
  6. Use language-model fusion for domains with jargon, and tune the LM weight against word error rate, not by feel.
  7. Build a contextual bias list from names, SKUs, and speakers you already know for each job.
  8. Gate decoding with voice activity detection so attention models do not hallucinate on silence.
  9. Add repetition penalty, no-speech threshold, and compression-ratio checks for offline transcription.
  10. Emit per-word and per-segment confidence, and calibrate it before thresholding.
  11. Store confidence and timing as metadata, never inline in the transcript text.
  12. Record the decoding configuration with the collection so changes are auditable and trigger backfill.


  13. Key Takeaways



    1. Speech models output probability lattices. The decoder turns them into text, and that step decides much of transcript quality.

    2. Greedy decoding is fast but myopic. Beam search keeps several hypotheses alive and usually wins most of its gain by width 8.

    3. CTC, transducer, and attention models expose different decoding problems: external LM fusion, streaming emit-or-advance, and autoregressive generation respectively.

    4. Language-model fusion and contextual biasing are how you get jargon and proper nouns right, which are exactly the terms retrieval depends on.

    5. Hallucination on non-speech and repetition loops are decoder pathologies with known fixes: VAD gating, no-speech thresholds, repetition penalties, and temperature fallback.

    6. Calibrated per-word confidence, stored as metadata, is what lets an agent decide whether to trust, verify, or escalate a transcript.

    Further Reading



  14. Audio Feature Extraction: How AI Agents Learn to Hear
  15. Forced Alignment for AI Agents: Word Timestamps, Diarization, and Audio Evidence Search
  16. Speaker Diarization: How AI Agents Know Who Said What in Audio and Video
  17. Calibrating Similarity Scores
  18. Managed Mixpeek

    Put multimodal search to work

    Connect a bucket and Mixpeek runs the whole multimodal search pipeline for you: extraction, indexing, and search over your own objects. No models to wire up, nothing to host.

    Start with Managed
    MVS · bring your own

    Already have vectors?

    Keep your embeddings on your own cloud and run dense, sparse, and BM25 search directly on object storage. First 1M vectors free.

    Start with MVS

    Build a Multimodal Search Pipeline

    Give agents searchable access to video, image, audio, and document evidence with Mixpeek.

    Start BuildingRead Docs

    Related guides

    Agent Perception

    Audio Feature Extraction: How AI Agents Learn to Hear

    A deep technical guide to audio AI: from raw waveforms to spectrograms, ASR architectures (CTC, attention, transducer), speaker diarization algorithms, and contrastive audio-language models like CLAP. Learn how to build perception pipelines that give AI agents the ability to hear, transcribe, and semantically search audio content.

    Read guide →
    Agent Perception

    Streaming Video Understanding: How Agents Watch an Unbounded Live Feed in Real Time

    A first-principles guide to online video understanding -- how an agent perceives a live, unbounded stream it cannot store or re-watch. Covers the causal constraint, ring buffers and fixed frame budgets, token merging and KV-cache pruning, hierarchical short-term and long-term memory, entity banks for cross-time identity, event-triggered indexing, and how a streaming front end feeds a searchable retrieval index so the agent can answer questions about something that happened minutes or hours ago.

    Read guide →
    Agent Perception

    Perceptual Image Hashing: How Agents Recognize the Same Picture After It Has Been Re-Encoded, Cropped, and Recolored

    A first-principles guide to perceptual image hashing -- the algorithm that decides whether two images are the same content even after resizing, JPEG re-compression, watermarking, or a tweaked crop. Covers average hashing, the DCT-based pHash, difference hashing, wavelet hashing, Hamming distance matching, multi-index BK-tree lookups, and when an agent should reach for a hash versus an embedding for visual identity and frame deduplication.

    Read guide →