The Transcript Is a Decision, Not a Model Output
An AI agent that searches spoken media almost always sits downstream of automatic speech recognition. Calls, meetings, podcasts, lectures, support tickets with voicemail attachments, video with dialogue: all of it becomes searchable only after speech becomes text. If the text is wrong, every layer above it is wrong too, and the failure is silent. The agent retrieves a chunk, the chunk contains plausible words, and the words are subtly wrong in a way no similarity score can detect.
Here is the part most pipelines miss: a speech model does not emit a transcript. It emits, for each short slice of audio, a probability distribution over possible tokens. A one-minute clip produces thousands of these distributions stacked into a lattice. Turning that lattice into a single string of words is a separate algorithm called the decoder, and the decoder is where a large fraction of transcript quality is decided. Two systems can use the exact same acoustic model and produce noticeably different transcripts purely because of how they decode.
This guide is about that step. Not how the model is trained, but how its raw outputs become the text an agent will search, cite, and reason over.
What Comes Out of the Acoustic Model
After the audio is turned into a spectrogram and pushed through the encoder, you have a sequence of frames, typically one every 20 to 40 milliseconds. For each frame the model produces a vector of scores, one score per token in the vocabulary. The vocabulary might be characters, subword pieces, or whole words, plus special symbols.
The single most important special symbol is the blank. Blank does not mean silence. It means "no new token emitted at this frame." Blank exists because audio frames are much more frequent than spoken tokens. A single word like "refund" might span fifteen frames, and the model needs a way to say "still the same word, nothing new yet."
So the raw output is a grid: frames down one axis, vocabulary across the other, a probability in every cell. The decoder's job is to walk this grid and choose one sequence of tokens that best explains the audio. Different model families constrain that walk differently, which is why decoding looks different for CTC, transducer, and attention models.
Greedy Decoding: The Cheap Baseline
The simplest decoder is greedy decoding, also called argmax decoding. At every frame, pick the single highest-probability token. Then clean up the result.
For a CTC model the cleanup has two rules, applied in this order:
1. Collapse consecutive duplicate tokens into one. 2. Remove all blank symbols.
A worked example. Suppose frame-by-frame argmax over the audio for the word "hello" produces this:
frames: h h _ e l l _ l o o _
step 1: h _ e l _ l o _ (collapse runs)
step 2: h e l l o (drop blanks)
result: h e l l o
Notice the two distinct "l" sounds survive because a blank sat between them. That blank is what tells the collapse rule "these are two letters, not one held letter." Remove blanks from the model and you can no longer spell "hello" or "letter" or "balloon." This is the whole reason the blank symbol exists.
Greedy decoding is fast and needs nothing but the model. Its weakness is that it commits to the best token at each frame independently. It never reconsiders. If the locally best choice at frame 40 makes the overall sentence less likely, greedy decoding cannot back out. For clean audio this is often fine. For accents, noise, rare names, and domain jargon it leaves accuracy on the table.
Beam Search: Keeping Several Hypotheses Alive
Beam search fixes greedy's myopia by tracking the best `B` partial transcripts at once, where `B` is the beam width. At each step every surviving hypothesis is extended by the plausible next tokens, all the new candidates are scored, and only the top `B` are kept. At the end you return the highest-scoring complete hypothesis.
The score of a hypothesis is the sum of log probabilities of its tokens. Log space matters: probabilities of long sequences underflow to zero if you multiply them directly, and addition in log space is both numerically stable and cheaper.
score(hypothesis) = sum over tokens of log P(token | audio, previous tokens)
A simplified beam step looks like this:
def beam_step(beams, frame_logprobs, beam_width):
candidates = []
for text, score in beams:
for token, token_logprob in frame_logprobs:
candidates.append((text + [token], score + token_logprob))
# keep only the most promising B hypotheses
candidates.sort(key=lambda c: c[1], reverse=True)
return candidates[:beam_width]
Real CTC beam search is more careful than this sketch because many token sequences collapse to the same text after the blank-and-duplicate rules. A correct implementation merges hypotheses that map to the same string and sums their probabilities, so a transcript that can be reached by several paths is correctly judged more likely. That merging is the difference between a toy beam search and one that actually beats greedy.
How Wide Should the Beam Be
Beam width trades accuracy for cost and latency.
| Beam width | Behavior | Typical use |
| 1 | Identical to greedy decoding | Real-time, edge devices, cost-sensitive batch |
| 4 to 8 | Most of the accuracy gain, modest cost | Common production default |
| 16 to 32 | Diminishing returns, useful with LM fusion | High-stakes offline transcription |
| 100+ | Rarely worth it, research or rescoring | Lattice generation for downstream rescoring |
Decoding Differs by Model Family
The three dominant ASR architectures expose different decoding problems. Knowing which one you run tells you which knobs exist.
CTC Decoding
CTC models score every frame independently given the audio and have no explicit model of token-to-token dependencies. That makes CTC decoding fast and naturally parallel, and it makes external language models especially valuable because the acoustic model alone has a weak sense of which word sequences are plausible. CTC beam search with an n-gram or neural LM is a classic, strong, low-latency combination.
Transducer Decoding (RNN-T and Variants)
Transducer models add a prediction network that conditions on the tokens emitted so far, so the model itself has a built-in sense of language. Decoding alternates between two moves at each step: emit a token, or advance to the next audio frame. This is what makes transducers the workhorse of streaming ASR. They can emit words as audio arrives without waiting for the end of the utterance. The decoding bookkeeping is more intricate because the number of tokens emitted per frame is variable, but the payoff is true low-latency streaming.
Attention Encoder-Decoder Decoding
Attention models such as Whisper generate text autoregressively, one token at a time, each token attending over the whole encoded audio. Decoding here is ordinary sequence generation: greedy or beam search over the decoder, often with a temperature and sampling fallback. This family produces the most fluent, best-punctuated, most multilingual output, but it is not naturally streaming because the decoder wants to see the full audio, and its fluency is exactly what makes it prone to confabulation on silence or noise, covered below.
| Family | Streaming | Built-in language model | Decoding cost | Typical strength |
| CTC | Yes | No | Lowest | Speed, easy external LM fusion |
| Transducer | Yes | Yes (prediction net) | Medium | Low-latency streaming |
| Attention | Awkward | Yes (decoder) | Highest | Fluency, punctuation, multilingual |
Language Model Fusion: Teaching the Decoder What Is Plausible
An acoustic model knows what the audio sounds like. It does not necessarily know that "wreck a nice beach" should be "recognize speech," or that your product is spelled "Mixpeek" and not "mix peak." A language model supplies that prior over word sequences, and fusing it into the decoder is one of the highest-leverage accuracy moves available.
The most common method is shallow fusion. During beam search you add the language model's log probability, scaled by a weight, to each hypothesis score:
combined = acoustic_logprob + lm_weight * lm_logprob + insertion_bonus * num_tokens
Three terms, three jobs. The acoustic term keeps the transcript faithful to the audio. The LM term pulls toward fluent, plausible word sequences. The insertion bonus counteracts a subtle bias: adding the LM term tends to penalize longer hypotheses (more tokens means more negative log probabilities summed), so without a length correction the decoder quietly prefers dropping words. The insertion bonus, sometimes called a word or length reward, compensates.
Tuning `lm_weight` is a balance. Too low and the LM does nothing. Too high and the decoder starts ignoring the audio and writing fluent sentences that were never spoken. You tune it on a held-out set against word error rate, not by feel.
Other fusion methods exist (deep fusion and cold fusion mix the LM inside the network during training rather than only at decode time), but shallow fusion stays popular because it needs no retraining and lets you swap the LM per domain. That swappability matters operationally: a medical transcription LM and a call-center LM can ride on top of the same acoustic model.
Contextual Biasing: Getting the Hard Words Right
Generic language models help with generic fluency. They do not know your specific account names, SKUs, drug names, place names, or the proper nouns in today's meeting. Those rare-but-critical words are exactly the ones an agent most needs to search on, and they are exactly the ones a vanilla decoder gets wrong, because the model has barely seen them.
Contextual biasing injects a list of expected terms into decoding so the decoder is more willing to choose them when the audio is close. Approaches include boosting the scores of paths that match a bias phrase, building the bias list into a small on-the-fly weighted graph, and attention-based biasing where the model attends over a provided list of phrases.
The practical pattern for an agent system: assemble the bias list dynamically from context you already have. Meeting attendee names, the customer's product entitlements, the catalog of SKUs, the speakers expected on the call. A small, well-chosen bias list often moves accuracy on the exact terms that retrieval depends on, while a giant indiscriminate list mostly adds false positives. Curate it.
Repetition, Hallucination, and Other Decoder Pathologies
Decoders fail in characteristic ways. Recognizing the symptom tells you which knob to reach for.
Hallucinated speech on non-speech. Attention models such as Whisper, asked to transcribe silence, music, or noise, will sometimes emit fluent, confident, entirely fabricated text, often a phrase from their training data like a sign-off or a subtitle credit. The model is a language generator at heart and abhors emitting nothing. The durable fix is upstream: run voice activity detection first and only decode segments that actually contain speech. Decoder-side mitigations include a no-speech probability threshold and temperature fallback.
Repetition loops. Autoregressive decoders can fall into a loop, repeating a word or phrase many times. This is a known failure of greedy and low-temperature decoding when the model gets stuck in a high-probability cycle. Mitigations include a repetition penalty, a compression-ratio check that rejects output whose text is suspiciously repetitive, and temperature fallback that re-decodes a failed segment with increasing randomness.
Timestamp drift on long audio. Long files decoded in one pass accumulate timing error. Chunked decoding with overlap, plus a forced-alignment pass afterward, keeps citations honest. Decoding gives you the words; alignment snaps them to the right second.
Over-aggressive LM fusion. If the transcript reads more fluently than the audio could justify and starts inventing reasonable-sounding words, the language-model weight is too high. Lower it and re-measure.
A defensive decoding configuration for offline transcription usually combines: VAD gating, a moderate beam, a no-speech threshold, a repetition penalty, a compression-ratio sanity check, and temperature fallback for segments that trip any check.
Confidence: The Signal an Agent Needs to Trust the Transcript
A transcript without confidence is a claim with no error bars. For an agent that has to decide whether to trust, double-check, or escalate, per-token and per-segment confidence is not a luxury, it is the input to that decision.
Confidence falls out of decoding naturally because the decoder already has probabilities. The common signals:
Two cautions. First, raw model probabilities are usually overconfident and need calibration before a threshold means anything. A model that says 0.99 should be wrong about one time in a hundred at that level, and untuned models rarely are. Second, store confidence as structured metadata, not inside the transcript text. Inlining "[0.62]" into the words pollutes any embedding you build from that text.
A useful per-word record:
{
"word": "Mixpeek",
"start_ms": 14820,
"end_ms": 15240,
"confidence": 0.71,
"decoded_by": "ctc_beam_lm",
"was_biased": true
}
An agent can use this. Low-confidence proper nouns can trigger a clarification, a second-pass decode, or a flag on the citation. High-confidence spans can be cited without hedging.
Why This Determines What an Agent Can Hear
Every retrieval question over spoken media inherits the decoder's choices. A semantic search for "the customer asked about reliability" depends on the decoder having produced the right words near that moment. An exact-match search for "SKU-447B" depends on contextual biasing having gotten that token right against a weak acoustic prior. A confidence-aware agent that knows when not to trust a transcript depends on the decoder having surfaced calibrated probabilities. The decoder is not plumbing beneath the interesting part. For audio, it is a large share of the interesting part.
Mixpeek Example
In Mixpeek, treat decoding configuration as part of the perception contract for an audio collection, not as a hidden default. The goal is a transcript with word timestamps and confidence that downstream retrieval can both search and trust.
from mixpeek import Mixpeek
mx = Mixpeek(api_key="YOUR_API_KEY")
mx.ingest.audio(
collection_id="support-calls",
source="s3://support-calls/2026/06/",
feature_extractors=[
{
"name": "audio_transcription",
"model_id": "Qwen/Qwen3-ASR-1.7B",
"params": {
"return_text": True,
"return_word_timestamps": True,
"return_word_confidence": True,
"beam_size": 8,
"no_speech_threshold": 0.6,
"compression_ratio_threshold": 2.4,
"bias_phrases": ["Mixpeek", "SKU-447B", "MVS", "Qdrant"]
}
},
{
"name": "text_embedding",
"model_id": "BAAI/bge-m3",
"params": {"source_field": "transcript_spans.text"}
}
]
)
The agent-facing retriever then searches the clean transcript text while keeping confidence and timing as filterable metadata, so an agent can prefer high-confidence evidence and still deep-link to the exact moment:
results = mx.retrievers.retrieve(
collection_ids=["support-calls"],
stages=[
{
"type": "hybrid_search",
"feature": "transcript_spans",
"query": "customer asked about reliability after the outage",
"top_k": 50
},
{
"type": "filter",
"field": "word_confidence_min",
"operator": ">=",
"value": 0.5
},
{
"type": "rerank",
"model_id": "Qwen/Qwen3-Reranker-4B",
"top_k": 10
}
],
return_fields=["source_uri", "text", "start_ms", "end_ms", "word_confidence_min"]
)
The decoding parameters are now explicit and auditable. If you later raise the beam, change the bias list, or swap the acoustic model, that is a recorded change to how the collection hears, and you know to backfill.
Design Checklist
Key Takeaways
1. Speech models output probability lattices. The decoder turns them into text, and that step decides much of transcript quality.
2. Greedy decoding is fast but myopic. Beam search keeps several hypotheses alive and usually wins most of its gain by width 8.
3. CTC, transducer, and attention models expose different decoding problems: external LM fusion, streaming emit-or-advance, and autoregressive generation respectively.
4. Language-model fusion and contextual biasing are how you get jargon and proper nouns right, which are exactly the terms retrieval depends on.
5. Hallucination on non-speech and repetition loops are decoder pathologies with known fixes: VAD gating, no-speech thresholds, repetition penalties, and temperature fallback.
6. Calibrated per-word confidence, stored as metadata, is what lets an agent decide whether to trust, verify, or escalate a transcript.