Technology that transcribes spoken words from audio recordings into text. ASR is a foundational capability for making audio and video content searchable by text in multimodal retrieval and processing systems.
Modern ASR systems use end-to-end neural models that directly map audio waveforms or spectrograms to text sequences. The audio is encoded into feature representations, and a decoder generates the text transcript. Attention mechanisms align audio frames with text tokens. Language model integration helps resolve ambiguities and improve fluency of the output transcript.
Leading models include Whisper (OpenAI, multilingual), Conformer (Google), and wav2vec 2.0 (Meta). Whisper supports 99 languages and includes built-in timestamp prediction. Output can include word-level timestamps, confidence scores, and language detection. Performance is measured using Word Error Rate (WER). Models range from tiny (39M parameters, real-time on CPU) to large (1.5B parameters, requiring GPU).