NEWWhy single embeddings fail for video.Read the post →
    Models/Speech & Audio/mistralai/Voxtral-Mini-4B-Realtime-2602
    HFTranscriptionApache-2.0

    Voxtral-Mini-4B-Realtime-2602

    by mistralai

    Open-source realtime streaming speech-to-text with sub-500ms latency across 13 languages

    1.29Mdl/month
    4Bparams
    Identifiers
    Model ID
    mistralai/Voxtral-Mini-4B-Realtime-2602
    Feature URI
    mixpeek://transcription@v1/mistral_voxtral_mini_4b_v1

    Overview

    Voxtral Mini 4B Realtime is among the first open-source speech models to achieve offline-comparable accuracy with sub-500ms latency. Its natively streaming architecture pairs a causal audio encoder (~0.6B params) with a Ministral-3-based LLM decoder (~3.4B params), both using sliding window attention for constant-memory streaming inference.

    On Mixpeek, Voxtral powers realtime and near-realtime transcription of audio and video content across 13 languages, with configurable latency from 240ms to 2.4s to balance speed against accuracy for live subtitling or batch processing.

    Architecture

    Two-component streaming architecture: (1) causal transformer audio encoder (0.6B params, 32 layers, causal attention) and (2) Ministral-3-based LLM decoder (3.4B params, 26 layers). Both use sliding window attention for streaming. Configurable transcription delay from 240ms to 2.4s.

    Mixpeek SDK Integration

    from mixpeek import Mixpeek
    mx = Mixpeek(api_key="YOUR_KEY")
    mx.ingest(
    collection_id="meeting-recordings",
    source="s3://audio/",
    extractors=[{
    "type": "transcription",
    "model": "mistralai/Voxtral-Mini-4B-Realtime-2602",
    "output_feature": "transcript"
    }]
    )

    Capabilities

    • Realtime streaming transcription with <500ms latency
    • 13 language support including English, Spanish, French, German
    • Configurable latency/accuracy tradeoff (240ms-2.4s delay)
    • Natively streaming architecture (no chunking workarounds)
    • Apache 2.0 open-source

    Use Cases on Mixpeek

    Live subtitling and closed captioning for video streams
    Voice assistant transcription with low-latency requirements
    Multilingual meeting transcription with realtime output

    Benchmarks

    DatasetMetricScoreSource
    FLEURS (13 languages, 480ms)Average WER8.72%Mistral AI, Feb 2026 — Voxtral Realtime paper
    FLEURS English (480ms)WER4.90%Mistral AI, Feb 2026 — Voxtral Realtime paper
    FLEURS (13 languages, 2.4s)Average WER6.73%Mistral AI, Feb 2026 — Voxtral Realtime paper

    Performance

    Input SizeStreaming audio (16kHz)
    GPU Latency240ms-2.4s configurable delay (A100)
    GPU ThroughputRealtime factor >1x (streaming)
    GPU Memory~8.5 GB

    Specification

    FrameworkHF
    Organizationmistralai
    FeatureTranscription
    Outputtext + timestamps
    Modalitiesvideo, audio
    RetrieverTranscript Search
    Parameters4B
    LicenseApache-2.0
    Downloads/mo1.29M

    Research Paper

    Voxtral Realtime

    arxiv.org

    Build a pipeline with Voxtral-Mini-4B-Realtime-2602

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Studio