NEWVectors or files. Pick a path.Start →
    Models/Speech & Audio/mistralai/Voxtral-Mini-4B-Realtime-2602
    HFTranscriptionapache-2.0

    Voxtral-Mini-4B-Realtime-2602

    by mistralai

    Open-source realtime streaming speech-to-text with sub-500ms latency across 13 languages

    1.1Mdl/month
    875likes
    4.4Bparams
    Identifiers
    Model ID
    mistralai/Voxtral-Mini-4B-Realtime-2602
    Feature URI
    mixpeek://transcription@v1/mistral_voxtral_mini_4b_v1

    Overview

    Voxtral Mini 4B Realtime is among the first open-source speech models to achieve offline-comparable accuracy with sub-500ms latency. Its natively streaming architecture pairs a causal audio encoder (~0.6B params) with a Ministral-3-based LLM decoder (~3.4B params), both using sliding window attention for constant-memory streaming inference.

    On Mixpeek, Voxtral powers realtime and near-realtime transcription of audio and video content across 13 languages, with configurable latency from 240ms to 2.4s to balance speed against accuracy for live subtitling or batch processing.

    Architecture

    Two-component streaming architecture: (1) causal transformer audio encoder (0.6B params, 32 layers, causal attention) and (2) Ministral-3-based LLM decoder (3.4B params, 26 layers). Both use sliding window attention for streaming. Configurable transcription delay from 240ms to 2.4s.

    Mixpeek SDK Integration

    import { Mixpeek } from "mixpeek";
    
    const mx = new Mixpeek({ apiKey: "API_KEY" });
    
    // Managed: create a collection over a bucket; Mixpeek runs this model's extractor
    const collection = await mx.collections.create({
      namespace_id: "my-namespace",
      collection_name: "my-collection",
      source: { type: "bucket", bucket_ids: ["bkt_your_bucket"] },
      feature_extractor: {
        feature_extractor_name: "transcription",
        version: "v1",
        parameters: { model_id: "mistralai/Voxtral-Mini-4B-Realtime-2602" },
      },
    });

    Capabilities

    • Realtime streaming transcription with <500ms latency
    • 13 language support including English, Spanish, French, German
    • Configurable latency/accuracy tradeoff (240ms-2.4s delay)
    • Natively streaming architecture (no chunking workarounds)
    • Apache 2.0 open-source

    Use Cases on Mixpeek

    Live subtitling and closed captioning for video streams
    Voice assistant transcription with low-latency requirements
    Multilingual meeting transcription with realtime output

    Benchmarks

    DatasetMetricScoreSource
    FLEURS (13 languages, 480ms)Average WER8.72%Mistral AI, Feb 2026 — Voxtral Realtime paper
    FLEURS English (480ms)WER4.90%Mistral AI, Feb 2026 — Voxtral Realtime paper
    FLEURS (13 languages, 2.4s)Average WER6.73%Mistral AI, Feb 2026 — Voxtral Realtime paper

    Performance

    Input SizeStreaming audio (16kHz)
    GPU Latency240ms-2.4s configurable delay (A100)
    GPU ThroughputRealtime factor >1x (streaming)
    GPU Memory~8.5 GB

    Specification

    FrameworkHF
    Organizationmistralai
    FeatureTranscription
    Outputtext + timestamps
    Modalitiesvideo, audio
    RetrieverTranscript Search
    Parameters4.4B
    Licenseapache-2.0
    Downloads/mo1.1M
    Likes875

    Research Paper

    Voxtral Realtime

    arxiv.org

    Build a pipeline with Voxtral-Mini-4B-Realtime-2602

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Studio