Voxtral-Mini-4B-Realtime-2602

by mistralai

Open-source realtime streaming speech-to-text with sub-500ms latency across 13 languages

2.0Mdl/month

919likes

4.4Bparams

HuggingFace Run on your data, free

Identifiers

Model ID

mistralai/Voxtral-Mini-4B-Realtime-2602

Feature URI

mixpeek://transcription@v1/mistral_voxtral_mini_4b_v1

Overview

Voxtral Mini 4B Realtime is among the first open-source speech models to achieve offline-comparable accuracy with sub-500ms latency. Its natively streaming architecture pairs a causal audio encoder (~0.6B params) with a Ministral-3-based LLM decoder (~3.4B params), both using sliding window attention for constant-memory streaming inference.

On Mixpeek, Voxtral powers realtime and near-realtime transcription of audio and video content across 13 languages, with configurable latency from 240ms to 2.4s to balance speed against accuracy for live subtitling or batch processing.

Architecture

Two-component streaming architecture: (1) causal transformer audio encoder (0.6B params, 32 layers, causal attention) and (2) Ministral-3-based LLM decoder (3.4B params, 26 layers). Both use sliding window attention for streaming. Configurable transcription delay from 240ms to 2.4s.

Mixpeek SDK Integration

import { Mixpeek } from "mixpeek";

const mx = new Mixpeek({ apiKey: "API_KEY" });

// Managed: create a collection over a bucket; Mixpeek runs this model's extractor
const collection = await mx.collections.create({
  namespace_id: "my-namespace",
  collection_name: "my-collection",
  source: { type: "bucket", bucket_ids: ["bkt_your_bucket"] },
  feature_extractor: {
    feature_extractor_name: "transcription",
    version: "v1",
    parameters: { model_id: "mistralai/Voxtral-Mini-4B-Realtime-2602" },
  },
});

Capabilities

Realtime streaming transcription with <500ms latency
13 language support including English, Spanish, French, German
Configurable latency/accuracy tradeoff (240ms-2.4s delay)
Natively streaming architecture (no chunking workarounds)
Apache 2.0 open-source

Use Cases on Mixpeek

Live subtitling and closed captioning for video streams

Voice assistant transcription with low-latency requirements

Multilingual meeting transcription with realtime output

Benchmarks

Dataset	Metric	Score	Source
FLEURS (13 languages, 480ms)	Average WER	8.72%	Mistral AI, Feb 2026 — Voxtral Realtime paper
FLEURS English (480ms)	WER	4.90%	Mistral AI, Feb 2026 — Voxtral Realtime paper
FLEURS (13 languages, 2.4s)	Average WER	6.73%	Mistral AI, Feb 2026 — Voxtral Realtime paper