Voxtral-Mini-4B-Realtime-2602
by mistralai
Open-source realtime streaming speech-to-text with sub-500ms latency across 13 languages
mistralai/Voxtral-Mini-4B-Realtime-2602mixpeek://transcription@v1/mistral_voxtral_mini_4b_v1Overview
Voxtral Mini 4B Realtime is among the first open-source speech models to achieve offline-comparable accuracy with sub-500ms latency. Its natively streaming architecture pairs a causal audio encoder (~0.6B params) with a Ministral-3-based LLM decoder (~3.4B params), both using sliding window attention for constant-memory streaming inference.
On Mixpeek, Voxtral powers realtime and near-realtime transcription of audio and video content across 13 languages, with configurable latency from 240ms to 2.4s to balance speed against accuracy for live subtitling or batch processing.
Architecture
Two-component streaming architecture: (1) causal transformer audio encoder (0.6B params, 32 layers, causal attention) and (2) Ministral-3-based LLM decoder (3.4B params, 26 layers). Both use sliding window attention for streaming. Configurable transcription delay from 240ms to 2.4s.
Mixpeek SDK Integration
import { Mixpeek } from "mixpeek";
const mx = new Mixpeek({ apiKey: "API_KEY" });
// Managed: create a collection over a bucket; Mixpeek runs this model's extractor
const collection = await mx.collections.create({
namespace_id: "my-namespace",
collection_name: "my-collection",
source: { type: "bucket", bucket_ids: ["bkt_your_bucket"] },
feature_extractor: {
feature_extractor_name: "transcription",
version: "v1",
parameters: { model_id: "mistralai/Voxtral-Mini-4B-Realtime-2602" },
},
});Capabilities
- Realtime streaming transcription with <500ms latency
- 13 language support including English, Spanish, French, German
- Configurable latency/accuracy tradeoff (240ms-2.4s delay)
- Natively streaming architecture (no chunking workarounds)
- Apache 2.0 open-source
Use Cases on Mixpeek
Benchmarks
| Dataset | Metric | Score | Source |
|---|---|---|---|
| FLEURS (13 languages, 480ms) | Average WER | 8.72% | Mistral AI, Feb 2026 — Voxtral Realtime paper |
| FLEURS English (480ms) | WER | 4.90% | Mistral AI, Feb 2026 — Voxtral Realtime paper |
| FLEURS (13 languages, 2.4s) | Average WER | 6.73% | Mistral AI, Feb 2026 — Voxtral Realtime paper |
Performance
Common Pipeline Companions
Explore on Mixpeek
Compare alternatives in this category
Hand-picked tools & platforms compared
Deep-dive technical guide
See how Mixpeek runs models as extractors
Store & search embeddings at scale
Usage-based pricing for pipelines
Compare models, APIs & infrastructure
Specification
Research Paper
Voxtral Realtime
arxiv.orgBuild a pipeline with Voxtral-Mini-4B-Realtime-2602
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Studio