Voxtral-Mini-3B-2507
by mistralai
Multimodal audio model for transcription, translation, and voice understanding
mistralai/Voxtral-Mini-3B-2507mixpeek://transcription@v1/mistral_voxtral_mini_3b_v1Overview
Voxtral Mini 3B is Mistral's multimodal audio model combining a Whisper large-v3 encoder with a Ministral-3B language decoder. It handles transcription, translation, audio understanding, and function calling from voice — supporting 8 languages with automatic language detection.
On Mixpeek, Voxtral Mini powers multilingual transcription pipelines and audio understanding workflows. Its ability to answer questions about audio content (not just transcribe) enables richer metadata extraction from podcasts, interviews, and meeting recordings.
Architecture
Three-component architecture: Whisper large-v3 audio encoder (640M) + 4x downsampling audio-language adapter (25M) + Ministral-3B language decoder (3.6B). 32K token context. Handles 30-min audio for transcription, 40-min for understanding. ~9.5 GB GPU RAM in bf16.
Mixpeek SDK Integration
import { Mixpeek } from "mixpeek";const mx = new Mixpeek({ apiKey: "API_KEY" });await mx.collections.ingest({collection_id: "multilingual-audio",source: { url: "https://example.com/interview-fr.mp3" },feature_extractors: [{feature: "transcription",model: "mistralai/Voxtral-Mini-3B-2507"}]});
Capabilities
- 8-language ASR (EN, ES, FR, PT, HI, DE, NL, IT)
- Automatic language detection
- Audio understanding and question answering
- Function calling from voice input
- Outperforms Whisper large-v3 on Open ASR Leaderboard
Use Cases on Mixpeek
Benchmarks
| Dataset | Metric | Score | Source |
|---|---|---|---|
| Open ASR Leaderboard | Mean WER | 7.05% | HuggingFace Open ASR Leaderboard, 2025 |
| LibriSpeech Clean | WER | 1.88% | Mistral, 2025 — arxiv:2507.13264 |
| LibriSpeech Other | WER | 4.10% | Mistral, 2025 — arxiv:2507.13264 |
Performance
Specification
Research Paper
Voxtral
arxiv.orgBuild a pipeline with Voxtral-Mini-3B-2507
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Studio