Voxtral-Mini-3B-2507

by mistralai

Multimodal audio model for transcription, translation, and voice understanding

532Kdl/month

~4.7Bparams

HuggingFace Run on your data

Identifiers

Model ID

mistralai/Voxtral-Mini-3B-2507

Feature URI

mixpeek://transcription@v1/mistral_voxtral_mini_3b_v1

Overview

Voxtral Mini 3B is Mistral's multimodal audio model combining a Whisper large-v3 encoder with a Ministral-3B language decoder. It handles transcription, translation, audio understanding, and function calling from voice, supporting 8 languages with automatic language detection.

On Mixpeek, Voxtral Mini powers multilingual transcription pipelines and audio understanding workflows. Its ability to answer questions about audio content (not just transcribe) enables richer metadata extraction from podcasts, interviews, and meeting recordings.

Architecture

Three-component architecture: Whisper large-v3 audio encoder (640M) + 4x downsampling audio-language adapter (25M) + Ministral-3B language decoder (3.6B). 32K token context. Handles 30-min audio for transcription, 40-min for understanding. ~9.5 GB GPU RAM in bf16.

Mixpeek SDK Integration

import { Mixpeek } from "mixpeek";

const mx = new Mixpeek({ apiKey: "API_KEY" });

// Managed: create a collection over a bucket; Mixpeek runs this model's extractor
const collection = await mx.collections.create({
  namespace_id: "my-namespace",
  collection_name: "my-collection",
  source: { type: "bucket", bucket_ids: ["bkt_your_bucket"] },
  feature_extractor: {
    feature_extractor_name: "transcription",
    version: "v1",
    parameters: { model_id: "mistralai/Voxtral-Mini-3B-2507" },
  },
});

Capabilities

8-language ASR (EN, ES, FR, PT, HI, DE, NL, IT)
Automatic language detection
Audio understanding and question answering
Function calling from voice input
Outperforms Whisper large-v3 on Open ASR Leaderboard

Use Cases on Mixpeek

Multilingual transcription for global media libraries

Audio Q&A: extract structured answers from recorded interviews

Voice-driven workflows: trigger actions from spoken commands

Podcast metadata extraction beyond plain transcription

Benchmarks

Dataset	Metric	Score	Source
Open ASR Leaderboard	Mean WER	7.05%	HuggingFace Open ASR Leaderboard, 2025
LibriSpeech Clean	WER	1.88%	Mistral, 2025: arxiv,2507.13264
LibriSpeech Other	WER	4.10%	Mistral, 2025: arxiv,2507.13264