NEWManaged multimodal retrieval.Explore platform →
    Models/Speech & Audio/mistralai/Voxtral-Mini-3B-2507
    HFTranscriptionApache 2.0

    Voxtral-Mini-3B-2507

    by mistralai

    Multimodal audio model for transcription, translation, and voice understanding

    532Kdl/month
    ~4.7Bparams
    Identifiers
    Model ID
    mistralai/Voxtral-Mini-3B-2507
    Feature URI
    mixpeek://transcription@v1/mistral_voxtral_mini_3b_v1

    Overview

    Voxtral Mini 3B is Mistral's multimodal audio model combining a Whisper large-v3 encoder with a Ministral-3B language decoder. It handles transcription, translation, audio understanding, and function calling from voice — supporting 8 languages with automatic language detection.

    On Mixpeek, Voxtral Mini powers multilingual transcription pipelines and audio understanding workflows. Its ability to answer questions about audio content (not just transcribe) enables richer metadata extraction from podcasts, interviews, and meeting recordings.

    Architecture

    Three-component architecture: Whisper large-v3 audio encoder (640M) + 4x downsampling audio-language adapter (25M) + Ministral-3B language decoder (3.6B). 32K token context. Handles 30-min audio for transcription, 40-min for understanding. ~9.5 GB GPU RAM in bf16.

    Mixpeek SDK Integration

    import { Mixpeek } from "mixpeek";
    const mx = new Mixpeek({ apiKey: "API_KEY" });
    await mx.collections.ingest({
    collection_id: "multilingual-audio",
    source: { url: "https://example.com/interview-fr.mp3" },
    feature_extractors: [{
    feature: "transcription",
    model: "mistralai/Voxtral-Mini-3B-2507"
    }]
    });

    Capabilities

    • 8-language ASR (EN, ES, FR, PT, HI, DE, NL, IT)
    • Automatic language detection
    • Audio understanding and question answering
    • Function calling from voice input
    • Outperforms Whisper large-v3 on Open ASR Leaderboard

    Use Cases on Mixpeek

    Multilingual transcription for global media libraries
    Audio Q&A: extract structured answers from recorded interviews
    Voice-driven workflows: trigger actions from spoken commands
    Podcast metadata extraction beyond plain transcription

    Benchmarks

    DatasetMetricScoreSource
    Open ASR LeaderboardMean WER7.05%HuggingFace Open ASR Leaderboard, 2025
    LibriSpeech CleanWER1.88%Mistral, 2025 — arxiv:2507.13264
    LibriSpeech OtherWER4.10%Mistral, 2025 — arxiv:2507.13264

    Performance

    Input SizeAudio (up to 40 min)
    GPU Latency~0.91x real-time (A100)
    GPU Throughput109.86x RTFx
    GPU Memory~9.5 GB

    Specification

    FrameworkHF
    Organizationmistralai
    FeatureTranscription
    Outputtext + timestamps
    Modalitiesvideo, audio
    RetrieverTranscript Search
    Parameters~4.7B
    LicenseApache 2.0
    Downloads/mo532K

    Research Paper

    Voxtral

    arxiv.org

    Build a pipeline with Voxtral-Mini-3B-2507

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Studio