NEWManaged multimodal retrieval.Explore platform →
    Models/Speech & Audio/XiaomiMiMo/MiMo-V2.5-ASR
    HFTranscriptionApache 2.0

    MiMo-V2.5-ASR

    by XiaomiMiMo

    Dialect-robust ASR with SOTA accuracy and song lyrics transcription

    2Kdl/month
    ~8Bparams
    Identifiers
    Model ID
    XiaomiMiMo/MiMo-V2.5-ASR
    Feature URI
    mixpeek://transcription@v1/xiaomi_mimo_v25_asr_v1

    Overview

    MiMo V2.5 ASR is Xiaomi's speech recognition model that tops the HuggingFace Open ASR Leaderboard at 5.73% mean WER. Beyond English accuracy, it excels in areas where other models struggle: Chinese dialect recognition (Wu, Cantonese, Hokkien, Sichuanese), code-switching between languages, song lyrics transcription, and noisy multi-speaker environments.

    On Mixpeek, MiMo fills a gap for content in Chinese dialects, multilingual recordings with code-switching, and music content where lyrics need to be searchable. Its robustness to background noise makes it suitable for real-world recordings where Whisper's accuracy drops.

    Architecture

    Large-scale speech encoder with language model decoder. Trained on diverse audio including dialects, code-switched speech, and music. Handles multi-speaker and noisy environments. Apache 2.0 license.

    Mixpeek SDK Integration

    import { Mixpeek } from "mixpeek";
    const mx = new Mixpeek({ apiKey: "API_KEY" });
    await mx.collections.ingest({
    collection_id: "music-library",
    source: { url: "https://example.com/song.mp3" },
    feature_extractors: [{
    feature: "transcription",
    model: "XiaomiMiMo/MiMo-V2.5-ASR"
    }]
    });

    Capabilities

    • #1 on HuggingFace Open ASR Leaderboard (5.73% mean WER)
    • Chinese dialect recognition (Wu, Cantonese, Hokkien, Sichuanese)
    • Code-switching between Chinese and English (14.07% WER)
    • Song lyrics transcription (3.95% WER on m4singer)
    • Robust in multi-speaker and noisy environments

    Use Cases on Mixpeek

    Multilingual media: transcribe recordings with Chinese dialect content
    Music indexing: extract searchable lyrics from music recordings
    Conference calls: handle code-switching between languages
    Noisy environments: transcribe real-world recordings with background noise

    Benchmarks

    DatasetMetricScoreSource
    Open ASR LeaderboardMean WER5.73%HuggingFace Open ASR Leaderboard, 2026
    LibriSpeech CleanWER1.45%Xiaomi, 2026 — Model Card
    m4singer (lyrics)WER3.95%Xiaomi, 2026 — Model Card

    Performance

    Input SizeAudio (any length)
    GPU Latency~0.3x real-time (A100)
    GPU Throughput~330x RTFx
    GPU Memory~16 GB

    Specification

    FrameworkHF
    OrganizationXiaomiMiMo
    FeatureTranscription
    Outputtext + timestamps
    Modalitiesvideo, audio
    RetrieverTranscript Search
    Parameters~8B
    LicenseApache 2.0
    Downloads/mo2K

    Research Paper

    MiMo-V2.5-ASR Technical Report

    arxiv.org

    Build a pipeline with MiMo-V2.5-ASR

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Studio