MiMo-V2.5-ASR
by XiaomiMiMo
Dialect-robust ASR with SOTA accuracy and song lyrics transcription
XiaomiMiMo/MiMo-V2.5-ASRmixpeek://transcription@v1/xiaomi_mimo_v25_asr_v1Overview
MiMo V2.5 ASR is Xiaomi's speech recognition model that tops the HuggingFace Open ASR Leaderboard at 5.73% mean WER. Beyond English accuracy, it excels in areas where other models struggle: Chinese dialect recognition (Wu, Cantonese, Hokkien, Sichuanese), code-switching between languages, song lyrics transcription, and noisy multi-speaker environments.
On Mixpeek, MiMo fills a gap for content in Chinese dialects, multilingual recordings with code-switching, and music content where lyrics need to be searchable. Its robustness to background noise makes it suitable for real-world recordings where Whisper's accuracy drops.
Architecture
Large-scale speech encoder with language model decoder. Trained on diverse audio including dialects, code-switched speech, and music. Handles multi-speaker and noisy environments. Apache 2.0 license.
Mixpeek SDK Integration
import { Mixpeek } from "mixpeek";const mx = new Mixpeek({ apiKey: "API_KEY" });await mx.collections.ingest({collection_id: "music-library",source: { url: "https://example.com/song.mp3" },feature_extractors: [{feature: "transcription",model: "XiaomiMiMo/MiMo-V2.5-ASR"}]});
Capabilities
- #1 on HuggingFace Open ASR Leaderboard (5.73% mean WER)
- Chinese dialect recognition (Wu, Cantonese, Hokkien, Sichuanese)
- Code-switching between Chinese and English (14.07% WER)
- Song lyrics transcription (3.95% WER on m4singer)
- Robust in multi-speaker and noisy environments
Use Cases on Mixpeek
Benchmarks
| Dataset | Metric | Score | Source |
|---|---|---|---|
| Open ASR Leaderboard | Mean WER | 5.73% | HuggingFace Open ASR Leaderboard, 2026 |
| LibriSpeech Clean | WER | 1.45% | Xiaomi, 2026 — Model Card |
| m4singer (lyrics) | WER | 3.95% | Xiaomi, 2026 — Model Card |
Performance
Specification
Research Paper
MiMo-V2.5-ASR Technical Report
arxiv.orgBuild a pipeline with MiMo-V2.5-ASR
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Studio