MiMo-V2.5-ASR

by XiaomiMiMo

Dialect-robust ASR with SOTA accuracy and song lyrics transcription

2Kdl/month

~8Bparams

HuggingFace Use in Pipeline

Identifiers

Model ID

XiaomiMiMo/MiMo-V2.5-ASR

Feature URI

mixpeek://transcription@v1/xiaomi_mimo_v25_asr_v1

Overview

MiMo V2.5 ASR is Xiaomi's speech recognition model that tops the HuggingFace Open ASR Leaderboard at 5.73% mean WER. Beyond English accuracy, it excels in areas where other models struggle: Chinese dialect recognition (Wu, Cantonese, Hokkien, Sichuanese), code-switching between languages, song lyrics transcription, and noisy multi-speaker environments.

On Mixpeek, MiMo fills a gap for content in Chinese dialects, multilingual recordings with code-switching, and music content where lyrics need to be searchable. Its robustness to background noise makes it suitable for real-world recordings where Whisper's accuracy drops.

Architecture

Large-scale speech encoder with language model decoder. Trained on diverse audio including dialects, code-switched speech, and music. Handles multi-speaker and noisy environments. Apache 2.0 license.

Mixpeek SDK Integration

import { Mixpeek } from "mixpeek";

const mx = new Mixpeek({ apiKey: "API_KEY" });

await mx.collections.ingest({
  collection_id: "music-library",
  source: { url: "https://example.com/song.mp3" },
  feature_extractors: [{
    feature: "transcription",
    model: "XiaomiMiMo/MiMo-V2.5-ASR"
  }]
});

Capabilities

#1 on HuggingFace Open ASR Leaderboard (5.73% mean WER)
Chinese dialect recognition (Wu, Cantonese, Hokkien, Sichuanese)
Code-switching between Chinese and English (14.07% WER)
Song lyrics transcription (3.95% WER on m4singer)
Robust in multi-speaker and noisy environments

Use Cases on Mixpeek

Multilingual media: transcribe recordings with Chinese dialect content

Music indexing: extract searchable lyrics from music recordings

Conference calls: handle code-switching between languages

Noisy environments: transcribe real-world recordings with background noise

Benchmarks

Dataset	Metric	Score	Source
Open ASR Leaderboard	Mean WER	5.73%	HuggingFace Open ASR Leaderboard, 2026
LibriSpeech Clean	WER	1.45%	Xiaomi, 2026 — Model Card
m4singer (lyrics)	WER	3.95%	Xiaomi, 2026 — Model Card