VibeVoice-ASR-HF
by microsoft
Unified ASR + diarization + timestamps in one 9B model — 60 min single-pass
microsoft/VibeVoice-ASR-HFmixpeek://transcription@v1/microsoft_vibevoice_asr_v1Overview
VibeVoice-ASR is Microsoft's unified speech recognition model that produces structured rich transcriptions — speaker labels, word-level timestamps, and content — from up to 60 minutes of audio in a single forward pass. It replaces the traditional pipeline of separate ASR, diarization, and alignment models with one 9B parameter model.
Supporting 50+ languages with native code-switching (no language flag required), it handles meetings, interviews, podcasts, and call center recordings where knowing who said what matters as much as what was said. On Mixpeek, it powers speaker-attributed transcription for video and audio assets.
Architecture
Encoder-decoder transformer (9B parameters) with multi-task training for simultaneous ASR, speaker diarization, and timestamp alignment. Processes up to 60 minutes of audio in a single pass without sliding window chunking.
Mixpeek SDK Integration
import { Mixpeek } from "mixpeek";const mx = new Mixpeek({ apiKey: "API_KEY" });await mx.collections.ingest({collection_id: "my-collection",source: { url: "https://example.com/meeting.mp4" },feature_extractors: [{name: "transcription",version: "v1",params: {model_id: "microsoft/VibeVoice-ASR-HF",enable_diarization: true}}]});
Capabilities
- Joint ASR + speaker diarization + word timestamps in one pass
- 60-minute single-pass processing without chunking
- 50+ languages with automatic code-switching
- Structured output: speaker ID, timestamps, and text per segment
- MIT license for unrestricted use
Use Cases on Mixpeek
Benchmarks
| Dataset | Metric | Score | Source |
|---|---|---|---|
| Earnings-22 (long-form) | WER | 11.2% | Microsoft, 2026 — Model Card |
Performance
Specification
Research Paper
VibeVoice-ASR: Longform Structured Speech Recognition at Scale
arxiv.orgBuild a pipeline with VibeVoice-ASR-HF
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Studio