NEWManaged multimodal retrieval.Explore platform →
    Models/Speech & Audio/microsoft/VibeVoice-ASR-HF
    HFTranscriptionMIT

    VibeVoice-ASR-HF

    by microsoft

    Unified ASR + diarization + timestamps in one 9B model — 60 min single-pass

    295Kdl/month
    9Bparams
    Identifiers
    Model ID
    microsoft/VibeVoice-ASR-HF
    Feature URI
    mixpeek://transcription@v1/microsoft_vibevoice_asr_v1

    Overview

    VibeVoice-ASR is Microsoft's unified speech recognition model that produces structured rich transcriptions — speaker labels, word-level timestamps, and content — from up to 60 minutes of audio in a single forward pass. It replaces the traditional pipeline of separate ASR, diarization, and alignment models with one 9B parameter model.

    Supporting 50+ languages with native code-switching (no language flag required), it handles meetings, interviews, podcasts, and call center recordings where knowing who said what matters as much as what was said. On Mixpeek, it powers speaker-attributed transcription for video and audio assets.

    Architecture

    Encoder-decoder transformer (9B parameters) with multi-task training for simultaneous ASR, speaker diarization, and timestamp alignment. Processes up to 60 minutes of audio in a single pass without sliding window chunking.

    Mixpeek SDK Integration

    import { Mixpeek } from "mixpeek";
    const mx = new Mixpeek({ apiKey: "API_KEY" });
    await mx.collections.ingest({
    collection_id: "my-collection",
    source: { url: "https://example.com/meeting.mp4" },
    feature_extractors: [{
    name: "transcription",
    version: "v1",
    params: {
    model_id: "microsoft/VibeVoice-ASR-HF",
    enable_diarization: true
    }
    }]
    });

    Capabilities

    • Joint ASR + speaker diarization + word timestamps in one pass
    • 60-minute single-pass processing without chunking
    • 50+ languages with automatic code-switching
    • Structured output: speaker ID, timestamps, and text per segment
    • MIT license for unrestricted use

    Use Cases on Mixpeek

    Meeting transcription with speaker attribution
    Podcast and interview indexing with per-speaker search
    Call center analytics with speaker-separated transcripts
    Video subtitle generation with speaker labels

    Benchmarks

    DatasetMetricScoreSource
    Earnings-22 (long-form)WER11.2%Microsoft, 2026 — Model Card

    Performance

    Input SizeUp to 60 minutes of audio
    GPU Latency~8s / minute of audio (A100)
    GPU Throughput~7.5x realtime (A100)
    GPU Memory~18 GB

    Specification

    FrameworkHF
    Organizationmicrosoft
    FeatureTranscription
    Outputtext + timestamps
    Modalitiesvideo, audio
    RetrieverTranscript Search
    Parameters9B
    LicenseMIT
    Downloads/mo295K

    Research Paper

    VibeVoice-ASR: Longform Structured Speech Recognition at Scale

    arxiv.org

    Build a pipeline with VibeVoice-ASR-HF

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Studio