NEWAgents can now see video via MCP.Try it now →
    Models/Speech & Audio/openai/whisper-large-v3
    HFTranscriptionapache-2.0

    whisper-large-v3

    by openai

    Robust speech recognition trained on 680K hours of multilingual audio

    4.7Mdl/month
    5,555likes
    1.5Bparams
    Identifiers
    Model ID
    openai/whisper-large-v3
    Feature URI
    mixpeek://transcription@v1/openai_whisper_large_v3

    Overview

    Whisper is a general-purpose speech recognition model trained on a massive dataset of diverse audio. It supports multilingual transcription, translation, and language identification. The large-v3 variant achieves near-human accuracy on many benchmarks.

    On Mixpeek, Whisper powers audio transcription for video and audio content, generating timestamped text that enables full-text search across spoken content.

    Architecture

    Encoder-decoder Transformer with 32 encoder layers and 32 decoder layers. Processes 30-second audio segments as 80-channel log-mel spectrograms. Uses multi-task training format with special tokens for timestamps, language, and task type.

    Mixpeek SDK Integration

    import { Mixpeek } from "mixpeek";
    const mx = new Mixpeek({ apiKey: "API_KEY" });
    await mx.collections.ingest({
    collection_id: "my-collection",
    source: { url: "https://example.com/video.mp4" },
    feature_extractors: [{
    name: "audio_transcription",
    version: "v1",
    params: {
    model_id: "openai/whisper-large-v3"
    }
    }]
    });

    Capabilities

    • 99+ language transcription and translation
    • Word-level timestamps
    • Robust to background noise, accents, and domain-specific vocabulary
    • Automatic language detection

    Use Cases on Mixpeek

    Transcribe video libraries for full-text search
    Generate subtitles and closed captions at scale
    Call center analytics, search call recordings by content
    Podcast and webinar content indexing

    Benchmarks

    DatasetMetricScoreSource
    Fleurs (62 langs)Avg WER10.4%Radford et al., 2023 — Table 1
    LibriSpeech (test-clean)WER2.0%Radford et al., 2023 — Table 2
    Common Voice 15Avg WER11.7%Whisper model card

    Performance

    Input Size30s audio chunks
    GPU Latency~320ms / 30s chunk (A100)
    CPU Latency~4.2s / 30s chunk
    GPU Throughput~5.6× realtime (A100)
    GPU Memory~3.1 GB

    1.55B params, supports 99 languages

    Specification

    FrameworkHF
    Organizationopenai
    FeatureTranscription
    Outputtext + timestamps
    Modalitiesvideo, audio
    RetrieverTranscript Search
    Parameters1.5B
    Licenseapache-2.0
    Downloads/mo4.7M
    Likes5,555

    Research Paper

    Robust Speech Recognition via Large-Scale Weak Supervision

    arxiv.org

    Build a pipeline with whisper-large-v3

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Pipeline Builder