NEWWhy single embeddings fail for video.Read the post →
    Models/Speech & Audio/openai/whisper-large-v3-turbo
    HFTranscriptionMIT

    whisper-large-v3-turbo

    by openai

    Whisper at 216x real-time -- pruned decoder for production-grade ASR speed

    4.3Mdl/month
    809Mparams
    Identifiers
    Model ID
    openai/whisper-large-v3-turbo
    Feature URI
    mixpeek://transcription@v1/openai_whisper_large_v3_turbo

    Overview

    Whisper Large V3 Turbo is OpenAI's speed-optimized variant of Whisper Large V3, achieved by pruning the decoder from 32 layers to 4. This yields a 216x real-time speed factor with less than 1% increase in word error rate compared to the full model.

    The model retains the full encoder and multilingual capabilities of Whisper Large V3, supporting 100+ languages. On Mixpeek, it provides the best speed/quality tradeoff for production ASR workloads -- fast enough for batch processing of large video libraries while maintaining near-full accuracy.

    Architecture

    Encoder-decoder Transformer. Full Whisper Large V3 encoder (32 layers) with pruned decoder (4 layers, down from 32). 809M total parameters. 128 mel spectrogram input. Multilingual, multitask (transcription + translation).

    Mixpeek SDK Integration

    import { Mixpeek } from "mixpeek";
    const mx = new Mixpeek({ apiKey: "API_KEY" });
    await mx.collections.ingest({
    collection_id: "my-collection",
    source: { url: "https://example.com/interview.mp4" },
    feature_extractors: [{
    name: "transcription",
    version: "v1",
    params: {
    model_id: "openai/whisper-large-v3-turbo"
    }
    }]
    });

    Capabilities

    • 216x real-time speed factor
    • 100+ language transcription
    • Speech translation to English
    • Timestamp prediction
    • Near-identical accuracy to full Whisper Large V3

    Use Cases on Mixpeek

    High-throughput batch transcription of video libraries
    Real-time captioning with low GPU cost
    Multilingual media archive transcription
    Agent perception of audio streams at scale

    Benchmarks

    DatasetMetricScoreSource
    LibriSpeech CleanWER2.0%OpenAI, 2024 -- Model Card
    CommonVoice 15 (multilingual)WER11.7%OpenAI, 2024 -- Model Card

    Performance

    Input SizeUp to 30s audio chunks
    GPU Latency~0.14s / 30s chunk (A100)
    GPU Throughput~216x realtime (A100)
    GPU Memory~3.2 GB

    Specification

    FrameworkHF
    Organizationopenai
    FeatureTranscription
    Outputtext + timestamps
    Modalitiesvideo, audio
    RetrieverTranscript Search
    Parameters809M
    LicenseMIT
    Downloads/mo4.3M

    Research Paper

    Whisper Large V3 Turbo

    arxiv.org

    Build a pipeline with whisper-large-v3-turbo

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Studio