NEWAgents can now see video via MCP.Try it now →
    Models/Speech & Audio/facebook/wav2vec2-large-960h
    HFTranscriptionapache-2.0

    wav2vec2-large-960h

    by facebook

    Self-supervised speech representations for automatic speech recognition

    37Kdl/month
    35likes
    317Mparams
    Identifiers
    Model ID
    facebook/wav2vec2-large-960h
    Feature URI
    mixpeek://transcription@v1/facebook_wav2vec2_large_v1

    Overview

    Wav2Vec 2.0 learns speech representations from raw audio through self-supervised pre-training, then fine-tunes with a small amount of labeled data. The 960h variant is fine-tuned on the full LibriSpeech dataset.

    On Mixpeek, Wav2Vec2 provides an alternative to Whisper for English transcription, with strong performance on clear speech and a smaller memory footprint.

    Architecture

    CNN feature encoder (7 convolutional layers) followed by a 24-layer Transformer. Self-supervised pre-training uses contrastive loss over quantized speech representations. Fine-tuned with CTC loss.

    Mixpeek SDK Integration

    import { Mixpeek } from "mixpeek";
    const mx = new Mixpeek({ apiKey: "API_KEY" });
    await mx.collections.ingest({
    collection_id: "my-collection",
    source: { url: "https://example.com/podcast.mp3" },
    feature_extractors: [{
    name: "audio_transcription",
    version: "v1",
    params: {
    model_id: "facebook/wav2vec2-large-960h"
    }
    }]
    });

    Capabilities

    • Self-supervised pre-training on unlabeled audio
    • Strong English ASR performance
    • Raw waveform input (no spectrogram needed)
    • Efficient fine-tuning with limited labeled data

    Use Cases on Mixpeek

    English-focused transcription workflows
    Low-resource language adaptation with limited training data
    Audio content indexing for search and discovery

    Benchmarks

    DatasetMetricScoreSource
    LibriSpeech (test-clean)WER2.7%Baevski et al., 2020 — Table 5
    LibriSpeech (test-other)WER5.2%Baevski et al., 2020 — Table 5

    Performance

    Input Sizevariable audio length
    GPU Latency~180ms / 30s chunk (A100)
    CPU Latency~2.8s / 30s chunk
    GPU Throughput~10× realtime (A100)
    GPU Memory~1.3 GB

    Specification

    FrameworkHF
    Organizationfacebook
    FeatureTranscription
    Outputtext + timestamps
    Modalitiesvideo, audio
    RetrieverTranscript Search
    Parameters317M
    Licenseapache-2.0
    Downloads/mo37K
    Likes35

    Research Paper

    wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

    arxiv.org

    Build a pipeline with wav2vec2-large-960h

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Pipeline Builder