wav2vec2-large-960h
by facebook
Self-supervised speech representations for automatic speech recognition
facebook/wav2vec2-large-960hmixpeek://transcription@v1/facebook_wav2vec2_large_v1Overview
Wav2Vec 2.0 learns speech representations from raw audio through self-supervised pre-training, then fine-tunes with a small amount of labeled data. The 960h variant is fine-tuned on the full LibriSpeech dataset.
On Mixpeek, Wav2Vec2 provides an alternative to Whisper for English transcription, with strong performance on clear speech and a smaller memory footprint.
Architecture
CNN feature encoder (7 convolutional layers) followed by a 24-layer Transformer. Self-supervised pre-training uses contrastive loss over quantized speech representations. Fine-tuned with CTC loss.
Mixpeek SDK Integration
import { Mixpeek } from "mixpeek";
const mx = new Mixpeek({ apiKey: "API_KEY" });
await mx.collections.ingest({
collection_id: "my-collection",
source: { url: "https://example.com/podcast.mp3" },
feature_extractors: [{
name: "audio_transcription",
version: "v1",
params: {
model_id: "facebook/wav2vec2-large-960h"
}
}]
});Capabilities
- Self-supervised pre-training on unlabeled audio
- Strong English ASR performance
- Raw waveform input (no spectrogram needed)
- Efficient fine-tuning with limited labeled data
Use Cases on Mixpeek
Specification
Research Paper
wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations
arxiv.orgBuild a pipeline with wav2vec2-large-960h
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Pipeline Builder