whisper-large-v3

by openai

Robust speech recognition trained on 680K hours of multilingual audio

6.0Mdl/month

5,941likes

1.5Bparams

HuggingFace Run on your data, free

Identifiers

Model ID

openai/whisper-large-v3

Feature URI

mixpeek://transcription@v1/openai_whisper_large_v3

Overview

Whisper is a general-purpose speech recognition model trained on a massive dataset of diverse audio. It supports multilingual transcription, translation, and language identification. The large-v3 variant achieves near-human accuracy on many benchmarks.

On Mixpeek, Whisper powers audio transcription for video and audio content, generating timestamped text that enables full-text search across spoken content.

Architecture

Encoder-decoder Transformer with 32 encoder layers and 32 decoder layers. Processes 30-second audio segments as 80-channel log-mel spectrograms. Uses multi-task training format with special tokens for timestamps, language, and task type.

Mixpeek SDK Integration

import { Mixpeek } from "mixpeek";

const mx = new Mixpeek({ apiKey: "API_KEY" });

// Managed: create a collection over a bucket; Mixpeek runs this model's extractor
const collection = await mx.collections.create({
  namespace_id: "my-namespace",
  collection_name: "my-collection",
  source: { type: "bucket", bucket_ids: ["bkt_your_bucket"] },
  feature_extractor: {
    feature_extractor_name: "audio_transcription",
    version: "v1",
    parameters: { model_id: "openai/whisper-large-v3" },
  },
});

Capabilities

99+ language transcription and translation
Word-level timestamps
Robust to background noise, accents, and domain-specific vocabulary
Automatic language detection

Use Cases on Mixpeek

Transcribe video libraries for full-text search

Generate subtitles and closed captions at scale

Call center analytics, search call recordings by content

Podcast and webinar content indexing

Benchmarks

Dataset	Metric	Score	Source
Fleurs (62 langs)	Avg WER	10.4%	Radford et al., 2023 — Table 1
LibriSpeech (test-clean)	WER	2.0%	Radford et al., 2023 — Table 2
Common Voice 15	Avg WER	11.7%	Whisper model card