whisper-large-v3
by openai
Robust speech recognition trained on 680K hours of multilingual audio
openai/whisper-large-v3mixpeek://transcription@v1/openai_whisper_large_v3Overview
Whisper is a general-purpose speech recognition model trained on a massive dataset of diverse audio. It supports multilingual transcription, translation, and language identification. The large-v3 variant achieves near-human accuracy on many benchmarks.
On Mixpeek, Whisper powers audio transcription for video and audio content, generating timestamped text that enables full-text search across spoken content.
Architecture
Encoder-decoder Transformer with 32 encoder layers and 32 decoder layers. Processes 30-second audio segments as 80-channel log-mel spectrograms. Uses multi-task training format with special tokens for timestamps, language, and task type.
Mixpeek SDK Integration
import { Mixpeek } from "mixpeek";
const mx = new Mixpeek({ apiKey: "API_KEY" });
await mx.collections.ingest({
collection_id: "my-collection",
source: { url: "https://example.com/video.mp4" },
feature_extractors: [{
name: "audio_transcription",
version: "v1",
params: {
model_id: "openai/whisper-large-v3"
}
}]
});Capabilities
- 99+ language transcription and translation
- Word-level timestamps
- Robust to background noise, accents, and domain-specific vocabulary
- Automatic language detection
Use Cases on Mixpeek
Specification
Research Paper
Robust Speech Recognition via Large-Scale Weak Supervision
arxiv.orgBuild a pipeline with whisper-large-v3
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Pipeline Builder