granite-speech-4.1-2b
by ibm-granite
Compact 2B multilingual ASR and speech translation with Conformer encoder and 5.33 mean WER
ibm-granite/granite-speech-4.1-2bmixpeek://transcription@v1/ibm_granite_speech_41_2b_v1Overview
Granite Speech 4.1 2B is IBM's compact speech-language model designed for multilingual automatic speech recognition (ASR) and bidirectional automatic speech translation (AST) across English, French, German, Spanish, Portuguese, and Japanese. It combines a 16-layer Conformer encoder trained with dual-head CTC for character and BPE units with a 2-layer window Q-Former that downsamples acoustic embeddings by 10x, producing a 10Hz embedding rate for the language model.
Trained on 174,000 hours of public audio corpora plus synthetic datasets for Japanese ASR and keyword-biased recognition, the model achieves a mean WER of 5.33 on the Open ASR Leaderboard. On Mixpeek, it powers multilingual audio transcription for video and podcast content, enabling full-text search across spoken content in six languages.
Architecture
16-layer Conformer encoder with dual-head CTC (character + BPE). 2-layer window Q-Former downsamples acoustic embeddings by 10x to 10Hz. Trained on 174K hours of audio. Encoder training: 26 days on 8x H100; projector fine-tuning: 4 days.
Mixpeek SDK Integration
import { Mixpeek } from "mixpeek";const mx = new Mixpeek({ apiKey: "API_KEY" });await mx.collections.ingest({collection_id: "multilingual-media",source: { url: "https://example.com/conference-talk.mp4" },feature_extractors: [{feature: "transcription",model: "ibm-granite/granite-speech-4.1-2b"}]});
Capabilities
- 6-language ASR: English, French, German, Spanish, Portuguese, Japanese
- Bidirectional automatic speech translation
- Mean WER 5.33 on Open ASR Leaderboard
- Keyword-biased ASR for domain-specific terminology
- Compact 2B parameters for cost-efficient deployment
Use Cases on Mixpeek
Benchmarks
| Dataset | Metric | Score | Source |
|---|---|---|---|
| Open ASR Leaderboard | Mean WER | 5.33 | IBM, April 2026 — Model Card |
| LibriSpeech (clean) | WER | 1.33% | IBM, April 2026 — Model Card |
| LibriSpeech (other) | WER | 2.50% | IBM, April 2026 — Model Card |
Performance
Specification
Research Paper
Granite Speech 4.1
arxiv.orgBuild a pipeline with granite-speech-4.1-2b
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Studio