granite-speech-4.1-2b

by ibm-granite

Compact 2B multilingual ASR and speech translation with Conformer encoder and 5.33 mean WER

185Kdl/month

2Bparams

HuggingFace Use in Pipeline

Identifiers

Model ID

ibm-granite/granite-speech-4.1-2b

Feature URI

mixpeek://transcription@v1/ibm_granite_speech_41_2b_v1

Overview

Granite Speech 4.1 2B is IBM's compact speech-language model designed for multilingual automatic speech recognition (ASR) and bidirectional automatic speech translation (AST) across English, French, German, Spanish, Portuguese, and Japanese. It combines a 16-layer Conformer encoder trained with dual-head CTC for character and BPE units with a 2-layer window Q-Former that downsamples acoustic embeddings by 10x, producing a 10Hz embedding rate for the language model.

Trained on 174,000 hours of public audio corpora plus synthetic datasets for Japanese ASR and keyword-biased recognition, the model achieves a mean WER of 5.33 on the Open ASR Leaderboard. On Mixpeek, it powers multilingual audio transcription for video and podcast content, enabling full-text search across spoken content in six languages.

Architecture

16-layer Conformer encoder with dual-head CTC (character + BPE). 2-layer window Q-Former downsamples acoustic embeddings by 10x to 10Hz. Trained on 174K hours of audio. Encoder training: 26 days on 8x H100; projector fine-tuning: 4 days.

Mixpeek SDK Integration

import { Mixpeek } from "mixpeek";

const mx = new Mixpeek({ apiKey: "API_KEY" });

await mx.collections.ingest({
  collection_id: "multilingual-media",
  source: { url: "https://example.com/conference-talk.mp4" },
  feature_extractors: [{
    feature: "transcription",
    model: "ibm-granite/granite-speech-4.1-2b"
  }]
});

Capabilities

6-language ASR: English, French, German, Spanish, Portuguese, Japanese
Bidirectional automatic speech translation
Mean WER 5.33 on Open ASR Leaderboard
Keyword-biased ASR for domain-specific terminology
Compact 2B parameters for cost-efficient deployment

Use Cases on Mixpeek

Multilingual video transcription: index spoken content in six languages for search

Podcast and webinar processing: generate searchable transcripts at scale

Speech translation pipelines: transcribe and translate audio content across language pairs

Benchmarks

Dataset	Metric	Score	Source
Open ASR Leaderboard	Mean WER	5.33	IBM, April 2026 — Model Card
LibriSpeech (clean)	WER	1.33%	IBM, April 2026 — Model Card
LibriSpeech (other)	WER	2.50%	IBM, April 2026 — Model Card