NEWManaged multimodal retrieval.Explore platform →
    Models/Speech & Audio/ibm-granite/granite-speech-4.1-2b
    HFTranscriptionApache-2.0

    granite-speech-4.1-2b

    by ibm-granite

    Compact 2B multilingual ASR and speech translation with Conformer encoder and 5.33 mean WER

    185Kdl/month
    2Bparams
    Identifiers
    Model ID
    ibm-granite/granite-speech-4.1-2b
    Feature URI
    mixpeek://transcription@v1/ibm_granite_speech_41_2b_v1

    Overview

    Granite Speech 4.1 2B is IBM's compact speech-language model designed for multilingual automatic speech recognition (ASR) and bidirectional automatic speech translation (AST) across English, French, German, Spanish, Portuguese, and Japanese. It combines a 16-layer Conformer encoder trained with dual-head CTC for character and BPE units with a 2-layer window Q-Former that downsamples acoustic embeddings by 10x, producing a 10Hz embedding rate for the language model.

    Trained on 174,000 hours of public audio corpora plus synthetic datasets for Japanese ASR and keyword-biased recognition, the model achieves a mean WER of 5.33 on the Open ASR Leaderboard. On Mixpeek, it powers multilingual audio transcription for video and podcast content, enabling full-text search across spoken content in six languages.

    Architecture

    16-layer Conformer encoder with dual-head CTC (character + BPE). 2-layer window Q-Former downsamples acoustic embeddings by 10x to 10Hz. Trained on 174K hours of audio. Encoder training: 26 days on 8x H100; projector fine-tuning: 4 days.

    Mixpeek SDK Integration

    import { Mixpeek } from "mixpeek";
    const mx = new Mixpeek({ apiKey: "API_KEY" });
    await mx.collections.ingest({
    collection_id: "multilingual-media",
    source: { url: "https://example.com/conference-talk.mp4" },
    feature_extractors: [{
    feature: "transcription",
    model: "ibm-granite/granite-speech-4.1-2b"
    }]
    });

    Capabilities

    • 6-language ASR: English, French, German, Spanish, Portuguese, Japanese
    • Bidirectional automatic speech translation
    • Mean WER 5.33 on Open ASR Leaderboard
    • Keyword-biased ASR for domain-specific terminology
    • Compact 2B parameters for cost-efficient deployment

    Use Cases on Mixpeek

    Multilingual video transcription: index spoken content in six languages for search
    Podcast and webinar processing: generate searchable transcripts at scale
    Speech translation pipelines: transcribe and translate audio content across language pairs

    Benchmarks

    DatasetMetricScoreSource
    Open ASR LeaderboardMean WER5.33IBM, April 2026 — Model Card
    LibriSpeech (clean)WER1.33%IBM, April 2026 — Model Card
    LibriSpeech (other)WER2.50%IBM, April 2026 — Model Card

    Performance

    Input SizeAudio (any length, chunked internally)
    GPU Latency~0.3x real-time (A100)
    GPU Throughput~3.3x real-time (A100)
    GPU Memory~4.5 GB

    Specification

    FrameworkHF
    Organizationibm-granite
    FeatureTranscription
    Outputtext + timestamps
    Modalitiesvideo, audio
    RetrieverTranscript Search
    Parameters2B
    LicenseApache-2.0
    Downloads/mo185K

    Research Paper

    Granite Speech 4.1

    arxiv.org

    Build a pipeline with granite-speech-4.1-2b

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Studio