NEWManaged multimodal retrieval.Explore platform →
    Models/Speech & Audio/ibm-granite/granite-speech-4.1-2b-plus
    HFTranscriptionApache 2.0

    granite-speech-4.1-2b-plus

    by ibm-granite

    Speaker-attributed ASR — diarization, word timestamps, and keyword biasing in 2B

    95Kdl/month
    2Bparams
    Identifiers
    Model ID
    ibm-granite/granite-speech-4.1-2b-plus
    Feature URI
    mixpeek://transcription@v1/ibm_granite_speech_41_2b_plus_v1

    Overview

    Granite Speech 4.1 2B Plus extends the base Granite Speech model with speaker attribution, word-level timestamp alignment (38.8ms average accuracy), and keyword biasing -- all in a single 2B parameter model. Unlike pipeline approaches that chain separate ASR and diarization models, it produces speaker-labeled, timestamped transcripts in one forward pass.

    With a Word Diarization Error Rate (WDER) of 0.9% on the FISHER dataset, it delivers production-grade speaker attribution. Keyword biasing lets you improve recognition of domain-specific terms (product names, technical jargon) without fine-tuning. On Mixpeek, it powers meeting transcription and call analytics pipelines where speaker identity and precise timing matter.

    Architecture

    Autoregressive encoder-decoder (2B parameters) with multi-task training heads for ASR, speaker attribution, and timestamp alignment. Supports keyword biasing via attention-based shallow fusion. Native vLLM serving support.

    Mixpeek SDK Integration

    import { Mixpeek } from "mixpeek";
    const mx = new Mixpeek({ apiKey: "API_KEY" });
    await mx.collections.ingest({
    collection_id: "my-collection",
    source: { url: "https://example.com/meeting.mp4" },
    feature_extractors: [{
    name: "transcription",
    version: "v1",
    params: {
    model_id: "ibm-granite/granite-speech-4.1-2b-plus",
    enable_diarization: true,
    keywords: ["Mixpeek", "RAG", "embeddings"]
    }
    }]
    });

    Capabilities

    • Joint ASR + speaker diarization in one pass
    • Word-level timestamps (38.8ms average accuracy)
    • Keyword biasing without fine-tuning
    • WDER 0.9% on FISHER dataset
    • Apache 2.0 license, vLLM-ready

    Use Cases on Mixpeek

    Meeting transcription with speaker labels and precise timestamps
    Call center analytics with per-speaker metrics
    Legal deposition transcription with speaker attribution
    Domain-specific transcription with keyword biasing for jargon

    Benchmarks

    DatasetMetricScoreSource
    FISHER (speaker diarization)WDER0.9%IBM, 2026 — Model Card
    Timestamp accuracyMean deviation38.8msIBM, 2026 — Model Card

    Performance

    Input SizeVariable-length audio
    GPU Latency~5s / minute of audio (A100)
    GPU Throughput~12x realtime (A100)
    GPU Memory~5 GB

    Specification

    FrameworkHF
    Organizationibm-granite
    FeatureTranscription
    Outputtext + timestamps
    Modalitiesvideo, audio
    RetrieverTranscript Search
    Parameters2B
    LicenseApache 2.0
    Downloads/mo95K

    Research Paper

    Granite Speech 4.1: Speaker-Attributed ASR

    arxiv.org

    Build a pipeline with granite-speech-4.1-2b-plus

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Studio