NEWWhy single embeddings fail for video.Read the post →
    Models/Captioning/google/gemma-4-26B-A4B-it
    HFScene CaptioningApache-2.0

    gemma-4-26B-A4B-it

    by google

    Mixture-of-experts VLM delivering 97% of 31B quality at 8x less compute

    1.1Mdl/month
    26B total / 4B activeparams
    Identifiers
    Model ID
    google/gemma-4-26B-A4B-it
    Feature URI
    mixpeek://image_extractor@v1/google_gemma4_26b_a4b_v1

    Overview

    Gemma 4 27B-A4B is Google's MoE vision-language model that activates only 4B parameters per token from a total of 26B. It ranked #6 on the Arena AI leaderboard at launch while using a fraction of the compute of dense models its size.

    The model handles both text and image input with a 256K context window, making it suitable for long-document visual understanding. Its efficiency profile makes it the best choice when you need high-quality VLM capabilities at manageable cost.

    Architecture

    Mixture-of-Experts architecture with 26B total parameters, 4B active per token. Vision encoder processes image patches alongside text tokens. 256K context window. Supports optional 'thinking' mode for chain-of-thought reasoning.

    Mixpeek SDK Integration

    from mixpeek import Mixpeek
    mx = Mixpeek(api_key="YOUR_KEY")
    mx.ingest(
    collection_id="visual-docs",
    source="s3://reports/",
    extractors=[
    {
    "type": "scene_caption",
    "model": "google/gemma-4-26B-A4B-it",
    "output_feature": "caption"
    },
    {
    "type": "text_embedding",
    "model": "BAAI/bge-m3",
    "input_field": "caption",
    "output_feature": "caption_embedding"
    }
    ]
    )

    Capabilities

    • Multimodal understanding (text + images)
    • 256K context window for long documents
    • MoE efficiency: 4B active / 26B total
    • Built-in reasoning mode
    • Apache 2.0 license

    Use Cases on Mixpeek

    Cost-efficient visual document captioning in Mixpeek ingestion pipelines
    Long-document visual understanding (multi-page PDFs with charts)
    Scene description for video frame analysis at scale

    Benchmarks

    DatasetMetricScoreSource
    MMLU ProAccuracy83%Google, May 2026
    AIME 2026Accuracy85%Google, May 2026
    Arena AI LeaderboardELO1441 (#6)Arena AI, May 2026

    Performance

    Input SizeUp to 256K tokens (text + image patches)
    GPU Latency~120ms / image (A100, 4B active)
    GPU Throughput~65 images/sec (A100, batch 8)
    GPU Memory~18 GB (MoE, sparse activation)

    Specification

    FrameworkHF
    Organizationgoogle
    FeatureScene Captioning
    Outputtext
    Modalitiesvideo, image
    RetrieverSemantic Search
    Parameters26B total / 4B active
    LicenseApache-2.0
    Downloads/mo1.1M

    Research Paper

    Gemma 4: Byte for byte, the most capable open models

    arxiv.org

    Build a pipeline with gemma-4-26B-A4B-it

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Studio