NEWWhy single embeddings fail for video.Read the post →
    Models/Captioning/google/gemma-4-31B-it
    HFScene CaptioningApache-2.0

    gemma-4-31B-it

    by google

    Top-3 open VLM with 256K context for dense visual document understanding

    820Kdl/month
    31Bparams
    Identifiers
    Model ID
    google/gemma-4-31B-it
    Feature URI
    mixpeek://image_extractor@v1/google_gemma4_31b_v1

    Overview

    Gemma 4 31B is Google's dense vision-language model, currently ranked #3 among open models on the Arena AI text leaderboard. Unlike the MoE variant (27B-A4B), this dense model activates all 31B parameters, delivering the highest quality at higher compute cost.

    The 256K context window and built-in thinking mode make it particularly strong for complex document understanding tasks where accuracy matters more than throughput.

    Architecture

    Dense transformer architecture with 31B parameters. Vision encoder processes image patches. 256K context window. Thinking mode enables chain-of-thought reasoning for complex visual tasks.

    Mixpeek SDK Integration

    from mixpeek import Mixpeek
    mx = Mixpeek(api_key="YOUR_KEY")
    mx.ingest(
    collection_id="technical-docs",
    source="s3://diagrams/",
    extractors=[
    {
    "type": "scene_caption",
    "model": "google/gemma-4-31B-it",
    "output_feature": "caption"
    },
    {
    "type": "text_embedding",
    "model": "Qwen/Qwen3-Embedding-4B",
    "input_field": "caption",
    "output_feature": "caption_embedding"
    }
    ]
    )

    Capabilities

    • Highest-quality open VLM (Arena #3)
    • 256K context window
    • Dense architecture for fine-tuning
    • Built-in reasoning mode
    • Apache 2.0 license

    Use Cases on Mixpeek

    High-accuracy visual document extraction where quality is critical
    Complex chart and diagram understanding
    Fine-tuning on domain-specific visual data (dense architecture)

    Benchmarks

    DatasetMetricScoreSource
    MMLU ProAccuracy85.2%Google, May 2026
    AIME 2026Accuracy89.2%Google, May 2026
    Arena AI LeaderboardELOTop 3 openArena AI, May 2026

    Performance

    Input SizeUp to 256K tokens (text + image patches)
    GPU Latency~280ms / image (A100)
    GPU Throughput~28 images/sec (A100, batch 4)
    GPU Memory~62 GB (dense, full activation)

    Specification

    FrameworkHF
    Organizationgoogle
    FeatureScene Captioning
    Outputtext
    Modalitiesvideo, image
    RetrieverSemantic Search
    Parameters31B
    LicenseApache-2.0
    Downloads/mo820K

    Research Paper

    Gemma 4: Byte for byte, the most capable open models

    arxiv.org

    Build a pipeline with gemma-4-31B-it

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Studio