NEWManaged multimodal retrieval.Explore platform →
    Models/Captioning/google/gemma-4-E4B-it
    HFScene CaptioningApache 2.0

    gemma-4-E4B-it

    by google

    Efficient 4B multimodal VLM with Per-Layer Embeddings for on-device AI

    5.7Mdl/month
    4.5B (effective)params
    Identifiers
    Model ID
    google/gemma-4-E4B-it
    Feature URI
    mixpeek://image_extractor@v1/google_gemma4_e4b_v1

    Overview

    Gemma 4 E4B is Google DeepMind's efficient multimodal model that uses Per-Layer Embeddings (PLE) to achieve the representational depth of a larger model while maintaining a compact inference footprint. With 4.5 billion effective parameters, it processes text, images, and audio with a 128K token context window, making it one of the most capable small models available.

    On Mixpeek, Gemma 4 E4B powers lightweight multimodal understanding tasks including scene captioning, visual question answering, and document analysis where you need strong accuracy without the compute overhead of larger models.

    Architecture

    Decoder-only transformer with hybrid attention interleaving local sliding-window and full global attention. Uses Per-Layer Embeddings (PLE) that feed a secondary embedding signal into every decoder layer, enabling 4.5B effective parameters from a 2.3B-active compute footprint. Final layer always uses global attention.

    Mixpeek SDK Integration

    import { Mixpeek } from "mixpeek";
    const mx = new Mixpeek({ apiKey: "API_KEY" });
    await mx.collections.ingest({
    collection_id: "my-collection",
    source: { url: "https://example.com/video.mp4" },
    feature_extractors: [{
    name: "scene_description",
    version: "v1",
    params: {
    model_id: "google/gemma-4-E4B-it"
    }
    }]
    });

    Capabilities

    • Multimodal input: text, image, and audio understanding
    • 128K token context window
    • Built-in thinking mode for step-by-step reasoning
    • Per-Layer Embeddings for compute-efficient inference
    • Fits under 1.5 GB with 2-bit quantization

    Use Cases on Mixpeek

    On-device visual understanding for mobile and edge media pipelines
    Lightweight scene captioning across large video libraries without GPU-heavy inference
    Multimodal document Q&A where images, text, and audio context must be processed together

    Benchmarks

    DatasetMetricScoreSource
    AIME 2026Accuracy42.5%Google Gemma 4 technical report
    MMLU ProAccuracy~55%Gemma 4 E4B model card

    Performance

    Input SizeText + 224×224 px images
    GPU Latency~25ms / image (A100)
    GPU Throughput~40 images/sec (A100)
    GPU Memory~3.5 GB (bf16)

    4.5B effective params via PLE — only 2.3B active at runtime

    Specification

    FrameworkHF
    Organizationgoogle
    FeatureScene Captioning
    Outputtext
    Modalitiesvideo, image
    RetrieverSemantic Search
    Parameters4.5B (effective)
    LicenseApache 2.0
    Downloads/mo5.7M

    Research Paper

    Gemma 4 model overview

    arxiv.org

    Build a pipeline with gemma-4-E4B-it

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Studio