NEWVectors or files. Pick a path.Start →
    Models/Captioning/google/gemma-4-12B-it
    HFScene CaptioningApache 2.0

    gemma-4-12B-it

    by google

    Open 12B multimodal model for image, audio, and long-context agent perception

    581Kdl/month
    12Bparams
    Identifiers
    Model ID
    google/gemma-4-12B-it
    Feature URI
    mixpeek://image_extractor@v1/google_gemma4_12b_it_v1

    Overview

    Gemma 4 12B IT is an instruction-tuned open model from Google DeepMind. The model card describes Gemma 4 as multimodal, with text and image input across the family and audio support on the E2B, E4B, and 12B variants. It is a strong fit for agents that need to inspect retrieved images, short audio clips, or mixed evidence after first-stage search.

    On Mixpeek, Gemma 4 12B belongs in the inspection layer. Use cheaper embeddings and filters to retrieve candidates, then ask Gemma to produce concise observations, answer bounded visual questions, or turn multimodal evidence into structured fields that downstream agents can cite.

    Architecture

    Instruction-tuned Gemma 4 multimodal model exposed through Hugging Face Transformers. The 12B checkpoint supports a 256K context window, multilingual text handling, image input, audio input, and text output.

    Mixpeek SDK Integration

    import { Mixpeek } from "mixpeek";
    const mx = new Mixpeek({ apiKey: "API_KEY" });
    await mx.collections.ingest({
    collection_id: "agent-visual-evidence",
    source: { url: "s3://media/keyframes/" },
    feature_extractors: [{
    feature: "scene_caption",
    model: "google/gemma-4-12B-it",
    params: {
    schema: {
    visible_objects: "string[]",
    scene_summary: "string",
    evidence_quality: "number"
    }
    }
    }]
    });

    Capabilities

    • Image-text and audio-text understanding in a single instruction-tuned model
    • Long-context multimodal reasoning for evidence inspection
    • Multilingual support across broad language coverage
    • Apache 2.0 license for production evaluation

    Use Cases on Mixpeek

    Answer visual questions over retrieved product, ad, or scene images
    Summarize mixed evidence that includes transcript snippets and frames
    Extract structured observations from screenshots, charts, and media frames
    Run second-stage agent inspection after vector search narrows the candidate set

    Performance

    Input SizeText, image, and audio inputs with long-context text
    GPU LatencyOutput length dependent
    GPU ThroughputBatch dependent
    GPU Memory12B multimodal deployment class

    Best used as a second-stage inspector after retrieval narrows candidates

    Specification

    FrameworkHF
    Organizationgoogle
    FeatureScene Captioning
    Outputtext
    Modalitiesvideo, image
    RetrieverSemantic Search
    Parameters12B
    LicenseApache 2.0
    Downloads/mo581K

    Research Paper

    Gemma 4 12B IT model card

    arxiv.org

    Build a pipeline with gemma-4-12B-it

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Studio