gemma-4-12B-it
by google
Open 12B multimodal model for image, audio, and long-context agent perception
google/gemma-4-12B-itmixpeek://image_extractor@v1/google_gemma4_12b_it_v1Overview
Gemma 4 12B IT is an instruction-tuned open model from Google DeepMind. The model card describes Gemma 4 as multimodal, with text and image input across the family and audio support on the E2B, E4B, and 12B variants. It is a strong fit for agents that need to inspect retrieved images, short audio clips, or mixed evidence after first-stage search.
On Mixpeek, Gemma 4 12B belongs in the inspection layer. Use cheaper embeddings and filters to retrieve candidates, then ask Gemma to produce concise observations, answer bounded visual questions, or turn multimodal evidence into structured fields that downstream agents can cite.
Architecture
Instruction-tuned Gemma 4 multimodal model exposed through Hugging Face Transformers. The 12B checkpoint supports a 256K context window, multilingual text handling, image input, audio input, and text output.
Mixpeek SDK Integration
import { Mixpeek } from "mixpeek";const mx = new Mixpeek({ apiKey: "API_KEY" });await mx.collections.ingest({collection_id: "agent-visual-evidence",source: { url: "s3://media/keyframes/" },feature_extractors: [{feature: "scene_caption",model: "google/gemma-4-12B-it",params: {schema: {visible_objects: "string[]",scene_summary: "string",evidence_quality: "number"}}}]});
Capabilities
- Image-text and audio-text understanding in a single instruction-tuned model
- Long-context multimodal reasoning for evidence inspection
- Multilingual support across broad language coverage
- Apache 2.0 license for production evaluation
Use Cases on Mixpeek
Performance
Best used as a second-stage inspector after retrieval narrows candidates
Specification
Research Paper
Gemma 4 12B IT model card
arxiv.orgBuild a pipeline with gemma-4-12B-it
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Studio