gemma-4-E4B-it

by google

Efficient 4B multimodal VLM with Per-Layer Embeddings for on-device AI

5.7Mdl/month

4.5B (effective)params

HuggingFace Run on your data

Identifiers

Model ID

google/gemma-4-E4B-it

Feature URI

mixpeek://image_extractor@v1/google_gemma4_e4b_v1

Overview

Gemma 4 E4B is Google DeepMind's efficient multimodal model that uses Per-Layer Embeddings (PLE) to achieve the representational depth of a larger model while maintaining a compact inference footprint. With 4.5 billion effective parameters, it processes text, images, and audio with a 128K token context window, making it one of the most capable small models available.

On Mixpeek, Gemma 4 E4B powers lightweight multimodal understanding tasks including scene captioning, visual question answering, and document analysis where you need strong accuracy without the compute overhead of larger models.

Architecture

Decoder-only transformer with hybrid attention interleaving local sliding-window and full global attention. Uses Per-Layer Embeddings (PLE) that feed a secondary embedding signal into every decoder layer, enabling 4.5B effective parameters from a 2.3B-active compute footprint. Final layer always uses global attention.

Mixpeek SDK Integration

import { Mixpeek } from "mixpeek";

const mx = new Mixpeek({ apiKey: "API_KEY" });

// Managed: create a collection over a bucket; Mixpeek runs this model's extractor
const collection = await mx.collections.create({
  namespace_id: "my-namespace",
  collection_name: "my-collection",
  source: { type: "bucket", bucket_ids: ["bkt_your_bucket"] },
  feature_extractor: {
    feature_extractor_name: "scene_description",
    version: "v1",
    parameters: { model_id: "google/gemma-4-E4B-it" },
  },
});