NEWMVS for embeddings. Managed for files. Both on object storage.Vectors or files. Pick a path.Start →

Start here

Vector Store (MVS)

Bring your own vectors. Dense, sparse, and BM25 search on object storage.

Managed Indexing

Connect a bucket and auto-extract scenes, faces, OCR, transcripts, and embeddings.

Build

Compose multi-stage search in <100ms: filter, join, rerank.

Feature Extractors

Typed pipelines for faces, scenes, transcripts, OCR, fingerprints.

S3, GCS, R2, Mux, LangChain, MCP, and more. Connect your stack.

Generate and store embeddings from 50+ models, then search them.

By Industry

Map, search, and reuse the moments that perform. Plugs into iconik & Mux.

Talent search, brand safety, creative analytics.

Scene search, recommendation, archive access.

Visual search, PDP enrichment, catalog QA.

Lecture search, transcript Q&A, content safety.

View all solutions →

By Use Case

Face & Person Search

Find anyone across video libraries in milliseconds.

IP & Copyright Detection

Logos, songs, faces: one pipeline, one report.

Visual Taste & Recs

Scene-similarity ranked recommendations with RL.

Brand & Ad Safety

Pre-publish content screening at bid-time speeds.

View all use cases →

Build

API reference, SDKs, recipes, and architecture guides.

Launches, deep dives, and field notes from our engineers.

Browse supported HuggingFace models by task and modality.

See what teams are building with Mixpeek.

Education

Vendor-neutral deep dives on perception, retrieval, and embeddings.

Best-of comparisons: vector DBs, embedding models, moderation APIs.

Multimodal University

Fundamentals of multimodal retrieval, modules + certs.

Every term you need: embeddings to re-rankers.

Talks, demos, and customer sessions on demand.

Visual explainers: embeddings, chunking, hybrid search, reranking, RAG.

Papers behind multimodal search — MUVERA, SAM 3, and more — explained.

Mixpeek vs. Pinecone, Weaviate, Twelve Labs, more.

Mission, team, and the multimodal vision.

We're hiring across research, infra, and design.

Talk to sales, support, or press.

45-min working session on your data — leave with a running notebook.

Sign in Request Demo Get started →

Models/Captioning/google/gemma-4-31B-it

HFScene CaptioningApache-2.0

gemma-4-31B-it

by google

Top-3 open VLM with 256K context for dense visual document understanding

820Kdl/month

31Bparams

HuggingFace Run on your data, free

Identifiers

Model ID

google/gemma-4-31B-it

Feature URI

mixpeek://image_extractor@v1/google_gemma4_31b_v1

Overview

Gemma 4 31B is Google's dense vision-language model, currently ranked #3 among open models on the Arena AI text leaderboard. Unlike the MoE variant (27B-A4B), this dense model activates all 31B parameters, delivering the highest quality at higher compute cost.

The 256K context window and built-in thinking mode make it particularly strong for complex document understanding tasks where accuracy matters more than throughput.

Architecture

Dense transformer architecture with 31B parameters. Vision encoder processes image patches. 256K context window. Thinking mode enables chain-of-thought reasoning for complex visual tasks.

Mixpeek SDK Integration

import { Mixpeek } from "mixpeek";

const mx = new Mixpeek({ apiKey: "API_KEY" });

// Managed: create a collection over a bucket; Mixpeek runs this model's extractor
const collection = await mx.collections.create({
  namespace_id: "my-namespace",
  collection_name: "my-collection",
  source: { type: "bucket", bucket_ids: ["bkt_your_bucket"] },
  feature_extractor: {
    feature_extractor_name: "scene_caption",
    version: "v1",
    parameters: { model_id: "google/gemma-4-31B-it" },
  },
});

Capabilities

Highest-quality open VLM (Arena #3)
256K context window
Dense architecture for fine-tuning
Built-in reasoning mode
Apache 2.0 license

Use Cases on Mixpeek

High-accuracy visual document extraction where quality is critical

Complex chart and diagram understanding

Fine-tuning on domain-specific visual data (dense architecture)

Benchmarks

Dataset	Metric	Score	Source
MMLU Pro	Accuracy	85.2%	Google, May 2026
AIME 2026	Accuracy	89.2%	Google, May 2026
Arena AI Leaderboard	ELO	Top 3 open	Arena AI, May 2026

Performance

Input SizeUp to 256K tokens (text + image patches)

GPU Latency~280ms / image (A100)

GPU Throughput~28 images/sec (A100, batch 4)

GPU Memory~62 GB (dense, full activation)

Common Pipeline Companions

google/gemma-4-26B-A4B-it

Compare MoE vs dense quality/cost tradeoff

Qwen/Qwen3-Embedding-4B

Embed dense captions for retrieval

Explore on Mixpeek

More Captioning models

Compare alternatives in this category

Best Document AI Platforms

Hand-picked tools & platforms compared

OCR & Document AI Internals

Deep-dive technical guide

Feature Extractors

See how Mixpeek runs models as extractors

Mixpeek Vector Store

Store & search embeddings at scale

Usage-based pricing for pipelines

All Curated Lists

Compare models, APIs & infrastructure

Specification

FrameworkHF

Organizationgoogle

FeatureScene Captioning

Outputtext

Modalitiesvideo, image

RetrieverSemantic Search

Parameters31B

LicenseApache-2.0

Downloads/mo820K

Research Paper

Gemma 4: Byte for byte, the most capable open models

Build a pipeline with gemma-4-31B-it

Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

Run on your data, free

Alternative Models

Salesforce/blip2-opt-2.7b

Scene Captioning

microsoft/Florence-2-large

Scene Captioning

google/paligemma2-3b-mix-448

Scene Captioning

Qwen/Qwen3-VL-8B-Instruct

Scene Captioning