MiniCPM-V-4_5

by openbmb

Best sub-30B vision-language model with 10FPS video understanding

116Kdl/month

8Bparams

HuggingFace Run on your data

Identifiers

Model ID

openbmb/MiniCPM-V-4_5

Feature URI

mixpeek://image_extractor@v1/openbmb_minicpm_v45_v1

Overview

MiniCPM-V 4.5 is an 8B-parameter vision-language model that achieves 77.0 on OpenCompass, surpassing GPT-4o and models 10x its size. Built on Qwen3-8B with SigLIP2-400M as the vision encoder, it processes images and video with a 96x video token compression scheme that enables understanding video at 10 frames per second -- fast enough for near-real-time scene captioning.

The model excels at detailed scene description, OCR, chart understanding, and multi-image reasoning, making it a strong choice for video decomposition pipelines where each scene needs a rich caption.

Architecture

Qwen3-8B language model + SigLIP2-400M vision encoder. 96x video token compression enables 10FPS video processing. Supports multiple images and video frames in a single forward pass.

Mixpeek SDK Integration

import { Mixpeek } from "mixpeek";

const mx = new Mixpeek({ apiKey: "API_KEY" });

// Managed: create a collection over a bucket; Mixpeek runs this model's extractor
const collection = await mx.collections.create({
  namespace_id: "my-namespace",
  collection_name: "my-collection",
  source: { type: "bucket", bucket_ids: ["bkt_your_bucket"] },
  feature_extractor: {
    feature_extractor_name: "scene_caption",
    version: "v1",
    parameters: { model_id: "openbmb/MiniCPM-V-4_5" },
  },
});