MiniCPM-V-4_5
by openbmb
Best sub-30B vision-language model with 10FPS video understanding
openbmb/MiniCPM-V-4_5mixpeek://image_extractor@v1/openbmb_minicpm_v45_v1Overview
MiniCPM-V 4.5 is an 8B-parameter vision-language model that achieves 77.0 on OpenCompass, surpassing GPT-4o and models 10x its size. Built on Qwen3-8B with SigLIP2-400M as the vision encoder, it processes images and video with a 96x video token compression scheme that enables understanding video at 10 frames per second -- fast enough for near-real-time scene captioning.
The model excels at detailed scene description, OCR, chart understanding, and multi-image reasoning, making it a strong choice for video decomposition pipelines where each scene needs a rich caption.
Architecture
Qwen3-8B language model + SigLIP2-400M vision encoder. 96x video token compression enables 10FPS video processing. Supports multiple images and video frames in a single forward pass.
Mixpeek SDK Integration
import { Mixpeek } from "mixpeek";const mx = new Mixpeek({ apiKey: "API_KEY" });await mx.collections.ingest({collection_id: "video-collection",source: { url: "https://example.com/video.mp4" },feature_extractors: [{feature: "scene_caption",model: "openbmb/MiniCPM-V-4_5"}]});
Capabilities
- 77.0 on OpenCompass (surpasses GPT-4o)
- 10FPS video understanding via 96x token compression
- Multi-image reasoning across frames
- Strong OCR and chart/table understanding
- Apache-2.0 license for commercial use
Use Cases on Mixpeek
Specification
Research Paper
MiniCPM-V 4.5
arxiv.orgBuild a pipeline with MiniCPM-V-4_5
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Studio