MiniCPM-V-4.6
by openbmb
1B-parameter edge VLM that matches 2B-class quality on vision tasks
openbmb/MiniCPM-V-4.6mixpeek://image_extractor@v1/openbmb_minicpm_v46_v1Overview
MiniCPM-V-4.6 is a 1B-parameter multimodal language model from OpenBMB designed for deployment on mobile and edge devices. Built on Qwen3.5-0.8B with a SigLIP2-400M vision encoder, it achieves performance comparable to models twice its size on vision-language benchmarks. It supports image understanding, video comprehension (up to 128 frames), OCR, and tool calling — all within a footprint that runs on smartphones.
Architecture
Frozen-tower vision-language model combining a SigLIP2-400M image encoder with a Qwen3.5-0.8B language decoder. Uses mixed 4x/16x visual token compression to balance detail and efficiency. Supports arbitrary image resolutions via dynamic tiling. Video input processes up to 128 frames with temporal position encoding.
Mixpeek SDK Integration
from mixpeek import Mixpeekmx = Mixpeek(api_key="YOUR_KEY")mx.ingest.videos(source="s3://ads/creatives/",collection="ad_library",feature_extractors=[{"name": "scene_caption","model": "openbmb/MiniCPM-V-4.6","params": {"max_frames": 64, "caption_detail": "detailed"}}])
Capabilities
- Image captioning and visual question answering
- Video understanding with multi-frame temporal reasoning
- Document OCR and structured text extraction
- Tool calling and agentic workflows
- On-device deployment (iOS, Android, HarmonyOS)
Use Cases on Mixpeek
Benchmarks
| Dataset | Metric | Score | Source |
|---|---|---|---|
| MMMU Pro | Accuracy | Matches Qwen3.5-2B level | At half the parameters |
| OCRBench | F1 | Competitive with 2B-class | Strong document text extraction |
Performance
Specification
Build a pipeline with MiniCPM-V-4.6
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Studio