MiniCPM-V-4.6

by openbmb

1B-parameter edge VLM that matches 2B-class quality on vision tasks

222Kdl/month

1B total (0.8B language + 0.4B vision)params

HuggingFace Use in Pipeline

Identifiers

Model ID

openbmb/MiniCPM-V-4.6

Feature URI

mixpeek://image_extractor@v1/openbmb_minicpm_v46_v1

Overview

MiniCPM-V-4.6 is a 1B-parameter multimodal language model from OpenBMB designed for deployment on mobile and edge devices. Built on Qwen3.5-0.8B with a SigLIP2-400M vision encoder, it achieves performance comparable to models twice its size on vision-language benchmarks. It supports image understanding, video comprehension (up to 128 frames), OCR, and tool calling — all within a footprint that runs on smartphones.

Architecture

Frozen-tower vision-language model combining a SigLIP2-400M image encoder with a Qwen3.5-0.8B language decoder. Uses mixed 4x/16x visual token compression to balance detail and efficiency. Supports arbitrary image resolutions via dynamic tiling. Video input processes up to 128 frames with temporal position encoding.

Mixpeek SDK Integration

from mixpeek import Mixpeek

mx = Mixpeek(api_key="YOUR_KEY")

mx.ingest.videos(
    source="s3://ads/creatives/",
    collection="ad_library",
    feature_extractors=[{
        "name": "scene_caption",
        "model": "openbmb/MiniCPM-V-4.6",
        "params": {"max_frames": 64, "caption_detail": "detailed"}
    }]
)