NEWManaged multimodal retrieval.Explore platform →
    Models/Captioning/openbmb/MiniCPM-V-4.6
    HFScene CaptioningApache 2.0

    MiniCPM-V-4.6

    by openbmb

    1B-parameter edge VLM that matches 2B-class quality on vision tasks

    222Kdl/month
    1B total (0.8B language + 0.4B vision)params
    Identifiers
    Model ID
    openbmb/MiniCPM-V-4.6
    Feature URI
    mixpeek://image_extractor@v1/openbmb_minicpm_v46_v1

    Overview

    MiniCPM-V-4.6 is a 1B-parameter multimodal language model from OpenBMB designed for deployment on mobile and edge devices. Built on Qwen3.5-0.8B with a SigLIP2-400M vision encoder, it achieves performance comparable to models twice its size on vision-language benchmarks. It supports image understanding, video comprehension (up to 128 frames), OCR, and tool calling — all within a footprint that runs on smartphones.

    Architecture

    Frozen-tower vision-language model combining a SigLIP2-400M image encoder with a Qwen3.5-0.8B language decoder. Uses mixed 4x/16x visual token compression to balance detail and efficiency. Supports arbitrary image resolutions via dynamic tiling. Video input processes up to 128 frames with temporal position encoding.

    Mixpeek SDK Integration

    from mixpeek import Mixpeek
    mx = Mixpeek(api_key="YOUR_KEY")
    mx.ingest.videos(
    source="s3://ads/creatives/",
    collection="ad_library",
    feature_extractors=[{
    "name": "scene_caption",
    "model": "openbmb/MiniCPM-V-4.6",
    "params": {"max_frames": 64, "caption_detail": "detailed"}
    }]
    )

    Capabilities

    • Image captioning and visual question answering
    • Video understanding with multi-frame temporal reasoning
    • Document OCR and structured text extraction
    • Tool calling and agentic workflows
    • On-device deployment (iOS, Android, HarmonyOS)

    Use Cases on Mixpeek

    High-throughput image/video captioning pipelines
    Mobile and edge visual AI applications
    Cost-efficient scene description at scale
    Document understanding in resource-constrained environments

    Benchmarks

    DatasetMetricScoreSource
    MMMU ProAccuracyMatches Qwen3.5-2B levelAt half the parameters
    OCRBenchF1Competitive with 2B-classStrong document text extraction

    Performance

    Input SizeVariable
    GPU LatencyInput dependent
    GPU Throughput~120 images/sec (A100, batch 32)
    GPU Memory~2.2 GB

    Specification

    FrameworkHF
    Organizationopenbmb
    FeatureScene Captioning
    Outputtext
    Modalitiesvideo, image
    RetrieverSemantic Search
    Parameters1B total (0.8B language + 0.4B vision)
    LicenseApache 2.0
    Downloads/mo222K

    Build a pipeline with MiniCPM-V-4.6

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Studio