NEWManaged multimodal retrieval.Explore platform →
    Models/Captioning/openbmb/MiniCPM-V-4_5
    HFScene CaptioningApache-2.0

    MiniCPM-V-4_5

    by openbmb

    Best sub-30B vision-language model with 10FPS video understanding

    116Kdl/month
    8Bparams
    Identifiers
    Model ID
    openbmb/MiniCPM-V-4_5
    Feature URI
    mixpeek://image_extractor@v1/openbmb_minicpm_v45_v1

    Overview

    MiniCPM-V 4.5 is an 8B-parameter vision-language model that achieves 77.0 on OpenCompass, surpassing GPT-4o and models 10x its size. Built on Qwen3-8B with SigLIP2-400M as the vision encoder, it processes images and video with a 96x video token compression scheme that enables understanding video at 10 frames per second -- fast enough for near-real-time scene captioning.

    The model excels at detailed scene description, OCR, chart understanding, and multi-image reasoning, making it a strong choice for video decomposition pipelines where each scene needs a rich caption.

    Architecture

    Qwen3-8B language model + SigLIP2-400M vision encoder. 96x video token compression enables 10FPS video processing. Supports multiple images and video frames in a single forward pass.

    Mixpeek SDK Integration

    import { Mixpeek } from "mixpeek";
    const mx = new Mixpeek({ apiKey: "API_KEY" });
    await mx.collections.ingest({
    collection_id: "video-collection",
    source: { url: "https://example.com/video.mp4" },
    feature_extractors: [{
    feature: "scene_caption",
    model: "openbmb/MiniCPM-V-4_5"
    }]
    });

    Capabilities

    • 77.0 on OpenCompass (surpasses GPT-4o)
    • 10FPS video understanding via 96x token compression
    • Multi-image reasoning across frames
    • Strong OCR and chart/table understanding
    • Apache-2.0 license for commercial use

    Use Cases on Mixpeek

    Video scene captioning: generate rich descriptions for each scene segment
    Visual question answering over video content
    Document understanding: extract structured data from complex layouts
    Real-time agent perception: process video feeds at near-interactive speeds

    Specification

    FrameworkHF
    Organizationopenbmb
    FeatureScene Captioning
    Outputtext
    Modalitiesvideo, image
    RetrieverSemantic Search
    Parameters8B
    LicenseApache-2.0
    Downloads/mo116K

    Research Paper

    MiniCPM-V 4.5

    arxiv.org

    Build a pipeline with MiniCPM-V-4_5

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Studio