NEWManaged multimodal retrieval.Explore platform →
    Models/Captioning/openbmb/MiniCPM-o-4_5
    HFScene CaptioningApache-2.0

    MiniCPM-o-4_5

    by openbmb

    9B omnimodal model — see, listen, and speak simultaneously with full-duplex streaming

    100Kdl/month
    9Bparams
    Identifiers
    Model ID
    openbmb/MiniCPM-o-4_5
    Feature URI
    mixpeek://image_extractor@v1/openbmb_minicpm_o45_v1

    Overview

    MiniCPM-o 4.5 is OpenBMB's 9B-parameter omnimodal model that processes text, images, video, and audio input simultaneously while generating concurrent text and speech output in an end-to-end fashion. Built on SigLIP2 (vision), Whisper-medium (audio encoder), CosyVoice2 (speech decoder), and Qwen3-8B (language model), it supports full-duplex interaction — seeing, listening, and speaking at the same time without mutual blocking.

    With only 9B parameters and 11GB VRAM (Int4 quantization), it surpasses GPT-4o on OpenCompass (77.6 avg across 8 benchmarks) and approaches Gemini 2.5 Flash for vision-language tasks. On Mixpeek, MiniCPM-o 4.5 powers unified multimodal understanding pipelines that need to process video with audio — generating scene descriptions that account for both visual content and spoken dialogue in a single pass.

    Architecture

    End-to-end omnimodal architecture: SigLIP2 vision encoder + Whisper-medium audio encoder + Qwen3-8B language model + CosyVoice2 speech decoder. 9B total parameters. Processes 1.8M pixel images and 10FPS video. 96x video token compression. Supports full-duplex real-time streaming.

    Mixpeek SDK Integration

    from mixpeek import Mixpeek
    mx = Mixpeek(api_key="YOUR_KEY")
    mx.ingest(
    collection_id="media-library",
    source="s3://recordings/",
    extractors=[
    {
    "type": "scene_caption",
    "model": "openbmb/MiniCPM-o-4_5",
    "output_feature": "omni_caption"
    },
    {
    "type": "text_embedding",
    "model": "BAAI/bge-m3",
    "input_field": "omni_caption",
    "output_feature": "caption_embedding"
    }
    ]
    )

    Capabilities

    • Full-duplex omnimodal: text, image, video, audio in; text and speech out
    • 77.6 avg on OpenCompass — surpasses GPT-4o
    • 10FPS video understanding with audio
    • Runs on 11GB VRAM (Int4 quantization)
    • Real-time streaming interaction without blocking

    Use Cases on Mixpeek

    Video-with-audio captioning: generate descriptions that capture both visual scenes and spoken dialogue
    Multimodal content understanding: process video calls, presentations, and lectures in a single pipeline
    Interactive media analysis: query video content with natural language about what was seen and said

    Benchmarks

    DatasetMetricScoreSource
    OpenCompass (8 benchmarks)Average77.6OpenBMB, 2026 — Model Card
    vs GPT-4o (vision-language)OpenCompassSurpasses GPT-4oOpenBMB, 2026 — Model Card
    VRAM (Int4 quantization)Memory11 GBOpenBMB, 2026 — Model Card

    Performance

    Input SizeText + images (1.8M px) + video (10FPS) + audio
    GPU Latency~150ms / frame (A100, full omni pipeline)
    GPU Throughput~10 FPS video processing (A100)
    GPU Memory~18 GB (FP16) / ~11 GB (Int4)

    Specification

    FrameworkHF
    Organizationopenbmb
    FeatureScene Captioning
    Outputtext
    Modalitiesvideo, image
    RetrieverSemantic Search
    Parameters9B
    LicenseApache-2.0
    Downloads/mo100K

    Research Paper

    MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction

    arxiv.org

    Build a pipeline with MiniCPM-o-4_5

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Studio