MiniCPM-o-4_5

by openbmb

9B omnimodal model: see, listen, and speak simultaneously with full-duplex streaming

100Kdl/month

9Bparams

HuggingFace Run on your data

Identifiers

Model ID

openbmb/MiniCPM-o-4_5

Feature URI

mixpeek://image_extractor@v1/openbmb_minicpm_o45_v1

Overview

MiniCPM-o 4.5 is OpenBMB's 9B-parameter omnimodal model that processes text, images, video, and audio input simultaneously while generating concurrent text and speech output in an end-to-end fashion. Built on SigLIP2 (vision), Whisper-medium (audio encoder), CosyVoice2 (speech decoder), and Qwen3-8B (language model), it supports full-duplex interaction, seeing, listening, and speaking at the same time without mutual blocking.

With only 9B parameters and 11GB VRAM (Int4 quantization), it surpasses GPT-4o on OpenCompass (77.6 avg across 8 benchmarks) and approaches Gemini 2.5 Flash for vision-language tasks. On Mixpeek, MiniCPM-o 4.5 powers unified multimodal understanding pipelines that need to process video with audio, generating scene descriptions that account for both visual content and spoken dialogue in a single pass.

Architecture

End-to-end omnimodal architecture: SigLIP2 vision encoder + Whisper-medium audio encoder + Qwen3-8B language model + CosyVoice2 speech decoder. 9B total parameters. Processes 1.8M pixel images and 10FPS video. 96x video token compression. Supports full-duplex real-time streaming.

Mixpeek SDK Integration

import { Mixpeek } from "mixpeek";

const mx = new Mixpeek({ apiKey: "API_KEY" });

// Managed: create a collection over a bucket; Mixpeek runs this model's extractor
const collection = await mx.collections.create({
  namespace_id: "my-namespace",
  collection_name: "my-collection",
  source: { type: "bucket", bucket_ids: ["bkt_your_bucket"] },
  feature_extractor: {
    feature_extractor_name: "scene_caption",
    version: "v1",
    parameters: { model_id: "openbmb/MiniCPM-o-4_5" },
  },
});

Capabilities

Full-duplex omnimodal: text, image, video, audio in; text and speech out
77.6 avg on OpenCompass: surpasses GPT-4o
10FPS video understanding with audio
Runs on 11GB VRAM (Int4 quantization)
Real-time streaming interaction without blocking

Use Cases on Mixpeek

Video-with-audio captioning: generate descriptions that capture both visual scenes and spoken dialogue

Multimodal content understanding: process video calls, presentations, and lectures in a single pipeline

Interactive media analysis: query video content with natural language about what was seen and said

Benchmarks

Dataset	Metric	Score	Source
OpenCompass (8 benchmarks)	Average	77.6	OpenBMB, 2026: Model Card
vs GPT-4o (vision-language)	OpenCompass	Surpasses GPT-4o	OpenBMB, 2026: Model Card
VRAM (Int4 quantization)	Memory	11 GB	OpenBMB, 2026: Model Card