MiniCPM-o-4_5
by openbmb
9B omnimodal model — see, listen, and speak simultaneously with full-duplex streaming
openbmb/MiniCPM-o-4_5mixpeek://image_extractor@v1/openbmb_minicpm_o45_v1Overview
MiniCPM-o 4.5 is OpenBMB's 9B-parameter omnimodal model that processes text, images, video, and audio input simultaneously while generating concurrent text and speech output in an end-to-end fashion. Built on SigLIP2 (vision), Whisper-medium (audio encoder), CosyVoice2 (speech decoder), and Qwen3-8B (language model), it supports full-duplex interaction — seeing, listening, and speaking at the same time without mutual blocking.
With only 9B parameters and 11GB VRAM (Int4 quantization), it surpasses GPT-4o on OpenCompass (77.6 avg across 8 benchmarks) and approaches Gemini 2.5 Flash for vision-language tasks. On Mixpeek, MiniCPM-o 4.5 powers unified multimodal understanding pipelines that need to process video with audio — generating scene descriptions that account for both visual content and spoken dialogue in a single pass.
Architecture
End-to-end omnimodal architecture: SigLIP2 vision encoder + Whisper-medium audio encoder + Qwen3-8B language model + CosyVoice2 speech decoder. 9B total parameters. Processes 1.8M pixel images and 10FPS video. 96x video token compression. Supports full-duplex real-time streaming.
Mixpeek SDK Integration
from mixpeek import Mixpeekmx = Mixpeek(api_key="YOUR_KEY")mx.ingest(collection_id="media-library",source="s3://recordings/",extractors=[{"type": "scene_caption","model": "openbmb/MiniCPM-o-4_5","output_feature": "omni_caption"},{"type": "text_embedding","model": "BAAI/bge-m3","input_field": "omni_caption","output_feature": "caption_embedding"}])
Capabilities
- Full-duplex omnimodal: text, image, video, audio in; text and speech out
- 77.6 avg on OpenCompass — surpasses GPT-4o
- 10FPS video understanding with audio
- Runs on 11GB VRAM (Int4 quantization)
- Real-time streaming interaction without blocking
Use Cases on Mixpeek
Benchmarks
| Dataset | Metric | Score | Source |
|---|---|---|---|
| OpenCompass (8 benchmarks) | Average | 77.6 | OpenBMB, 2026 — Model Card |
| vs GPT-4o (vision-language) | OpenCompass | Surpasses GPT-4o | OpenBMB, 2026 — Model Card |
| VRAM (Int4 quantization) | Memory | 11 GB | OpenBMB, 2026 — Model Card |
Performance
Specification
Research Paper
MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction
arxiv.orgBuild a pipeline with MiniCPM-o-4_5
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Studio