Phi-4-multimodal-instruct
by microsoft
5.6B multimodal model processing text, images, and speech in a single architecture
microsoft/Phi-4-multimodal-instructmixpeek://image_extractor@v1/microsoft_phi4_multimodal_v1Overview
Phi-4 Multimodal Instruct is Microsoft's 5.6B-parameter foundation model that unifies text, vision, and speech understanding in a single architecture. Built on the Phi-4-mini backbone with advanced encoders and LoRA adapters for vision and audio, it ranked #1 on the HuggingFace Open ASR Leaderboard with 6.14% WER at release and is the first open-source model capable of speech summarization.
On Mixpeek, Phi-4 Multimodal enables unified processing of mixed-media content where text, images, and audio need to be understood together. Its compact 5.6B size makes it deployable on edge devices while delivering competitive performance against much larger models on document understanding, visual QA, and speech recognition tasks.
Architecture
Phi-4-mini language model backbone with advanced vision and speech encoders connected via LoRA adapters. 5.6B total parameters. 128K token context length. Trained on 5T text tokens, 2.3M speech hours, and 1.1T image-text tokens. Supports simultaneous text, image, and audio input.
Mixpeek SDK Integration
import { Mixpeek } from "mixpeek";const mx = new Mixpeek({ apiKey: "API_KEY" });await mx.collections.ingest({collection_id: "my-collection",source: { url: "https://example.com/presentation.mp4" },feature_extractors: [{name: "scene_caption",version: "v1",params: {model_id: "microsoft/Phi-4-multimodal-instruct"}}]});
Capabilities
- Unified text + image + speech understanding in one model
- #1 on Open ASR Leaderboard at release (6.14% WER)
- 128K token context length
- DocVQA: 93.2%, MMBench: 86.7%, OCRBench: 84.4%
- First open-source model with speech summarization
Use Cases on Mixpeek
Benchmarks
| Dataset | Metric | Score | Source |
|---|---|---|---|
| HF Open ASR Leaderboard | WER | 6.14% | Microsoft, Mar 2025 — Model Card |
| DocVQA | Accuracy | 93.2% | Microsoft, Mar 2025 — Model Card |
| MMBench | Accuracy | 86.7% | Microsoft, Mar 2025 — Model Card |
Performance
Specification
Research Paper
Phi-4 Technical Report
arxiv.orgBuild a pipeline with Phi-4-multimodal-instruct
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Studio