Phi-4-multimodal-instruct

by microsoft

5.6B multimodal model processing text, images, and speech in a single architecture

391Kdl/month

5.6Bparams

HuggingFace Run on your data

Identifiers

Model ID

microsoft/Phi-4-multimodal-instruct

Feature URI

mixpeek://image_extractor@v1/microsoft_phi4_multimodal_v1

Overview

Phi-4 Multimodal Instruct is Microsoft's 5.6B-parameter foundation model that unifies text, vision, and speech understanding in a single architecture. Built on the Phi-4-mini backbone with advanced encoders and LoRA adapters for vision and audio, it ranked #1 on the HuggingFace Open ASR Leaderboard with 6.14% WER at release and is the first open-source model capable of speech summarization.

On Mixpeek, Phi-4 Multimodal enables unified processing of mixed-media content where text, images, and audio need to be understood together. Its compact 5.6B size makes it deployable on edge devices while delivering competitive performance against much larger models on document understanding, visual QA, and speech recognition tasks.

Architecture

Phi-4-mini language model backbone with advanced vision and speech encoders connected via LoRA adapters. 5.6B total parameters. 128K token context length. Trained on 5T text tokens, 2.3M speech hours, and 1.1T image-text tokens. Supports simultaneous text, image, and audio input.

Mixpeek SDK Integration

import { Mixpeek } from "mixpeek";

const mx = new Mixpeek({ apiKey: "API_KEY" });

// Managed: create a collection over a bucket; Mixpeek runs this model's extractor
const collection = await mx.collections.create({
  namespace_id: "my-namespace",
  collection_name: "my-collection",
  source: { type: "bucket", bucket_ids: ["bkt_your_bucket"] },
  feature_extractor: {
    feature_extractor_name: "scene_caption",
    version: "v1",
    parameters: { model_id: "microsoft/Phi-4-multimodal-instruct" },
  },
});

Capabilities

Unified text + image + speech understanding in one model
#1 on Open ASR Leaderboard at release (6.14% WER)
128K token context length
DocVQA: 93.2%, MMBench: 86.7%, OCRBench: 84.4%
First open-source model with speech summarization

Use Cases on Mixpeek

Multimodal content analysis combining document images, text, and audio narration

Edge-deployed visual QA for mobile and embedded devices at 5.6B parameters

Meeting analysis with joint speech transcription and slide understanding

Benchmarks

Dataset	Metric	Score	Source
HF Open ASR Leaderboard	WER	6.14%	Microsoft, Mar 2025: Model Card
DocVQA	Accuracy	93.2%	Microsoft, Mar 2025: Model Card
MMBench	Accuracy	86.7%	Microsoft, Mar 2025: Model Card