NEWManaged multimodal retrieval.Explore platform →
    Models/Captioning/microsoft/Phi-4-multimodal-instruct
    HFScene CaptioningMIT

    Phi-4-multimodal-instruct

    by microsoft

    5.6B multimodal model processing text, images, and speech in a single architecture

    391Kdl/month
    5.6Bparams
    Identifiers
    Model ID
    microsoft/Phi-4-multimodal-instruct
    Feature URI
    mixpeek://image_extractor@v1/microsoft_phi4_multimodal_v1

    Overview

    Phi-4 Multimodal Instruct is Microsoft's 5.6B-parameter foundation model that unifies text, vision, and speech understanding in a single architecture. Built on the Phi-4-mini backbone with advanced encoders and LoRA adapters for vision and audio, it ranked #1 on the HuggingFace Open ASR Leaderboard with 6.14% WER at release and is the first open-source model capable of speech summarization.

    On Mixpeek, Phi-4 Multimodal enables unified processing of mixed-media content where text, images, and audio need to be understood together. Its compact 5.6B size makes it deployable on edge devices while delivering competitive performance against much larger models on document understanding, visual QA, and speech recognition tasks.

    Architecture

    Phi-4-mini language model backbone with advanced vision and speech encoders connected via LoRA adapters. 5.6B total parameters. 128K token context length. Trained on 5T text tokens, 2.3M speech hours, and 1.1T image-text tokens. Supports simultaneous text, image, and audio input.

    Mixpeek SDK Integration

    import { Mixpeek } from "mixpeek";
    const mx = new Mixpeek({ apiKey: "API_KEY" });
    await mx.collections.ingest({
    collection_id: "my-collection",
    source: { url: "https://example.com/presentation.mp4" },
    feature_extractors: [{
    name: "scene_caption",
    version: "v1",
    params: {
    model_id: "microsoft/Phi-4-multimodal-instruct"
    }
    }]
    });

    Capabilities

    • Unified text + image + speech understanding in one model
    • #1 on Open ASR Leaderboard at release (6.14% WER)
    • 128K token context length
    • DocVQA: 93.2%, MMBench: 86.7%, OCRBench: 84.4%
    • First open-source model with speech summarization

    Use Cases on Mixpeek

    Multimodal content analysis combining document images, text, and audio narration
    Edge-deployed visual QA for mobile and embedded devices at 5.6B parameters
    Meeting analysis with joint speech transcription and slide understanding

    Benchmarks

    DatasetMetricScoreSource
    HF Open ASR LeaderboardWER6.14%Microsoft, Mar 2025 — Model Card
    DocVQAAccuracy93.2%Microsoft, Mar 2025 — Model Card
    MMBenchAccuracy86.7%Microsoft, Mar 2025 — Model Card

    Performance

    Input SizeImages: variable resolution; Audio: variable length; Text: 128K tokens
    GPU Latency~25ms / image (A100)
    GPU Throughput~40 images/sec (A100)
    GPU Memory~12 GB

    Specification

    FrameworkHF
    Organizationmicrosoft
    FeatureScene Captioning
    Outputtext
    Modalitiesvideo, image
    RetrieverSemantic Search
    Parameters5.6B
    LicenseMIT
    Downloads/mo391K

    Research Paper

    Phi-4 Technical Report

    arxiv.org

    Build a pipeline with Phi-4-multimodal-instruct

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Studio