NEWManaged multimodal retrieval.Explore platform →
    Models/Captioning/microsoft/Phi-4-reasoning-vision-15B
    HFScene CaptioningMIT

    Phi-4-reasoning-vision-15B

    by microsoft

    Compact reasoning VLM — chain-of-thought over documents, screenshots, and math

    320Kdl/month
    15Bparams
    Identifiers
    Model ID
    microsoft/Phi-4-reasoning-vision-15B
    Feature URI
    mixpeek://image_extractor@v1/microsoft_phi4_reasoning_vision_v1

    Overview

    Phi-4-reasoning-vision-15B combines a Phi-4-Reasoning language backbone with a SigLIP-2 vision encoder to produce a multimodal model that reasons step-by-step over visual input. Unlike captioning models that describe what they see, this model chains logical inferences across visual evidence -- solving math problems from whiteboard photos, answering questions about complex charts, and grounding UI elements in screenshots.

    It scores 88.2 on ScreenSpot-V2 (GUI grounding), 76.0 on OCRBench, and 75.2 on MathVista. The MIT license makes it one of the most permissively licensed capable VLMs available. On Mixpeek, it powers document QA, visual reasoning over extracted frames, and structured data extraction from screenshots and slides.

    Architecture

    Mid-fusion architecture: SigLIP-2 vision encoder processes images into visual tokens, which are interleaved with text tokens in a Phi-4-Reasoning transformer backbone (15B parameters). Supports chain-of-thought reasoning via <think> mode for multi-step visual inference.

    Mixpeek SDK Integration

    import { Mixpeek } from "mixpeek";
    const mx = new Mixpeek({ apiKey: "API_KEY" });
    await mx.collections.ingest({
    collection_id: "my-collection",
    source: { url: "https://example.com/presentation.pdf" },
    feature_extractors: [{
    name: "scene_caption",
    version: "v1",
    params: {
    model_id: "microsoft/Phi-4-reasoning-vision-15B",
    enable_reasoning: true
    }
    }]
    });

    Capabilities

    • Chain-of-thought reasoning over visual content
    • GUI grounding: locate UI elements by description (ScreenSpot-V2: 88.2)
    • Document understanding with OCR (OCRBench: 76.0)
    • Mathematical reasoning from visual input (MathVista: 75.2)
    • MIT license for unrestricted commercial use

    Use Cases on Mixpeek

    Document QA: answer complex questions about charts, tables, and diagrams
    Screenshot analysis: extract structured data from UI captures
    Visual reasoning for agent perception: interpret whiteboard notes, slides, and forms
    Automated grading and assessment from photographed work

    Benchmarks

    DatasetMetricScoreSource
    ScreenSpot-V2 (GUI grounding)Accuracy88.2%Microsoft, 2026 — Model Card
    OCRBenchScore76.0Microsoft, 2026 — Model Card
    MathVistaAccuracy75.2%Microsoft, 2026 — Model Card

    Performance

    Input SizeVariable resolution images + up to 32K text tokens
    GPU Latency~180ms / image (A100)
    GPU Throughput~5.5 images/sec (A100)
    GPU Memory~30 GB

    Specification

    FrameworkHF
    Organizationmicrosoft
    FeatureScene Captioning
    Outputtext
    Modalitiesvideo, image
    RetrieverSemantic Search
    Parameters15B
    LicenseMIT
    Downloads/mo320K

    Research Paper

    Phi-4 Reasoning: Training a Multimodal Reasoning Model

    arxiv.org

    Build a pipeline with Phi-4-reasoning-vision-15B

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Studio