Phi-4-reasoning-vision-15B

by microsoft

Compact reasoning VLM: chain-of-thought over documents, screenshots, and math

320Kdl/month

15Bparams

HuggingFace Run on your data

Identifiers

Model ID

microsoft/Phi-4-reasoning-vision-15B

Feature URI

mixpeek://image_extractor@v1/microsoft_phi4_reasoning_vision_v1

Overview

Phi-4-reasoning-vision-15B combines a Phi-4-Reasoning language backbone with a SigLIP-2 vision encoder to produce a multimodal model that reasons step-by-step over visual input. Unlike captioning models that describe what they see, this model chains logical inferences across visual evidence -- solving math problems from whiteboard photos, answering questions about complex charts, and grounding UI elements in screenshots.

It scores 88.2 on ScreenSpot-V2 (GUI grounding), 76.0 on OCRBench, and 75.2 on MathVista. The MIT license makes it one of the most permissively licensed capable VLMs available. On Mixpeek, it powers document QA, visual reasoning over extracted frames, and structured data extraction from screenshots and slides.

Architecture

Mid-fusion architecture: SigLIP-2 vision encoder processes images into visual tokens, which are interleaved with text tokens in a Phi-4-Reasoning transformer backbone (15B parameters). Supports chain-of-thought reasoning via <think> mode for multi-step visual inference.

Mixpeek SDK Integration

import { Mixpeek } from "mixpeek";

const mx = new Mixpeek({ apiKey: "API_KEY" });

// Managed: create a collection over a bucket; Mixpeek runs this model's extractor
const collection = await mx.collections.create({
  namespace_id: "my-namespace",
  collection_name: "my-collection",
  source: { type: "bucket", bucket_ids: ["bkt_your_bucket"] },
  feature_extractor: {
    feature_extractor_name: "scene_caption",
    version: "v1",
    parameters: { model_id: "microsoft/Phi-4-reasoning-vision-15B" },
  },
});

Capabilities

Chain-of-thought reasoning over visual content
GUI grounding: locate UI elements by description (ScreenSpot-V2: 88.2)
Document understanding with OCR (OCRBench: 76.0)
Mathematical reasoning from visual input (MathVista: 75.2)
MIT license for unrestricted commercial use

Use Cases on Mixpeek

Document QA: answer complex questions about charts, tables, and diagrams

Screenshot analysis: extract structured data from UI captures

Visual reasoning for agent perception: interpret whiteboard notes, slides, and forms

Automated grading and assessment from photographed work

Benchmarks

Dataset	Metric	Score	Source
ScreenSpot-V2 (GUI grounding)	Accuracy	88.2%	Microsoft, 2026: Model Card
OCRBench	Score	76.0	Microsoft, 2026: Model Card
MathVista	Accuracy	75.2%	Microsoft, 2026: Model Card