Phi-4-reasoning-vision-15B
by microsoft
Compact reasoning VLM — chain-of-thought over documents, screenshots, and math
microsoft/Phi-4-reasoning-vision-15Bmixpeek://image_extractor@v1/microsoft_phi4_reasoning_vision_v1Overview
Phi-4-reasoning-vision-15B combines a Phi-4-Reasoning language backbone with a SigLIP-2 vision encoder to produce a multimodal model that reasons step-by-step over visual input. Unlike captioning models that describe what they see, this model chains logical inferences across visual evidence -- solving math problems from whiteboard photos, answering questions about complex charts, and grounding UI elements in screenshots.
It scores 88.2 on ScreenSpot-V2 (GUI grounding), 76.0 on OCRBench, and 75.2 on MathVista. The MIT license makes it one of the most permissively licensed capable VLMs available. On Mixpeek, it powers document QA, visual reasoning over extracted frames, and structured data extraction from screenshots and slides.
Architecture
Mid-fusion architecture: SigLIP-2 vision encoder processes images into visual tokens, which are interleaved with text tokens in a Phi-4-Reasoning transformer backbone (15B parameters). Supports chain-of-thought reasoning via <think> mode for multi-step visual inference.
Mixpeek SDK Integration
import { Mixpeek } from "mixpeek";const mx = new Mixpeek({ apiKey: "API_KEY" });await mx.collections.ingest({collection_id: "my-collection",source: { url: "https://example.com/presentation.pdf" },feature_extractors: [{name: "scene_caption",version: "v1",params: {model_id: "microsoft/Phi-4-reasoning-vision-15B",enable_reasoning: true}}]});
Capabilities
- Chain-of-thought reasoning over visual content
- GUI grounding: locate UI elements by description (ScreenSpot-V2: 88.2)
- Document understanding with OCR (OCRBench: 76.0)
- Mathematical reasoning from visual input (MathVista: 75.2)
- MIT license for unrestricted commercial use
Use Cases on Mixpeek
Benchmarks
| Dataset | Metric | Score | Source |
|---|---|---|---|
| ScreenSpot-V2 (GUI grounding) | Accuracy | 88.2% | Microsoft, 2026 — Model Card |
| OCRBench | Score | 76.0 | Microsoft, 2026 — Model Card |
| MathVista | Accuracy | 75.2% | Microsoft, 2026 — Model Card |
Performance
Specification
Research Paper
Phi-4 Reasoning: Training a Multimodal Reasoning Model
arxiv.orgBuild a pipeline with Phi-4-reasoning-vision-15B
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Studio