OmniParser-v2.0
by microsoft
Screen parser that turns screenshots into structured UI elements for agents
microsoft/OmniParser-v2.0mixpeek://image_extractor@v1/microsoft_omniparser_v2_v1Overview
OmniParser v2 is Microsoft's screen parsing model for computer-use agents. It converts screenshots into structured elements by detecting interactable regions and captioning icons, so an LLM can reason over a screen as objects with coordinates and functions.
On Mixpeek, OmniParser is relevant for indexing UI recordings, app screenshots, support sessions, and agent traces. It makes visual interfaces searchable by element semantics instead of raw pixels alone.
Architecture
Two-model screen parser combining a fine-tuned YOLOv8 icon detector with a fine-tuned Florence-2 icon captioner. V2 adds cleaner icon grounding data and lower latency than V1.
Mixpeek SDK Integration
import { Mixpeek } from "mixpeek";const mx = new Mixpeek({ apiKey: "API_KEY" });await mx.collections.ingest({collection_id: "ui-recordings",source: { url: "https://example.com/screenshot.png" },feature_extractors: [{feature: "scene_caption",model: "microsoft/OmniParser-v2.0"}]});
Capabilities
- Detects clickable and actionable UI regions
- Captions icons with functional semantics
- Converts screenshots into structured screen elements
- Useful with computer-use agents and GUI automation
Use Cases on Mixpeek
Benchmarks
| Dataset | Metric | Score | Source |
|---|---|---|---|
| ScreenSpot Pro | Average accuracy | 39.6 | Microsoft OmniParser v2 model card |
Performance
Best used for UI screenshots rather than natural scene imagery
Specification
Research Paper
OmniParser for Pure Vision Based GUI Agent
arxiv.orgBuild a pipeline with OmniParser-v2.0
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Studio