NEWManaged multimodal retrieval.Explore platform →
    Models/Captioning/microsoft/OmniParser-v2.0
    HFScene CaptioningMIT and AGPL components

    OmniParser-v2.0

    by microsoft

    Screen parser that turns screenshots into structured UI elements for agents

    85Kdl/month
    YOLOv8 + Florence-2params
    Identifiers
    Model ID
    microsoft/OmniParser-v2.0
    Feature URI
    mixpeek://image_extractor@v1/microsoft_omniparser_v2_v1

    Overview

    OmniParser v2 is Microsoft's screen parsing model for computer-use agents. It converts screenshots into structured elements by detecting interactable regions and captioning icons, so an LLM can reason over a screen as objects with coordinates and functions.

    On Mixpeek, OmniParser is relevant for indexing UI recordings, app screenshots, support sessions, and agent traces. It makes visual interfaces searchable by element semantics instead of raw pixels alone.

    Architecture

    Two-model screen parser combining a fine-tuned YOLOv8 icon detector with a fine-tuned Florence-2 icon captioner. V2 adds cleaner icon grounding data and lower latency than V1.

    Mixpeek SDK Integration

    import { Mixpeek } from "mixpeek";
    const mx = new Mixpeek({ apiKey: "API_KEY" });
    await mx.collections.ingest({
    collection_id: "ui-recordings",
    source: { url: "https://example.com/screenshot.png" },
    feature_extractors: [{
    feature: "scene_caption",
    model: "microsoft/OmniParser-v2.0"
    }]
    });

    Capabilities

    • Detects clickable and actionable UI regions
    • Captions icons with functional semantics
    • Converts screenshots into structured screen elements
    • Useful with computer-use agents and GUI automation

    Use Cases on Mixpeek

    Search UI recordings for a specific button, dialog, or workflow state
    Give agents structured observations from application screenshots
    Index support sessions by visible UI elements and user journeys
    Ground natural-language instructions to screen coordinates

    Benchmarks

    DatasetMetricScoreSource
    ScreenSpot ProAverage accuracy39.6Microsoft OmniParser v2 model card

    Performance

    Input SizeScreenshot image
    GPU Latency~0.6s / frame (A100)
    GPU Throughput~1.6 frames/sec (A100)
    GPU Memory~4 GB

    Best used for UI screenshots rather than natural scene imagery

    Specification

    FrameworkHF
    Organizationmicrosoft
    FeatureScene Captioning
    Outputtext
    Modalitiesvideo, image
    RetrieverSemantic Search
    ParametersYOLOv8 + Florence-2
    LicenseMIT and AGPL components
    Downloads/mo85K

    Research Paper

    OmniParser for Pure Vision Based GUI Agent

    arxiv.org

    Build a pipeline with OmniParser-v2.0

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Studio