NEWManaged multimodal retrieval pipelines for data on your storage.Managed multimodal retrieval.Explore platform →

Ingest & Store

Feature Extractors

Typed pipelines for faces, scenes, transcripts, OCR, fingerprints.

Vector Store (MVS)

Mixpeek Vector Store: horizontally scaled, feature-aware indexes.

Retrieve & Analyze

Compose multi-stage search in <100ms:filter, join, rerank.

Group scenes, faces or objects by similarity with Thompson sampling.

Encode your domain as versioned ontologies enforced at query time.

By Industry

Talent search, brand safety, creative analytics.

Scene search, recommendation, archive access.

Visual search, PDP enrichment, catalog QA.

Lecture search, transcript Q&A, content safety.

View all solutions →

By Use Case

Face & Person Search

Find anyone across video libraries in milliseconds.

IP & Copyright Detection

Logos, songs, faces:one pipeline, one report.

Visual Taste & Recs

Scene-similarity ranked recommendations with RL.

Brand & Ad Safety

Pre-publish content screening at bid-time speeds.

View all use cases →

Build

API reference, SDKs, recipes, and architecture guides.

Launches, deep dives, and field notes from our engineers.

Browse supported HuggingFace models by task and modality.

See what teams are building with Mixpeek.

Education

Multimodal University

Fundamentals of multimodal retrieval, modules + certs.

Every term you need:embeddings to re-rankers.

Talks, demos, and customer sessions on demand.

Mixpeek vs. Pinecone, Weaviate, Twelve Labs, more.

Mission, team, and the multimodal vision.

We're hiring across research, infra, and design.

Talk to sales, support, or press.

White-glove 30-day production pilot for new customers.

Vector Store Integrations Pricing

Sign in Request Demo Get started →

Models/Captioning/microsoft/OmniParser-v2.0

HFScene CaptioningMIT and AGPL components

OmniParser-v2.0

by microsoft

Screen parser that turns screenshots into structured UI elements for agents

85Kdl/month

YOLOv8 + Florence-2params

HuggingFace Use in Pipeline

Identifiers

Model ID

microsoft/OmniParser-v2.0

Feature URI

mixpeek://image_extractor@v1/microsoft_omniparser_v2_v1

Overview

OmniParser v2 is Microsoft's screen parsing model for computer-use agents. It converts screenshots into structured elements by detecting interactable regions and captioning icons, so an LLM can reason over a screen as objects with coordinates and functions.

On Mixpeek, OmniParser is relevant for indexing UI recordings, app screenshots, support sessions, and agent traces. It makes visual interfaces searchable by element semantics instead of raw pixels alone.

Architecture

Two-model screen parser combining a fine-tuned YOLOv8 icon detector with a fine-tuned Florence-2 icon captioner. V2 adds cleaner icon grounding data and lower latency than V1.

Mixpeek SDK Integration

import { Mixpeek } from "mixpeek";

const mx = new Mixpeek({ apiKey: "API_KEY" });

await mx.collections.ingest({
  collection_id: "ui-recordings",
  source: { url: "https://example.com/screenshot.png" },
  feature_extractors: [{
    feature: "scene_caption",
    model: "microsoft/OmniParser-v2.0"
  }]
});

Capabilities

Detects clickable and actionable UI regions
Captions icons with functional semantics
Converts screenshots into structured screen elements
Useful with computer-use agents and GUI automation

Use Cases on Mixpeek

Search UI recordings for a specific button, dialog, or workflow state

Give agents structured observations from application screenshots

Index support sessions by visible UI elements and user journeys

Ground natural-language instructions to screen coordinates

Benchmarks

Dataset	Metric	Score	Source
ScreenSpot Pro	Average accuracy	39.6	Microsoft OmniParser v2 model card

Performance

Input SizeScreenshot image

GPU Latency~0.6s / frame (A100)

GPU Throughput~1.6 frames/sec (A100)

GPU Memory~4 GB

Best used for UI screenshots rather than natural scene imagery

Common Pipeline Companions

Qwen/Qwen3-VL-Reranker-2B

Rerank UI screenshots for semantic relevance

BAAI/bge-large-en-v1.5

Embed extracted UI text and element captions

Specification

FrameworkHF

Organizationmicrosoft

FeatureScene Captioning

Outputtext

Modalitiesvideo, image

RetrieverSemantic Search

ParametersYOLOv8 + Florence-2

LicenseMIT and AGPL components

Downloads/mo85K

Research Paper

OmniParser for Pure Vision Based GUI Agent

Build a pipeline with OmniParser-v2.0

Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

Alternative Models

Salesforce/blip2-opt-2.7b

Scene Captioning

microsoft/Florence-2-large

Scene Captioning

google/paligemma2-3b-mix-448

Scene Captioning

Qwen/Qwen3-VL-8B-Instruct

Scene Captioning