NEWAgents can now see video via MCP.Try it now →
    Models/Detection & Recognition/google/owlvit-large-patch14
    HFObject DetectionApache 2.0

    owlvit-large-patch14

    by google

    Simple open-vocabulary object detection with Vision Transformers

    580Kdl/month
    ~300Mparams
    Identifiers
    Model ID
    google/owlvit-large-patch14
    Feature URI
    mixpeek://image_extractor@v1/google_owlvit_large_v1

    Overview

    OWL-ViT transfers image-text pre-trained models to open-vocabulary object detection using a standard ViT with minimal modifications. It supports both text-conditioned zero-shot detection and one-shot image-conditioned detection.

    On Mixpeek, OWL-ViT provides a clean, well-scaling detection model that improves consistently with larger pre-trained backbones and more data.

    Architecture

    Plain Vision Transformer (ViT-L/14) pre-trained with contrastive image-text learning, then fine-tuned end-to-end for detection. No detection-specific backbone changes needed.

    Mixpeek SDK Integration

    import { Mixpeek } from "mixpeek";
    const mx = new Mixpeek({ apiKey: "API_KEY" });
    await mx.collections.ingest({
    collection_id: "my-collection",
    source: { url: "https://example.com/image.jpg" },
    feature_extractors: [{
    name: "object_detection",
    version: "v1",
    params: { model_id: "google/owlvit-large-patch14" }
    }]
    });

    Capabilities

    • Zero-shot text-conditioned object detection
    • One-shot image-conditioned detection
    • Consistent scaling with model and data size
    • Standard ViT architecture, minimal modifications

    Use Cases on Mixpeek

    Detecting objects from text descriptions in images and video
    One-shot detection using a reference image
    Scalable visual search with text queries

    Benchmarks

    DatasetMetricScoreSource
    LVIS (zero-shot)AP_rare31.2Minderer et al., 2022 — Table 1
    COCO (zero-shot)AP34.6Minderer et al., 2022 — Table 1

    Performance

    Input Size840×840 px
    GPU Latency~18ms / image (A100)
    CPU Latency~220ms / image
    GPU Throughput~55 images/sec (A100)
    GPU Memory~1.5 GB

    Specification

    FrameworkHF
    Organizationgoogle
    FeatureObject Detection
    Outputbbox + label
    Modalitiesvideo, image
    RetrieverObject Filter
    Parameters~300M
    LicenseApache 2.0
    Downloads/mo580K

    Research Paper

    Simple Open-Vocabulary Object Detection with Vision Transformers

    arxiv.org

    Build a pipeline with owlvit-large-patch14

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Pipeline Builder