owlvit-large-patch14
by google
Simple open-vocabulary object detection with Vision Transformers
google/owlvit-large-patch14mixpeek://image_extractor@v1/google_owlvit_large_v1Overview
OWL-ViT transfers image-text pre-trained models to open-vocabulary object detection using a standard ViT with minimal modifications. It supports both text-conditioned zero-shot detection and one-shot image-conditioned detection.
On Mixpeek, OWL-ViT provides a clean, well-scaling detection model that improves consistently with larger pre-trained backbones and more data.
Architecture
Plain Vision Transformer (ViT-L/14) pre-trained with contrastive image-text learning, then fine-tuned end-to-end for detection. No detection-specific backbone changes needed.
Mixpeek SDK Integration
import { Mixpeek } from "mixpeek";
const mx = new Mixpeek({ apiKey: "API_KEY" });
// Managed: create a collection over a bucket; Mixpeek runs this model's extractor
const collection = await mx.collections.create({
namespace_id: "my-namespace",
collection_name: "my-collection",
source: { type: "bucket", bucket_ids: ["bkt_your_bucket"] },
feature_extractor: {
feature_extractor_name: "object_detection",
version: "v1",
parameters: { model_id: "google/owlvit-large-patch14" },
},
});Capabilities
- Zero-shot text-conditioned object detection
- One-shot image-conditioned detection
- Consistent scaling with model and data size
- Standard ViT architecture, minimal modifications
Use Cases on Mixpeek
Benchmarks
| Dataset | Metric | Score | Source |
|---|---|---|---|
| LVIS (zero-shot) | AP_rare | 31.2 | Minderer et al., 2022 — Table 1 |
| COCO (zero-shot) | AP | 34.6 | Minderer et al., 2022 — Table 1 |
Performance
Common Pipeline Companions
Explore on Mixpeek
Compare alternatives in this category
Hand-picked tools & platforms compared
Deep-dive technical guide
See how Mixpeek runs models as extractors
Store & search embeddings at scale
Usage-based pricing for pipelines
Compare models, APIs & infrastructure
Specification
Research Paper
Simple Open-Vocabulary Object Detection with Vision Transformers
arxiv.orgBuild a pipeline with owlvit-large-patch14
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Run on your data, free