clip-vit-large-patch14
by openai
Contrastive Language-Image Pre-Training for zero-shot visual understanding
openai/clip-vit-large-patch14mixpeek://video_descriptor@v1/openai_clip_large_v1Overview
CLIP (Contrastive Language-Image Pre-Training) is a neural network trained on 400M image-text pairs from the internet. It learns visual concepts from natural language supervision, enabling zero-shot transfer to downstream tasks without task-specific training data.
On Mixpeek, CLIP powers visual embedding extraction, converting video frames and images into 768-dimensional vectors that capture semantic meaning. This enables similarity search across visual content using natural language queries.
Architecture
Vision Transformer (ViT-L/14) with 24 layers, 1024-dim hidden size, 16 attention heads. Text encoder is a 12-layer transformer. Both encoders project into a shared 768-dim embedding space via contrastive learning.
Mixpeek SDK Integration
import { Mixpeek } from "mixpeek";const mx = new Mixpeek({ apiKey: "API_KEY" });// Upload and extract visual embeddingsawait mx.collections.ingest({collection_id: "my-collection",source: { url: "https://example.com/video.mp4" },feature_extractors: [{name: "image_embedding",version: "v1",params: {model_id: "openai/clip-vit-large-patch14"}}]});
Capabilities
- Zero-shot image classification without fine-tuning
- Cross-modal text-to-image and image-to-text retrieval
- 768-dimensional dense vector embeddings
- Processes 224x224 pixel image patches
- Supports 40+ languages via multilingual text encoder
Use Cases on Mixpeek
Benchmarks
| Dataset | Metric | Score | Source |
|---|---|---|---|
| ImageNet zero-shot | Top-1 Accuracy | 75.3% | Radford et al., 2021 — Table 11 |
| MS-COCO (text→image) | Recall@5 | 56.4% | Radford et al., 2021 — Table 8 |
| Flickr30k (text→image) | Recall@1 | 87.1% | Radford et al., 2021 — Table 8 |
Performance
Specification
Research Paper
Learning Transferable Visual Models From Natural Language Supervision
arxiv.orgBuild a pipeline with clip-vit-large-patch14
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Pipeline Builder