clip-vit-large-patch14
by openai
Contrastive Language-Image Pre-Training for zero-shot visual understanding
openai/clip-vit-large-patch14mixpeek://video_descriptor@v1/openai_clip_large_v1Overview
CLIP (Contrastive Language-Image Pre-Training) is a neural network trained on 400M image-text pairs from the internet. It learns visual concepts from natural language supervision, enabling zero-shot transfer to downstream tasks without task-specific training data.
On Mixpeek, CLIP powers visual embedding extraction — converting video frames and images into 768-dimensional vectors that capture semantic meaning. This enables similarity search across visual content using natural language queries.
Architecture
Vision Transformer (ViT-L/14) with 24 layers, 1024-dim hidden size, 16 attention heads. Text encoder is a 12-layer transformer. Both encoders project into a shared 768-dim embedding space via contrastive learning.
Mixpeek SDK Integration
import { Mixpeek } from "mixpeek";
const mx = new Mixpeek({ apiKey: "API_KEY" });
// Upload and extract visual embeddings
await mx.collections.ingest({
collection_id: "my-collection",
source: { url: "https://example.com/video.mp4" },
feature_extractors: [{
name: "image_embedding",
version: "v1",
params: {
model_id: "openai/clip-vit-large-patch14"
}
}]
});Capabilities
- Zero-shot image classification without fine-tuning
- Cross-modal text-to-image and image-to-text retrieval
- 768-dimensional dense vector embeddings
- Processes 224x224 pixel image patches
- Supports 40+ languages via multilingual text encoder
Use Cases on Mixpeek
Specification
Research Paper
Learning Transferable Visual Models From Natural Language Supervision
arxiv.orgBuild a pipeline with clip-vit-large-patch14
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Pipeline Builder