Mixpeek Logo
    Models/Embeddings/openai/clip-vit-large-patch14
    HFVisual EmbeddingsMIT

    clip-vit-large-patch14

    by openai

    Contrastive Language-Image Pre-Training for zero-shot visual understanding

    6.9Mdl/month
    1,966likes
    428Mparams
    Identifiers
    Model ID
    openai/clip-vit-large-patch14
    Feature URI
    mixpeek://video_descriptor@v1/openai_clip_large_v1

    Overview

    CLIP (Contrastive Language-Image Pre-Training) is a neural network trained on 400M image-text pairs from the internet. It learns visual concepts from natural language supervision, enabling zero-shot transfer to downstream tasks without task-specific training data.

    On Mixpeek, CLIP powers visual embedding extraction — converting video frames and images into 768-dimensional vectors that capture semantic meaning. This enables similarity search across visual content using natural language queries.

    Architecture

    Vision Transformer (ViT-L/14) with 24 layers, 1024-dim hidden size, 16 attention heads. Text encoder is a 12-layer transformer. Both encoders project into a shared 768-dim embedding space via contrastive learning.

    Mixpeek SDK Integration

    import { Mixpeek } from "mixpeek";
    
    const mx = new Mixpeek({ apiKey: "API_KEY" });
    
    // Upload and extract visual embeddings
    await mx.collections.ingest({
      collection_id: "my-collection",
      source: { url: "https://example.com/video.mp4" },
      feature_extractors: [{
        name: "image_embedding",
        version: "v1",
        params: {
          model_id: "openai/clip-vit-large-patch14"
        }
      }]
    });

    Capabilities

    • Zero-shot image classification without fine-tuning
    • Cross-modal text-to-image and image-to-text retrieval
    • 768-dimensional dense vector embeddings
    • Processes 224x224 pixel image patches
    • Supports 40+ languages via multilingual text encoder

    Use Cases on Mixpeek

    Visual search across video libraries — find frames matching natural language descriptions
    Content moderation — detect brand logos, inappropriate content, or specific objects
    E-commerce product matching — find visually similar products across catalogs
    Media asset management — auto-tag and organize image/video archives

    Specification

    FrameworkHF
    Organizationopenai
    FeatureVisual Embeddings
    Output768-dim vector
    Modalitiesvideo, image
    RetrieverVector Search
    Parameters428M
    LicenseMIT
    Downloads/mo6.9M
    Likes1,966

    Research Paper

    Learning Transferable Visual Models From Natural Language Supervision

    arxiv.org

    Build a pipeline with clip-vit-large-patch14

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Pipeline Builder