NEWManaged multimodal retrieval.Explore platform →
    Models/Embeddings/BAAI/EVA02-CLIP-L-14-336
    HFVisual EmbeddingsMIT

    EVA02-CLIP-L-14-336

    by BAAI

    Enhanced CLIP visual encoder with masked image modeling pre-training at 336px resolution

    620Kdl/month
    430Mparams
    Identifiers
    Model ID
    BAAI/EVA02-CLIP-L-14-336
    Feature URI
    mixpeek://image_extractor@v1/baai_eva02_clip_large_v1

    Overview

    EVA02-CLIP-L-14-336 is a Vision Transformer CLIP model pre-trained with masked image modeling (MIM) to reconstruct language-aligned vision features, then fine-tuned with contrastive image-text learning. At 336px resolution with ~430M parameters, it achieves 80.4% zero-shot top-1 accuracy on ImageNet while using only ~1/6 the parameters and training data of the previous largest open-source CLIP.

    On Mixpeek, EVA02-CLIP provides high-quality visual embeddings with better efficiency than giant CLIP models, powering image and video frame search with strong zero-shot generalization across domains.

    Architecture

    EVA02 Vision Transformer (ViT-L/14) with 24 layers, pre-trained via masked image modeling with CLIP feature reconstruction targets. Contrastive image-text fine-tuning on 6B image-text pairs. 336x336 pixel input resolution with patch size 14.

    Mixpeek SDK Integration

    from mixpeek import Mixpeek
    mx = Mixpeek(api_key="YOUR_KEY")
    mx.ingest(
    collection_id="image-archive",
    source="s3://images/",
    extractors=[{
    "type": "visual_embedding",
    "model": "BAAI/EVA02-CLIP-L-14-336",
    "output_feature": "image_embedding"
    }]
    )

    Capabilities

    • 80.4% zero-shot ImageNet top-1 (best in class for L-scale)
    • MIM pre-training for robust visual features
    • 768-dimensional dense vector embeddings
    • 336px high-resolution input for fine-grained details
    • 1/6 parameters of comparable giant CLIP models

    Use Cases on Mixpeek

    High-quality visual search across image and video collections
    Zero-shot classification of visual content without fine-tuning
    Visual embedding extraction where accuracy matters more than speed

    Benchmarks

    DatasetMetricScoreSource
    ImageNet zero-shotTop-1 Accuracy80.4%Fang et al., 2023 — EVA-CLIP paper
    ImageNet fine-tunedTop-1 Accuracy90.0%Fang et al., 2023 — EVA-02 paper
    ObjectNetTop-1 Accuracy72.3%Fang et al., 2023 — EVA-CLIP paper

    Performance

    Input Size336x336 px
    Embedding Dim768
    GPU Latency~10ms / image (A100)
    CPU Latency~110ms / image
    GPU Throughput~100 images/sec (A100)
    GPU Memory~1.8 GB

    Specification

    FrameworkHF
    OrganizationBAAI
    FeatureVisual Embeddings
    Output768-dim vector
    Modalitiesvideo, image
    RetrieverVector Search
    Parameters430M
    LicenseMIT
    Downloads/mo620K

    Research Paper

    EVA-02: A Visual Representation for Neon Genesis

    arxiv.org

    Build a pipeline with EVA02-CLIP-L-14-336

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Studio