EVA02-CLIP-L-14-336

by BAAI

Enhanced CLIP visual encoder with masked image modeling pre-training at 336px resolution

11Kdl/month

430Mparams

HuggingFace Run on your data

Identifiers

Model ID

BAAI/EVA02-CLIP-L-14-336

Feature URI

mixpeek://image_extractor@v1/baai_eva02_clip_large_v1

Overview

EVA02-CLIP-L-14-336 is a Vision Transformer CLIP model pre-trained with masked image modeling (MIM) to reconstruct language-aligned vision features, then fine-tuned with contrastive image-text learning. At 336px resolution with ~430M parameters, it achieves 80.4% zero-shot top-1 accuracy on ImageNet while using only ~1/6 the parameters and training data of the previous largest open-source CLIP.

On Mixpeek, EVA02-CLIP provides high-quality visual embeddings with better efficiency than giant CLIP models, powering image and video frame search with strong zero-shot generalization across domains.

Architecture

EVA02 Vision Transformer (ViT-L/14) with 24 layers, pre-trained via masked image modeling with CLIP feature reconstruction targets. Contrastive image-text fine-tuning on 6B image-text pairs. 336x336 pixel input resolution with patch size 14.

Mixpeek SDK Integration

import { Mixpeek } from "mixpeek";

const mx = new Mixpeek({ apiKey: "API_KEY" });

// Managed: create a collection over a bucket; Mixpeek runs this model's extractor
const collection = await mx.collections.create({
  namespace_id: "my-namespace",
  collection_name: "my-collection",
  source: { type: "bucket", bucket_ids: ["bkt_your_bucket"] },
  feature_extractor: {
    feature_extractor_name: "visual_embedding",
    version: "v1",
    parameters: { model_id: "BAAI/EVA02-CLIP-L-14-336" },
  },
});

Capabilities

80.4% zero-shot ImageNet top-1 (best in class for L-scale)
MIM pre-training for robust visual features
768-dimensional dense vector embeddings
336px high-resolution input for fine-grained details
1/6 parameters of comparable giant CLIP models

Use Cases on Mixpeek

High-quality visual search across image and video collections

Zero-shot classification of visual content without fine-tuning

Visual embedding extraction where accuracy matters more than speed

Benchmarks

Dataset	Metric	Score	Source
ImageNet zero-shot	Top-1 Accuracy	80.4%	Fang et al., 2023: EVA-CLIP paper
ImageNet fine-tuned	Top-1 Accuracy	90.0%	Fang et al., 2023: EVA-02 paper
ObjectNet	Top-1 Accuracy	72.3%	Fang et al., 2023: EVA-CLIP paper