EVA02-CLIP-L-14-336
by BAAI
Enhanced CLIP visual encoder with masked image modeling pre-training at 336px resolution
BAAI/EVA02-CLIP-L-14-336mixpeek://image_extractor@v1/baai_eva02_clip_large_v1Overview
EVA02-CLIP-L-14-336 is a Vision Transformer CLIP model pre-trained with masked image modeling (MIM) to reconstruct language-aligned vision features, then fine-tuned with contrastive image-text learning. At 336px resolution with ~430M parameters, it achieves 80.4% zero-shot top-1 accuracy on ImageNet while using only ~1/6 the parameters and training data of the previous largest open-source CLIP.
On Mixpeek, EVA02-CLIP provides high-quality visual embeddings with better efficiency than giant CLIP models, powering image and video frame search with strong zero-shot generalization across domains.
Architecture
EVA02 Vision Transformer (ViT-L/14) with 24 layers, pre-trained via masked image modeling with CLIP feature reconstruction targets. Contrastive image-text fine-tuning on 6B image-text pairs. 336x336 pixel input resolution with patch size 14.
Mixpeek SDK Integration
from mixpeek import Mixpeekmx = Mixpeek(api_key="YOUR_KEY")mx.ingest(collection_id="image-archive",source="s3://images/",extractors=[{"type": "visual_embedding","model": "BAAI/EVA02-CLIP-L-14-336","output_feature": "image_embedding"}])
Capabilities
- 80.4% zero-shot ImageNet top-1 (best in class for L-scale)
- MIM pre-training for robust visual features
- 768-dimensional dense vector embeddings
- 336px high-resolution input for fine-grained details
- 1/6 parameters of comparable giant CLIP models
Use Cases on Mixpeek
Benchmarks
| Dataset | Metric | Score | Source |
|---|---|---|---|
| ImageNet zero-shot | Top-1 Accuracy | 80.4% | Fang et al., 2023 — EVA-CLIP paper |
| ImageNet fine-tuned | Top-1 Accuracy | 90.0% | Fang et al., 2023 — EVA-02 paper |
| ObjectNet | Top-1 Accuracy | 72.3% | Fang et al., 2023 — EVA-CLIP paper |
Performance
Specification
Research Paper
EVA-02: A Visual Representation for Neon Genesis
arxiv.orgBuild a pipeline with EVA02-CLIP-L-14-336
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Studio