AIMv2-large-patch14-native
by apple
Multimodal autoregressive vision encoder outperforming CLIP and SigLIP on understanding tasks
apple/AIMv2-large-patch14-nativemixpeek://image_extractor@v1/apple_aimv2_large_v1Overview
AIMv2 Large is Apple's 309M-parameter vision encoder pre-trained with a multimodal autoregressive objective that pairs the encoder with a decoder autoregressively generating raw image patches and text tokens. Unlike contrastive models such as CLIP, AIMv2 captures fine-grained visual features through its generative pre-training, outperforming both CLIP and SigLIP on multimodal understanding benchmarks.
On Mixpeek, AIMv2 provides high-quality visual feature extraction for downstream tasks like classification, grounding, and retrieval. Its native resolution variant accepts variable-size images without resizing artifacts, making it particularly effective for document images, satellite imagery, and other content where resolution matters.
Architecture
Vision Transformer with 24 layers, 1024-dim hidden size, 8 attention heads, patch size 14. 309M parameters. Pre-trained with multimodal autoregressive objective using a paired text decoder. Native resolution variant supports variable input sizes without fixed resizing.
Mixpeek SDK Integration
import { Mixpeek } from "mixpeek";const mx = new Mixpeek({ apiKey: "API_KEY" });await mx.collections.ingest({collection_id: "my-collection",source: { url: "https://example.com/satellite-image.tiff" },feature_extractors: [{name: "image_embedding",version: "v1",params: {model_id: "apple/AIMv2-large-patch14-native"}}]});
Capabilities
- Outperforms CLIP and SigLIP on multimodal understanding benchmarks
- Native resolution input without resizing artifacts
- 1024-dimensional feature representations
- Strong transfer to localization, grounding, and classification
- Outperforms DINOv2 on open-vocabulary detection
Use Cases on Mixpeek
Benchmarks
| Dataset | Metric | Score | Source |
|---|---|---|---|
| ImageNet-1k (frozen trunk, 3B variant) | Top-1 Accuracy | 89.5% | Fini et al., 2024 — arXiv 2411.14402 |
| Multimodal understanding (avg) | Score | Outperforms CLIP ViT-L & SigLIP | Fini et al., 2024 — arXiv 2411.14402 |
Performance
Specification
Research Paper
Multimodal Autoregressive Pre-training of Large Vision Encoders
arxiv.orgBuild a pipeline with AIMv2-large-patch14-native
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Studio