aimv2-large-patch14-native

by apple

Multimodal autoregressive vision encoder outperforming CLIP and SigLIP on understanding tasks

343dl/month

16likes

309Mparams

HuggingFace Use in Pipeline

Identifiers

Model ID

apple/aimv2-large-patch14-native

Feature URI

mixpeek://image_extractor@v1/apple_aimv2_large_v1

Overview

AIMv2 Large is Apple's 309M-parameter vision encoder pre-trained with a multimodal autoregressive objective that pairs the encoder with a decoder autoregressively generating raw image patches and text tokens. Unlike contrastive models such as CLIP, AIMv2 captures fine-grained visual features through its generative pre-training, outperforming both CLIP and SigLIP on multimodal understanding benchmarks.

On Mixpeek, AIMv2 provides high-quality visual feature extraction for downstream tasks like classification, grounding, and retrieval. Its native resolution variant accepts variable-size images without resizing artifacts, making it particularly effective for document images, satellite imagery, and other content where resolution matters.

Architecture

Vision Transformer with 24 layers, 1024-dim hidden size, 8 attention heads, patch size 14. 309M parameters. Pre-trained with multimodal autoregressive objective using a paired text decoder. Native resolution variant supports variable input sizes without fixed resizing.

Mixpeek SDK Integration

import { Mixpeek } from "mixpeek";

const mx = new Mixpeek({ apiKey: "API_KEY" });

await mx.collections.ingest({
  collection_id: "my-collection",
  source: { url: "https://example.com/satellite-image.tiff" },
  feature_extractors: [{
    name: "image_embedding",
    version: "v1",
    params: {
      model_id: "apple/AIMv2-large-patch14-native"
    }
  }]
});

Capabilities

Outperforms CLIP and SigLIP on multimodal understanding benchmarks
Native resolution input without resizing artifacts
1024-dimensional feature representations
Strong transfer to localization, grounding, and classification
Outperforms DINOv2 on open-vocabulary detection

Use Cases on Mixpeek

High-fidelity visual feature extraction for document images and satellite imagery at native resolution

Visual search backbone replacing CLIP for higher accuracy on understanding tasks

Open-vocabulary object detection and referring expression grounding on video frames

Benchmarks

Dataset	Metric	Score	Source
ImageNet-1k (frozen trunk, 3B variant)	Top-1 Accuracy	89.5%	Fini et al., 2024 — arXiv 2411.14402
Multimodal understanding (avg)	Score	Outperforms CLIP ViT-L & SigLIP	Fini et al., 2024 — arXiv 2411.14402