NEWManaged multimodal retrieval.Explore platform →
    Models/Embeddings/apple/AIMv2-large-patch14-native
    HFVisual EmbeddingsApple Sample Code License

    AIMv2-large-patch14-native

    by apple

    Multimodal autoregressive vision encoder outperforming CLIP and SigLIP on understanding tasks

    380Kdl/month
    309Mparams
    Identifiers
    Model ID
    apple/AIMv2-large-patch14-native
    Feature URI
    mixpeek://image_extractor@v1/apple_aimv2_large_v1

    Overview

    AIMv2 Large is Apple's 309M-parameter vision encoder pre-trained with a multimodal autoregressive objective that pairs the encoder with a decoder autoregressively generating raw image patches and text tokens. Unlike contrastive models such as CLIP, AIMv2 captures fine-grained visual features through its generative pre-training, outperforming both CLIP and SigLIP on multimodal understanding benchmarks.

    On Mixpeek, AIMv2 provides high-quality visual feature extraction for downstream tasks like classification, grounding, and retrieval. Its native resolution variant accepts variable-size images without resizing artifacts, making it particularly effective for document images, satellite imagery, and other content where resolution matters.

    Architecture

    Vision Transformer with 24 layers, 1024-dim hidden size, 8 attention heads, patch size 14. 309M parameters. Pre-trained with multimodal autoregressive objective using a paired text decoder. Native resolution variant supports variable input sizes without fixed resizing.

    Mixpeek SDK Integration

    import { Mixpeek } from "mixpeek";
    const mx = new Mixpeek({ apiKey: "API_KEY" });
    await mx.collections.ingest({
    collection_id: "my-collection",
    source: { url: "https://example.com/satellite-image.tiff" },
    feature_extractors: [{
    name: "image_embedding",
    version: "v1",
    params: {
    model_id: "apple/AIMv2-large-patch14-native"
    }
    }]
    });

    Capabilities

    • Outperforms CLIP and SigLIP on multimodal understanding benchmarks
    • Native resolution input without resizing artifacts
    • 1024-dimensional feature representations
    • Strong transfer to localization, grounding, and classification
    • Outperforms DINOv2 on open-vocabulary detection

    Use Cases on Mixpeek

    High-fidelity visual feature extraction for document images and satellite imagery at native resolution
    Visual search backbone replacing CLIP for higher accuracy on understanding tasks
    Open-vocabulary object detection and referring expression grounding on video frames

    Benchmarks

    DatasetMetricScoreSource
    ImageNet-1k (frozen trunk, 3B variant)Top-1 Accuracy89.5%Fini et al., 2024 — arXiv 2411.14402
    Multimodal understanding (avg)ScoreOutperforms CLIP ViT-L & SigLIPFini et al., 2024 — arXiv 2411.14402

    Performance

    Input SizeNative resolution (variable, patch size 14)
    Embedding Dim1024
    GPU Latency~10ms / image (A100)
    GPU Throughput~100 images/sec (A100)
    GPU Memory~1.5 GB

    Specification

    FrameworkHF
    Organizationapple
    FeatureVisual Embeddings
    Output768-dim vector
    Modalitiesvideo, image
    RetrieverVector Search
    Parameters309M
    LicenseApple Sample Code License
    Downloads/mo380K

    Research Paper

    Multimodal Autoregressive Pre-training of Large Vision Encoders

    arxiv.org

    Build a pipeline with AIMv2-large-patch14-native

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Studio