NEWAgents can now see video via MCP.Try it now →
    Models/Segmentation/facebook/sam2.1-hiera-large
    HFSegmentationApache 2.0

    sam2.1-hiera-large

    by facebook

    Unified promptable segmentation for images and video with streaming memory

    1.8Mdl/month
    224.4Mparams
    Identifiers
    Model ID
    facebook/sam2.1-hiera-large
    Feature URI
    mixpeek://image_extractor@v1/facebook_sam2_large_v1

    Overview

    SAM 2 extends SAM to video with a streaming memory architecture for real-time processing. It's 6x faster than SAM on images with better accuracy, and the first foundation model that segments and tracks objects across video frames with prompts.

    On Mixpeek, SAM 2 enables video-native segmentation — track objects across frames, segment specific items at any point in a video, and extract per-object features over time.

    Architecture

    Hiera image encoder with streaming memory for temporal context. SAM 2.1 Large: 224.4M params, 39.5 FPS on A100. Memory attention modules propagate masks across frames without re-computing the full image encoder.

    Mixpeek SDK Integration

    import { Mixpeek } from "mixpeek";
    const mx = new Mixpeek({ apiKey: "API_KEY" });
    await mx.collections.ingest({
    collection_id: "my-collection",
    source: { url: "https://example.com/video.mp4" },
    feature_extractors: [{
    name: "segmentation",
    version: "v1",
    params: { model_id: "facebook/sam2.1-hiera-large" }
    }]
    });

    Capabilities

    • Video object segmentation and tracking
    • 6x faster than SAM on images
    • Streaming memory architecture for real-time video
    • Multi-object tracking with mask propagation
    • Image segmentation with improved accuracy

    Use Cases on Mixpeek

    Video object tracking and segmentation across frames
    Real-time content understanding in video streams
    Per-object feature extraction in video pipelines
    Interactive video annotation and editing

    Benchmarks

    DatasetMetricScoreSource
    SA-V (video seg.)J&F79.5Ravi et al., 2024 — Table 1
    DAVIS 2017 (val)J&F82.0Ravi et al., 2024 — Table 2

    Performance

    Input Size1024×1024 px
    GPU Latency~18ms / frame (A100)
    GPU Throughput~55 frames/sec (A100)
    GPU Memory~2.8 GB

    Streaming architecture — processes video frames sequentially with memory

    Specification

    FrameworkHF
    Organizationfacebook
    FeatureSegmentation
    Outputmask + label
    Modalitiesvideo, image
    RetrieverMask Filter
    Parameters224.4M
    LicenseApache 2.0
    Downloads/mo1.8M

    Research Paper

    SAM 2: Segment Anything in Images and Videos

    arxiv.org

    Build a pipeline with sam2.1-hiera-large

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Pipeline Builder