NEWAgents can now see video via MCP.Try it now →
    Models/Segmentation/facebook/sam-vit-huge
    HFSegmentationApache 2.0

    sam-vit-huge

    by facebook

    Promptable foundation model for image segmentation

    3.2Mdl/month
    632Mparams
    Identifiers
    Model ID
    facebook/sam-vit-huge
    Feature URI
    mixpeek://image_extractor@v1/facebook_sam_vit_huge_v1

    Overview

    SAM (Segment Anything Model) is Meta's foundation model for image segmentation. Given prompts like points, boxes, or text, it produces high-quality object masks. Trained on SA-1B — the largest segmentation dataset with 1 billion masks on 11M images.

    On Mixpeek, SAM powers pixel-level object segmentation for precise content understanding, enabling mask-based filtering and region-specific feature extraction.

    Architecture

    ViT-H image encoder (632M params) with a lightweight mask decoder. Produces 256x256 low-res masks refined to full resolution. Supports multiple prompt types: points, boxes, and masks.

    Mixpeek SDK Integration

    import { Mixpeek } from "mixpeek";
    const mx = new Mixpeek({ apiKey: "API_KEY" });
    await mx.collections.ingest({
    collection_id: "my-collection",
    source: { url: "https://example.com/image.jpg" },
    feature_extractors: [{
    name: "segmentation",
    version: "v1",
    params: { model_id: "facebook/sam-vit-huge" }
    }]
    });

    Capabilities

    • Promptable segmentation with points, boxes, or masks
    • Automatic mask generation for everything in an image
    • Zero-shot transfer competitive with supervised models
    • Trained on 1 billion masks (SA-1B dataset)

    Use Cases on Mixpeek

    Pixel-level content segmentation in video and images
    Automated mask generation for training data creation
    Region-specific feature extraction pipelines
    Interactive annotation assistance

    Benchmarks

    DatasetMetricScoreSource
    SA-1B (segmentation)mIoU79.3Kirillov et al., 2023 — Table 1
    COCO (instance seg.)AP46.5Kirillov et al., 2023 — Table 7

    Performance

    Input Size1024×1024 px
    GPU Latency~42ms / image (A100)
    CPU Latency~620ms / image
    GPU Throughput~24 images/sec (A100)
    GPU Memory~2.6 GB

    Image encoder runs once; mask decoder runs per prompt (~6ms)

    Specification

    FrameworkHF
    Organizationfacebook
    FeatureSegmentation
    Outputmask + label
    Modalitiesvideo, image
    RetrieverMask Filter
    Parameters632M
    LicenseApache 2.0
    Downloads/mo3.2M

    Research Paper

    Segment Anything

    arxiv.org

    Build a pipeline with sam-vit-huge

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Pipeline Builder