NEWVectors or files. Pick a path.Start →
    Models/Segmentation/CIDAS/clipseg-rd64-refined
    HFSegmentationApache 2.0

    clipseg-rd64-refined

    by CIDAS

    Text-prompted image segmentation for queryable masks and region crops

    1.7Mdl/month
    0.2Bparams
    Identifiers
    Model ID
    CIDAS/clipseg-rd64-refined
    Feature URI
    mixpeek://image_extractor@v1/cidas_clipseg_rd64_refined_v1

    Overview

    CLIPSeg RD64 Refined is a CLIP-conditioned segmentation model that produces a mask from an image plus a natural language prompt. Instead of requiring a fixed class label set, it lets a pipeline ask for regions like "red logo," "person holding a box," or "damaged corner" and turn those regions into indexed evidence.

    On Mixpeek, CLIPSeg is useful before region embedding or visual QA. The segmenter isolates the queried foreground, Mixpeek stores the mask geometry and crop lineage, and an agent can search or inspect the precise region instead of the whole frame.

    Architecture

    CLIPSeg combines a CLIP visual-text backbone with a lightweight decoder for dense prediction. The RD64 refined checkpoint is optimized for image segmentation with natural language prompts and outputs pixel masks aligned to the input image.

    Mixpeek SDK Integration

    import { Mixpeek } from "mixpeek";
    const mx = new Mixpeek({ apiKey: "API_KEY" });
    await mx.collections.ingest({
    collection_id: "visual-evidence",
    source: { url: "s3://media/keyframes/" },
    feature_extractors: [{
    feature: "segmentation",
    model: "CIDAS/clipseg-rd64-refined",
    params: {
    prompts: ["brand logo", "product in hand", "damaged surface"],
    return_crops: true
    }
    }]
    });

    Capabilities

    • Text-guided segmentation without a closed class list
    • Foreground mask generation for natural images and video keyframes
    • Region crop extraction before visual embedding
    • Spatial metadata for evidence citations

    Use Cases on Mixpeek

    Find frames where a specific described object appears
    Crop queried regions before CLIP, SigLIP, or Nomic vision embedding
    Build agent tools that inspect only the visual region relevant to a question
    Filter product, ad, or screenshot libraries by prompted visual regions

    Benchmarks

    DatasetMetricScoreSource
    PhraseCutSegmentation-CLIPSeg paper
    RefCOCOReferring segmentation-CLIPSeg paper

    Performance

    Input SizeImage plus text prompt
    GPU Latency~20-40ms / image on A100
    GPU ThroughputBatch dependent
    GPU Memory~1 GB

    Run on selected frames or first-stage candidates when prompt count is high

    Specification

    FrameworkHF
    OrganizationCIDAS
    FeatureSegmentation
    Outputmask + label
    Modalitiesvideo, image
    RetrieverMask Filter
    Parameters0.2B
    LicenseApache 2.0
    Downloads/mo1.7M

    Research Paper

    Image Segmentation Using Text and Image Prompts

    arxiv.org

    Build a pipeline with clipseg-rd64-refined

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Studio