NEWVectors or files. Pick a path.Start →
    Models/Detection & Recognition/IDEA-Research/grounding-dino-base
    HFObject Detectionapache-2.0

    grounding-dino-base

    by IDEA-Research

    Open-set detection using natural language descriptions

    2.2Mdl/month
    186likes
    233Mparams
    Identifiers
    Model ID
    IDEA-Research/grounding-dino-base
    Feature URI
    mixpeek://image_extractor@v1/idea_grounding_dino_base_v1

    Overview

    Grounding DINO combines a DINO-style detection transformer with grounded language understanding for open-set object detection. It achieves 52.5 AP on COCO with zero training data on COCO, and 56.7 AP when fine-tuned.

    On Mixpeek, Grounding DINO enables detecting any object by describing it in text. Combined with segmentation models like SAM, it provides a powerful detect-then-segment pipeline.

    Architecture

    DINO-style detection transformer with Swin backbone, enhanced with text-grounding modules for open-vocabulary detection. Swin-B variant achieves 56.7 AP on COCO.

    Mixpeek SDK Integration

    import { Mixpeek } from "mixpeek";
    
    const mx = new Mixpeek({ apiKey: "API_KEY" });
    
    // Managed: create a collection over a bucket; Mixpeek runs this model's extractor
    const collection = await mx.collections.create({
      namespace_id: "my-namespace",
      collection_name: "my-collection",
      source: { type: "bucket", bucket_ids: ["bkt_your_bucket"] },
      feature_extractor: {
        feature_extractor_name: "object_detection",
        version: "v1",
        parameters: { model_id: "IDEA-Research/grounding-dino-base" },
      },
    });

    Capabilities

    • Zero-shot detection: 52.5 AP on COCO without COCO training data
    • Natural language object descriptions as prompts
    • Fine-tuned detection: 56.7 AP (Swin-B)
    • Pairs with SAM for detect-then-segment pipelines

    Use Cases on Mixpeek

    Open-vocabulary object detection in video surveillance
    Content tagging with arbitrary category sets
    Visual grounding for question answering
    Automated annotation for training data generation

    Benchmarks

    DatasetMetricScoreSource
    COCO val2017 (zero-shot)AP48.4Liu et al., 2024 — Table 1
    RefCOCO (val)Accuracy89.2%Liu et al., 2024 — Table 3

    Performance

    Input Size800×1333 px
    GPU Latency~32ms / image (A100)
    CPU Latency~410ms / image
    GPU Throughput~31 images/sec (A100)
    GPU Memory~1.6 GB

    Specification

    FrameworkHF
    OrganizationIDEA-Research
    FeatureObject Detection
    Outputbbox + label
    Modalitiesvideo, image
    RetrieverObject Filter
    Parameters233M
    Licenseapache-2.0
    Downloads/mo2.2M
    Likes186

    Research Paper

    Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

    arxiv.org

    Build a pipeline with grounding-dino-base

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Run on your data, free