NEWAgents can now see video via MCP.Try it now →
    Models/Detection & Recognition/facebook/detr-resnet-50
    HFObject Detectionapache-2.0

    detr-resnet-50

    by facebook

    End-to-end object detection with Transformers, no anchor boxes needed

    246Kdl/month
    943likes
    42Mparams
    Identifiers
    Model ID
    facebook/detr-resnet-50
    Feature URI
    mixpeek://image_extractor@v1/facebook_detr_r50_v1

    Overview

    DETR (DEtection TRansformer) reimagines object detection as a set prediction problem, using a transformer encoder-decoder architecture to directly output a set of bounding boxes and class labels without the need for hand-designed components like anchor boxes or non-maximum suppression.

    On Mixpeek, DETR extracts structured object annotations from video frames and images, producing bounding boxes with class labels that power attribute-based filtering in retrieval pipelines.

    Architecture

    ResNet-50 CNN backbone followed by a 6-layer transformer encoder-decoder. Uses bipartite matching loss (Hungarian algorithm) to assign predictions to ground truth. Outputs 100 object queries in parallel.

    Mixpeek SDK Integration

    import { Mixpeek } from "mixpeek";
    const mx = new Mixpeek({ apiKey: "API_KEY" });
    await mx.collections.ingest({
    collection_id: "my-collection",
    source: { url: "https://example.com/video.mp4" },
    feature_extractors: [{
    name: "object_detection",
    version: "v1",
    params: {
    model_id: "facebook/detr-resnet-50"
    }
    }]
    });

    Capabilities

    • 91 COCO object categories out of the box
    • Bounding box + class label predictions
    • Panoptic segmentation with extensions
    • No hand-designed post-processing (NMS-free)

    Use Cases on Mixpeek

    Video surveillance, detect people, vehicles, objects in security footage
    Retail analytics, count and classify products on shelves
    Content moderation, identify objects for compliance filtering
    Autonomous driving data, annotate frames with detected objects

    Benchmarks

    DatasetMetricScoreSource
    COCO val2017AP (box)42.0Carion et al., 2020 — Table 1
    COCO val2017AP5062.4Carion et al., 2020 — Table 1
    COCO val2017AP (small)20.5Carion et al., 2020 — Table 1

    Performance

    Input Size800×1333 px (max)
    GPU Latency~28ms / image (A100)
    CPU Latency~340ms / image
    GPU Throughput~35 images/sec (A100)
    GPU Memory~1.8 GB

    Specification

    FrameworkHF
    Organizationfacebook
    FeatureObject Detection
    Outputbbox + label
    Modalitiesvideo, image
    RetrieverObject Filter
    Parameters42M
    Licenseapache-2.0
    Downloads/mo246K
    Likes943

    Research Paper

    End-to-End Object Detection with Transformers

    arxiv.org

    Build a pipeline with detr-resnet-50

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Pipeline Builder