detr-resnet-50
by facebook
End-to-end object detection with Transformers, no anchor boxes needed
facebook/detr-resnet-50mixpeek://image_extractor@v1/facebook_detr_r50_v1Overview
DETR (DEtection TRansformer) reimagines object detection as a set prediction problem, using a transformer encoder-decoder architecture to directly output a set of bounding boxes and class labels without the need for hand-designed components like anchor boxes or non-maximum suppression.
On Mixpeek, DETR extracts structured object annotations from video frames and images, producing bounding boxes with class labels that power attribute-based filtering in retrieval pipelines.
Architecture
ResNet-50 CNN backbone followed by a 6-layer transformer encoder-decoder. Uses bipartite matching loss (Hungarian algorithm) to assign predictions to ground truth. Outputs 100 object queries in parallel.
Mixpeek SDK Integration
import { Mixpeek } from "mixpeek";const mx = new Mixpeek({ apiKey: "API_KEY" });await mx.collections.ingest({collection_id: "my-collection",source: { url: "https://example.com/video.mp4" },feature_extractors: [{name: "object_detection",version: "v1",params: {model_id: "facebook/detr-resnet-50"}}]});
Capabilities
- 91 COCO object categories out of the box
- Bounding box + class label predictions
- Panoptic segmentation with extensions
- No hand-designed post-processing (NMS-free)
Use Cases on Mixpeek
Benchmarks
| Dataset | Metric | Score | Source |
|---|---|---|---|
| COCO val2017 | AP (box) | 42.0 | Carion et al., 2020 — Table 1 |
| COCO val2017 | AP50 | 62.4 | Carion et al., 2020 — Table 1 |
| COCO val2017 | AP (small) | 20.5 | Carion et al., 2020 — Table 1 |
Performance
Specification
Research Paper
End-to-End Object Detection with Transformers
arxiv.orgBuild a pipeline with detr-resnet-50
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Pipeline Builder