detr-resnet-50
by facebook
End-to-end object detection with Transformers — no anchor boxes needed
facebook/detr-resnet-50mixpeek://image_extractor@v1/facebook_detr_r50_v1Overview
DETR (DEtection TRansformer) reimagines object detection as a set prediction problem, using a transformer encoder-decoder architecture to directly output a set of bounding boxes and class labels without the need for hand-designed components like anchor boxes or non-maximum suppression.
On Mixpeek, DETR extracts structured object annotations from video frames and images, producing bounding boxes with class labels that power attribute-based filtering in retrieval pipelines.
Architecture
ResNet-50 CNN backbone followed by a 6-layer transformer encoder-decoder. Uses bipartite matching loss (Hungarian algorithm) to assign predictions to ground truth. Outputs 100 object queries in parallel.
Mixpeek SDK Integration
import { Mixpeek } from "mixpeek";
const mx = new Mixpeek({ apiKey: "API_KEY" });
await mx.collections.ingest({
collection_id: "my-collection",
source: { url: "https://example.com/video.mp4" },
feature_extractors: [{
name: "object_detection",
version: "v1",
params: {
model_id: "facebook/detr-resnet-50"
}
}]
});Capabilities
- 91 COCO object categories out of the box
- Bounding box + class label predictions
- Panoptic segmentation with extensions
- No hand-designed post-processing (NMS-free)
Use Cases on Mixpeek
Specification
Research Paper
End-to-End Object Detection with Transformers
arxiv.orgBuild a pipeline with detr-resnet-50
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Pipeline Builder