yolos-tiny
by hustvl
You Only Look at One Sequence, ViT-based real-time object detection
hustvl/yolos-tinymixpeek://image_extractor@v1/hustvl_yolos_tiny_v1Overview
YOLOS adapts the Vision Transformer (ViT) architecture for object detection by simply appending detection tokens to the input sequence. It demonstrates that a pure transformer can perform object detection without any convolutional components.
On Mixpeek, YOLOS Tiny provides a lightweight, fast alternative to DETR for object detection tasks where speed is prioritized over maximum accuracy.
Architecture
Vision Transformer (ViT-Tiny) with 12 layers. Appends 100 learnable detection tokens to the image patch sequence. Uses bipartite matching loss like DETR.
Mixpeek SDK Integration
import { Mixpeek } from "mixpeek";const mx = new Mixpeek({ apiKey: "API_KEY" });await mx.collections.ingest({collection_id: "my-collection",source: { url: "https://example.com/video.mp4" },feature_extractors: [{name: "object_detection",version: "v1",params: {model_id: "hustvl/yolos-tiny"}}]});
Capabilities
- Lightweight ViT-based object detection
- Fast inference suitable for real-time processing
- COCO object categories
- Pure transformer architecture (no CNN backbone)
Use Cases on Mixpeek
Benchmarks
| Dataset | Metric | Score | Source |
|---|---|---|---|
| COCO val2017 | AP (box) | 30.4 | Fang et al., 2021 — Table 1 |
| COCO val2017 | AP50 | 48.6 | Fang et al., 2021 — Table 1 |
Performance
6.5M params — optimized for edge and high-throughput scenarios
Specification
Research Paper
You Only Look at One Sequence
arxiv.orgBuild a pipeline with yolos-tiny
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Pipeline Builder