yolos-tiny
by hustvl
You Only Look at One Sequence — ViT-based real-time object detection
hustvl/yolos-tinymixpeek://image_extractor@v1/hustvl_yolos_tiny_v1Overview
YOLOS adapts the Vision Transformer (ViT) architecture for object detection by simply appending detection tokens to the input sequence. It demonstrates that a pure transformer can perform object detection without any convolutional components.
On Mixpeek, YOLOS Tiny provides a lightweight, fast alternative to DETR for object detection tasks where speed is prioritized over maximum accuracy.
Architecture
Vision Transformer (ViT-Tiny) with 12 layers. Appends 100 learnable detection tokens to the image patch sequence. Uses bipartite matching loss like DETR.
Mixpeek SDK Integration
import { Mixpeek } from "mixpeek";
const mx = new Mixpeek({ apiKey: "API_KEY" });
await mx.collections.ingest({
collection_id: "my-collection",
source: { url: "https://example.com/video.mp4" },
feature_extractors: [{
name: "object_detection",
version: "v1",
params: {
model_id: "hustvl/yolos-tiny"
}
}]
});Capabilities
- Lightweight ViT-based object detection
- Fast inference suitable for real-time processing
- COCO object categories
- Pure transformer architecture (no CNN backbone)
Use Cases on Mixpeek
Specification
Research Paper
You Only Look at One Sequence
arxiv.orgBuild a pipeline with yolos-tiny
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Pipeline Builder