Tarsier2-7b-0115
by omni-research
SOTA video description — detailed, temporally-aligned captions that outperform GPT-4o
omni-research/Tarsier2-7b-0115mixpeek://video_extractor@v1/omni_tarsier2_7b_v1Overview
Tarsier2 generates highly detailed, temporally-aligned video descriptions. It achieves state-of-the-art across 16 video understanding benchmarks spanning captioning, QA, grounding, and hallucination detection — outperforming GPT-4o and Gemini 1.5 Pro on video description quality.
For video RAG, detailed description quality is critical: the richer the textual representation of video content, the better text-based retrieval performs. Tarsier2 produces the kind of dense, accurate descriptions that make video truly searchable.
Architecture
7B parameter model from ByteDance research. Optimized for generating faithful, temporally-ordered descriptions that minimize hallucination while maximizing detail density.
Mixpeek SDK Integration
mixpeek.ingest.from_url(url="s3://media/interview.mp4",collection="video_library",feature_extractors=[{"type": "caption","model": "mixpeek://video_extractor@v1/omni_tarsier2_7b_v1"}])
Capabilities
- Detailed video captioning
- Temporal grounding
- Video QA
- Hallucination-resistant description
- Scene narration
Use Cases on Mixpeek
Benchmarks
| Dataset | Metric | Score | Source |
|---|---|---|---|
| Video Description (16 benchmarks) | Avg Rank | #1 | Model card |
Performance
Common Pipeline Companions
Specification
Build a pipeline with Tarsier2-7b-0115
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Studio