Tarsier2-7b-0115

by omni-research

SOTA video description: detailed, temporally-aligned captions that outperform GPT-4o

45Kdl/month

7Bparams

HuggingFace Run on your data

Identifiers

Model ID

omni-research/Tarsier2-7b-0115

Feature URI

mixpeek://video_extractor@v1/omni_tarsier2_7b_v1

Overview

Tarsier2 generates highly detailed, temporally-aligned video descriptions. It achieves state-of-the-art across 16 video understanding benchmarks spanning captioning, QA, grounding, and hallucination detection, outperforming GPT-4o and Gemini 1.5 Pro on video description quality.

For video RAG, detailed description quality is critical: the richer the textual representation of video content, the better text-based retrieval performs. Tarsier2 produces the kind of dense, accurate descriptions that make video truly searchable.

Architecture

7B parameter model from ByteDance research. Optimized for generating faithful, temporally-ordered descriptions that minimize hallucination while maximizing detail density.

Mixpeek SDK Integration

import { Mixpeek } from "mixpeek";

const mx = new Mixpeek({ apiKey: "API_KEY" });

// Managed: create a collection over a bucket; Mixpeek runs this model's extractor
const collection = await mx.collections.create({
  namespace_id: "my-namespace",
  collection_name: "my-collection",
  source: { type: "bucket", bucket_ids: ["bkt_your_bucket"] },
  feature_extractor: {
    feature_extractor_name: "caption",
    version: "v1",
    parameters: { model_id: "mixpeek://video_extractor@v1/omni_tarsier2_7b_v1" },
  },
});