NEWManaged multimodal retrieval.Explore platform →
    Models/Captioning/omni-research/Tarsier2-7b-0115
    HFScene CaptioningApache 2.0

    Tarsier2-7b-0115

    by omni-research

    SOTA video description — detailed, temporally-aligned captions that outperform GPT-4o

    45Kdl/month
    7Bparams
    Identifiers
    Model ID
    omni-research/Tarsier2-7b-0115
    Feature URI
    mixpeek://video_extractor@v1/omni_tarsier2_7b_v1

    Overview

    Tarsier2 generates highly detailed, temporally-aligned video descriptions. It achieves state-of-the-art across 16 video understanding benchmarks spanning captioning, QA, grounding, and hallucination detection — outperforming GPT-4o and Gemini 1.5 Pro on video description quality.

    For video RAG, detailed description quality is critical: the richer the textual representation of video content, the better text-based retrieval performs. Tarsier2 produces the kind of dense, accurate descriptions that make video truly searchable.

    Architecture

    7B parameter model from ByteDance research. Optimized for generating faithful, temporally-ordered descriptions that minimize hallucination while maximizing detail density.

    Mixpeek SDK Integration

    mixpeek.ingest.from_url(
    url="s3://media/interview.mp4",
    collection="video_library",
    feature_extractors=[{
    "type": "caption",
    "model": "mixpeek://video_extractor@v1/omni_tarsier2_7b_v1"
    }]
    )

    Capabilities

    • Detailed video captioning
    • Temporal grounding
    • Video QA
    • Hallucination-resistant description
    • Scene narration

    Use Cases on Mixpeek

    Video-to-text for searchable video archives
    Rich metadata generation for video RAG
    Content description for accessibility
    Ad creative analysis

    Benchmarks

    DatasetMetricScoreSource
    Video Description (16 benchmarks)Avg Rank#1Model card

    Performance

    Input SizeVariable
    GPU Latency~180ms per scene (A100)
    GPU Throughput~6 scenes/sec
    GPU MemoryModel dependent

    Specification

    FrameworkHF
    Organizationomni-research
    FeatureScene Captioning
    Outputtext
    Modalitiesvideo, image
    RetrieverSemantic Search
    Parameters7B
    LicenseApache 2.0
    Downloads/mo45K

    Build a pipeline with Tarsier2-7b-0115

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Studio