NEWManaged multimodal retrieval.Explore platform →
    Models/Captioning/Vision-CAIR/Tempo-6B
    HFScene CaptioningApache 2.0

    Tempo-6B

    by Vision-CAIR

    Compact 6B model for hours-long video understanding via query-aware temporal compression

    18Kdl/month
    6Bparams
    Identifiers
    Model ID
    Vision-CAIR/Tempo-6B
    Feature URI
    mixpeek://video_extractor@v1/visioncair_tempo_6b_v1

    Overview

    Tempo is a 6B-parameter vision-language model purpose-built for extreme long-video understanding. While most video VLMs struggle beyond a few minutes, Tempo processes hours-long videos by using Adaptive Token Allocation — a query-aware compression mechanism that allocates between 0.5 and 16 visual tokens per frame based on content relevance to the query.

    Despite being 6B parameters, Tempo scores 52.3 on LVBench (average video length 4101 seconds), outperforming GPT-4o and Gemini 1.5 Pro on long-video benchmarks. On Mixpeek, Tempo is ideal for processing meeting recordings, surveillance footage, lectures, and other long-form video where understanding temporal structure across hours of content is critical.

    Architecture

    Vision encoder with query-aware Adaptive Token Allocation (ATA) that compresses video frames to 0.5–16 tokens each based on query relevance. 6B parameters. Processes videos up to several hours within bounded context windows by dynamically allocating representation budget across time.

    Mixpeek SDK Integration

    from mixpeek import Mixpeek
    mixpeek = Mixpeek(api_key="YOUR_API_KEY")
    mixpeek.ingest.videos(
    collection="meeting_recordings",
    source={"type": "s3", "bucket": "recordings"},
    pipeline={
    "captioning": {
    "model": "mixpeek://video_extractor@v1/visioncair_tempo_6b_v1"
    }
    }
    )

    Capabilities

    • Hours-long video understanding (4000+ second videos)
    • Query-aware temporal compression for efficient processing
    • Outperforms GPT-4o on long-video benchmarks at 1/20th the size
    • Temporal reasoning across scenes separated by minutes or hours

    Use Cases on Mixpeek

    Meeting recording analysis and summarization
    Surveillance footage search and event detection
    Lecture and webinar content indexing
    Long-form content moderation
    Agent perception over extended video streams

    Benchmarks

    DatasetMetricScoreSource
    LVBenchAccuracy52.3Model card
    Video-MME (long)Accuracy58.7Model card
    MLVUScore67.4Model card

    Performance

    Input SizeVariable
    GPU Latency~0.5–16 tokens/frame (adaptive)
    GPU ThroughputHours-long video in single pass
    GPU Memory~14 GB (A100)

    Specification

    FrameworkHF
    OrganizationVision-CAIR
    FeatureScene Captioning
    Outputtext
    Modalitiesvideo, image
    RetrieverSemantic Search
    Parameters6B
    LicenseApache 2.0
    Downloads/mo18K

    Research Paper

    Model paper or technical report

    arxiv.org

    Build a pipeline with Tempo-6B

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Studio