Tempo-6B

by Vision-CAIR

Compact 6B model for hours-long video understanding via query-aware temporal compression

18Kdl/month

6Bparams

HuggingFace Run on your data

Identifiers

Model ID

Vision-CAIR/Tempo-6B

Feature URI

mixpeek://video_extractor@v1/visioncair_tempo_6b_v1

Overview

Tempo is a 6B-parameter vision-language model purpose-built for extreme long-video understanding. While most video VLMs struggle beyond a few minutes, Tempo processes hours-long videos by using Adaptive Token Allocation: a query-aware compression mechanism that allocates between 0.5 and 16 visual tokens per frame based on content relevance to the query.

Despite being 6B parameters, Tempo scores 52.3 on LVBench (average video length 4101 seconds), outperforming GPT-4o and Gemini 1.5 Pro on long-video benchmarks. On Mixpeek, Tempo is ideal for processing meeting recordings, surveillance footage, lectures, and other long-form video where understanding temporal structure across hours of content is critical.

Architecture

Vision encoder with query-aware Adaptive Token Allocation (ATA) that compresses video frames to 0.5-16 tokens each based on query relevance. 6B parameters. Processes videos up to several hours within bounded context windows by dynamically allocating representation budget across time.

Mixpeek SDK Integration

import { Mixpeek } from "mixpeek";

const mx = new Mixpeek({ apiKey: "API_KEY" });

// Managed: create a collection over a bucket; Mixpeek runs this model's extractor
const collection = await mx.collections.create({
  namespace_id: "my-namespace",
  collection_name: "my-collection",
  source: { type: "bucket", bucket_ids: ["bkt_your_bucket"] },
  feature_extractor: {
    feature_extractor_name: "s3",
    version: "v1",
    parameters: { model_id: "mixpeek://video_extractor@v1/visioncair_tempo_6b_v1" },
  },
});