VideoLLaMA3-7B

by DAMO-NLP-SG

Video understanding foundation model with efficient long-video processing

4Kdl/month

76likes

8.0Bparams

HuggingFace Run on your data

Identifiers

Model ID

DAMO-NLP-SG/VideoLLaMA3-7B

Feature URI

mixpeek://video_extractor@v1/damo_videollama3_7b_v1

Overview

VideoLLaMA3 is a frontier multimodal model for image and video understanding from Alibaba DAMO Academy. It uses a vision-centric architecture with a 4-stage training pipeline including video-centric fine-tuning.

The model reduces vision tokens based on frame similarity for efficient long-video processing, making it practical for indexing hours of footage without proportional compute cost.

Architecture

7B parameter model with vision-centric design. 4-stage training: image pretraining → image SFT → video pretraining → video SFT. Adaptive token reduction based on inter-frame similarity for long videos.

Mixpeek SDK Integration

import { Mixpeek } from "mixpeek";

const mx = new Mixpeek({ apiKey: "API_KEY" });

// Managed: create a collection over a bucket; Mixpeek runs this model's extractor
const collection = await mx.collections.create({
  namespace_id: "my-namespace",
  collection_name: "my-collection",
  source: { type: "bucket", bucket_ids: ["bkt_your_bucket"] },
  feature_extractor: {
    feature_extractor_name: "caption",
    version: "v1",
    parameters: { model_id: "mixpeek://video_extractor@v1/damo_videollama3_7b_v1" },
  },
});