NEWManaged multimodal retrieval.Explore platform →
    Models/Captioning/DAMO-NLP-SG/VideoLLaMA3-7B
    HFScene CaptioningApache 2.0

    VideoLLaMA3-7B

    by DAMO-NLP-SG

    Video understanding foundation model with efficient long-video processing

    62Kdl/month
    7Bparams
    Identifiers
    Model ID
    DAMO-NLP-SG/VideoLLaMA3-7B
    Feature URI
    mixpeek://video_extractor@v1/damo_videollama3_7b_v1

    Overview

    VideoLLaMA3 is a frontier multimodal model for image and video understanding from Alibaba DAMO Academy. It uses a vision-centric architecture with a 4-stage training pipeline including video-centric fine-tuning.

    The model reduces vision tokens based on frame similarity for efficient long-video processing, making it practical for indexing hours of footage without proportional compute cost.

    Architecture

    7B parameter model with vision-centric design. 4-stage training: image pretraining → image SFT → video pretraining → video SFT. Adaptive token reduction based on inter-frame similarity for long videos.

    Mixpeek SDK Integration

    mixpeek.ingest.from_url(
    url="s3://footage/episode.mp4",
    collection="video_archive",
    feature_extractors=[{
    "type": "caption",
    "model": "mixpeek://video_extractor@v1/damo_videollama3_7b_v1"
    }]
    )

    Capabilities

    • Video comprehension
    • Image understanding
    • Long-video processing
    • Scene description
    • Video QA
    • Temporal reasoning

    Use Cases on Mixpeek

    Video content indexing at scale
    Generating scene descriptions for video search
    Long-form video summarization
    Video QA for content libraries

    Performance

    Input SizeVariable
    GPU Latency~200ms per scene (A100)
    GPU Throughput~5 scenes/sec
    GPU MemoryModel dependent

    Specification

    FrameworkHF
    OrganizationDAMO-NLP-SG
    FeatureScene Captioning
    Outputtext
    Modalitiesvideo, image
    RetrieverSemantic Search
    Parameters7B
    LicenseApache 2.0
    Downloads/mo62K

    Build a pipeline with VideoLLaMA3-7B

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Studio