NEWVectors or files. Pick a path.Start →
    Models/Captioning/nvidia/4D-RGPT-8B
    HFScene CaptioningCC-BY-NC-4.0

    4D-RGPT-8B

    by nvidia

    8B video model for region-grounded 3D and 4D reasoning

    108dl/month
    13likes
    8Bparams
    Identifiers
    Model ID
    nvidia/4D-RGPT-8B
    Feature URI
    mixpeek://video_extractor@v1/nvidia_4d_rgpt_8b_v1

    Overview

    4D-RGPT-8B is an NVIDIA video-text model focused on region grounding, 3D reasoning, and 4D reasoning. Those capabilities are important when an agent needs more than a clip-level summary. The agent needs to know which region changed, where the object moved, and how the event evolved over time.

    On Mixpeek, 4D-RGPT can enrich video indexes with region-grounded temporal evidence. It is a fit for robotics footage, surveillance review, sports clips, and operational video where the retrieval result must preserve spatial and temporal context.

    Architecture

    NVILA-Lite-8B based video-text-to-text model. The Hugging Face metadata tags it for video understanding, region grounding, 3D reasoning, 4D reasoning, and perceptual distillation.

    Mixpeek SDK Integration

    import { Mixpeek } from "mixpeek";
    
    const mx = new Mixpeek({ apiKey: "API_KEY" });
    
    // Managed: create a collection over a bucket; Mixpeek runs this model's extractor
    const collection = await mx.collections.create({
      namespace_id: "my-namespace",
      collection_name: "my-collection",
      source: { type: "bucket", bucket_ids: ["bkt_your_bucket"] },
      feature_extractor: {
        feature_extractor_name: "s3",
        version: "v1",
        parameters: { model_id: "mixpeek://video_extractor@v1/nvidia_4d_rgpt_8b_v1" },
      },
    });

    Capabilities

    • Region-grounded video understanding
    • 3D and 4D reasoning over spatial-temporal evidence
    • Video-text-to-text analysis for agent perception loops
    • Designed for grounding objects and events through time

    Use Cases on Mixpeek

    Retrieve clips where an object moves through a specific region
    Audit robotics or physical-world agent observations
    Build evidence bundles for security and operations review
    Search sports or live-event video with spatial-temporal constraints

    Performance

    Input SizeVariable
    GPU LatencyInput dependent
    GPU ThroughputVideo length dependent
    GPU Memory~18 GB

    Region-grounded video reasoning cost depends heavily on clip length and frame sampling.

    Specification

    FrameworkHF
    Organizationnvidia
    FeatureScene Captioning
    Outputtext
    Modalitiesvideo, image
    RetrieverSemantic Search
    Parameters8B
    LicenseCC-BY-NC-4.0
    Downloads/mo108
    Likes13

    Research Paper

    4D-RGPT

    arxiv.org

    Build a pipeline with 4D-RGPT-8B

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Studio