NEWVectors or files. Pick a path.Start →
    Models/Captioning/nvidia/4D-RGPT-8B
    HFScene CaptioningCC-BY-NC-4.0

    4D-RGPT-8B

    by nvidia

    8B video model for region-grounded 3D and 4D reasoning

    108dl/month
    13likes
    8Bparams
    Identifiers
    Model ID
    nvidia/4D-RGPT-8B
    Feature URI
    mixpeek://video_extractor@v1/nvidia_4d_rgpt_8b_v1

    Overview

    4D-RGPT-8B is an NVIDIA video-text model focused on region grounding, 3D reasoning, and 4D reasoning. Those capabilities are important when an agent needs more than a clip-level summary. The agent needs to know which region changed, where the object moved, and how the event evolved over time.

    On Mixpeek, 4D-RGPT can enrich video indexes with region-grounded temporal evidence. It is a fit for robotics footage, surveillance review, sports clips, and operational video where the retrieval result must preserve spatial and temporal context.

    Architecture

    NVILA-Lite-8B based video-text-to-text model. The Hugging Face metadata tags it for video understanding, region grounding, 3D reasoning, 4D reasoning, and perceptual distillation.

    Mixpeek SDK Integration

    from mixpeek import Mixpeek
    mixpeek = Mixpeek(api_key="YOUR_API_KEY")
    mixpeek.ingest.videos(
    collection="operations_video",
    source={"type": "s3", "bucket": "ops-footage"},
    pipeline={
    "captioning": {
    "model": "mixpeek://video_extractor@v1/nvidia_4d_rgpt_8b_v1"
    }
    }
    )

    Capabilities

    • Region-grounded video understanding
    • 3D and 4D reasoning over spatial-temporal evidence
    • Video-text-to-text analysis for agent perception loops
    • Designed for grounding objects and events through time

    Use Cases on Mixpeek

    Retrieve clips where an object moves through a specific region
    Audit robotics or physical-world agent observations
    Build evidence bundles for security and operations review
    Search sports or live-event video with spatial-temporal constraints

    Performance

    Input SizeVariable
    GPU LatencyInput dependent
    GPU ThroughputVideo length dependent
    GPU Memory~18 GB

    Region-grounded video reasoning cost depends heavily on clip length and frame sampling.

    Specification

    FrameworkHF
    Organizationnvidia
    FeatureScene Captioning
    Outputtext
    Modalitiesvideo, image
    RetrieverSemantic Search
    Parameters8B
    LicenseCC-BY-NC-4.0
    Downloads/mo108
    Likes13

    Research Paper

    4D-RGPT

    arxiv.org

    Build a pipeline with 4D-RGPT-8B

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Studio