4D-RGPT-8B
by nvidia
8B video model for region-grounded 3D and 4D reasoning
nvidia/4D-RGPT-8Bmixpeek://video_extractor@v1/nvidia_4d_rgpt_8b_v1Overview
4D-RGPT-8B is an NVIDIA video-text model focused on region grounding, 3D reasoning, and 4D reasoning. Those capabilities are important when an agent needs more than a clip-level summary. The agent needs to know which region changed, where the object moved, and how the event evolved over time.
On Mixpeek, 4D-RGPT can enrich video indexes with region-grounded temporal evidence. It is a fit for robotics footage, surveillance review, sports clips, and operational video where the retrieval result must preserve spatial and temporal context.
Architecture
NVILA-Lite-8B based video-text-to-text model. The Hugging Face metadata tags it for video understanding, region grounding, 3D reasoning, 4D reasoning, and perceptual distillation.
Mixpeek SDK Integration
from mixpeek import Mixpeekmixpeek = Mixpeek(api_key="YOUR_API_KEY")mixpeek.ingest.videos(collection="operations_video",source={"type": "s3", "bucket": "ops-footage"},pipeline={"captioning": {"model": "mixpeek://video_extractor@v1/nvidia_4d_rgpt_8b_v1"}})
Capabilities
- Region-grounded video understanding
- 3D and 4D reasoning over spatial-temporal evidence
- Video-text-to-text analysis for agent perception loops
- Designed for grounding objects and events through time
Use Cases on Mixpeek
Performance
Region-grounded video reasoning cost depends heavily on clip length and frame sampling.
Specification
Research Paper
4D-RGPT
arxiv.orgBuild a pipeline with 4D-RGPT-8B
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Studio