Cosmos3-Nano
by nvidia
16B omni model with text, image, video, audio, action generation, and video reasoner input
nvidia/Cosmos3-Nanomixpeek://video_extractor@v1/nvidia_cosmos3_nano_v1Overview
Cosmos3-Nano is a compact member of NVIDIA's Cosmos3 family. The model card describes generator inputs across text, image, video with or without audio, and action trajectory, plus a reasoner path that accepts text, text plus image, and text plus video, then returns text. That makes it relevant to agent perception work where a system needs to inspect or reason over a short video candidate.
On Mixpeek, Cosmos3-Nano is most useful after retrieval has selected a small set of clips. Store timeline metadata and keyframe embeddings first, then run a video reasoning pass to extract events, object interactions, or natural-language answers tied back to the source clip.
Architecture
Cosmos3 omni model with generator and reasoner interfaces. The reasoner supports text, text plus image, and text plus video input with text output. The model card recommends video reasoner input around 4 fps and supports long-context inputs up to 256K tokens.
Mixpeek SDK Integration
import { Mixpeek } from "mixpeek";const mx = new Mixpeek({ apiKey: "API_KEY" });await mx.collections.ingest({collection_id: "video-inspection",source: { url: "s3://media/clips/" },feature_extractors: [{feature: "scene_caption",model: "nvidia/Cosmos3-Nano",params: {frame_rate: 4,output_schema: ["event_summary", "visible_objects", "uncertainty"]}}]});
Capabilities
- Video reasoner input for short retrieved clips
- Text and image conditioning for multimodal inspection
- Video, audio, and action generation interfaces for simulation workflows
- Long-context text handling around video evidence
Use Cases on Mixpeek
Performance
Use on retrieved clips or sampled windows rather than every raw frame
Specification
Research Paper
Cosmos3-Nano model card
arxiv.orgBuild a pipeline with Cosmos3-Nano
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Studio