Cosmos3-Nano

by nvidia

16B omni model with text, image, video, audio, action generation, and video reasoner input

36.7Kdl/month

16Bparams

HuggingFace Run on your data, free

Identifiers

Model ID

nvidia/Cosmos3-Nano

Feature URI

mixpeek://video_extractor@v1/nvidia_cosmos3_nano_v1

Overview

Cosmos3-Nano is a compact member of NVIDIA's Cosmos3 family. The model card describes generator inputs across text, image, video with or without audio, and action trajectory, plus a reasoner path that accepts text, text plus image, and text plus video, then returns text. That makes it relevant to agent perception work where a system needs to inspect or reason over a short video candidate.

On Mixpeek, Cosmos3-Nano is most useful after retrieval has selected a small set of clips. Store timeline metadata and keyframe embeddings first, then run a video reasoning pass to extract events, object interactions, or natural-language answers tied back to the source clip.

Architecture

Cosmos3 omni model with generator and reasoner interfaces. The reasoner supports text, text plus image, and text plus video input with text output. The model card recommends video reasoner input around 4 fps and supports long-context inputs up to 256K tokens.

Mixpeek SDK Integration

import { Mixpeek } from "mixpeek";

const mx = new Mixpeek({ apiKey: "API_KEY" });

// Managed: create a collection over a bucket; Mixpeek runs this model's extractor
const collection = await mx.collections.create({
  namespace_id: "my-namespace",
  collection_name: "my-collection",
  source: { type: "bucket", bucket_ids: ["bkt_your_bucket"] },
  feature_extractor: {
    feature_extractor_name: "scene_caption",
    version: "v1",
    parameters: { model_id: "nvidia/Cosmos3-Nano" },
  },
});