NEWVectors or files. Pick a path.Start →
    Models/Captioning/nvidia/Cosmos3-Nano
    CosmosScene CaptioningOther

    Cosmos3-Nano

    by nvidia

    16B omni model with text, image, video, audio, action generation, and video reasoner input

    36.7Kdl/month
    16Bparams
    Identifiers
    Model ID
    nvidia/Cosmos3-Nano
    Feature URI
    mixpeek://video_extractor@v1/nvidia_cosmos3_nano_v1

    Overview

    Cosmos3-Nano is a compact member of NVIDIA's Cosmos3 family. The model card describes generator inputs across text, image, video with or without audio, and action trajectory, plus a reasoner path that accepts text, text plus image, and text plus video, then returns text. That makes it relevant to agent perception work where a system needs to inspect or reason over a short video candidate.

    On Mixpeek, Cosmos3-Nano is most useful after retrieval has selected a small set of clips. Store timeline metadata and keyframe embeddings first, then run a video reasoning pass to extract events, object interactions, or natural-language answers tied back to the source clip.

    Architecture

    Cosmos3 omni model with generator and reasoner interfaces. The reasoner supports text, text plus image, and text plus video input with text output. The model card recommends video reasoner input around 4 fps and supports long-context inputs up to 256K tokens.

    Mixpeek SDK Integration

    import { Mixpeek } from "mixpeek";
    const mx = new Mixpeek({ apiKey: "API_KEY" });
    await mx.collections.ingest({
    collection_id: "video-inspection",
    source: { url: "s3://media/clips/" },
    feature_extractors: [{
    feature: "scene_caption",
    model: "nvidia/Cosmos3-Nano",
    params: {
    frame_rate: 4,
    output_schema: ["event_summary", "visible_objects", "uncertainty"]
    }
    }]
    });

    Capabilities

    • Video reasoner input for short retrieved clips
    • Text and image conditioning for multimodal inspection
    • Video, audio, and action generation interfaces for simulation workflows
    • Long-context text handling around video evidence

    Use Cases on Mixpeek

    Ask an agent to explain what happens in a retrieved video clip
    Extract event descriptions from short candidate clips before reranking
    Build visual inspection tools that reason over video, not only single frames
    Prototype synthetic video or action data around agent perception evals

    Performance

    Input SizeText, image, video, optional audio, and action inputs
    GPU LatencyVideo length and output length dependent
    GPU ThroughputBatch dependent
    GPU Memory16B omni deployment class

    Use on retrieved clips or sampled windows rather than every raw frame

    Specification

    FrameworkCosmos
    Organizationnvidia
    FeatureScene Captioning
    Outputtext
    Modalitiesvideo, image
    RetrieverSemantic Search
    Parameters16B
    LicenseOther
    Downloads/mo36.7K

    Research Paper

    Cosmos3-Nano model card

    arxiv.org

    Build a pipeline with Cosmos3-Nano

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Studio