NEWManaged multimodal retrieval.Explore platform →
    Models/Captioning/nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16
    HFScene CaptioningNVIDIA Open Model License

    Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16

    by nvidia

    Omnimodal VLM that processes text, images, video, and audio with only 3B active parameters

    263Kdl/month
    31B total / 3B activeparams
    Identifiers
    Model ID
    nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16
    Feature URI
    mixpeek://image_extractor@v1/nvidia_nemotron3_nano_omni_v1

    Overview

    Nemotron-3-Nano-Omni is NVIDIA's Mixture-of-Experts model that unifies vision, audio, and language understanding in a single architecture. With 31B total parameters but only 3B active per token, it delivers omnimodal perception at a fraction of the compute cost of dense models — up to 9x throughput over comparable open alternatives.

    On Mixpeek, Nemotron-3-Nano-Omni serves as a universal perception backbone: a single model call extracts understanding from video (up to 2 minutes), audio (up to 1 hour), images, and text. This eliminates the need for separate caption, transcription, and analysis models in complex pipelines.

    Architecture

    Mamba2-Transformer hybrid MoE. C-RADIOv4-H vision encoder + Parakeet-TDT-0.6B audio encoder + MoE language decoder. 31B total / ~3B active params per token. 256K context window. Processes up to 2 minutes of video or 1 hour of audio.

    Mixpeek SDK Integration

    import { Mixpeek } from "mixpeek";
    const mx = new Mixpeek({ apiKey: "API_KEY" });
    await mx.collections.ingest({
    collection_id: "media-library",
    source: { url: "https://example.com/meeting-recording.mp4" },
    feature_extractors: [{
    feature: "scene_caption",
    model: "nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16"
    }]
    });

    Capabilities

    • Unified vision + audio + text understanding in one model
    • Only 3B active parameters per token (MoE efficiency)
    • 256K context window for long audio and document processing
    • Strong OCR and document understanding (67.04 on OCRBenchV2)
    • Video + audio QA (74.52 on DailyOmni)

    Use Cases on Mixpeek

    Universal content understanding: caption, transcribe, and analyze in one pass
    Video Q&A: answer questions about visual and audio content simultaneously
    Meeting analysis: understand slides, speaker audio, and chat simultaneously
    Multi-format content indexing: process mixed media without separate models

    Benchmarks

    DatasetMetricScoreSource
    Video MMEAccuracy72.2%NVIDIA, 2026 — arxiv:2604.24954
    DailyOmni (video+audio QA)Accuracy74.52%NVIDIA, 2026 — arxiv:2604.24954
    OCRBenchV2 (EN)Accuracy67.04NVIDIA, 2026 — arxiv:2604.24954

    Performance

    Input SizeVideo (2 min) / Audio (1 hr) / Image / Text
    GPU Latency~50ms / item (A100)
    GPU Throughput~20 items/sec (A100)
    GPU Memory~12 GB (MoE sparse activation)

    Specification

    FrameworkHF
    Organizationnvidia
    FeatureScene Captioning
    Outputtext
    Modalitiesvideo, image
    RetrieverSemantic Search
    Parameters31B total / 3B active
    LicenseNVIDIA Open Model License
    Downloads/mo263K

    Research Paper

    Nemotron-3-Nano-Omni Technical Report

    arxiv.org

    Build a pipeline with Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Studio