omni-embed-nemotron-3b

by nvidia

Unified embedding model for text, image, audio, and video retrieval in a single vector space

3Kdl/month

126likes

4.7Bparams

HuggingFace Run on your data

Identifiers

Model ID

nvidia/omni-embed-nemotron-3b

Feature URI

mixpeek://image_extractor@v1/nvidia_omni_embed_nemotron_3b_v1

Overview

Omni-Embed Nemotron is NVIDIA's omnimodal embedding model that encodes text, images, audio, and video into a shared 2048-dimensional vector space. Built on the Thinker component of Qwen2.5-Omni-3B, it processes each modality independently and projects into a single retrieval-ready embedding.

On Mixpeek, Omni-Embed Nemotron enables true cross-modal search: query with text and retrieve matching video clips, audio segments, document pages, or images from a single index. One model replaces four separate embedding pipelines.

Architecture

Transformer-based encoder derived from Qwen2.5-Omni-3B (Thinker only, no Talker). 2048-dim output embeddings. 32K max context tokens. Modality-separated encoding with independent audio and video processing paths. 4.7B parameters.

Mixpeek SDK Integration

import { Mixpeek } from "mixpeek";

const mx = new Mixpeek({ apiKey: "API_KEY" });

// Managed: create a collection over a bucket; Mixpeek runs this model's extractor
const collection = await mx.collections.create({
  namespace_id: "my-namespace",
  collection_name: "my-collection",
  source: { type: "bucket", bucket_ids: ["bkt_your_bucket"] },
  feature_extractor: {
    feature_extractor_name: "multimodal_embedding",
    version: "v1",
    parameters: { model_id: "nvidia/omni-embed-nemotron-3b" },
  },
});

Capabilities

Unified text, image, audio, and video embeddings in one model
2048-dimensional dense vectors for cross-modal retrieval
32K token context window
State-of-the-art video retrieval among embedding models
Competitive visual document retrieval (85.7 nDCG@5 on ViDoRe V1)

Use Cases on Mixpeek

Cross-modal search: query with text, retrieve matching video clips or audio segments

Unified media index: embed an entire multimedia library into one searchable vector space

Podcast and meeting search: find audio moments matching visual or textual queries

Video library retrieval: surface relevant clips by scene description or spoken content

Benchmarks

Dataset	Metric	Score	Source
ViDoRe V1 (visual doc)	nDCG@5	85.7%	NVIDIA, 2025: Model Card
MTEB text retrieval (10 tasks)	nDCG@10 avg	0.606	NVIDIA, 2025: Model Card
Video retrieval (LPM + FineVideo)	nDCG@10 avg	0.706	NVIDIA, 2025: Model Card