Mixpeek Logo
    Schedule Demo
    7 min read

    Multimodal Monday #28: Diffusion Thinks, Retrieval Unifies

    Multimodal Monday #28: Fast-dLLM v2 diffuses text 2.5x faster, Omni-Embed-Nemotron hunts across modalities, and Think-Then-Embed reasons to top MMEB-V2.

    Multimodal Monday #28: Diffusion Thinks, Retrieval Unifies
    Multimodal Monday

    Week of October 6-12, 2025

    audio-thumbnail
    Diffusion Thinks Retrieval Unifies
    0:00
    /836.452426

    📢 Quick Takes (TL;DR)

    Unified multimodal retrieval advances - Nvidia's Omni-Embed-Nemotron extends retrieval embeddings to handle text, images, audio, and video in a single architecture, building on recent work like ColPali and inspired by models like

    Diffusion models branch beyond images - RND1 applies diffusion to text generation. Fast-dLLM v2 converts autoregressive models to parallel generation. DiffusionNFT makes reinforcement learning work with diffusion. The technique generalizes.

    Models start explaining their thinking - Think-Then-Embed generates reasoning traces before embeddings. Hunyuan-Vision-1.5-Thinking processes visual info through structured reasoning steps. Chunkr-parse-1-thinking understands document context beyond text extraction. Explanation improves accuracy.

    🧠 Research Highlights

    Nvidia Omni-Embed-Nemotron: A Unified Multimodal Retrieval Model for Text, Image, Audio, and Video

    Nvidia built a single model that searches across text, images, audio, and video simultaneously. Unlike current systems that handle each media type separately, Omni-Embed-Nemotron performs both cross-modal searches (finding videos from text queries) and joint-modal searches (combining text and audio to find relevant content).

    Link: Paper

    Meta SSDD: Single-Step Diffusion Decoder for Efficient Image Tokenization

    Meta created SSDD, a new image tokenizer that processes images in one step instead of multiple iterations. The system achieves better image quality (FID improves from 0.87 to 0.50) while running 3.8x faster than current methods, all without needing adversarial training.

    Left: Speed-quality Pareto-front for different state-of-the-art f8c4 feedforward and diffusion autoencoders. Right: Reconstructions of KL-VAE and SSDD models with similar throughput. Bottom: High-level overview of our method.

    Link: Paper

    Nvidia Fast-dLLM v2: Efficient Block-Diffusion LLM

    Nvidia's Fast-dLLM v2 converts standard language models into parallel text generators using only 1 billion training tokens (500x less than competitors). The system generates text 2.5x faster than regular models while maintaining the same quality, reaching 217.5 tokens per second at batch size 4.

    Links: Paper | Project Page

    Character Mixing for Video Generation

    Researchers developed a system that places characters from different worlds into the same video while preserving their unique styles. The framework uses Cross-Character Embedding to learn character behaviors and Cross-Character Augmentation to create synthetic training data, enabling Mr. Bean to interact naturally in Tom & Jerry's cartoon world.

    Links: Announcement | Project Page | GitHub | Paper

    Think Then Embed: Generative Context Improves Multimodal Embedding

    Think-Then-Embed makes models reason through complex queries before creating embeddings. The two-stage approach first generates reasoning traces that explain the query, then produces embeddings based on both the original query and the reasoning, achieving state-of-the-art results on MMEB-V2 benchmark.

    Given a multi-modal input, we want to first think about the desired embedding content. The representation is conditioned on both original input and the thinking result.

    Link: Paper

    Token Perception for Multimodal Reinforcement Learning enhances agent understanding across different modalities through token-based perception mechanisms.
    Link: Paper

    Vision Language Models: A Survey of 26K Papers analyzes 26,000 VLM papers, revealing the field's shift toward multimodal LLMs and generative methods.
    Link: Paper

    DiffusionNFT: Online Diffusion Reinforcement with Forward Process makes reinforcement learning practical for diffusion models by working directly on the forward process.
    Links: Paper | GitHub

    ChronoEdit: Towards Temporal Reasoning for Image Editing and World Simulation treats image editing as video generation to maintain physical consistency.

    Links: Announcement | Project Page | Paper

    RLAD: Reasoning Abstractions for Decision Making teaches models to learn textual strategies that guide exploration in decision-making tasks.

    Two diagrams side by side. The left diagram, labeled "Generate Abstractions by Summarizing the Future," shows interconnected nodes in red, yellow, and green, forming a network. The right diagram, labeled "Propose and Utilize Abstractions," displays nodes and connections in pink, green, and gray, with text boxes containing mathematical expressions like "ax + bx + c = 0 (mod m), D = b^2 - 4ac, then X = (-b ± sqrt(D))/2a (mod m)."

    Links: Announcement | Paper | Project Page

    VLM-Lens: Interpreting Vision-Language Models provides tools for analyzing vision-language models at any layer with YAML configuration.

    A calico cat sitting on rocks near grass. A flowchart with icons including an eye, gears, and a cat, connected by arrows to text blocks labeled BLIP-2, PaliGemma, Qwen-VL, and others. Screenshots of people and code, with text overlays like "Probe (Training)" and "Probe (Inference)". Text blocks list model names like MiniCPM, InternVL, and model layers.

    Links: Announcement | GitHub | Paper

    NaViL: Rethinking Scaling Properties of Native Multimodal Large Language Models under Data Constraints shows how to scale multimodal LLMs when training data is limited.
    Link: Paper

    🛠️ Tools & Techniques

    StreamDiffusionV2: Real-Time Interactive Video Generation

    StreamDiffusionV2 generates live video at 42 FPS on 4 H100 GPUs and 16.6 FPS on 2 RTX 4090s. The system uses StreamVAE to compress 4 video frames into one latent frame, synced rolling KV cache for style consistency, and motion-aware noise control that adapts to how fast things move in your video.

    Links: Announcement | Project Page | GitHub

    Tencent Hunyuan-Vision-1.5-Thinking

    Tencent released their latest vision-language model, ranked #3 on LM Arena. The model processes visual information through structured reasoning steps before generating responses, handling complex visual understanding tasks that require multi-step thinking.
    Links: Announcement | Github

    chunkr-parse-1 and chunkr-parse-1-thinking VLMs

    Chunkr built VLMs specifically for parsing documents with complex layouts, tables, and formulas in over 100 languages. The models outperform AWS Textract, Gemini 2.5 Pro, and Mistral at OCR tasks, with the "thinking" variant understanding document context beyond just text extraction.

    Bar chart titled Model Performance comparing chunkr-parse-1 in orange bars against other models in gray bars across tasks including chunkr parse 1 thinking 2.5 pro gemini 1.5 flash marlin 0.3 claude 3.5 sonnet gpt 4o mini gpt 4o and health underwriting with scores on y-axis from 0 to 1 and tasks labeled on x-axis.

    Links: Annoucement | Website | Blog Post

    Paris: Decentralized Trained Open-Weight Diffusion Model

    Paris is the first diffusion model trained across multiple nodes without centralized control. The decentralized approach distributes computational resources while keeping the model weights open for anyone to use.

    Links: Annoucement | Paper | HuggingFace

    RND1: Powerful Base Diffusion Language Model

    Radical Numerics launched RND1, their base diffusion language model that generates text through diffusion processes instead of traditional sequential generation. The model offers parallel text generation with improved controllability.

    RND1 diffusion text generation

    Links: Annoucement | Blog Post | GitHub | HuggingFace

    MM-HELIX launches on Hugging Face as a 7B parameter multimodal model with thinking capabilities.
    Links: Paper | HuggingFace

    kani-tts-370m introduces a lightweight 370M parameter text-to-speech model for resource-constrained environments.

    Links: HuggingFace Model | Demo Space

    OpenAI GPT-5 Pro and AgentKit launches with enhanced capabilities and agent development frameworks, representing significant advancement in large language model capabilities and agent development tools.
    Links: Docs

    UNIDOC-BENCH provides 70,000 real-world PDF pages spanning eight domains with 1,600 multimodal QA pairs for evaluating document RAG systems. Links: Paper | GitHub

    CFVBench offers comprehensive video benchmark for fine-grained multimodal retrieval-augmented generation.
    Link: Paper

    Unified Multimodal Retrieval Architectures Consolidate Modalities

    Nvidia's Omni-Embed-Nemotron demonstrates a trend toward consolidating retrieval capabilities across text, images, audio, and video within single unified architectures. Building on recent work like ColPali, which showed that preserving document layout through image-based representations improves retrieval quality, and inspired by capabilities of models like Qwen2.5-Omni, the approach extends unified embeddings to encompass all major content modalities. Rather than maintaining separate retrieval pipelines for each content type, the model provides consistent embedding spaces that enable both cross-modal retrieval (searching across different modality types) and joint-modal retrieval (combining multiple modalities in queries). This architectural consolidation addresses practical challenges in managing heterogeneous content collections. Organizations maintaining documents, presentations, recorded meetings, and training videos can implement unified search interfaces rather than separate systems for each content type. The ability to find connections across media types, such as locating a concept mentioned in both a PDF document and a recorded presentation, becomes more straightforward when embeddings exist in consistent spaces.

    For multimodal indexing and retrieval systems, this trend suggests a shift from modality-specific pipelines toward unified architectures that handle diverse content types through consistent frameworks. The practical implications center on simplified infrastructure, more coherent search experiences, and the ability to discover relationships across content that exists in different formats. As these unified architectures mature, the distinction between searching "documents" versus "videos" versus "audio recordings" may become less relevant to end users, who simply search for information regardless of its original format.

    Diffusion Models Expand Beyond Image Generation

    The application of diffusion techniques to domains beyond image generation represents a significant expansion of the methodology's scope. RND1 demonstrates diffusion applied to text generation, achieving competitive performance with traditional autoregressive approaches while offering potential advantages in parallel generation and controllability. Fast-dLLM v2 shows that pretrained autoregressive models can be efficiently adapted into block diffusion models for parallel text generation, requiring only ~1B tokens of fine-tuning compared to the 580B tokens needed for full-attention diffusion LLMs like Dream. DiffusionNFT extends the technique to reinforcement learning, introducing a new online RL paradigm that works directly on the forward process via flow matching.

    These applications share a common pattern: adapting diffusion principles to new domains while addressing domain-specific challenges. For text, this involves handling discrete tokens and maintaining coherence across longer sequences. For reinforcement learning, it requires integrating reward signals into the diffusion process. The success of these adaptations suggests that diffusion's core principles, iterative refinement through learned denoising processes, generalize beyond the image domain where they first proved successful.

    The practical implications for multimodal systems are substantial. Text generation components that can operate in parallel rather than sequentially may reduce latency in multimodal pipelines. Reinforcement learning approaches that leverage diffusion could enable more sophisticated training of multimodal agents. As diffusion techniques continue expanding into new domains, multimodal systems gain access to a broader toolkit of generation and optimization methods that share common underlying principles, potentially simplifying integration and enabling novel combinations of capabilities across different modalities.

    🧩 Community + Shoutouts

    Shoutout to the Alibaba Qwen team for releasing Qwen3-VL Cookbooks. Comprehensive guides that teach while documenting.
    Links: Announcement | GitHub

    Shoutout to enigmatic_e for the entertaining stress test of Wan 2.2 Animate.
    Link: Post


    That's this week's Multimodal Monday. Unified retrieval architectures that handle multiple modalities. Real-time generation at production speeds. Models that think before they act. The research continues advancing.

    Ready to build multimodal solutions that actually work? Let's talk.