Multimodal Monday #39: Generative Giants, Retrieval Weaklings

Happy New Year! This roundup covers the past two weeks and 2026 has started with a bang, it is going to be a big year for multimodal AI.

Quick Hits (TL;DR)

MLLMs can't retrieve what they can generate. New research shows models like Qwen2-VL and Paligemma2 fail at zero-shot multimodal retrieval despite excelling at generation. CLIP still wins for search tasks.
Sequence models discovered a new memory type. Google DeepMind identified geometric memory in deep sequence models, where embeddings encode global entity relationships rather than just co-occurrence patterns.
Video generation hit a speed breakthrough. HiStream eliminates spatial, temporal, and timestep redundancy to generate 1080p video 107x faster than previous methods.

Tools, Models and Techniques

Tencent HY-Motion 1.0

Tencent released a billion-parameter text-to-motion model using Diffusion Transformer architecture and flow matching. The model generates 3D character animations from text prompts.

Why it matters: You can now generate fluid, diverse character animations without motion capture data.
Links: Project Page | Github | Hugging Face | Technical report

Diffusion Knows Transparency (DKT)

DKT repurposes video diffusion models for transparent object depth and normal estimation. It achieves zero-shot SOTA on ClearPose/DREDS benchmarks, runs at 0.17s per frame, and maintains temporal consistency across videos.

Why it matters: Transparent objects finally get reliable depth estimation without specialized training data.
Links: Hugging Face | Paper | Website | Models

LongVideoAgent

A multi-agent framework where a master LLM coordinates a grounding agent for segment localization and a vision agent for observation extraction. The system uses reinforcement learning to optimize multi-agent cooperation with step limits.

Why it matters: You can now query hour-long videos with targeted questions instead of processing entire timelines.
Links: Paper | Website | GitHub

Qwen-Image-2512

Qwen's new text-to-image model delivers more realistic humans, finer natural textures, and stronger text rendering. It sets a new SOTA benchmark for image generation quality.
Links: Hugging Face | GitHub | Blog | Demo | GGUF

Yume-1.5

A text-controlled interactive world generation model that creates explorable 3D environments. Users can navigate and interact with generated spaces in real time.
Links: Website | Hugging Face | Paper

TwinFlow

Enables one-step generation on large models using self-adversarial flows. The approach eliminates iterative sampling while maintaining output quality.
Links: Hugging Face

Stable Video Infinite 2.0 Pro

The new version launched with immediate ComfyUI wrapper support from Kijai. Models are already available for download and integration.
Links: Hugging Face | GitHub

Soprano

An ultra-lightweight TTS model that generates 10 hours of 32kHz audio in under 20 seconds. It streams with sub-15ms latency using only 80M parameters and less than 1GB VRAM.
Links: GitHub

Wan-NVFP4

A fast video model claiming 28x faster render speeds than previous versions. Released by lightx2v on Hugging Face.
Links: Hugging Face

JavisGPT

A unified multi-modal LLM for sounding-video comprehension and generation. The model handles video analysis and audio-visual synthesis in one framework.
Links: Paper | GitHub | Models

Dream-VL & Dream-VLA

Open vision-language and vision-language-action models using a diffusion language model backbone. Both models integrate visual understanding with either language or robotic action outputs.
Links: Paper | Hugging Face | Hugging Face | GitHub

HyperCLOVA X SEED Omni 8B

A unified multimodal model handling text, vision, audio, and video inputs with text, image, and audio outputs.
Links: Hugging Face

AMD ROCm

AMD published a guide for accelerating multimodal inference in vLLM using batch-level dynamic programming switches.
Links: Blog

Research Highlights

Generative Giants, Retrieval Weaklings: Why do Multimodal Large Language Models Fail at Multimodal Retrieval?

MLLMs like Qwen2-VL and Paligemma2 fail at zero-shot multimodal retrieval despite excelling at generation. Using sparse autoencoders, researchers identified three limitations: training objectives optimize for generation not retrieval, evaluation focuses on generative tasks, and autoregressive architectures compute poor similarity scores.

Distribution of modality scores for learned concepts by (a) CLIP, (b) SigLIP2, (c) Qwen3-VL-8BInstruct, and (d) Paligemma2-3b-Mix-224. The Modality Score quantifies the bias of each concept towards the image modality (blue region) or the text modality (red region).

Why it matters: You need specialized models like CLIP for retrieval tasks until MLLMs address these architectural limitations.
Links: Paper

ReaSeq: Unleashing World Knowledge via Reasoning for Sequential Modeling

ReaSeq solves two problems in recommender systems: knowledge poverty in ID-based representations and systemic blindness to off-platform user interests. The framework uses reasoning to incorporate world knowledge into sequential modeling.

Why it matters: Recommendation systems can now reason about user interests beyond platform interaction logs.
Links: Paper

MegaRAG: Multimodal Knowledge Graph-Based Retrieval Augmented Generation

MegaRAG automatically constructs multimodal knowledge graphs integrating text, visual, and spatial information from documents. It uses a two-round, page-based approach where LLMs extract entities in parallel then refine the graph by retrieving relevant subgraphs.

Why it matters: RAG systems can now handle visually rich documents without manual graph construction.
Links: Paper

Retrieval-augmented Prompt Learning for Pre-trained Foundation Models

This framework enhances prompt learning by decoupling knowledge from memorization. The approach addresses instability in low-resource settings where parametric models overfit to shallow patterns.

Why it matters: Prompt learning becomes reliable even with limited training data.
Links: Paper | GitHub

Latent Implicit Visual Reasoning

LIVR discovers visual reasoning tokens without explicit supervision. The method outperforms approaches requiring costly annotations.

Why it matters: Visual reasoning models no longer need expensive manual labeling.
Links: Paper

Geometric Memory in Sequence Models

Google DeepMind identified geometric memory in deep sequence models, where embeddings encode global relationships between all entities including those never co-occurring in training. This contrasts with associative memory's brute-force lookup approach.

Associative vs. geometric memory of models trained on various graphs. There are two dramatically different ways to memorize a dataset of atomic facts.

Why it matters: Sequence models can now perform multi-step reasoning through single-step navigation in embedding space.

Links: Paper

Step-DeepResearch

A 32B parameter research agent matching OpenAI and Gemini DeepResearch through atomic capability training. It decomposes research into planning, information gathering, verification, and writing, achieving 61.42 on ResearchRubrics.
Links: Paper | GitHub

SpatialTree

A 4-level cognitive hierarchy mapping spatial abilities in MLLMs from perception to agentic competence. Benchmarks 27 sub-abilities across 16 models and reveals transfer patterns.
Links: Website | Paper | Benchmark

FlowBlending

Stage-aware multi-model sampling for fast and high-fidelity video generation. The approach optimizes different models for different stages of the generation process.
Links: Paper

SpaceTimePilot

Adobe's video diffusion model disentangles space and time for controllable rendering. From a single input video, it enables independent control of camera viewpoint and motion for bullet-time, slow motion, and mixed trajectories.
Links: Website | Paper

HiStream

Meta's autoregressive framework for 1080p video generation eliminates spatial, temporal, and timestep redundancy. HiStream achieves SOTA quality with up to 107.5x speedup.
Links: Website | Paper | Code

InsertAnywhere

Bridges 4D scene geometry and diffusion models for realistic video object insertion. The method maintains spatial and temporal consistency across frames.
Links: Paper | Website

Robust-R1

A framework making multimodal models robust to visual degradations through explicit degradation-aware reasoning chains. Achieves SOTA robustness on R-Bench while maintaining interpretability.
Links: Paper | Demo | Dataset

StoryMem

ByteDance's multi-shot long video storytelling framework with memory. The system maintains narrative consistency across extended video sequences.
Links: Website | Code

Spatia

Microsoft's video generation system maintains a 3D scene point cloud as persistent spatial memory. Enables long-horizon, spatially consistent video generation with explicit camera control and 3D-aware editing.
Links: Website | Paper | Video

DiffThinker

Enables generative multimodal reasoning with diffusion models. The approach integrates reasoning capabilities directly into the diffusion generation process.
Links: Paper | Website

Trends & Predictions

MLLMs Are Not Taking Over... Yet

New research reveals a fundamental problem with multimodal large language models. Despite their impressive text generation, models like Qwen2-VL and Paligemma2 fail at zero-shot multimodal retrieval. They can describe an image in detail but cannot tell you which image best matches a query. They can answer complex questions about a video but cannot find relevant videos in a database.

The problem is architectural. MLLMs train to generate text, not retrieve images. They optimize for autoregressive generation, not similarity computation. Their evaluation metrics focus on generation quality, not retrieval accuracy.

This matters for your systems. If you need multimodal retrieval, use CLIP. It was designed for retrieval. MLLMs were not. CLIP runs faster, retrieves more accurately, and trains easier. The specialized model wins.

This pattern extends beyond retrieval. Step-DeepResearch shows a 32B parameter model matches much larger research agents by focusing on atomic capabilities. HiStream achieves 107x speedup by eliminating redundancy rather than adding parameters. Soprano runs real-time speech synthesis on consumer hardware with 80M parameters.

The trend is clear. Specialized models beat general-purpose models at specific tasks. You need to match your model to your task. The ecosystem is diversifying, not consolidating. Different problems need different architectures.

MLLMs will improve at retrieval. Someone will solve the architectural issues. But that day is not today. Build with the tools that work now, not the tools you hope will work later.

Community + Shoutouts

ComfyUI Segmentation Agent

Adam Barbato released an LLM-based character segmentation agent for ComfyUI using SAM 3.
Links: GitHub

CosyVoice 3 ComfyUI

Machine Delusion released a voice cloning node pack featuring CosyVoice 3 for ComfyUI.
Links: Announcement | GitHub

SAM3 Video Tracking in X-AnyLabeling

Important_Priority76 integrated SAM3 video object tracking into X-AnyLabeling for easy annotation workflows.
Links: Reddit | GitHub

That's a wrap for Multimodal Monday #39! This week exposed fundamental limitations in current multimodal models while demonstrating paths forward. MLLMs excel at generation but fail at retrieval due to architectural constraints. Meanwhile, HiStream proves speed comes from eliminating redundancy, MegaRAG shows automatic multimodal knowledge graphs work, and geometric memory reveals how models encode relationships they never saw in training.

Ready to build multimodal solutions that actually work? Let's talk