Mixpeek Logo
    8 min read

    Multimodal Monday #39: Generative Giants, Retrieval Weaklings

    Weeks of December 29 - January 4, 2026: Research reveals why MLLMs like Qwen2-VL fall down, MegaRAG builds multimodal knowledge graphs without manual construction, HiStream generates 1080p video 107x faster, and Google DeepMind discovers geometric memory in sequence models.

    Multimodal Monday #39: Generative Giants, Retrieval Weaklings
    Multimodal Monday

    Happy New Year! This roundup covers the past two weeks and 2026 has started with a bang, it is going to be a big year for multimodal AI.

    Quick Hits (TL;DR)

    • MLLMs can't retrieve what they can generate. New research shows models like Qwen2-VL and Paligemma2 fail at zero-shot multimodal retrieval despite excelling at generation. CLIP still wins for search tasks.
    • Sequence models discovered a new memory type. Google DeepMind identified geometric memory in deep sequence models, where embeddings encode global entity relationships rather than just co-occurrence patterns.
    • Video generation hit a speed breakthrough. HiStream eliminates spatial, temporal, and timestep redundancy to generate 1080p video 107x faster than previous methods.

    Tools, Models and Techniques

    Tencent HY-Motion 1.0

    Tencent released a billion-parameter text-to-motion model using Diffusion Transformer architecture and flow matching. The model generates 3D character animations from text prompts.

    Why it matters: You can now generate fluid, diverse character animations without motion capture data.
    Links: Project Page | Github | Hugging Face | Technical report

    Diffusion Knows Transparency (DKT)

    DKT repurposes video diffusion models for transparent object depth and normal estimation. It achieves zero-shot SOTA on ClearPose/DREDS benchmarks, runs at 0.17s per frame, and maintains temporal consistency across videos.

    Why it matters: Transparent objects finally get reliable depth estimation without specialized training data.
    Links: Hugging Face | Paper | Website | Models

    LongVideoAgent

    A multi-agent framework where a master LLM coordinates a grounding agent for segment localization and a vision agent for observation extraction. The system uses reinforcement learning to optimize multi-agent cooperation with step limits.

    Why it matters: You can now query hour-long videos with targeted questions instead of processing entire timelines.
    Links: Paper | Website | GitHub

    Qwen-Image-2512

    Qwen's new text-to-image model delivers more realistic humans, finer natural textures, and stronger text rendering. It sets a new SOTA benchmark for image generation quality.
    Links: Hugging Face | GitHub | Blog | Demo | GGUF

    Yume-1.5

    A text-controlled interactive world generation model that creates explorable 3D environments. Users can navigate and interact with generated spaces in real time.
    Links: Website | Hugging Face | Paper

    TwinFlow

    Enables one-step generation on large models using self-adversarial flows. The approach eliminates iterative sampling while maintaining output quality.
    Links: Hugging Face

    Stable Video Infinite 2.0 Pro

    The new version launched with immediate ComfyUI wrapper support from Kijai. Models are already available for download and integration.
    Links: Hugging Face | GitHub

    Soprano

    An ultra-lightweight TTS model that generates 10 hours of 32kHz audio in under 20 seconds. It streams with sub-15ms latency using only 80M parameters and less than 1GB VRAM.
    Links: GitHub

    Wan-NVFP4

    A fast video model claiming 28x faster render speeds than previous versions. Released by lightx2v on Hugging Face.
    Links: Hugging Face

    JavisGPT

    A unified multi-modal LLM for sounding-video comprehension and generation. The model handles video analysis and audio-visual synthesis in one framework.
    Links: Paper | GitHub | Models

    Dream-VL & Dream-VLA

    Open vision-language and vision-language-action models using a diffusion language model backbone. Both models integrate visual understanding with either language or robotic action outputs.
    Links: Paper | Hugging Face | Hugging Face | GitHub

    HyperCLOVA X SEED Omni 8B

    A unified multimodal model handling text, vision, audio, and video inputs with text, image, and audio outputs.
    Links: Hugging Face

    AMD ROCm

    AMD published a guide for accelerating multimodal inference in vLLM using batch-level dynamic programming switches.
    Links: Blog

    Research Highlights

    Generative Giants, Retrieval Weaklings: Why do Multimodal Large Language Models Fail at Multimodal Retrieval?

    MLLMs like Qwen2-VL and Paligemma2 fail at zero-shot multimodal retrieval despite excelling at generation. Using sparse autoencoders, researchers identified three limitations: training objectives optimize for generation not retrieval, evaluation focuses on generative tasks, and autoregressive architectures compute poor similarity scores.

    Distribution of modality scores for learned concepts by (a) CLIP, (b) SigLIP2, (c) Qwen3-VL-8BInstruct, and (d) Paligemma2-3b-Mix-224. The Modality Score quantifies the bias of each concept towards the image modality (blue region) or the text modality (red region).

    Why it matters: You need specialized models like CLIP for retrieval tasks until MLLMs address these architectural limitations.
    Links: Paper

    ReaSeq: Unleashing World Knowledge via Reasoning for Sequential Modeling

    ReaSeq solves two problems in recommender systems: knowledge poverty in ID-based representations and systemic blindness to off-platform user interests. The framework uses reasoning to incorporate world knowledge into sequential modeling.

    Why it matters: Recommendation systems can now reason about user interests beyond platform interaction logs.
    Links: Paper

    MegaRAG: Multimodal Knowledge Graph-Based Retrieval Augmented Generation

    MegaRAG automatically constructs multimodal knowledge graphs integrating text, visual, and spatial information from documents. It uses a two-round, page-based approach where LLMs extract entities in parallel then refine the graph by retrieving relevant subgraphs.

    Why it matters: RAG systems can now handle visually rich documents without manual graph construction.
    Links: Paper

    Retrieval-augmented Prompt Learning for Pre-trained Foundation Models

    This framework enhances prompt learning by decoupling knowledge from memorization. The approach addresses instability in low-resource settings where parametric models overfit to shallow patterns.

    Why it matters: Prompt learning becomes reliable even with limited training data.
    Links: Paper | GitHub

    Latent Implicit Visual Reasoning

    LIVR discovers visual reasoning tokens without explicit supervision. The method outperforms approaches requiring costly annotations.

    Why it matters: Visual reasoning models no longer need expensive manual labeling.
    Links: Paper

    Geometric Memory in Sequence Models

    Google DeepMind identified geometric memory in deep sequence models, where embeddings encode global relationships between all entities including those never co-occurring in training. This contrasts with associative memory's brute-force lookup approach.

    Associative vs. geometric memory of models trained on various graphs. There are two dramatically different ways to memorize a dataset of atomic facts.

    Why it matters: Sequence models can now perform multi-step reasoning through single-step navigation in embedding space.

    Links: Paper

    Step-DeepResearch

    A 32B parameter research agent matching OpenAI and Gemini DeepResearch through atomic capability training. It decomposes research into planning, information gathering, verification, and writing, achieving 61.42 on ResearchRubrics.
    Links: Paper | GitHub

    SpatialTree

    A 4-level cognitive hierarchy mapping spatial abilities in MLLMs from perception to agentic competence. Benchmarks 27 sub-abilities across 16 models and reveals transfer patterns.
    Links: Website | Paper | Benchmark

    FlowBlending

    Stage-aware multi-model sampling for fast and high-fidelity video generation. The approach optimizes different models for different stages of the generation process.
    Links: Paper

    SpaceTimePilot

    Adobe's video diffusion model disentangles space and time for controllable rendering. From a single input video, it enables independent control of camera viewpoint and motion for bullet-time, slow motion, and mixed trajectories.
    Links: Website | Paper

    HiStream

    Meta's autoregressive framework for 1080p video generation eliminates spatial, temporal, and timestep redundancy. HiStream achieves SOTA quality with up to 107.5x speedup.
    Links: Website | Paper | Code

    InsertAnywhere

    Bridges 4D scene geometry and diffusion models for realistic video object insertion. The method maintains spatial and temporal consistency across frames.
    Links: Paper | Website

    Robust-R1

    A framework making multimodal models robust to visual degradations through explicit degradation-aware reasoning chains. Achieves SOTA robustness on R-Bench while maintaining interpretability.
    Links: Paper | Demo | Dataset

    StoryMem

    ByteDance's multi-shot long video storytelling framework with memory. The system maintains narrative consistency across extended video sequences.
    Links: Website | Code

    Spatia

    Microsoft's video generation system maintains a 3D scene point cloud as persistent spatial memory. Enables long-horizon, spatially consistent video generation with explicit camera control and 3D-aware editing.
    Links: Website | Paper | Video

    DiffThinker

    Enables generative multimodal reasoning with diffusion models. The approach integrates reasoning capabilities directly into the diffusion generation process.
    Links: Paper | Website

    MLLMs Are Not Taking Over... Yet

    New research reveals a fundamental problem with multimodal large language models. Despite their impressive text generation, models like Qwen2-VL and Paligemma2 fail at zero-shot multimodal retrieval. They can describe an image in detail but cannot tell you which image best matches a query. They can answer complex questions about a video but cannot find relevant videos in a database.

    The problem is architectural. MLLMs train to generate text, not retrieve images. They optimize for autoregressive generation, not similarity computation. Their evaluation metrics focus on generation quality, not retrieval accuracy.

    This matters for your systems. If you need multimodal retrieval, use CLIP. It was designed for retrieval. MLLMs were not. CLIP runs faster, retrieves more accurately, and trains easier. The specialized model wins.

    This pattern extends beyond retrieval. Step-DeepResearch shows a 32B parameter model matches much larger research agents by focusing on atomic capabilities. HiStream achieves 107x speedup by eliminating redundancy rather than adding parameters. Soprano runs real-time speech synthesis on consumer hardware with 80M parameters.

    The trend is clear. Specialized models beat general-purpose models at specific tasks. You need to match your model to your task. The ecosystem is diversifying, not consolidating. Different problems need different architectures.

    MLLMs will improve at retrieval. Someone will solve the architectural issues. But that day is not today. Build with the tools that work now, not the tools you hope will work later.

    Community + Shoutouts

    ComfyUI Segmentation Agent

    Adam Barbato released an LLM-based character segmentation agent for ComfyUI using SAM 3.
    Links: GitHub

    CosyVoice 3 ComfyUI

    Machine Delusion released a voice cloning node pack featuring CosyVoice 3 for ComfyUI.
    Links: Announcement | GitHub

    SAM3 Video Tracking in X-AnyLabeling

    Important_Priority76 integrated SAM3 video object tracking into X-AnyLabeling for easy annotation workflows.
    Links: Reddit | GitHub


    That's a wrap for Multimodal Monday #39! This week exposed fundamental limitations in current multimodal models while demonstrating paths forward. MLLMs excel at generation but fail at retrieval due to architectural constraints. Meanwhile, HiStream proves speed comes from eliminating redundancy, MegaRAG shows automatic multimodal knowledge graphs work, and geometric memory reveals how models encode relationships they never saw in training.

    Ready to build multimodal solutions that actually work? Let's talk