Mixpeek Logo
    7 min read

    Multimodal Monday #38: World Models, Real-Time Generation

    Week of December 16 - December 22, 2025: TurboDiffusion achieves 100-205x video generation speedup, WorldPlay generates interactive 3D worlds with geometric consistency, Step-GUI reaches SOTA on AndroidWorld & OSWorld benchmarks, and LongVie 2 produces 5-minute continuous videos.

    Multimodal Monday #38: World Models, Real-Time Generation
    Multimodal Monday

    Quick Hits (TL;DR)

    Long Context Solved Multiple Ways. LongVie 2 generates 5-minute continuous videos through architectural improvements. MemFlow handles streaming video through adaptive memory that decides what to remember. WorldPlay maintains geometric consistency across extended sequences.

    Video generation gets a speed boost. TurboDiffusion accelerates video diffusion models by 100-205 times, making real-time video generation practical for the first time. And ReFusion bridges ARMs and MDMs for efficient, high-performance text generation.

    The Living Edge is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

    Layer-Based Control Emerges. Qwen-Image-Layered separates images into editable RGBA layers. StereoPilot handles depth-aware stereo conversion. Generative Refocusing controls focus layers. DeContext adds protection layers.

    Tools, Models and Techniques

    T5Gemma 2

    Google released T5Gemma 2, the next generation of encoder-decoder models. The architecture combines bidirectional understanding with flexible text generation.

    Why it matters: T5Gemma 2 handles tasks that require both deep comprehension and creative output in a single model.
    Links: Blog | Model

    Perception Encoder Audiovisual (PE-AV)

    Meta released PE-AV, the technical engine behind SAM Audio’s audio separation capabilities. The model processes both visual and audio information to isolate individual sound sources.

    Why it matters: PE-AV separates overlapping audio sources with unprecedented accuracy by understanding what objects should sound like.
    Links: Announcement | Paper | Code

    MiMo-V2-Flash

    Xiaomi released MiMo-V2-Flash, optimized for speed in real-time applications. The model sacrifices some accuracy for dramatic latency reductions.
    Links: Hugging Face | Report | Blog

    TurboDiffusion

    TurboDiffusion accelerates video diffusion models by 100-205 times through architectural optimizations. The speedup comes from reducing redundant computations without quality loss.

    Why it matters: Video generation moves from minutes per frame to real-time speeds, enabling interactive applications.
    Links: Hugging Face | GitHub | Paper

    Qwen-Image-Layered

    Qwen-Image-Layered decomposes images into multiple RGBA layers that can be independently edited. Each layer isolates specific semantic or structural components while maintaining visual coherence.

    Why it matters: Image editing becomes precise and reversible when you control individual layers instead of manipulating pixels directly.
    Links: Hugging Face | ModelScope | Paper | Announcement | Demo

    N3D-VLM

    N3D-VLM grounds spatial reasoning in native 3D representations rather than 2D projections. The model understands depth, distance, and spatial relationships directly.

    Why it matters: Vision-language models gain accurate spatial understanding without the distortions inherent in 2D image analysis.
    Links: GitHub | Website | Model

    MemFlow

    MemFlow maintains adaptive memory for long streaming videos, deciding which frames to remember and which to discard. The system balances memory efficiency with video understanding quality.

    Why it matters: You can now process hours of video without exhausting memory or losing critical context.
    Links: Website | Paper | Model

    WorldPlay

    Tencent’s WorldPlay generates interactive 3D worlds with long-term geometric consistency. The model maintains spatial relationships across extended video sequences, allowing persistent interaction with generated environments.

    Why it matters: AI-generated worlds become navigable spaces instead of static videos.
    Links: Website | Paper | Model

    LongVie 2

    LongVie 2 generates 5-minute continuous videos with controllable elements and consistent geometry. The model handles multiple modalities and maintains coherence across thousands of frames.

    Why it matters: Video generation moves from clips to scenes with the length and control needed for actual storytelling.
    Links: Paper | Website | GitHub

    FoundationMotion

    FoundationMotion labels and analyzes spatial movement in videos automatically. The system identifies motion patterns and spatial trajectories without manual annotation.

    Why it matters: Video datasets gain spatial movement labels at scale, enabling better motion understanding in future models.
    Links: Paper | GitHub | Demo | Dataset | Models

    Generative Refocusing

    Generative Refocusing controls depth of field in existing images, simulating camera focus changes after capture. The model infers 3D scene structure to generate realistic blur patterns.

    Why it matters: You gain photographer-level depth control without specialized camera equipment or multiple exposures.
    Links: Website | Demo | Paper | GitHub

    StereoPilot

    StereoPilot converts 2D videos to stereo 3D through learned generative priors. The system produces depth-aware conversions suitable for VR headsets.

    Why it matters: Existing 2D content becomes viewable in 3D without manual conversion or specialized filming.
    Links: Website | Model | GitHub | Paper

    LongCat-Video-Avatar Links: Website | GitHub | Paper | ComfyUI

    Chatterbox Turbo: Hugging Face

    TRELLIS 2: 3D generative model designed for high-fidelity image-to-3D generation. Links: Model | Demo

    Map Anything(Meta): Model

    Gemini 3 Flash(Google): Blog

    FunctionGemma(Google): Model

    Wan 2.6(Alibaba) - Announcement

    /agent(Firebase) for agentic web scraping - Announcement

    Research Highlights

    KV-Tracker: Real-Time Pose Tracking with Transformers

    KV-Tracker achieves real-time tracking at 30 FPS without any training. The approach uses transformer key-value pairs to track objects and scenes across frames.
    Links: Website | Announcement

    DeContext: Protecting Images from Unwanted In-Context Edits

    DeContext adds imperceptible perturbations that prevent DiT models like FLUX and Qwen-Image from making unwanted edits. The protection preserves visual quality while blocking manipulation attempts.
    Links: Website | Paper | GitHub

    Research teaser figure

    EgoX: Generate Immersive First-Person Video from Any Third-Person Clip

    EgoX transforms third-person videos into realistic first-person perspectives using video diffusion. The framework from KAIST AI and Seoul National University maintains spatial and temporal coherence during the transformation.

    Why it matters: Any third-person footage becomes immersive first-person content without specialized cameras or reshooting.
    Links: Website | Paper | GitHub

    MMGR: Multi-Modal Generative Reasoning

    MMGR benchmarks reveal systematic reasoning failures in GPT-4o and other leading multimodal models. The evaluation exposes gaps between perception and logical inference.

    MMGR Overview

    Why it matters: Top models still struggle with basic reasoning tasks despite strong perceptual capabilities.
    Links: Website | Paper

    KlingAvatar 2.0 and Kling-Omni Technical Report

    KlingAvatar 2.0 generates high-fidelity avatar videos through a spatio-temporal cascade framework. Kling-Omni provides a generalist framework for multimodal video generation with a Co-Reasoning Director that fuses instructions across modalities.
    Links: Kling-Avatar Paper | Kling-Omni Paper | Try It

    Step-GUI Technical Report

    Step-GUI introduces a self-evolving pipeline for GUI automation. The system reaches state-of-the-art on AndroidWorld and OSWorld benchmarks through iterative improvement.
    Links: Paper | Website | GitHub | Model

    ReFusion

    ReFusion combines diffusion models with parallel autoregressive decoding. The architecture bridges autoregressive models and diffusion models for faster text generation.
    Links: Paper

    Left (Inference): An iterative slot-level “plan-and-infill” loop. A diffusion step plans and drafts slots, followed by parallel autoregressive verify-and-predict infilling. Right (Training): Mirrors inference, optimizing a hybrid objective of autoregressive loss on permuted clean slots and denoising loss on masked slots.

    DEER

    DEER uses diffusion models to draft content and autoregressive models to verify it. The two-stage approach balances generation quality with computational efficiency.

    Why it matters: You get diffusion quality at autoregressive speeds by splitting generation and verification.
    Links: Paper | Website

    IC-Effect

    IC-Effect applies video effects through in-context learning without fine-tuning. The system learns effect patterns from examples and applies them to new videos.
    Links: Website | GitHub | Paper

    Flow Map Trajectory Tilting

    Flow Map Trajectory Tilting improves diffusion model outputs at test time using flow maps. The technique adjusts generation trajectories without retraining.
    Links: Paper | Website

    World Models

    Three major releases this week generate videos that understand 3D space. WorldPlay from Tencent maintains geometric consistency across extended sequences. LongVie 2 produces 5-minute continuous videos with controllable elements. Kling-Omni provides a generalist framework for multimodal video generation.

    These models don’t just generate pixels. They maintain spatial relationships. When an object moves behind another object in frame 100, it stays behind that object in frame 500. The geometry persists.

    This changes video search. You can now index the underlying 3D structure instead of raw pixels. A search for “car turning left at intersection” finds the spatial event, not just visual patterns that look like cars and intersections.

    This changes VR and AR. Generated worlds become explorable spaces. You navigate through them instead of watching them.

    This changes agent training. Robots learn in simulated 3D environments before deployment. The sim-to-real gap narrows when simulations maintain physical consistency.

    The models still have limits. Five minutes is impressive but not enough for long-form content. Geometric consistency breaks down in complex scenes. Control remains imperfect.

    But the foundation is here. Video generation now builds worlds, not just sequences of images.

    Community + Shoutouts

    Mini Literature Survey Shoutout to Danfei Xu from Nvidia for a literature survey thread covering recent VLA developments.
    Links: Thread


    That's a wrap for Multimodal Monday #38! From speed breakthroughs with TurboDiffusion's 100-205x acceleration and KV-Tracker's 30 FPS training-free tracking, to spatial breakthroughs with WorldPlay's interactive 3D worlds and N3D-VLM's native 3D reasoning, to control breakthroughs with Qwen-Image-Layered's RGBA decomposition and DeContext's edit protection, this week shows multimodal AI gaining the speed, spatial understanding, and precision control needed for production applications. The era of slow, 2D, uncontrollable generation is ending.

    Ready to build multimodal solutions that actually work? Let's talk