Multimodal Monday #38: World Models, Real-Time Generation

Quick Hits (TL;DR)

Long Context Solved Multiple Ways. LongVie 2 generates 5-minute continuous videos through architectural improvements. MemFlow handles streaming video through adaptive memory that decides what to remember. WorldPlay maintains geometric consistency across extended sequences.

Video generation gets a speed boost. TurboDiffusion accelerates video diffusion models by 100-205 times, making real-time video generation practical for the first time. And ReFusion bridges ARMs and MDMs for efficient, high-performance text generation.

The Living Edge is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

Layer-Based Control Emerges. Qwen-Image-Layered separates images into editable RGBA layers. StereoPilot handles depth-aware stereo conversion. Generative Refocusing controls focus layers. DeContext adds protection layers.

Tools, Models and Techniques

T5Gemma 2

Google released T5Gemma 2, the next generation of encoder-decoder models. The architecture combines bidirectional understanding with flexible text generation.

Why it matters: T5Gemma 2 handles tasks that require both deep comprehension and creative output in a single model.
Links: Blog | Model

Perception Encoder Audiovisual (PE-AV)

Meta released PE-AV, the technical engine behind SAM Audio’s audio separation capabilities. The model processes both visual and audio information to isolate individual sound sources.

Why it matters: PE-AV separates overlapping audio sources with unprecedented accuracy by understanding what objects should sound like.
Links: Announcement | Paper | Code

MiMo-V2-Flash

Xiaomi released MiMo-V2-Flash, optimized for speed in real-time applications. The model sacrifices some accuracy for dramatic latency reductions.
Links: Hugging Face | Report | Blog

TurboDiffusion

TurboDiffusion accelerates video diffusion models by 100-205 times through architectural optimizations. The speedup comes from reducing redundant computations without quality loss.

Why it matters: Video generation moves from minutes per frame to real-time speeds, enabling interactive applications.
Links: Hugging Face | GitHub | Paper

Qwen-Image-Layered

Qwen-Image-Layered decomposes images into multiple RGBA layers that can be independently edited. Each layer isolates specific semantic or structural components while maintaining visual coherence.

Why it matters: Image editing becomes precise and reversible when you control individual layers instead of manipulating pixels directly.
Links: Hugging Face | ModelScope | Paper | Announcement | Demo

N3D-VLM

N3D-VLM grounds spatial reasoning in native 3D representations rather than 2D projections. The model understands depth, distance, and spatial relationships directly.

Why it matters: Vision-language models gain accurate spatial understanding without the distortions inherent in 2D image analysis.
Links: GitHub | Website | Model

MemFlow

MemFlow maintains adaptive memory for long streaming videos, deciding which frames to remember and which to discard. The system balances memory efficiency with video understanding quality.

Why it matters: You can now process hours of video without exhausting memory or losing critical context.
Links: Website | Paper | Model

WorldPlay

Tencent’s WorldPlay generates interactive 3D worlds with long-term geometric consistency. The model maintains spatial relationships across extended video sequences, allowing persistent interaction with generated environments.

Why it matters: AI-generated worlds become navigable spaces instead of static videos.
Links: Website | Paper | Model

LongVie 2

LongVie 2 generates 5-minute continuous videos with controllable elements and consistent geometry. The model handles multiple modalities and maintains coherence across thousands of frames.

Why it matters: Video generation moves from clips to scenes with the length and control needed for actual storytelling.
Links: Paper | Website | GitHub

FoundationMotion

FoundationMotion labels and analyzes spatial movement in videos automatically. The system identifies motion patterns and spatial trajectories without manual annotation.

Why it matters: Video datasets gain spatial movement labels at scale, enabling better motion understanding in future models.
Links: Paper | GitHub | Demo | Dataset | Models

Generative Refocusing

Generative Refocusing controls depth of field in existing images, simulating camera focus changes after capture. The model infers 3D scene structure to generate realistic blur patterns.

Why it matters: You gain photographer-level depth control without specialized camera equipment or multiple exposures.
Links: Website | Demo | Paper | GitHub

StereoPilot

StereoPilot converts 2D videos to stereo 3D through learned generative priors. The system produces depth-aware conversions suitable for VR headsets.

Why it matters: Existing 2D content becomes viewable in 3D without manual conversion or specialized filming.
Links: Website | Model | GitHub | Paper

LongCat-Video-Avatar Links: Website | GitHub | Paper | ComfyUI

Chatterbox Turbo: Hugging Face

TRELLIS 2: 3D generative model designed for high-fidelity image-to-3D generation. Links: Model | Demo

Map Anything(Meta): Model

Gemini 3 Flash(Google): Blog

FunctionGemma(Google): Model

Wan 2.6(Alibaba) - Announcement

/agent(Firebase) for agentic web scraping - Announcement

Research Highlights

KV-Tracker: Real-Time Pose Tracking with Transformers

KV-Tracker achieves real-time tracking at 30 FPS without any training. The approach uses transformer key-value pairs to track objects and scenes across frames.
Links: Website | Announcement

DeContext: Protecting Images from Unwanted In-Context Edits

DeContext adds imperceptible perturbations that prevent DiT models like FLUX and Qwen-Image from making unwanted edits. The protection preserves visual quality while blocking manipulation attempts.
Links: Website | Paper | GitHub

EgoX: Generate Immersive First-Person Video from Any Third-Person Clip

EgoX transforms third-person videos into realistic first-person perspectives using video diffusion. The framework from KAIST AI and Seoul National University maintains spatial and temporal coherence during the transformation.

Why it matters: Any third-person footage becomes immersive first-person content without specialized cameras or reshooting.
Links: Website | Paper | GitHub

MMGR benchmarks reveal systematic reasoning failures in GPT-4o and other leading multimodal models. The evaluation exposes gaps between perception and logical inference.

Why it matters: Top models still struggle with basic reasoning tasks despite strong perceptual capabilities.
Links: Website | Paper

KlingAvatar 2.0 and Kling-Omni Technical Report

KlingAvatar 2.0 generates high-fidelity avatar videos through a spatio-temporal cascade framework. Kling-Omni provides a generalist framework for multimodal video generation with a Co-Reasoning Director that fuses instructions across modalities.
Links: Kling-Avatar Paper | Kling-Omni Paper | Try It

Step-GUI Technical Report

Step-GUI introduces a self-evolving pipeline for GUI automation. The system reaches state-of-the-art on AndroidWorld and OSWorld benchmarks through iterative improvement.
Links: Paper | Website | GitHub | Model

ReFusion

ReFusion combines diffusion models with parallel autoregressive decoding. The architecture bridges autoregressive models and diffusion models for faster text generation.
Links: Paper

Left (Inference): An iterative slot-level “plan-and-infill” loop. A diffusion step plans and drafts slots, followed by parallel autoregressive verify-and-predict infilling. Right (Training): Mirrors inference, optimizing a hybrid objective of autoregressive loss on permuted clean slots and denoising loss on masked slots.

DEER

DEER uses diffusion models to draft content and autoregressive models to verify it. The two-stage approach balances generation quality with computational efficiency.

Why it matters: You get diffusion quality at autoregressive speeds by splitting generation and verification.
Links: Paper | Website

IC-Effect

IC-Effect applies video effects through in-context learning without fine-tuning. The system learns effect patterns from examples and applies them to new videos.
Links: Website | GitHub | Paper

Flow Map Trajectory Tilting

Flow Map Trajectory Tilting improves diffusion model outputs at test time using flow maps. The technique adjusts generation trajectories without retraining.
Links: Paper | Website

Trends & Predictions

World Models

Three major releases this week generate videos that understand 3D space. WorldPlay from Tencent maintains geometric consistency across extended sequences. LongVie 2 produces 5-minute continuous videos with controllable elements. Kling-Omni provides a generalist framework for multimodal video generation.

These models don’t just generate pixels. They maintain spatial relationships. When an object moves behind another object in frame 100, it stays behind that object in frame 500. The geometry persists.

This changes video search. You can now index the underlying 3D structure instead of raw pixels. A search for “car turning left at intersection” finds the spatial event, not just visual patterns that look like cars and intersections.

This changes VR and AR. Generated worlds become explorable spaces. You navigate through them instead of watching them.

This changes agent training. Robots learn in simulated 3D environments before deployment. The sim-to-real gap narrows when simulations maintain physical consistency.

The models still have limits. Five minutes is impressive but not enough for long-form content. Geometric consistency breaks down in complex scenes. Control remains imperfect.

But the foundation is here. Video generation now builds worlds, not just sequences of images.

Community + Shoutouts

Mini Literature Survey Shoutout to Danfei Xu from Nvidia for a literature survey thread covering recent VLA developments.
Links: Thread

That's a wrap for Multimodal Monday #38! From speed breakthroughs with TurboDiffusion's 100-205x acceleration and KV-Tracker's 30 FPS training-free tracking, to spatial breakthroughs with WorldPlay's interactive 3D worlds and N3D-VLM's native 3D reasoning, to control breakthroughs with Qwen-Image-Layered's RGBA decomposition and DeContext's edit protection, this week shows multimodal AI gaining the speed, spatial understanding, and precision control needed for production applications. The era of slow, 2D, uncontrollable generation is ending.

Ready to build multimodal solutions that actually work? Let's talk