Multimodal Monday #42: See, Click, Speak
Week of January 19 - January 25, 2026: D4RT turns video into 4D space, PersonaPlex enables full-duplex speech with persona control, EvoCUA hits #1 on OSWorld, and Linum V2 brings 720p video gen to 2B parameters.

📢 Quick Take (TL;DR)
- Video models unlock the fourth dimension - Google's D4RT and the new OmniTransfer framework are pushing video understanding beyond simple playback, enabling AI to perceive time, space, and motion as manipulatable 4D data structures.
- Agents are learning to see and click - From Microsoft's Rho-alpha controlling robots to Meituan's EvoCUA mastering computer interfaces, we are seeing a surge in "action-first" multimodal models that translate vision directly into physical or digital execution.
- Speech becomes a two-way street - New models like NVIDIA's PersonaPlex and Qwen3-TTS are moving us towards real-time, full-duplex conversational AI that can listen, think, and speak with human-like latency and emotional nuance.
🛠️ Tools, Models and Techniques
PersonaPlex
NVIDIA released a real-time, full-duplex speech-to-speech model that lets you control the speaker's persona through text prompts. Built on the Moshi architecture, it handles simultaneous listening and speaking with low latency, allowing for natural interruptions and back-and-forth conversation.
Why it matters: True conversational AI needs to be more than just a fast transcriber. PersonaPlex shows that we can build systems that not only understand words but also embody specific personalities and handle the messy, overlapping nature of real human dialogue.
Research Page | GitHub
Qwen3-TTS
This open-source text-to-speech model can clone voices, design new ones, and speak naturally in 10 languages in real-time. Its dual-track architecture and custom audio tokenizers keep quality high while maintaining the speed needed for live applications.

Why it matters: High-quality, low-latency TTS is the missing link for voice agents. By open-sourcing a model that rivals proprietary APIs, Qwen3-TTS democratizes access to the kind of voice technology needed to build responsive, human-level assistants.
Models | Writeup
Linum V2
A new 2B parameter text-to-video model that generates high-quality 720p video from text prompts. Trained from scratch by a small team, it aims to make animation accessible and affordable without requiring massive compute clusters.
Why it matters: Video generation has been dominated by massive, closed-source models. Linum V2 proves that efficient, smaller-scale models can still deliver impressive results, opening the door for more developers to integrate video creation into their apps.
Launch Post | Hugging Face
OpenVision 3
A unified visual encoder designed for both understanding and generation tasks. It outperforms standard CLIP-based encoders in generative frameworks, proving that a single model can effectively handle the dual tasks of analyzing images and creating them.
Why it matters: Most multimodal systems use separate models for "seeing" and "drawing." OpenVision 3 suggests a more efficient future where a single "visual brain" handles both perception and creation, simplifying pipelines and potentially improving semantic consistency.
Paper | GitHub
EvoCUA
Meituan's "Evolutionary Computer Use Agent" is now the #1 open-source model for controlling operating systems. It achieves 56.7% on the OSWorld benchmark by generating its own synthetic training tasks and learning from its failures in a sandbox environment.
Why it matters: Training agents to use computers is hard because real-world interaction data is scarce. EvoCUA's self-improving loop shows a path forward where agents teach themselves to navigate software, reducing the reliance on expensive human demonstrations.
Paper | GitHub
- Rho-alpha: Microsoft's first robotics model derived from the Phi series, translating language into bimanual robot actions. Story
- Waypoint-1: A real-time interactive video diffusion model from Overworld. Blog
- VibeVoice ASR: A speech-to-text model for long-form audio with speaker identification and custom hotwords. Model
- Stitch MCP Server: Google's new tool for connecting Gemini to external data and skills via the Model Context Protocol. Docs
- RF-DETR: A state-of-the-art real-time segmentation model from Roboflow, released under Apache 2.0. Blog
- Remotion Skills: New MCP skills for the Remotion video creation framework. GitHub
- VIGA: An agent that converts images into 3D Blender code, treating vision as inverse graphics. Project Page
- LuxTTS: A lightweight TTS model that is 150x faster than real-time. GitHub
- LightOnOCR: A vision-language model for converting complex documents into clean, ordered text. Hugging Face
- Higgsfield AI Influencer Factory: A studio for creating consistent AI characters and videos. Studio
- Docs2Synth: A framework for training document retrieval systems using synthetic data. Paper
🧠 Research Highlights
D4RT: Seeing the World in 4D
Google DeepMind introduced D4RT, a unified model that turns video into 4D representations (3D space + time). Unlike previous methods that process video frame-by-frame, D4RT understands the entire spatio-temporal volume, allowing it to track objects and geometry consistently over time.
Why it matters: This is a step change for video understanding. Instead of just labeling pixels, the model builds a coherent 3D world model from 2D footage. For indexing, this means we can search for actions and objects based on their physical behavior in space, not just their visual appearance in a single frame.
Blog | Project Page
HERMES: Faster Streaming Video Understanding
HERMES uses the KV cache as a hierarchical memory system to speed up streaming video analysis. It achieves 10x faster time-to-first-token and reduces video tokens by 68% without sacrificing accuracy, making it feasible to run high-performance video understanding in real-time.
Why it matters: Video is heavy. Processing every frame kills latency and blows up costs. HERMES shows that we can be selective about what we keep in memory, discarding redundant visual information while retaining the context needed to answer questions. This is crucial for building responsive video search and alert systems.
Paper
OmniTransfer
This all-in-one framework handles "spatio-temporal video transfer," meaning it can transfer styles, motion, and effects from one video to another. It can animate static images or completely restyle a video clip while preserving the original motion dynamics.
Why it matters: Generative video is moving beyond "text-to-video" to "video-to-video." OmniTransfer gives creators granular control over the look and feel of their content, allowing for complex edits and transformations that previously required manual VFX work.
Project Page | Paper
Think3D: Thinking with Space
Think3D demonstrates that smaller multimodal models can significantly improve their spatial reasoning skills without extra training, simply by using "tool-augmented spatial exploration." It teaches the model to use external tools to verify its spatial understanding, leading to more human-like 3D reasoning.
Why it matters: We often assume that reasoning requires massive scale. This paper shows that how a model thinks (and the tools it uses) matters just as much. For spatial indexing, this suggests we can get better results by equipping smaller, faster models with the right geometric tools rather than just throwing more parameters at the problem.
Paper
CoDance
A new model that animates characters in images based on text prompts and pose sequences. It uses an "unbind-rebind" paradigm to separate the character from their original spatial context, allowing for flexible re-posing and animation even in complex scenes with multiple subjects.
Project Page | Paper
- Knots to Knobs: Using sparse autoencoders to make collaborative filtering steerable. Paper
- Flow Policies in Robotics: A new method using Adjoint Matching to train flow policies for robot control. Paper
- 360Anything: Lifting standard images and videos into 360-degree geometries without geometry priors. Project Page
- UniX: Unifying autoregression and diffusion for chest X-ray understanding and generation. Paper
- PROGRESSLM: A study revealing that VLMs struggle with progress estimation, and a new model to fix it. Paper
📈 Trends & Predictions
The "Action-First" Multimodal Agent
This week, we saw a clear trend towards agents that do more than just chat—they act. Microsoft's Rho-alpha translates language directly into robot arm movements. Meituan's EvoCUA learns to control operating systems by trial and error. VIGA converts images into executable Blender code.
Why it matters:
- Direct Execution: We are moving from models that describe the world to models that manipulate it. The output of these models isn't text; it's a control signal, a mouse click, or a script.
- Synthetic Training Loops: EvoCUA's success highlights a critical shift: self-play and synthetic environments are becoming the primary way to train agents. We can't collect enough human data for every possible computer task, so agents must learn by doing.
- The "Inverse Graphics" Approach: VIGA's approach of treating vision as "inverse graphics" (reconstructing the code that could have created the image) is a powerful paradigm for 3D generation, bridging the gap between pixel data and structured 3D assets.
🧩 Community + Shoutouts
- WebGPU Pocket TTS: Shoutout to @ekzhang1 for porting Kyutai's Pocket TTS to WebGPU. You can now run high-quality text-to-speech directly in your browser with zero server latency. Demo
- VibeComfy: A huge thanks to PetersOdyssey for building VibeComfy, a CLI tool that lets Claude Code understand and edit your ComfyUI workflows. This is a game-changer for automating complex node graphs. Reddit Discussion
- Natural Speech Tutorial: Shoutout to @Mho_23 for a comprehensive tutorial on generating natural-sounding speech using the latest open and closed-source models. A must-watch for anyone building voice apps. X Post
That's a wrap for Multimodal Monday #42! The theme this week is capability unlocks: EvoCUA and Rho-alpha show agents that execute rather than describe, D4RT and OmniTransfer reveal video as an editable 4D canvas, and PersonaPlex and Qwen3-TTS finally make voice interfaces feel like real conversations. We're watching multimodal AI cross the threshold from perception to intervention.
Ready to build multimodal solutions that actually work? Let's talk
