Multimodal Monday #42: See, Click, Speak

📢 Quick Take (TL;DR)

Video models unlock the fourth dimension - Google's D4RT and the new OmniTransfer framework are pushing video understanding beyond simple playback, enabling AI to perceive time, space, and motion as manipulatable 4D data structures.
Agents are learning to see and click - From Microsoft's Rho-alpha controlling robots to Meituan's EvoCUA mastering computer interfaces, we are seeing a surge in "action-first" multimodal models that translate vision directly into physical or digital execution.
Speech becomes a two-way street - New models like NVIDIA's PersonaPlex and Qwen3-TTS are moving us towards real-time, full-duplex conversational AI that can listen, think, and speak with human-like latency and emotional nuance.

🛠️ Tools, Models and Techniques

PersonaPlex

NVIDIA released a real-time, full-duplex speech-to-speech model that lets you control the speaker's persona through text prompts. Built on the Moshi architecture, it handles simultaneous listening and speaking with low latency, allowing for natural interruptions and back-and-forth conversation.

Why it matters: True conversational AI needs to be more than just a fast transcriber. PersonaPlex shows that we can build systems that not only understand words but also embody specific personalities and handle the messy, overlapping nature of real human dialogue.
Research Page | GitHub

Qwen3-TTS

This open-source text-to-speech model can clone voices, design new ones, and speak naturally in 10 languages in real-time. Its dual-track architecture and custom audio tokenizers keep quality high while maintaining the speed needed for live applications.

Why it matters: High-quality, low-latency TTS is the missing link for voice agents. By open-sourcing a model that rivals proprietary APIs, Qwen3-TTS democratizes access to the kind of voice technology needed to build responsive, human-level assistants.
Models | Writeup

Linum V2

A new 2B parameter text-to-video model that generates high-quality 720p video from text prompts. Trained from scratch by a small team, it aims to make animation accessible and affordable without requiring massive compute clusters.

Why it matters: Video generation has been dominated by massive, closed-source models. Linum V2 proves that efficient, smaller-scale models can still deliver impressive results, opening the door for more developers to integrate video creation into their apps.
Launch Post | Hugging Face

OpenVision 3

A unified visual encoder designed for both understanding and generation tasks. It outperforms standard CLIP-based encoders in generative frameworks, proving that a single model can effectively handle the dual tasks of analyzing images and creating them.

Why it matters: Most multimodal systems use separate models for "seeing" and "drawing." OpenVision 3 suggests a more efficient future where a single "visual brain" handles both perception and creation, simplifying pipelines and potentially improving semantic consistency.
Paper | GitHub

EvoCUA

Meituan's "Evolutionary Computer Use Agent" is now the #1 open-source model for controlling operating systems. It achieves 56.7% on the OSWorld benchmark by generating its own synthetic training tasks and learning from its failures in a sandbox environment.

Why it matters: Training agents to use computers is hard because real-world interaction data is scarce. EvoCUA's self-improving loop shows a path forward where agents teach themselves to navigate software, reducing the reliance on expensive human demonstrations.
Paper | GitHub

Rho-alpha: Microsoft's first robotics model derived from the Phi series, translating language into bimanual robot actions. Story
Waypoint-1: A real-time interactive video diffusion model from Overworld. Blog
VibeVoice ASR: A speech-to-text model for long-form audio with speaker identification and custom hotwords. Model
Stitch MCP Server: Google's new tool for connecting Gemini to external data and skills via the Model Context Protocol. Docs

RF-DETR: A state-of-the-art real-time segmentation model from Roboflow, released under Apache 2.0. Blog
Remotion Skills: New MCP skills for the Remotion video creation framework. GitHub

VIGA: An agent that converts images into 3D Blender code, treating vision as inverse graphics. Project Page

LuxTTS: A lightweight TTS model that is 150x faster than real-time. GitHub
LightOnOCR: A vision-language model for converting complex documents into clean, ordered text. Hugging Face
Higgsfield AI Influencer Factory: A studio for creating consistent AI characters and videos. Studio
Docs2Synth: A framework for training document retrieval systems using synthetic data. Paper

🧠 Research Highlights

D4RT: Seeing the World in 4D

Google DeepMind introduced D4RT, a unified model that turns video into 4D representations (3D space + time). Unlike previous methods that process video frame-by-frame, D4RT understands the entire spatio-temporal volume, allowing it to track objects and geometry consistently over time.

Why it matters: This is a step change for video understanding. Instead of just labeling pixels, the model builds a coherent 3D world model from 2D footage. For indexing, this means we can search for actions and objects based on their physical behavior in space, not just their visual appearance in a single frame.
Blog | Project Page

HERMES: Faster Streaming Video Understanding

HERMES uses the KV cache as a hierarchical memory system to speed up streaming video analysis. It achieves 10x faster time-to-first-token and reduces video tokens by 68% without sacrificing accuracy, making it feasible to run high-performance video understanding in real-time.

Why it matters: Video is heavy. Processing every frame kills latency and blows up costs. HERMES shows that we can be selective about what we keep in memory, discarding redundant visual information while retaining the context needed to answer questions. This is crucial for building responsive video search and alert systems.
Paper

OmniTransfer

This all-in-one framework handles "spatio-temporal video transfer," meaning it can transfer styles, motion, and effects from one video to another. It can animate static images or completely restyle a video clip while preserving the original motion dynamics.

Why it matters: Generative video is moving beyond "text-to-video" to "video-to-video." OmniTransfer gives creators granular control over the look and feel of their content, allowing for complex edits and transformations that previously required manual VFX work.
Project Page | Paper

Think3D: Thinking with Space

Think3D demonstrates that smaller multimodal models can significantly improve their spatial reasoning skills without extra training, simply by using "tool-augmented spatial exploration." It teaches the model to use external tools to verify its spatial understanding, leading to more human-like 3D reasoning.

Why it matters: We often assume that reasoning requires massive scale. This paper shows that how a model thinks (and the tools it uses) matters just as much. For spatial indexing, this suggests we can get better results by equipping smaller, faster models with the right geometric tools rather than just throwing more parameters at the problem.
Paper

CoDance

A new model that animates characters in images based on text prompts and pose sequences. It uses an "unbind-rebind" paradigm to separate the character from their original spatial context, allowing for flexible re-posing and animation even in complex scenes with multiple subjects.
Project Page | Paper

Knots to Knobs: Using sparse autoencoders to make collaborative filtering steerable. Paper
Flow Policies in Robotics: A new method using Adjoint Matching to train flow policies for robot control. Paper
360Anything: Lifting standard images and videos into 360-degree geometries without geometry priors. Project Page
UniX: Unifying autoregression and diffusion for chest X-ray understanding and generation. Paper
PROGRESSLM: A study revealing that VLMs struggle with progress estimation, and a new model to fix it. Paper

📈 Trends & Predictions

The "Action-First" Multimodal Agent

This week, we saw a clear trend towards agents that do more than just chat—they act. Microsoft's Rho-alpha translates language directly into robot arm movements. Meituan's EvoCUA learns to control operating systems by trial and error. VIGA converts images into executable Blender code.

Why it matters:

Direct Execution: We are moving from models that describe the world to models that manipulate it. The output of these models isn't text; it's a control signal, a mouse click, or a script.
Synthetic Training Loops: EvoCUA's success highlights a critical shift: self-play and synthetic environments are becoming the primary way to train agents. We can't collect enough human data for every possible computer task, so agents must learn by doing.
The "Inverse Graphics" Approach: VIGA's approach of treating vision as "inverse graphics" (reconstructing the code that could have created the image) is a powerful paradigm for 3D generation, bridging the gap between pixel data and structured 3D assets.

🧩 Community + Shoutouts

WebGPU Pocket TTS: Shoutout to @ekzhang1 for porting Kyutai's Pocket TTS to WebGPU. You can now run high-quality text-to-speech directly in your browser with zero server latency. Demo
VibeComfy: A huge thanks to PetersOdyssey for building VibeComfy, a CLI tool that lets Claude Code understand and edit your ComfyUI workflows. This is a game-changer for automating complex node graphs. Reddit Discussion
Natural Speech Tutorial: Shoutout to @Mho_23 for a comprehensive tutorial on generating natural-sounding speech using the latest open and closed-source models. A must-watch for anyone building voice apps. X Post

That's a wrap for Multimodal Monday #42! The theme this week is capability unlocks: EvoCUA and Rho-alpha show agents that execute rather than describe, D4RT and OmniTransfer reveal video as an editable 4D canvas, and PersonaPlex and Qwen3-TTS finally make voice interfaces feel like real conversations. We're watching multimodal AI cross the threshold from perception to intervention.

Ready to build multimodal solutions that actually work? Let's talk