Multimodal Monday 33: Physical AI, Human Vision

📢 Quick Hits (TL;DR)

AI learns through action, not observation.
Traditional AI watches and classifies. Pelican-VL’s DPPO training lets robots practice and self-correct, SIMA 2 maintains goals across gaming sessions, and Holo2 navigates interfaces conceptually. All three learn by doing.

Vision AI groups by meaning, not appearance.
DeepMind’s odd-one-out method stops AI from grouping bananas with yellow cars. OmniVinci fuses vision, audio, and language in one space with 6x less training data. Both understand what things are, not just how they look.

Single images generate complete 3D worlds.
Fei-Fei Li’s World Labs Marble creates walkable environments from one photo. Depth Anything 3 extracts depth from any 2D image. PAN simulates nested physical interactions. Every image can now train spatial AI.

🧠 Research Highlights

UniVA: Universal Video Agent

UniVA works like LEGO for video AI, you plug in whatever tools you need. The demo shows it tracking objects, editing footage, and understanding complex scenes all in one system.

Links: Demo | Paper

Phys2Real: Sim-to-Real Transfer

This method trains robots in simulation then transfers that knowledge to the real world by accounting for real-world messiness. The robot learns what it doesn’t know and adapts accordingly.

Links: Project Page | Paper | Twitter

Pelican-VL 1.0: The Embodied Intelligence Brain

Beijing’s Pelican-VL converts what robots see into 3D movement commands directly. Their DPPO training method works like human practice, make mistakes, reflect, improve.

Links: Project Page | Paper | GitHub | Hugging Face

NVIDIA’s OmniVinci processes vision, audio, and language in one unified space. It beats Qwen2.5-Omni by 19% while using 6x less training data.

Links: Project Page | Paper | Model

Teaching AI to See the World More Like We Do

DeepMind used an “odd-one-out” test to show how differently AI sees things compared to humans. Their three-step alignment method fixes this, making AI group concepts the way you naturally would.

AGIAlignment-blog-large-figure-3@x3 — Diagram of their three-step model-alignment method.

Links: Blog Post | Twitter

RF-DETR: First real-time segmentation model to beat YOLO models.
Links: Paper | GitHub | Hugging Face

Meta Omnilingual ASR: Speech recognition for 1,600+ languages in one model. Links: Blog Post | GitHub | Twitter

The Value of Personalized Recommendations: Netflix data shows how recommendation algorithms actually work. Links: Paper

🛠️ Tools, Models and Techniques

SIMA 2

Google’s SIMA 2 plays games with you, learns through trial and error, and actually reasons about what to do. Talk to it through text, voice, or images, it understands high-level goals and figures out how to achieve them.

Why it matters: Your next gaming buddy will be an AI that actually understands the game.
Links: Blog Post | Twitter

Depth Anything 3 (DA3)

DA3 generates depth maps from regular images with unprecedented accuracy. The demo shows it working on everything from selfies to satellite imagery.

Why it matters: Every 2D image can now become 3D data for your applications.Links: Project Page | Paper | GitHub | Hugging Face | Twitter

Marble

World Labs’ Marble creates persistent 3D worlds from a single image, video, or text prompt. Upload a photo of your living room, get a walkable 3D space.

Why it matters: 3D content creation just became as simple as taking a photo.
Links: Website | Blog Post | Twitter

Holo2

H-Company’s Holo2 beats all computer-use benchmarks across web, desktop, and mobile. Drop it into your existing Holo setup, it works immediately on Ubuntu, Android, or Chrome.

Links: Blog Post | GitHub | Hugging Face

Music Flamingo

NVIDIA’s Music Flamingo understands full songs, not just clips. It analyzes music structure, identifies instruments, and reasons about compositions.

Why it matters: AI finally understands music the way musicians do.
Links: Project Page | Paper | Hugging Face | Demo

PAN: General world model simulates physical, agentic, and nested worlds. Links: Demo | Twitter

ERNIE-4.5-VL-28B-A3B-Thinking: Baidu’s natively omni-modal foundation model. Links: Hugging Face | Demo | Twitter

Llama-Embed-Nemotron-8B: NVIDIA’s universal text embedding for 100+ languages. Links: Paper | Hugging Face

DeepEyesV2: Multimodal agent that writes and runs code while searching the web. Links: Project Page | Paper | Hugging Face

Maya1: Create any voice from text. Links: Demo

📈 Trends & Predictions

The Perception-to-Action Gap Closes

This week shows three distinct approaches to the same problem: how do you get AI to actually do things, not just understand them?

Pelican-VL tackles this for robotics with its DPPO training method, the model practices tasks, fails, analyzes what went wrong, then adjusts. Think of it like teaching a robot to play piano: it doesn’t just memorize finger positions, it learns the relationship between what it sees and how to move. The Beijing team tested this on real humanoid robots doing manipulation tasks, and the results show genuine spatial reasoning emerging from visual input alone.

SIMA 2 solves this in virtual environments. Google’s agent doesn’t just execute commands, it maintains persistent goals across gaming sessions, reasons about cause and effect, and learns new skills without being explicitly programmed. When you tell it “build a house,” it figures out it needs to gather materials first, find a good location, and plan the structure. This kind of multi-step reasoning with environmental feedback is new.

Holo2 brings this to computer interfaces. It’s not using predefined clicking patterns or UI maps. The model understands interface elements conceptually, it knows what a button does, not just where it is. H-Company’s benchmarks show it handling complex workflows across different operating systems without specific training for each one.

What connects these three? They’re all moving beyond the traditional pipeline of “perceive → classify → decide → act” toward integrated systems where perception and action inform each other continuously. The models learn by doing, not just by observing. This feedback loop between action and understanding is what makes these systems actually useful in unpredictable real-world scenarios.

The technical breakthrough here is handling uncertainty through interaction. Instead of needing perfect understanding before acting, these systems act to improve their understanding. That’s fundamentally different from how we’ve built AI systems until now.

Community + Shoutouts

dLLM

Zhanhui Zhou turned BERT into a chatbot using diffusion. Yes, you read that right, BERT can now chat.
Links: GitHub | Report | Hugging Face

Next Scene LoRA

OdinLovis built a LoRA that adds camera movement to image generation. Type “Next Scene” and watch your static image become a cinematic sequence.
Links: Hugging Face

That’s a wrap for Multimodal Monday #33! From robots that understand space with Pelican-VL, to AI that sees concepts like humans via DeepMind, to instant 3D worlds through Marble, this week redefined what multimodal means. Add Meta’s 1,600-language ASR and NVIDIA’s Music Flamingo understanding full songs, and you’re looking at AI systems that perceive, reason, and act across every modality.