Multimodal Monday #44: Agents Ship Code, Robots Skateboard
Feb 2 - 9: GPT-5.3-Codex and Claude Opus 4.6 handle full software lifecycles from debug to deploy, MiniCPM-o 4.5 beats GPT-4o on vision tasks at 9B parameters running on-device, HUSKY skateboards using physics-based control, and TinyLoRA fine-tunes models with a single parameter.

Quick Take (TL;DR)
- Coding agents ship whole products now - GPT-5.3-Codex and Claude Opus 4.6 don't just write code. They debug, deploy, and build standalone tools like a local media cropper, start to finish, without a human touching a terminal.
- Generated video has to obey physics now - Roblox's 4D Cube Foundation Model creates interactive 3D objects, and the HUSKY humanoid robot skateboards using physics-based control. The line between "looks real" and "acts real" is disappearing.
- Flagship performance runs on your phone - MiniCPM-o 4.5 beats GPT-4o on vision benchmarks at 9B parameters, and TinyLoRA proves you can fine-tune a model with a single trainable parameter. You no longer need a data center to run serious multimodal AI.
Tools, Models and Techniques
Kling 3.0 - Kling AI launched its "all-in-one creative engine" for native multimodal creation, positioning it as an accessible tool for high-quality video generation without production budgets. Why it matters: Another strong entry in the race to make video creation as easy as writing a prompt. X Post
Claude Opus 4.6 - Anthropic's new flagship has a 1M token context window and major upgrades in coding and reasoning. It plans complex workflows, debugs its own mistakes, and handles long-running autonomous tasks without losing the thread. Why it matters: A model that catches its own errors and stays coherent across massive contexts makes sustained agentic work practical, not just possible. News
Nemotron ColEmbed V2 - NVIDIA released visual document retrieval models (3B, 4B, 8B) that set new state-of-the-art scores, with the 8B model topping the ViDoRe V3 benchmark by 3%. These are purpose-built for finding information inside scanned documents and PDFs. Why it matters: Specialized visual embeddings now meaningfully outperform generic ones for document-heavy workflows. Paper | Hugging Face
GPT-5.3-Codex - OpenAI's latest coding agent runs 25% faster than its predecessor and handles deployment, app building, and debugging as integrated capabilities. OpenAI used the model to help develop and deploy itself. Why it matters: This closes the gap between "AI writes code" and "AI ships software." Blog
MiniCPM-o 4.5 - A 9B parameter multimodal model built for phones that supports real-time bilingual voice conversations and outperforms larger proprietary models on vision tasks. It runs entirely on-device with no cloud dependency. Why it matters: If the model runs locally, your data never leaves your phone, which unlocks sensitive use cases in health, finance, and personal conversations. Hugging Face
- Grok Imagine 1.0: xAI enters the multimodal video generation space with a new API.
- VK-LSVD: A massive dataset of 40 billion user interactions for short-video recommendation. Hugging Face
Research Highlights
Heterogeneous Computing for AI Agents - This paper introduces "Operational Intensity" and "Capacity Footprint" as better metrics for AI workloads and shows that memory capacity, not bandwidth or compute, is often the real bottleneck for agent inference. Why it matters: If you're building agent infrastructure, you're probably optimizing for the wrong hardware constraint. Paper
HUSKY: Skateboarding Humanoid - A humanoid robot uses physics-based control to balance and maneuver on a skateboard in the real world, mimicking human agility and stability. It handles the fast, unstable dynamics that most robots can't touch. Why it matters: If a robot handles skateboarding, it can handle the unpredictable physics of most real-world environments. Project Page
Context Forcing - A technique for generating consistent long-form video that keeps characters and backgrounds stable across many frames. It directly addresses the "morphing" problem where faces and objects drift between shots. Why it matters: Consistency is what separates a demo from a usable tool for storytelling and production. Project Page
InfoTok: Shared Visual Tokenization - This paper proposes a unified visual tokenization mechanism for multimodal LLMs, using information regularization to create shared tokens that work for both understanding and generation. Why it matters: A single visual vocabulary for seeing and creating removes the overhead of running separate encoders for each task. Paper
SwimBird - A framework that lets models dynamically switch reasoning modes between vision and text, choosing the best modality for each step of a problem. It improves both flexibility and performance on complex multi-step tasks. Why it matters: A model that knows when to look and when to read solves problems faster than one locked into a single modality. Project Page
- 3D-Aware Implicit Motion Control: For view-adaptive human video generation. Project Page
- InterPrior: Scaling generative control for physics-based human-object interactions. Paper
- Distribution-Aware Embedding: For streaming numerical features in click-through rate prediction. Paper
- MissMAC-Bench: A benchmark for robustness under missing modalities in emotion recognition. Paper
- HORAI & MM-TS: A foundation model and billion-point dataset for multimodal time-series analysis. Paper
- ERNIE 5.0: Technical report on Baidu's latest foundation model. Paper
- TinyLoRA: A method from Meta FAIR to fine-tune models with as few as one trainable parameter. Paper
- Col-Bandit: Zero-shot query-time pruning for faster late-interaction retrieval. Paper
Physics Meets Generative Video
Roblox's 4D Cube and HUSKY represent a clear trend, generated content needs to behave correctly, not just look correct.
What this means for you:
- Creation, not just consumption. Roblox's model lets players create interactive 3D objects from prompts. Generative AI moves from making things you watch to making things you use.
- Seeing translates to moving. HUSKY proves that visual understanding transfers to complex physical control. This is the link between perception models and robots that actually do useful work.
- Physics as a quality bar. Techniques like Context Forcing and InterPrior enforce physical consistency in generated video. Training simulators, virtual production, and game content all need this.
Community + Shoutouts
- Lingbot World Launcher: Shoutout to @zast57 for creating a 1-click Gradio launcher for the Lingbot World Model. It makes testing this powerful model accessible to anyone with a GPU. X Post
- Kling 3.0 Guide: Thanks to @fal for sharing a comprehensive prompting guide for Kling 3.0. As these models get more complex, knowing how to talk to them is a skill in itself. Blog
- Cropper: A local, private media cropper built entirely by GPT-5.3-Codex. X Post
- Pet Video Fun: The community is having a blast with LTX-2, using it to animate pet photos. A fun reminder that the best AI applications are often the ones that make you smile. Reddit
That's a wrap for Multimodal Monday #44! From GPT-5.3-Codex building a standalone media cropper without human help, to a 9B model outperforming GPT-4o on your phone, to a humanoid robot skateboarding on physics-based control alone, this week is about removing the things that kept multimodal AI theoretical: human oversight for code, cloud dependency for inference, and physical plausibility for generated content. One by one, the blockers are falling.
Ready to build multimodal solutions that actually work? Let's talk
