Multimodal Monday #22: Spatial Crisis, Trust Bottleneck

Week of August 18-24, 2025

Quick Takes (TL;DR)

• MLLMs can't tell left from right - Even GPT-5 and o3 fail at kindergarten-level tasks: distinguishing 90° from 270° rotations. Meanwhile, MCOUT proves we've been thinking about AI reasoning all wrong, ditching language for continuous thought vectors yields 8% better performance.

• Open-source just ate closed-source's lunch - Intern-S1's 241B parameter model beats GPT and Claude at molecular synthesis and scientific tasks, proving specialized beats generalized. The era of "bigger is better" might be ending.

• Trust is the trillion-dollar problem - MultiTrust-X reveals current MLLMs are ticking time bombs across safety, fairness, and privacy. The gap between capability and trustworthiness is so wide, we're building Ferraris with bicycle brakes.

🧠 Research Highlights

Multimodal Chain of Continuous Thought for Latent-Space Reasoning in Vision-Language Models

MCOUT ditches language-based reasoning entirely, instead using continuous hidden vectors that evolve iteratively, like thoughts without words. The approach achieves up to 8.23% accuracy improvements on challenging benchmarks by reasoning directly in multimodal latent space rather than forcing everything through the bottleneck of language tokens.

With 1B parameters, MCOUT outperforms Kosmos-2 and MiniGPT4 variants, and nearly matches OpenFlamingo-9B and Qwen-VL, demonstrating strong efficiency in college-level reasoning.

Why It Matters: This paradigm shift could revolutionize how AI systems understand and connect information across text, images, and audio, no translation required.
Links: Paper | Code

Learning to Steer: Input-dependent Steering for Multimodal LLMs

L2S introduces context-aware behavior modification that adapts to each specific input rather than applying blanket safety rules. The system uses a small auxiliary module to predict query-specific steering vectors, enabling nuanced responses, like knowing when to refuse harmful requests versus redirecting medical questions to professionals.

Overview of L2S: during a first training phase (left), for each sample, input-dependent contrastive prompts (T + X and T − X ) are appended to the prompt and passed in teacher forcing mode through the LLM. The last token of the concatenated prompt for a layer L ∗ , as well as The last token of the base prompt at another layer L ′ are used to extract the steering vector. This steering vector is then modeled through the auxiliary network g. At inference time (right), this predicted steering vector is used to allow lightweight, input-dependent, behavior-specific correction of the model’s output.

Why It Matters: Finally, AI safety that isn't one-size-fits-all, critical for deploying systems that can handle diverse real-world scenarios without being overly restrictive.
Links: Paper

RotBench: Evaluating Multimodal Large Language Models on Identifying Image Rotation

This embarrassingly simple benchmark asks models to identify if an image is rotated 0°, 90°, 180°, or 270°, and they spectacularly fail. Even GPT-5, o3, and Gemini-2.5-Pro can't reliably distinguish between 90° and 270° rotations, exposing a fundamental blind spot in spatial reasoning that no amount of prompting or fine-tuning can fix.

They present two ROTBENCH images: one (left) to Gemini-2.5-Pro, the other (right) to GPT-5. Humans can easily identify the correct rotation of the two images, but both models fail to do so.

Why It Matters: If AI can't tell left from right, how can we trust it with autonomous driving, robotics, or any task requiring spatial awareness?
Links: Paper

SPANER: Shared Prompt Aligner for Multimodal Semantic Representation

SPANER creates truly unified semantic spaces where text, images, and audio cluster by meaning rather than modality using shared prompts as conceptual anchors. The framework is inherently extensible, new modalities plug in without architectural changes, achieving competitive few-shot retrieval while maintaining semantic coherence.

Why It Matters: This enables search that actually understands that a photo of a barking dog, the sound of barking, and the text "dog barking" are all the same concept.
Links: Paper

Unveiling Trust in Multimodal Large Language Models: Evaluation, Analysis, and Mitigation

MultiTrust-X benchmarks MLLMs across 32 tasks covering truthfulness, robustness, safety, fairness, and privacy, and the results are alarming. Current models show massive vulnerabilities, with multimodal training actually amplifying risks present in base LLMs, while most mitigation strategies create unexpected trade-offs that compromise utility.

Framework of MultiTrust-X, including aspect division of trustworthiness, evaluation strategy of multimodal-related risks, and categorization of mitigation methods from a machine learning perspective. Specifically, they implement 32 tasks to comprehensively benchmark both modern MLLMs and existing approaches, which are exemplified in the grids between the axes of aspect and method, to provide practical insights for future improvements in trustworthy MLLMs.

Why It Matters: This research proves we're nowhere near deployment-ready for high-stakes applications, current MLLMs are brilliant but fundamentally untrustworthy.
Links: Paper

Can we Evaluate RAGs with Synthetic Data?

This study reveals that synthetic benchmarks work reasonably well for testing retrieval configurations but completely fail when comparing different generator architectures. The breakdown occurs because synthetic questions tend toward technical specificity while human questions are ambiguous and general, a mismatch that yields misleading evaluation results.

Why It Matters: Using synthetic data to evaluate RAG systems is like training for a marathon on a treadmill, it might help, but it won't prepare you for real terrain.
Links: Paper | GitHub

🛠️ Tools & Techniques

Intern-S1: A Scientific Multimodal Foundation Model

Intern-S1 packs 241B parameters (28B activated) trained on 5 trillion tokens, with half specifically from scientific domains, achieving SOTA performance in molecular synthesis and crystal stability prediction. The model uses Mixture-of-Rewards to train on 1000+ tasks simultaneously, surpassing closed-source models in specialized scientific applications.

Why It Matters: Open-source just proved it can beat Big Tech at their own game, when you focus on doing one thing exceptionally well.
Links: Paper | HuggingFace | GitHub

ElevenLabs launches Eleven Music API (August 18, 2025)

ElevenLabs expanded from voice to music with the first commercially-licensed AI music generation API, already powering VEED, AMP, and Hedra Labs. The API generates both vocal and instrumental tracks across any genre from text descriptions, with full commercial rights cleared through partnerships with labels and artists.

Why It Matters: AI-generated content just went mainstream, this isn't a demo, it's production-ready infrastructure with real customers and proper licensing.
Links: Blog Post | Announcement

InScene: Flux Kontext LoRA for Scene Consistency

InScene solves the holy grail of AI image generation, maintaining scene consistency across multiple shots, using just 394 carefully curated image pairs. The LoRA enables variations while preserving backgrounds, characters, and visual styles, with 4,432 downloads last month alone proving its practical utility.

Why It Matters: Finally, AI that can generate a coherent visual story instead of random disconnected images.
Links: HuggingFace | Announcement | Dataset

SAM 2 and MatchAnything Integration into HuggingFace Transformers

Meta's SAM 2 brings segment-anything capabilities to video with frame-to-frame object tracking, while MatchAnything enables robust cross-modal visual matching. Both now integrate seamlessly with HuggingFace Transformers, complete with optimized inference pipelines and comprehensive documentation.

Why It Matters: State-of-the-art computer vision just became as easy to use as from transformers import SAM2 democratization at its finest.
Links: SAM 2 Integration | MatchAnything Integration

📈 Trends & Predictions

The Spatial Reasoning Crisis: Why MLLMs Can't Tell Left from Right

RotBench's revelation that even GPT-5 can't distinguish 90° from 270° rotations isn't just embarrassing, it's existential. This fundamental inability to understand spatial relationships exposes a critical flaw in how vision-language models process visual information. They're pattern matchers, not spatial reasoners.

The implications are staggering. Autonomous vehicles need to understand that a "turn left" sign rotated 180° means something entirely different. Robotic systems need spatial awareness to manipulate objects. AR/VR applications require understanding of 3D space and orientation. Yet our most advanced models fail at tasks a toddler could handle.

This suggests we need entirely new architectures—perhaps specialized spatial reasoning modules or fundamentally different approaches to visual encoding. The current transformer-based architectures might have hit a wall that no amount of scaling can overcome. Expect to see hybrid systems emerging that combine general multimodal models with specialized spatial reasoning components, similar to how the human brain has distinct regions for spatial processing.

Trust: The Trillion-Dollar Bottleneck

MultiTrust-X's comprehensive evaluation reveals an uncomfortable truth: we've built Formula 1 race cars with no brakes. Current MLLMs are incredibly capable but fundamentally untrustworthy across every critical dimension, safety, fairness, privacy, robustness, and truthfulness.

This trust gap is becoming the primary barrier to commercial deployment. Enterprises won't adopt systems that might hallucinate financial advice, generate biased hiring recommendations, or leak private information. The liability risks are astronomical. One high-profile failure could set the entire industry back years.

The solution won't come from post-hoc patches or safety fine-tuning. We need trustworthiness built into the architecture from day one. Expect to see new model designs that prioritize verifiability over capability, systems that can explain their reasoning in auditable ways, and architectures that enforce safety constraints at the computational level rather than through training.

The winners in the multimodal AI race won't be those with the highest benchmark scores, they'll be those who first achieve "boring" reliability. The company that builds the Toyota Camry of AI (reliable, safe, predictable) will capture more value than those building AI Ferraris that occasionally drive off cliffs.

🧩 Community + Shoutouts

Reality Check Award - Massive props to the RotBench team for having the audacity to ask "can AI identify rotated images?" and exposing that our emperor has no clothes. Sometimes the simplest questions reveal the biggest problems.

Open Source Heroes - Intern-S1 team deserves a standing ovation for proving open-source can compete with, and beat, Big Tech's closed models. This isn't just a technical achievement; it's a victory for accessible AI.

Builder Spotlight - Shoutout to @fofrAI for their GPT-5 + ffmpeg video editor that actually works, from basic cuts to complex transformations via natural language.

Also props to @xenovacom for the WebGPU-accelerated DINOv3 demo proving browser-based computer vision isn't just possible, it's fast.

That's a wrap for Multimodal Monday #22! This week proved that bigger isn't always better, trust is harder than intelligence, and somehow, we've built AI that can write symphonies but can't tell if a picture is sideways.

Ready to build multimodal solutions that actually work? Let's talk