Multimodal Monday 36: Factual Recall, Real-Time Video

Quick Hits (TL;DR)

VLMs are failing at basic factual recall. New research identifies why 11 out of 14 vision-language models perform worse than their text-only backbones: they form entity representations too late in processing. If you're building multimodal RAG systems, this explains the performance gap you've been seeing.

Real-time video generation arrives. Adobe's RELIC and Alibaba's Reward Forcing both generate interactive video in real time. You can now create and modify video content on the fly.

Small models keep getting faster & more efficient. Microsoft released a 0.5B parameter text-to-speech model that runs in real time. We're watching the pattern repeat: what required billions of parameters last year now works with half a billion.

Tools, Models and Techniques

Live Avatar
Alibaba's streaming system generates real-time audio-driven avatars with infinite length. The system handles continuous audio input and produces synchronized avatar animations without length constraints.

Why it matters: Streaming architecture removes the artificial time limits from avatar generation.
Links: Website | Paper | GitHub | Hugging Face

ViBT: The First Vision Bridge Transformer at 20B Parameters
ViBT models data-to-data translation directly by learning trajectory distributions for conditional generation. It runs up to 4x faster than comparable models while handling image and video generation in a unified framework.

Why it matters: Direct trajectory modeling eliminates intermediate representation steps that slow down generation.
Links: Website | Paper | GitHub | Demo | Model

Runway Gen-4.5
Runway's latest model improves motion quality, prompt adherence, and visual fidelity. The update focuses on temporal coherence and fine-grained control over generated motion.

Why it matters: Video generation quality now matches what you'd expect from manually edited footage.
Links: Twitter

Stable Video Infinite 2.0
Version 2.0 extends video generation length while maintaining consistency. The open-source release includes model weights and inference code.

Why it matters: Open alternatives to commercial video models keep improving.
Links: Hugging Face | GitHub | KJ version

VibeVoice-Realtime-0.5B
Microsoft's 0.5B parameter text-to-speech model runs in real time with low latency. The model trades parameter count for inference speed while maintaining voice quality.

Why it matters: Real-time speech synthesis now works on devices that couldn't run previous models.
Links: Hugging Face | Demo

Lux: Computer use model optimized for speed. Links: Website | SDK | Blog

YingVideo-MV: Animates still portraits into singing performances. Links: Website | Paper | GitHub

Reward Forcing: Streaming video generation in real time. Links: Website | Paper | Hugging Face | GitHub

EvoQwen2.5-VL Retriever: Open-source visual document retriever. Links: 7B | 3B

LongCat Image: 6B parameter image generation model optimized for efficiency. Links: Hugging Face | GitHub

OneThinker: Visual reasoning model handling multiple tasks. Links: Hugging Face | Paper

Research Highlights

Too Late to Recall: Explaining the Two-Hop Problem in Multimodal Knowledge Retrieval
VLMs show worse factual recall than their underlying language models because they construct visual entity representations too late in the network. The study tested 14 models and found poorly-performing ones (especially LLaVA-style architectures) resolve entities after the layers responsible for factual recall.

Why it matters: Your multimodal RAG system's performance depends on choosing models that form entity representations early enough to access factual knowledge.
Links: Paper | GitHub

BookRAG: A Hierarchical Structure-aware Index-based Approach for Retrieval-Augmented Generation on Complex Documents
BookRAG introduces BookIndex, which maps hierarchical document trees to knowledge graphs through a unified structure. This captures both the organizational flow and semantic relationships that standard retrieval systems miss.

Why it matters: Complex documents have inherent structure that flat indexing throws away.
Links: Paper

Attention Interaction Alignment (AIA)
Researchers found that decoupling unified multimodal models works because it forces task-specific interaction patterns. AIA achieves similar gains through a specialized loss function without architectural changes.

Why it matters: You can get the benefits of model decoupling without rebuilding your architecture.
Links: Paper | Project Page | GitHub

PowerCLIP
PowerCLIP aligns image sub-regions with text by treating them as powersets rather than flat representations. It outperforms existing models on zero-shot classification, retrieval, robustness, and compositional understanding.

Why it matters: Treating visual regions as structured sets captures compositional relationships that flat embeddings miss.
Links: Paper

LORE: A Large Generative Model for Search Relevance
Alibaba's three-year deployment improved e-commerce search by 27% through a framework recognizing that relevance needs three distinct capabilities: knowledge reasoning, multimodal matching, and rule compliance. The system handles these as separate modules rather than assuming a single model solves everything.

Why it matters: Real-world search relevance is multiple problems, not one.
Links: Paper

STARFlow-V: End-to-end video generation using normalizing flows. Links: Paper

RaySt3R: Predicts depth maps for completing occluded objects without training. Links: Paper | GitHub | Demo

VLA Models Are More Generalizable Than You Think: Physical and spatial modeling revisited. Links: Paper

RELIC World Model: Interactive video with long-horizon spatial memory. Links: Website

MG-Nav: Visual navigation using sparse spatial memory at dual scales. Links: Paper | Demo

BlockVid: Block diffusion for minute-long video generation. Links: Paper

NeuralRemaster: Phase-preserving diffusion for structure-aligned generation. Links: Paper

Infinity-RoPE: Training-free framework for unlimited length videos. Links: Website | Paper

Yann LeCun's Humanoid Robot Paper: Humanoid robots learn actions from AI-generated videos. Links: Paper

IMPASTO: Robotic oil painting system. Links: Website | Paper | Twitter

VLASH: Asynchronous inference for real-time vision-language-action models. Links: Paper | GitHub

Trends & Predictions

The Two-Hop Problem

Vision-language models are underperforming their text-only backbones on factual recall. New research shows why: 11 out of 14 tested models form entity representations too late in their processing pipeline.

Here's the mechanism. When you show a VLM an image of the Eiffel Tower, it needs to recognize the tower and then retrieve facts about it. Most VLMs delay the first step until after the network layers that handle factual recall. By the time the model knows it's looking at the Eiffel Tower, it has already passed the computational stages where its language backbone would access stored facts.

The data shows this clearly. LLaVA-style models consistently resolve entity representations late and perform poorly on factual recall. Models like Gemma-3-12B and Qwen2.5-VL-7B form entity representations early and maintain strong factual recall. The difference comes down to training: models with extensive multimodal fine-tuning learn to process visual entities earlier.

This has three direct implications for anyone building multimodal systems.

First, model selection matters more than you think. Benchmark scores don't reveal whether a VLM processes entities early or late. You need to test factual recall specifically. Take the same question, ask it with text only using the base language model, then ask it with an image using the VLM. If performance drops significantly, you're seeing the two-hop problem.

Second, current evaluation methods miss this issue. Standard benchmarks measure overall accuracy but don't isolate where models fail. You need targeted tests that separate visual recognition from factual recall. This tells you whether your model can't see what's in the image or can't remember facts about what it sees.

Third, you need realistic expectations. VLMs handle different tasks with different capabilities. Some excel at visual reasoning but struggle with factual recall. Others maintain factual performance but process images more slowly. Understanding these tradeoffs lets you choose the right model for your specific use case.

The two-hop problem is fixable. Models that underwent sufficient multimodal fine-tuning already solve it. The challenge is identifying which models work well and understanding why. As training methods improve, more models will form entity representations early enough to access their full factual knowledge.

This matters because multimodal RAG depends on reliable factual recall. If your VLM can't consistently retrieve facts about recognized entities, your retrieval system breaks down. The research gives us a clear diagnostic: test entity recognition separately from factual recall. When both work together, you have a model you can build on.

Community + Shoutouts

Video models on 4GB VRAM and 16GB RAM Shoutout to yanokusnir for demonstrating current video models running on 4GB VRAM and 16GB RAM. Amazing how far we've come in a couple years(remember Will Smith spagetti) Links: Reddit

SOTA image model comparison Shoutout to BoostPixels for comparing Z-Image-Turbo, Gemini 3 Pro, and Qwen Image Edit 2509 on uncanny valley performance. Links: Reddit

Basketball vision AI tutorial Shoutout to SkalSki for a basketball vision AI tutorial covering player detection with RF-DETR, tracking with SAM2, team clustering with SigLIP and K-means, and number recognition with SmolVLM2. Links: Twitter | YouTube

NanoBanana Pro LoRA Dataset Generator Shoutout to Lovis Odin for releasing NanoBanana Pro LoRA Dataset Generator with @fal, creating training datasets for Flux 2, Z-Image, Qwen Image Edit, and other image-to-image models. Links: Twitter | Website | GitHub

That's a wrap for Multimodal Monday #36! Research identifies why VLMs form entity representations too late for factual recall. BookRAG combines document trees with knowledge graphs for better retrieval. Real-time video generation arrives from Adobe and Alibaba. Microsoft releases 0.5B parameter TTS that runs in real time. Alibaba's Live Avatar generates infinite-length audio-driven avatars. Video generation now works on consumer hardware.

Ready to build multimodal solutions that actually work? Let's talk.