🎯 Multimodal Monday #3 — Scaling Multimodal AI: Laws, Lightweights & Large Releases

📢 Quick Take (TL;DR)

Apple tests multimodal scaling: A new scaling laws study finds training vision+language models from scratch (early-fusion) can match or beat the usual “attach a vision encoder to an LLM” approach, hinting at simpler architectures ahead.
Open-source VLMs on the rise: Moonshot AI launched Kimi-VL, a lightweight 2.8B-parameter Vision-Language Model with reasoning and 128K context, while OpenGVLab’s InternVL3 family (1B–78B params) pushes state-of-the-art multimodal reasoning and tool use.
Real-world going multimodal: Google’s Search AI can now “see” images via Lens+Gemini (Google’s AI Mode can now see and search with images | The Verge), and industries from healthcare to retail are piloting AI that fuses modalities for better insights (This Week in AI: Global Trends and Developments (Early April 2025)

🧠 Research Highlights

Scaling Laws for Native Multimodal Models (Apple): Apple & Sorbonne researchers trained 457 models to compare early-fusion vs late-fusion multimodal architectures. Key finding: no inherent advantage to “late-fusion” (e.g. a pre-trained vision encoder + LLM) over training a single multimodal model; in fact, early-fusion performs *on par or slightly better at smaller scale. They also show these native models scale much like text-only LLMs and benefit from Mixture-of-Experts for modality-specific capabilities. Why it matters: It challenges the prevalent strategy of bolting image encoders onto language models. Training one unified model from scratch might simplify multimodal AI development without sacrificing performance.
Sparse Disentangled Retrieval Representations: A new method from IIT Delhi/Adobe proposes sparse, interpretable embeddings for multimodal retrieval, enabling queries with exclusions (e.g. “sports but not basketball”). Instead of huge dense vectors, they map similar words/concepts into shared dimensions to keep embeddings compact. On MSCOCO and Conceptual Captions, this approach beat dense models like CLIP by up to 11% in precision. Why it matters: By making features more interpretable and controllable, we can build search systems that answer complex queries (including negatives) and better explain why a result was retrieved – a big win for multimodal feature extraction transparency.

🛠️ Tools & Techniques

Kimi-VL & Kimi-VL-Thinking (Moonshot AI): An efficient open-source MoE vision-language model boasting advanced visual reasoning, long-context understanding, and agent capabilities, all with just ~2.8B active parameters(GitHub - MoonshotAI/Kimi-VL: Kimi-VL: Mixture-of-Experts Vision-Language Model for Multimodal Reasoning, Long-Context Understanding, and Strong Agent Capabilities). It packs a 128K-token context window and a native-resolution Vision Transformer, enabling multi-image and document understanding.
- How you might use it: The models (including the chain-of-thought tuned “Thinking” variant) are available on Hugging Face. Developers can run Kimi-VL on modest GPUs to power image Q&A, OCR with reasoning, or multi-turn visual assistants that would typically require much larger models.

InternVL3 (OpenGVLab): The InternVL3 suite (from the team behind InternImage/InternLM) rolled out 7 checkpoint variants from 1B up to 78B parame InternVL3 is out. It’s built with an InternViT vision encoder and Qwen2.5-VL language decoder, and it outperforms the original Qwen-VL model by a good margin. These models demonstrate strong visual reasoning, can handle complex document understanding, and even exhibit tool use and agentic capabilities.

🏗️ Real-World Applications

Google Search gets multimodal: Google is rolling out image understanding in Search’s AI Mode. Users can now snap or upload a picture and ask questions about it, thanks to a custom Gemini AI model combined with Google Lens(Google’s AI Mode can now see and search with images | The Verge). The system can interpret an entire scene – recognizing objects, their relationships, and context – and then “fan out” search queries to give a rich answer (e.g. identify books in a photo and recommend similar (Google’s AI Mode can now see and search with images | The Verge). This marks a major step in putting multimodal retrieval into a flagship consumer product.
Multimodal AI in healthcare: Healthcare is on the cusp of mainstream adoption of multimodal AI. Hospitals are piloting diagnostic tools that analyze medical images alongside patient text data, and virtual health assistants that understand speech and vision (This Week in AI: Global Trends and Developments (Early April 2025). Regulators are expected to green-light more medical AI systems soon, meaning tasks like radiology image analysis and personalized treatment recommendations could become routine. The goal: leverage combined modalities (e.g. MRI scans + clinical notes) to improve accuracy and assist doctors in care delivery (This Week in AI: Global Trends and Developments (Early April 2025).
Industry use-cases expanding: Beyond tech giants, sectors like retail are embracing multimodal feature extraction for better search and recommendation. E-commerce platforms now let shoppers search by image and text together for more precise results (think snapping a photo of a dress and adding the query “summer style”). According to market analysis, multimodal AI adoption is accelerating across retail, finance, and media, as organizations seek to fuse data types to gain richer insights (Multimodal AI Research Report 2025: Market to Grow by Over). From automated video analysis in media archives to multimodal fraud detection in banking, the real-world use cases of multimodal systems are broadening by the week.

📈 Trends & Predictions

MOE for scalability: Mixture-of-Experts (MoE) is emerging as the go-to solution for scaling multimodal models. Meta’s Llama 4 uses MoE to juggle massive capacity—Llama 4 Maverick reportedly has 128 experts (total 400B weights, 17B active) and achieves a 1 million-token context window, while the “smaller” Llama 4 Scout (109B total) hit an unprecedented 10 million-token context (Archive for April 2025). We’re seeing others follow suit (e.g. Kimi-VL, InternVL3) by using experts to boost model size or speed. Expect future state-of-the-art VLMs to rely on MoE to attain huge contexts and improved efficiency simultaneously.
Integrated reasoning: There’s a clear pattern of multimodal models moving beyond perception into cognition. Several new models emphasize chain-of-thought reasoning: Google’s Gemini 2.5 “pauses to think” with an internal chain-of-thought approach (This Week in AI: Global Trends and Developments (Early April 2025), and Moonshot’s Kimi-VL-Thinking is explicitly trained for complex visual reasoning (Kimi VL and Kimi VL Thinking: Powerful Open Source Vision Models). Additionally, DeepMind CEO Demis Hassabis recently mentioned on the Possible podcast plans to merge Gemini with Veo's advanced video-generation models, aiming for multimodal assistants that deeply integrate visual understanding with reasoning to support real-world interactions. This trend suggests next-gen multimodal systems won’t just describe images or videos – they’ll perform logical reasoning on them (e.g. solve math problems from a diagram or plan tasks from an image). Multimodal AI is gradually shifting from “understand and tell” towards “understand and figure out,” blurring the line between perception and problem-solving.
Keeping up with the flood: The pace of multimodal AI progress isn’t slowing – if anything, it’s accelerating. Every week brings new model releases, whether it’s an open-source vision-language model or a specialized framework for cross-modal search. This rapid churn is giving rise to tooling to manage it. (For instance, multimodal data warehouses like Mixpeek help teams by managing a library of feature extractors and retrieval pipelines so developers can plug in the latest and greatest models without reinventing the wheel.) A key prediction: organizations that invest in infrastructure to swiftly evaluate and swap in new multimodal models will stay ahead. Flexibility will be critical, because today’s state-of-the-art could be outpaced next month.

🧩 Community + Shoutouts

Open-source heroes on X: Shoutout to the researchers sharing their multimodal work openly. For example, @Kimi_Moonshot unveiled Kimi-VL on Twitter as a “lightweight yet powerful Vision-Language Model with reasoning capability" (You Jiacheng on X: "MoonViT trained with Muon? Interesting ...), and @mervenoyann (OpenGVLab, now at Hugging Face) announced InternVL3 with sizes from 1B to 78B and support for reasoning.
Viral Ghibli-fication of images: The public is having fun with multimodal AI too. Shortly after OpenAI’s GPT-4o (image-generation enabled GPT-4) went live, a viral trend took off where users turned their personal photos into Studio Ghibli–style art. The craze was so popular it reportedly strained OpenAI’s (This Week in AI: Global Trends and Developments (Early April 2025)! It’s a lighthearted example of how quickly new multimodal capabilities get adopted by creators and casual users. From fake restaurant receipts to anime portraits, the community keeps finding creative (and unexpected) ways to use these tools.

Hugging Face hubs & demos: Every week, new models and demos pop up on Hugging Face. Recent highlights include community Spaces that let you try out image-question answering with open VLMs and visualization notebooks comparing embeddings from models like CLIP, BLIP-2, and the latest research prototypes. The Hugging Face community (shoutout to contributors like @smol-vision enthusiasts) are continuously evaluating and fine-tuning these models, which helps surface their strengths and weaknesses. If you’re not sure which new multimodal model works best for your task, chances are someone in the community has a benchmark or demo up for it already. 🎉

📢 Quick Take (TL;DR)

🧠 Research Highlights

🛠️ Tools & Techniques

🏗️ Real-World Applications

📈 Trends & Predictions

🧩 Community + Shoutouts

Join the Discussion