PBPhilip Bankier
â˘â˘6 min readđŻ Multimodal Monday #3 â Scaling Multimodal AI: Laws, Lightweights & Large Releases
Appleâs new scaling law research redefines how multimodal models are built, while Moonshot and OpenGVLab drop powerful open-source VLMs with reasoning and tool-use.

Multimodal Monday
đ˘ Quick Take (TL;DR)
- Apple tests multimodal scaling: A new scaling laws study finds training vision+language models from scratch (early-fusion) can match or beat the usual âattach a vision encoder to an LLMâ approach, hinting at simpler architectures ahead.
- Open-source VLMs on the rise: Moonshot AI launched Kimi-VL, a lightweight 2.8B-parameter Vision-Language Model with reasoning and 128K context, while OpenGVLabâs InternVL3 family (1Bâ78B params) pushes state-of-the-art multimodal reasoning and tool use.
- Real-world going multimodal: Googleâs Search AI can now âseeâ images via Lens+Gemini (Googleâs AI Mode can now see and search with images | The Verge), and industries from healthcare to retail are piloting AI that fuses modalities for better insights (This Week in AI: Global Trends and Developments (Early April 2025)

đ§ Research Highlights
- Scaling Laws for Native Multimodal Models (Apple): Apple & Sorbonne researchers trained 457 models to compare early-fusion vs late-fusion multimodal architectures. Key finding: no inherent advantage to âlate-fusionâ (e.g. a pre-trained vision encoder + LLM) over training a single multimodal model; in fact, early-fusion performs *on par or slightly better at smaller scale. They also show these native models scale much like text-only LLMs and benefit from Mixture-of-Experts for modality-specific capabilities. Why it matters: It challenges the prevalent strategy of bolting image encoders onto language models. Training one unified model from scratch might simplify multimodal AI development without sacrificing performance.
- Sparse Disentangled Retrieval Representations: A new method from IIT Delhi/Adobe proposes sparse, interpretable embeddings for multimodal retrieval, enabling queries with exclusions (e.g. âsports but not basketballâ). Instead of huge dense vectors, they map similar words/concepts into shared dimensions to keep embeddings compact. On MSCOCO and Conceptual Captions, this approach beat dense models like CLIP by up to 11% in precision. Why it matters: By making features more interpretable and controllable, we can build search systems that answer complex queries (including negatives) and better explain why a result was retrieved â a big win for multimodal feature extraction transparency.
đ ď¸ Tools & Techniques
- Kimi-VL & Kimi-VL-Thinking (Moonshot AI): An efficient open-source MoE vision-language model boasting advanced visual reasoning, long-context understanding, and agent capabilities, all with just ~2.8B active parameters(GitHub - MoonshotAI/Kimi-VL: Kimi-VL: Mixture-of-Experts Vision-Language Model for Multimodal Reasoning, Long-Context Understanding, and Strong Agent Capabilities). It packs a 128K-token context window and a native-resolution Vision Transformer, enabling multi-image and document understanding.
- How you might use it: The models (including the chain-of-thought tuned âThinkingâ variant) are available on Hugging Face. Developers can run Kimi-VL on modest GPUs to power image Q&A, OCR with reasoning, or multi-turn visual assistants that would typically require much larger models.

- InternVL3 (OpenGVLab): The InternVL3 suite (from the team behind InternImage/InternLM) rolled out 7 checkpoint variants from 1B up to 78B parame InternVL3 is out. Itâs built with an InternViT vision encoder and Qwen2.5-VL language decoder, and it outperforms the original Qwen-VL model by a good margin. These models demonstrate strong visual reasoning, can handle complex document understanding, and even exhibit tool use and agentic capabilities.
đď¸ Real-World Applications
- Google Search gets multimodal: Google is rolling out image understanding in Searchâs AI Mode. Users can now snap or upload a picture and ask questions about it, thanks to a custom Gemini AI model combined with Google Lens(Googleâs AI Mode can now see and search with images | The Verge). The system can interpret an entire scene â recognizing objects, their relationships, and context â and then âfan outâ search queries to give a rich answer (e.g. identify books in a photo and recommend similar (Googleâs AI Mode can now see and search with images | The Verge). This marks a major step in putting multimodal retrieval into a flagship consumer product.
- Multimodal AI in healthcare: Healthcare is on the cusp of mainstream adoption of multimodal AI. Hospitals are piloting diagnostic tools that analyze medical images alongside patient text data, and virtual health assistants that understand speech and vision (This Week in AI: Global Trends and Developments (Early April 2025). Regulators are expected to green-light more medical AI systems soon, meaning tasks like radiology image analysis and personalized treatment recommendations could become routine. The goal: leverage combined modalities (e.g. MRI scans + clinical notes) to improve accuracy and assist doctors in care delivery (This Week in AI: Global Trends and Developments (Early April 2025).
- Industry use-cases expanding: Beyond tech giants, sectors like retail are embracing multimodal feature extraction for better search and recommendation. E-commerce platforms now let shoppers search by image and text together for more precise results (think snapping a photo of a dress and adding the query âsummer styleâ). According to market analysis, multimodal AI adoption is accelerating across retail, finance, and media, as organizations seek to fuse data types to gain richer insights (Multimodal AI Research Report 2025: Market to Grow by Over). From automated video analysis in media archives to multimodal fraud detection in banking, the real-world use cases of multimodal systems are broadening by the week.

đ Trends & Predictions
- MOE for scalability: Mixture-of-Experts (MoE) is emerging as the go-to solution for scaling multimodal models. Metaâs Llama 4 uses MoE to juggle massive capacityâLlama 4 Maverick reportedly has 128 experts (total 400B weights, 17B active) and achieves a 1 million-token context window, while the âsmallerâ Llama 4 Scout (109B total) hit an unprecedented 10 million-token context (Archive for April 2025). Weâre seeing others follow suit (e.g. Kimi-VL, InternVL3) by using experts to boost model size or speed. Expect future state-of-the-art VLMs to rely on MoE to attain huge contexts and improved efficiency simultaneously.
- Integrated reasoning: Thereâs a clear pattern of multimodal models moving beyond perception into cognition. Several new models emphasize chain-of-thought reasoning: Googleâs Gemini 2.5 âpauses to thinkâ with an internal chain-of-thought approach (This Week in AI: Global Trends and Developments (Early April 2025), and Moonshotâs Kimi-VL-Thinking is explicitly trained for complex visual reasoning (Kimi VL and Kimi VL Thinking: Powerful Open Source Vision Models). Additionally, DeepMind CEO Demis Hassabis recently mentioned on the Possible podcast plans to merge Gemini with Veo's advanced video-generation models, aiming for multimodal assistants that deeply integrate visual understanding with reasoning to support real-world interactions. This trend suggests next-gen multimodal systems wonât just describe images or videos â theyâll perform logical reasoning on them (e.g. solve math problems from a diagram or plan tasks from an image). Multimodal AI is gradually shifting from âunderstand and tellâ towards âunderstand and figure out,â blurring the line between perception and problem-solving.
- Keeping up with the flood: The pace of multimodal AI progress isnât slowing â if anything, itâs accelerating. Every week brings new model releases, whether itâs an open-source vision-language model or a specialized framework for cross-modal search. This rapid churn is giving rise to tooling to manage it. (For instance, multimodal data warehouses like Mixpeek help teams by managing a library of feature extractors and retrieval pipelines so developers can plug in the latest and greatest models without reinventing the wheel.) A key prediction: organizations that invest in infrastructure to swiftly evaluate and swap in new multimodal models will stay ahead. Flexibility will be critical, because todayâs state-of-the-art could be outpaced next month.
đ§Š Community + Shoutouts
- Open-source heroes on X: Shoutout to the researchers sharing their multimodal work openly. For example, @Kimi_Moonshot unveiled Kimi-VL on Twitter as a âlightweight yet powerful Vision-Language Model with reasoning capability" (You Jiacheng on X: "MoonViT trained with Muon? Interesting ...), and @mervenoyann (OpenGVLab, now at Hugging Face) announced InternVL3 with sizes from 1B to 78B and support for reasoning.
- Viral Ghibli-fication of images: The public is having fun with multimodal AI too. Shortly after OpenAIâs GPT-4o (image-generation enabled GPT-4) went live, a viral trend took off where users turned their personal photos into Studio Ghibliâstyle art. The craze was so popular it reportedly strained OpenAIâs (This Week in AI: Global Trends and Developments (Early April 2025)! Itâs a lighthearted example of how quickly new multimodal capabilities get adopted by creators and casual users. From fake restaurant receipts to anime portraits, the community keeps finding creative (and unexpected) ways to use these tools.

- Hugging Face hubs & demos: Every week, new models and demos pop up on Hugging Face. Recent highlights include community Spaces that let you try out image-question answering with open VLMs and visualization notebooks comparing embeddings from models like CLIP, BLIP-2, and the latest research prototypes. The Hugging Face community (shoutout to contributors like @smol-vision enthusiasts) are continuously evaluating and fine-tuning these models, which helps surface their strengths and weaknesses. If youâre not sure which new multimodal model works best for your task, chances are someone in the community has a benchmark or demo up for it already. đ
Join the Discussion
Have thoughts, questions, or insights about this post? Be the first to start the conversation in our community!
Start a Discussion