Mixpeek Logo
    Schedule Demo
    ESEthan Steininger
    7 min read

    🎯 Multimodal Monday #2 — From Tiny VLMs to 10M‑Token Titans

    📢 Quick Take (TL;DR) * Major multimodal model releases: Meta unveiled Llama 4 Scout & Maverick – open Mixture-of-Experts models with native text+image (and even video/audio) support – and Microsoft introduced Phi-4-Multimodal, a compact 3.8B-param model integrating vision, text, and spee (Today is the start of a new era of natively multimodal AI… | AI at Meta | 190 comments) ([2503.01743] Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs)7】. Bot

    🎯 Multimodal Monday #2 — From Tiny VLMs to 10M‑Token Titans
    Multimodal Monday

    📢 Quick Take (TL;DR)

    🧠 Research Highlights

    🛠️ Tools & Techniques

    🏗️ Real-World Applications

    • Healthcare QA gets visual: Google Cloud’s Vertex AI Search (Healthcare) just added a Visual Q&A feature, enabling doctors to query medical records using images like charts or x-rays dir (Google Cloud unveils new genAI and visual search tech at HIMSS25 | Healthcare IT News)-L77】. Instead of relying only on text, the system analyzes visual inputs (e.g. a diagram from a patient intake form) to find relevant info. This multimodal search promises more comprehensive clinical answers – a boon when ~90% of healthcare data is imagery. (Powered under the hood by Google’s Gemini 2.0 multimodal model.)
    • Assistive Tech Retrieval: A new assistive technology search dataset (from an ECIR 2025 paper) showed that representing products with both text and image features helps find the best disability aids for (Multimodal Feature Extraction for Assistive Technology: Evaluation and Dataset | SpringerLink)1-L4】. In tests, a multimodal model beat text-only or image-only approaches at predicting which assistive products meet certain user “goals.” Interestingly, for abstract criteria (like a user’s end-goal), single modalities sufficed, but for detailed specs the multimodal richness wo (Multimodal Feature Extraction for Assistive Technology: Evaluation and Dataset | SpringerLink)1-L4】. Why it matters: This real-world study (built from a production AT database) underscores how multimodal retrieval can improve recommendations for special-needs equipment – an area aligned with inclusive design and UN accessibility goals.
    • Small & Modular is Big: Rather than simply scaling up parameter counts, the community is embracing efficient architectures. Meta’s Llama 4 uses mixture-of-experts to boost capacity by activating subsets of weights, and Phi-4-Multimodal keeps a base model frozen and adds LoRA adapters per modality – a technique that outperforms more brute-force appr ([2503.01743] Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs)L1-L4】. Expect more such modular tricks (MoE, adapters, distillation) to deliver multimodal prowess without requiring trillion-parameter models.
    • Edge AI and Accessibility: A clear push is on to make multimodal AI run anywhere. From “smol” models by Hugging Face to optimized encoders, new releases aim to be **accessible on lower-spec hard (Hugging Face claims world’s smallest vision language models)3-L28】. This trend means we’ll see multimodal assistants in mobile apps, IoT cameras, AR glasses, etc., bringing vision-and-language understanding to devices that previously couldn’t handle large models.
    • Robustness & Evaluation Focus: As multimodal systems deploy in high-stakes settings, there’s rising concern with their reliability. Recent work exposing poisoning vulnerabi (One Pic is All it Takes: Poisoning Visual Document Retrieval Augmented Generation with a Single Image)2-L47】 and studies on how well multimodal models truly “understand” (e.g. aligning with human brain semantics) are spurring efforts to harden these models. We anticipate more benchmarks and defenses around factuality, security, and bias in multimodal AI – not just raw performance.
    • Enterprise Adoption Accelerates: Big players are investing in domain-specific multimodal AI. This week’s example: Google’s visual search in healthcare showing strong industry (Google Cloud unveils new genAI and visual search tech at HIMSS25 | Healthcare IT News)1-L69】. We foresee finance, security, education, and other sectors following suit with custom multimodal solutions (often retrieval-augmented). The presence of multimodal features in cloud platforms and enterprise software signals that “multimodal everything” could be the next competitive differentiator in AI products.

    🧩 Community + Shoutouts

    • Meta announces LlamaCon: Mark your calendars – LlamaCon, Meta’s first developer conference for generative AI (named after the Llama model series), is set for April 2 (This Week in AI: Maybe we should ignore AI benchmarks for now | TechCrunch)6-L204】. This reflects how large AI providers are actively building developer communities around their models. We expect to see demos, fine-tuning tutorials, and perhaps some Llama 4 surprises at the event. 🦙🎉
    • MAGMaR Workshop CFP: The 1st Workshop on Multimodal Augmented Generation via Retrieval (MAGMaR) is calling f ( MAGMaR Workshop )†L53-L61】. Slated for ACL later this year, it focuses on exactly our newsletter’s theme: combining text, images, video, etc. with retrieval and generation. There’s even a shared task on video retrieval. Shoutout to scholars – submissions are open till May 1, 2025, for those pushing the envelope in multimodal RAG.
    • Llama 4 hype: The open-release of Llama 4 caused an eruption on social media. Meta’s announcement (“a new era of natively multimodal AI”) racked up over 6,500 likes and hundreds of comments wit (Today is the start of a new era of natively multimodal AI… | AI at Meta | 190 comments)†L50-L58】. Developers across Reddit/Twitter immediately started experimenting – e.g. loading Llama 4 into chatbots and even running decentralized instances. 🚀 The enthusiasm highlights the growing impact of community-driven AI: when a powerful multimodal model is freely available, creators jump in to build cool demos and applications almost overnight.
    ES
    Ethan Steininger

    April 7, 2025 · 7 min read