Multimodal Monday #5: GPT-Image Drops, Security Pops

A weekly pulse on everything multimodal—models, data, tools & community.

🎯 Quick Take (TL;DR)

OpenAI GPT-Image 1 arrives in API: The model powering ChatGPT's viral image generation is now available to developers, enabling high-quality, professional-grade image creation directly in third-party apps. [Details ]
Baidu's Ernie 4.5 Turbo goes multimodal: The new LLM interprets pictures and videos while creating documents at 40% the cost of DeepSeek V3 and 25% of DeepSeek R1, accelerating multimodal adoption in the Chinese market. [Announcement ]
Meta drops "swiss army knife for vision": New image/video encoders for vision language and spatial understanding outperform InternVL3 and Qwen2.5VL. [Details ]
Wan 2.1 offers free video generation: The platform now allows users to generate AI videos for free with a credit-based system for priority access, significantly lowering the barrier to entry for video creation. [Platform ]

💡

Multimodal Researcher of the week: mxp.co/benlonnqvist

🧠 Research Highlights

Meta's Perception Models redefine vision benchmarks: PerceptionLM (PLM) delivers state-of-the-art performance on vision-language tasks through a family of open and fully reproducible models, pushing forward research accessibility in multimodal understanding. [GitHub ]

Cross-modal prompt injection exposes multimodal vulnerabilities: Researchers demonstrate how adversarial perturbations in visual or audio modalities can hijack multimodal agents to generate attacker-specified content, highlighting security challenges unique to multimodal systems. [Paper ]
Contour integration underlies human-like vision: New research shows humans excel at recognizing objects from fragmented contours while most AI models perform poorly until trained at massive scale (~5B dataset size). This integration bias—preferring directional fragments—appears to be a key mechanism for robust object recognition. [Paper ]

💡

Research Highlight: mxp.co/contour-integration

🛠️ Tools & Techniques

OpenAI GPT-Image 1: This natively multimodal model creates images across different styles, follows custom guidelines, leverages world knowledge, and renders text—all with built-in safety guardrails and C2PA watermarking. Pricing ranges from 2-19 cents per image depending on quality, with companies like Adobe, Figma, and Instacart already integrating it. [API Documentation ]
Meta's Vision Encoders: These large-scale vision encoder models deliver state-of-the-art performance on classification and retrieval tasks while internally producing strong general features that scale for downstream tasks. The architecture enables both vision language capabilities and spatial understanding for object detection and similar tasks. [Repository ]
Dia-1.6B by Nari Labs: This open-weight (Apache 2.0) text-to-speech model generates highly realistic dialogue directly from transcripts, including speaker turns and non-verbal sounds like laughter and coughs. It supports voice cloning through audio prompts and runs in real-time on enterprise GPUs, requiring about 10GB VRAM. [Hugging Face ]

0:00

/1:10

Dia-1.6B by Nari Labs vs Eleven Labs vs Sesame CSM-1B

🏗️ Real-World Applications

Google's visual sales agent demo: At Google Cloud Next '25, Google demonstrated an AI agent capable of handling complex sales interactions with visual understanding, analyzing diagrams and responding naturally in real-time. The system showcases how multimodal AI is moving beyond simple Q&A to sophisticated business applications. [Keynote ]
Wan Video democratizes AI video creation: By offering free access to AI video generation with a credit-based system for priority access, Wan is making generative video accessible to creators and developers who previously faced high cost barriers, potentially accelerating innovation in video applications. [Platform ]
GPT-Image 1 drives industry adoption: Major companies including Adobe, Airtable, Wix, Instacart, GoDaddy, Canva, and Figma are already integrating OpenAI's image generation capabilities. Figma Design now lets users generate and edit images via the API, while Instacart is testing it for recipe images and shopping lists. [TechCrunch ]

📈 Trends & Predictions

Multimodal agents evolve beyond Q&A: Google's visual sales agent demo represents a significant evolution in multimodal AI applications—moving from simple question answering to complex business interactions that combine visual understanding, reasoning, and natural dialogue. This signals a shift toward AI systems that can handle nuanced professional tasks requiring multiple cognitive abilities simultaneously. [Analysis ]
Open-source multimodal models gain momentum: With Meta's vision tools and Nari's Dia-1.6B, we're seeing increasingly powerful open-source alternatives to proprietary systems. This democratization trend is likely to accelerate innovation by allowing more developers to build on and customize sophisticated multimodal capabilities without prohibitive costs or licensing restrictions. [Trend ]

🎉 Community & Shout-outs

SmolVLM Research Jam: Community researchers explored techniques for creating small and efficient multimodal models during a recent Genloop Research Jam, highlighting grassroots efforts to make VLMs more accessible on consumer hardware. [Discussion ]
"Whatever you do, don’t grow up" - Wan VFX Lora's in action. [Source]

0:00

/0:20

Mixpeek helps teams drop any of the above models (segmentation, video gen, color analysis) into production by wiring feature extractors and retrieval pipelines—so next week’s breakthroughs plug‑in, not rewrite.