Multimodal Monday #4: From Pixels to Plans

A weekly pulse on everything multimodal—models, data, tools & community.

🎯 Quick Take (TL;DR)

OpenAI o3 and o4‑mini introduce visual chain‑of‑thought: ChatGPT now manipulates and reasons over images directly during prompts, boosting accuracy on visual benchmarks. [OpenAI blog]
Wan 2.1‑FLF2V‑14B goes open‑source: The 14B video model supports text2vid and frame2vid generation and is now freely available under Apache‑2.0. [Announcement]
ColorBench makes color a first‑class benchmark: New benchmark shows VLMs still fail subtle hue distinctions despite model scale and CoT prompting. [Paper]
Gemini 2.5 gains segmentation masks: Pro and Flash APIs now return PNG masks in JSON, making large-scale structured visual queries cheap and simple. [Demo & Analysis]
Kling AI 2.0 crosses 22 million users: Kling's generative video model outperforms Runway Gen-4 in blind comparisons and continues its mainstream momentum. [Business Insider]

🛠️ Tools & Techniques

OpenAI o3 &o4‑mini Vision CoT: the models expose image‑manipulation functions (crop, rotate, zoom) within the prompt chain, letting developers build agents that inspect fine details or run step‑by‑step visual reasoning without extra vision code. [OpenAI blog]

Gemini 2.5 Image Segmentation: the new mask field delivers base64 PNG silhouettes in the same JSON block as bounding boxes, so no second call or extra model is required. Developers can feed those masks directly into Mixpeek and instantly run colour‑aware or region‑specific vector search over their image corpus—no OpenCV hacks or custom post‑processing. [Docs]
BitNet b1.58 (2 B, 1‑bit weights): by collapsing weights to {‑1, 0, +1}, Microsoft squeezes a competent LLM into < 500 MB RAM while retaining competitive code, math and reasoning scores. That footprint means you can co‑locate the language core with a vision encoder on an edge box, enabling privacy‑preserving multimodal assistants inside kiosks, cars and wearables [GitHub].
Wan 2.1‑FLF2V(First-Last-Frame to video generation model): ready‑to‑run pipelines for text→video, image→video and first‑/last‑frame→video tasks. Because output frames follow standard PIL tensors, teams can pipe results straight into Mixpeek’s video‑to‑vector ingest for similarity search, captioning or downstream editing—no format wrangling. [Release]

🧠 Research Highlights

TinyLLaVA‑Video‑R1 (3 B parameters): after group relative policy optimisation on NextQA data, this 3 B model leaps ~8 pp on MVBench and MMVU. Its generated <think> traces show multi‑step temporal reasoning, challenging the notion that video logic requires heavyweight 7–8 B backbones. [Code & Paper]
Geo4D: Oxford & Naver researchers fine‑tune a DynamiCrafter video diffusion model to jointly predict depth, point and ray maps, then fuse them with a sliding‑window optimiser. The system outperforms monocular baselines like MonST3R on Sintel and KITTI depth while reconstructing full 4‑D geometry without NeRF‑style per‑scene training. [Paper]
ColorBench evaluation: 32 VLMs (GPT‑4o, Gemini 2.5, LLaVA‑Next) are grilled on colour perception, reasoning and robustness. Larger language heads help, but absolute scores hover in the mid‑60 % range. Chain‑of‑thought adds 4–6 pp robustness but still leaves hue‑based traps unresolved. [Paper]

🏗️ Real‑World Applications

Google Veo 2 in Gemini Advanced: paying users can now render 8‑second, 720 p cinematic clips from text, or convert a single still into motion via Whisk Animate. Each video ships with SynthID watermarks and is capped to PG‑13 prompts, reflecting Google’s steady, safety‑first rollout of generative media.
Illumina × Tempus AI alliance: the partnership will merge Tempus’s 56‑petabyte multimodal clinical lake (DNA, radiology, pathology, EHR) with Illumina’s sequencing models. The goal is end‑to‑end diagnostic predictors that can flag cardiac or neuro risks from a fusion of genomic variants and historical scans, an industrial‑scale test of multimodal healthcare AI.

📈 Trends & Predictions

Perceive ➜ Reason ➜ Create, end‑to‑end: Geo4D demonstrates perception models can output structured 3‑D geometry, OpenAI’s visual CoT performs in‑context reasoning over those visuals, and Wan 2.1 turns abstract instructions into new video. Together they hint at assistants that ingest pixels and emit decisions or media without juggling separate OCR, planner and generator models. Imagine an app that inspects a broken shelf, reasons through repair steps and returns an annotated video tutorial.
Velocity as competitive edge: lightweight cores like BitNet paired with Mixpeek’s hot‑swappable extractors mean a team can A/B a new vision backbone on Monday and ship to prod by Friday. In domains like personalized shopping or micro‑learning, that iteration speed will trump massive but slower‑moving monoliths.

🎉 Community & Shout‑outs

Pet‑morph mania: TikTok and X feeds are awash with “turn my Doberman into a nightclub bouncer” images generated via ChatGPT’s visual tools, evidence that frontier multimodal features leap from lab demo to meme template in record time [Mashable].
AutoGPT agent boom: The open‑source AutoGPT repository just rocketed past 175,000 GitHub stars, with builders wiring it into web browsers, vision encoders and trading bots showing the community’s hunger for autonomous, multimodal agents that plan and execute tasks end‑to‑end [GitHub].

Mixpeek helps teams drop any of the above models (segmentation, video gen, color analysis) into production by wiring feature extractors and retrieval pipelines—so next week’s breakthroughs plug‑in, not rewrite.