Mixpeek Logo
    6 min read

    Multimodal Monday 35: Small Models, Modular Vision

    Week of Nov 24-30, 2025: Alibaba's 6B Z-Image impresses, Tencent's 1B HunyuanOCR beats larger models and APIs, VisionRAG uses 6-9x less memory than ColPali, and RynnVLA-002 boosts real-world robot success by 50%.

    Multimodal Monday 35: Small Models, Modular Vision
    Multimodal Monday

    Quick Hits (TL;DR)

    Small models are closing the gap - A 6B image model matches commercial giants. A 1B OCR model beats larger competitors and paid APIs. VisionRAG uses 6-9x less memory than ColPali while matching performance. You no longer need massive models to get state-of-the-art results.

    Vision models become modular components - Microsoft’s framework treats vision models as plug-and-play “eyes” for language models. You can upgrade the vision component without retraining the entire system. This makes AI systems cheaper to improve and easier to maintain.

    World models handle real robots - Alibaba’s RynnVLA-002 boosts real-world robot task success by 50%. GigaAI’s GigaWorld-0 trains robots on simulated data that transfers to physical tasks. The simulation-to-reality gap is shrinking fast.

    Research Highlights

    VisionRAG: 3-Pass Pyramid Indexing for Vision-Enhanced Document Retrieval

    Inception AI built an OCR-free document retrieval system that matches ColPali’s performance using 6-9x less memory. The system uses pyramid indexing to extract semantic information at multiple levels, producing only 17-27 compact vectors per page instead of 1024.

    Evolution of document retrieval approaches. (Top) OCR-based RAG flattens visual structure, losing layout and table context. (Middle) ColPali adds vision awareness via dense patch embeddings (~1,024 vectors/page) but at high cost. (Bottom) VisionRAG

    Why it matters: Large-scale multimodal search becomes feasible when you can index 40x more documents on the same hardware.
    Links: Paper

    Be My Eyes: Extending Large Language Models to New Modalities

    Microsoft and USC researchers built a framework where vision models act as external “eyes” for language models through multi-agent collaboration. You can swap out the vision component without expensive retraining of the full architecture.

    Why it matters: Modular AI systems cost less to improve and adapt faster to new capabilities.
    Links: Paper

    E2E-GRec: An End-to-End Joint Training Framework for GNNs and Recommender Systems

    TikTok’s framework integrates Graph Neural Networks directly into industrial recommender systems with joint training. This eliminates repeated offline GNN inference and enables true end-to-end optimization.

    Why it matters: Graph-based recommendations become practical at scale when inference overhead disappears.
    Links: Paper

    The Image as Its Own Reward: Reinforcement Learning with Adversarial Reward

    ByteDance uses the generated image itself as the reward signal through adversarial training. Human evaluators prefer these results over baseline methods 70% of the time in head-to-head comparisons.

    Why it matters: This solves the reward hacking problem that prevented RL from improving image quality.
    Links: Paper

    MIRA: Multimodal Iterative Reasoning Agent for Image Editing

    IBM’s agent uses iterative reasoning to plan and execute image edits. The system breaks down complex editing tasks into steps and refines results through multiple passes.

    Why it matters: Reasoning-based approaches produce more consistent edits than single-pass methods.
    Links: Project Page | Paper

    Canvas-to-Image: A unified framework for compositional image generation. Links: Project Page | Paper

    Inferix: A next-gen inference engine for immersive world simulation. Links: GitHub | Paper

    MedSAM3: Segment Anything with Medical Concepts. Links: Paper

    OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning. Links: Paper

    SpaceMind: Camera-Guided Modality Fusion for Spatial Reasoning. Links: Paper

    Tools, Models and Techniques

    Z-Image

    Alibaba released a 6B parameter image generation model that competes with commercial systems. It produces photorealistic images and handles bilingual text rendering at a quality level comparable to leading paid services.

    Image

    Why it matters: You get commercial-grade image generation without a license fee.
    Links: Website | Hugging Face | ComfyUI | Announcement

    Claude Opus 4.5

    Anthropic’s Claude Opus 4.5 sets new benchmarks for coding, agents, and computer use. The model handles complex reasoning tasks and multi-step workflows better than previous versions.

    Why it matters: You can build more capable agents and automation systems.
    Links: Blog Post | Announcement

    Flux.2

    Black Forest Labs updated Flux with a new interface, additional filters, and expanded editing tools. The release includes improved workflows for image creation and manipulation.

    Why it matters: Better tools mean faster iteration on image editing projects.
    Links: Blog Post | Announcement

    HunyuanOCR

    Tencent’s 1B parameter vision-language model handles OCR tasks better than commercial APIs and larger models like Qwen3-VL-4B. It achieves state-of-the-art results on OCRBench for models under 3B parameters.

    Why it matters: You get better OCR quality while using less compute and paying nothing.
    Links: Technical Report | Model | Demo

    RynnVLA-002

    Alibaba’s unified vision-language-action and world model combines robot action generation with environment dynamics prediction. It achieves 97.4% success on LIBERO simulation and boosts real-world LeRobot task performance by 50%.

    Why it matters: Robots can learn from simulation and transfer that knowledge to physical tasks.
    Links: Paper | Model

    Vidi2: A 12B multimodal model for video understanding and creation. Links: Website | Paper | GitHub

    GigaWorld-0: A unified world model acting as a data engine for VLA learning. Links: Paper | Demo

    Adv-GRPO: An RL framework for image generation that uses adversarial rewards to combat reward hacking. Links: Paper | Model

    The Shift to Efficiency

    Model efficiency stopped being a nice-to-have. It became the competitive advantage.

    VisionRAG proves this. Instead of generating 1024 vectors per page like ColPali, it uses pyramid indexing to create 17-27 vectors. Same performance. 6-9x less memory. You can index 40 times more content on the same hardware. This makes large-scale multimodal search practical for companies that couldn’t afford it before.

    Z-Image proves it again. This 6B model competes with commercial systems that cost money and use far more parameters. Alibaba didn’t optimize for bigger. They optimized for smarter architecture and better training methods.

    HunyuanOCR makes the same point. Just 1B parameters. Beats larger models. Beats commercial APIs. Runs on devices, not data centers.

    Modular AI Systems

    Microsoft’s “Be My Eyes” framework changes how we build multimodal systems. Vision models become plug-and-play components. You swap them out without retraining the entire architecture.

    This matters because retraining joint vision-language models costs millions. Modular systems cut that cost to zero. You upgrade individual components as better versions release. Your system improves faster and cheaper.

    This pattern will spread. We’ll see modular audio components, modular video understanding, modular reasoning modules. AI systems will look less like monolithic models and more like composable toolchains.

    Community + Shoutouts

    fofr’s Nano Banana Pro Guide Shoutout to fofr for this awesome guide to prompting Nano Banana Pro. A must-read for anyone who wants to get the most out of this amazing new tool. Links: Blog Post

    Helpful Nano Banana Pro Workflow Shoutout to techhalla for this awesome Nano Banana Pro workflow. A great way to get started with this powerful new tool. Links: Post

    Image

    FLUX.2 vs. Qwen Image Edit 2509 vs. Gemini 3 Pro Shoutout to BoostPixels for this great comparison of FLUX.2, Qwen Image Edit 2509, and Gemini 3 Pro. A must-read for anyone who is trying to decide which image editing tool to use. Links: Reddit

    r/QwenImageGen - FLUX.2 vs. Qwen Image Edit 2509 vs. Gemini 3 Pro Image Preview

    Consistent AI Avatar Shoutout to Jeff Dotson for this impressive consistent AI avatar. A great example of what is possible with the latest generation of AI tools. Links: Post


    That’s a wrap for Multimodal Monday #35! Z-Image takes the community by storm at only 6B parameters. HunyuanOCR shows 1B models beat larger competitors and paid APIs. VisionRAG cuts memory usage 6-9x while matching ColPali performance. Microsoft’s “Be My Eyes” framework makes vision components modular and swappable. RynnVLA-002 boosts real-world robot task success by 50%. GigaWorld-0 trains robots in simulation that transfer to physical tasks. TikTok’s E2E-GRec integrates GNNs into production recommender systems. ByteDance’s adversarial reward framework solves reward hacking in image generation. The pattern is clear: smaller models with better architecture beat bigger models with more compute. Efficiency became the competitive advantage.

    Ready to build multimodal solutions that actually work? Let's talk.