Mixpeek Logo
    Schedule Demo
    ESEthan Steininger
    5 min read

    🧠 Multimodal Monday #1 - State of the Stack

    Researchers introducing new methods to replace embeddings with discrete IDs for faster cross-modal search

    🧠 Multimodal Monday #1 - State of the Stack
    Multimodal Monday

    This week in multimodal AI:

    • Researchers introducing new methods to replace embeddings with discrete IDs for faster cross-modal search
    • New video retrieval systems showing improvement by integrating audio with visual cues
    • Tutorials to build Multimodal RAG applications
    • Multimodal use cases expanding across healthcare and e-commerce
    • Cool demos showcasing visual reasoning and multimodal agents

    🧠 Research Highlights

    Refer to caption
    # Multimodal Retrieval with BGE-VL
    # Demonstrates how to use BGE-VL for image + text queries
    
    import torch
    from transformers import AutoModel
    from PIL import Image
    
    # Load the model - choose from base or large versions
    model = AutoModel.from_pretrained("BAAI/BGE-VL-base", trust_remote_code=True)
    model.set_processor("BAAI/BGE-VL-base")  # Initialize the processor
    model.eval()  # Set to evaluation mode
    
    # Example: Combined image + text query
    # The power of multimodal retrieval is combining both modalities
    with torch.no_grad():
        # Encode a query using both image and text instruction
        query_embedding = model.encode(
            images="./product_image.jpg", 
            text="Find this in blue color with leather material"
        )
        
        # Encode candidate images from your database
        candidate_embeddings = model.encode(
            images=["./candidate1.jpg", "./candidate2.jpg", "./candidate3.jpg"]
        )
        
        # Calculate similarity scores
        similarity_scores = query_embedding @ candidate_embeddings.T
        
        # Get the most similar item
        best_match_idx = torch.argmax(similarity_scores).item()
        print(f"Best match: candidate{best_match_idx+1}.jpg with score {similarity_scores[0][best_match_idx]:.4f}")
    Refer to caption

    🛠️ Tools & Techniques

    • RAP (Retrieval-Augmented Personalization) – A new open-source library (with a CVPR’25 paper) that lets you inject personalized knowledge into a multimodal LLM via retri (GitHub - Hoar012/RAP-MLLM: [CVPR 2025] RAP: Retrieval-Augmented Personalization). How you might use it: You can have a vision-language model “remember” custom concepts (e.g. who your family members are in photos) by feeding it a private image-text database, enabling personalized Q&A or content generation without retraining.
    RAP-MLLM
    Multimodal rag flow

    🏗️ Real-World Applications

    • “All-in-One” models: There’s a clear movement toward unified models that handle many modalities simultaneously. The debut of Qwen2.5-Omni (vision + audio + text in one) and other efforts (e.g. Google’s Gemini vision-language upgrades) suggest that future AI stacks will favor integrated multimodal understanding over siloed models – all while keeping model size manageable (7B parameters in Qwen’s case, optimized for edge depl (Alibaba Cloud Releases Qwen2.5-Omni-7B An End-to-end Multimodal AI Model - Alibaba Cloud Community).
    • Retrieval-augmented multimodality: Many updates this week highlight retrieval as a crucial component of multimodal systems. From using external knowledge to ground image answers, to personalization via retrieved user data (RAP), to generative retrieval replacing indexes – combining search with multimodal models is becoming a standard strategy to boost accuracy and controllability.
    • Efficiency and scalability: A growing theme is making multimodal models and searches more efficient. Approaches like GENIUS avoid costly nearest-neighbor search by generating IDs on t (GENIUS: A Generative Framework for Universal Multimodal Search)3-L68】, and new architectures (e.g. Qwen’s blockwise encoders and Thinker-Talker design) enable streaming inputs/outputs without giant compute ov ([2503.20215] Qwen2.5-Omni Technical Report). We foresee a push toward resource-friendly multimodal AI that can run in real-time and at scale.

    🧩 Community + Shoutouts

    • QVQ-Max visual reasoning demo: The Alibaba AI team unveiled QVQ-Max, a visual reasoning module for their Qwen chat—allowing users to upload an image or video and then see the model’s step-by-step “thinking” process when answering a question (Multimodal: AI News Week Ending 03/28/2025 - Ethan B. Holland). This peek under the hood of a multimodal LLM got the community excited about new ways to interpret and trust AI’s visual answers.
    • Together Chat (multimodal agent): Startup Together released a free web demo that combines several open-source models to handle diverse tasks. Dubbed Together Chat, it can perform web search, code writing, image generation, and even image analysis in one in (Multimodal: AI News Week Ending 03/28/2025 - Ethan B. Holland). It showcases the power of connecting multimodal tools (like a language model + vision model) – all accessible to users for free, demonstrating the community’s push toward open, all-in-one AI assistants.
    ES
    Ethan Steininger

    April 2, 2025 · 5 min read