Mixpeek Logo
    Models/Captioning/Salesforce/blip2-opt-2.7b
    HFScene Captioningmit

    blip2-opt-2.7b

    by Salesforce

    Bootstrapping Language-Image Pre-training with frozen LLMs

    484Kdl/month
    433likes
    3.7Bparams
    Identifiers
    Model ID
    Salesforce/blip2-opt-2.7b
    Feature URI
    mixpeek://image_extractor@v1/salesforce_blip2_v1

    Overview

    BLIP-2 bridges the modality gap between vision and language using a lightweight Querying Transformer (Q-Former) that connects a frozen image encoder to a frozen large language model. This enables powerful visual question answering and image captioning.

    On Mixpeek, BLIP-2 generates rich natural language descriptions of video frames and images, making visual content searchable with full-text queries.

    Architecture

    Three-stage architecture: (1) frozen ViT-G/14 image encoder, (2) Q-Former with 32 learnable query tokens that bridge vision and language, (3) frozen OPT 2.7B language model. Only the Q-Former is trained.

    Mixpeek SDK Integration

    import { Mixpeek } from "mixpeek";
    
    const mx = new Mixpeek({ apiKey: "API_KEY" });
    
    await mx.collections.ingest({
      collection_id: "my-collection",
      source: { url: "https://example.com/video.mp4" },
      feature_extractors: [{
        name: "scene_description",
        version: "v1",
        params: {
          model_id: "Salesforce/blip2-opt-2.7b"
        }
      }]
    });

    Capabilities

    • Natural language scene descriptions
    • Visual question answering
    • Image-grounded text generation
    • Zero-shot visual reasoning

    Use Cases on Mixpeek

    Auto-captioning video archives for accessibility and search
    Content discovery — find scenes by natural language description
    Automated metadata generation for media asset management
    Visual Q&A over surveillance or training footage

    Specification

    FrameworkHF
    OrganizationSalesforce
    FeatureScene Captioning
    Outputtext
    Modalitiesvideo, image
    RetrieverSemantic Search
    Parameters3.7B
    Licensemit
    Downloads/mo484K
    Likes433

    Research Paper

    BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

    arxiv.org

    Build a pipeline with blip2-opt-2.7b

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Pipeline Builder