NEWWhy single embeddings fail for video.Read the post →
    Models/Captioning/Salesforce/blip2-opt-2.7b
    HFScene Captioningmit

    blip2-opt-2.7b

    by Salesforce

    Bootstrapping Language-Image Pre-training with frozen LLMs

    516Kdl/month
    438likes
    3.7Bparams
    Identifiers
    Model ID
    Salesforce/blip2-opt-2.7b
    Feature URI
    mixpeek://image_extractor@v1/salesforce_blip2_v1

    Overview

    BLIP-2 bridges the modality gap between vision and language using a lightweight Querying Transformer (Q-Former) that connects a frozen image encoder to a frozen large language model. This enables powerful visual question answering and image captioning.

    On Mixpeek, BLIP-2 generates rich natural language descriptions of video frames and images, making visual content searchable with full-text queries.

    Architecture

    Three-stage architecture: (1) frozen ViT-G/14 image encoder, (2) Q-Former with 32 learnable query tokens that bridge vision and language, (3) frozen OPT 2.7B language model. Only the Q-Former is trained.

    Mixpeek SDK Integration

    import { Mixpeek } from "mixpeek";
    const mx = new Mixpeek({ apiKey: "API_KEY" });
    await mx.collections.ingest({
    collection_id: "my-collection",
    source: { url: "https://example.com/video.mp4" },
    feature_extractors: [{
    name: "scene_description",
    version: "v1",
    params: {
    model_id: "Salesforce/blip2-opt-2.7b"
    }
    }]
    });

    Capabilities

    • Natural language scene descriptions
    • Visual question answering
    • Image-grounded text generation
    • Zero-shot visual reasoning

    Use Cases on Mixpeek

    Auto-captioning video archives for accessibility and search
    Content discovery, find scenes by natural language description
    Automated metadata generation for media asset management
    Visual Q&A over surveillance or training footage

    Benchmarks

    DatasetMetricScoreSource
    COCO CaptioningCIDEr145.8Li et al., 2023 — Table 3
    VQAv2 (test-dev)Accuracy65.0%Li et al., 2023 — Table 4
    NoCaps (val)CIDEr121.6Li et al., 2023 — Table 3

    Performance

    Input Size224×224 px
    GPU Latency~45ms / image (A100)
    CPU Latency~680ms / image
    GPU Throughput~22 images/sec (A100)
    GPU Memory~6.2 GB

    Includes OPT-2.7B LLM decoder for caption generation

    Specification

    FrameworkHF
    OrganizationSalesforce
    FeatureScene Captioning
    Outputtext
    Modalitiesvideo, image
    RetrieverSemantic Search
    Parameters3.7B
    Licensemit
    Downloads/mo516K
    Likes438

    Research Paper

    BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

    arxiv.org

    Build a pipeline with blip2-opt-2.7b

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Pipeline Builder