blip2-opt-2.7b

by Salesforce

Bootstrapping Language-Image Pre-training with frozen LLMs

484Kdl/month

433likes

3.7Bparams

HuggingFace Use in Pipeline

Identifiers

Model ID

Salesforce/blip2-opt-2.7b

Feature URI

mixpeek://image_extractor@v1/salesforce_blip2_v1

Overview

BLIP-2 bridges the modality gap between vision and language using a lightweight Querying Transformer (Q-Former) that connects a frozen image encoder to a frozen large language model. This enables powerful visual question answering and image captioning.

On Mixpeek, BLIP-2 generates rich natural language descriptions of video frames and images, making visual content searchable with full-text queries.

Architecture

Three-stage architecture: (1) frozen ViT-G/14 image encoder, (2) Q-Former with 32 learnable query tokens that bridge vision and language, (3) frozen OPT 2.7B language model. Only the Q-Former is trained.

Mixpeek SDK Integration

import { Mixpeek } from "mixpeek";

const mx = new Mixpeek({ apiKey: "API_KEY" });

await mx.collections.ingest({
  collection_id: "my-collection",
  source: { url: "https://example.com/video.mp4" },
  feature_extractors: [{
    name: "scene_description",
    version: "v1",
    params: {
      model_id: "Salesforce/blip2-opt-2.7b"
    }
  }]
});

Capabilities

Natural language scene descriptions
Visual question answering
Image-grounded text generation
Zero-shot visual reasoning

Use Cases on Mixpeek

Auto-captioning video archives for accessibility and search

Content discovery — find scenes by natural language description

Automated metadata generation for media asset management

Visual Q&A over surveillance or training footage

Specification

FrameworkHF

OrganizationSalesforce

FeatureScene Captioning

Outputtext

Modalitiesvideo, image

RetrieverSemantic Search

Parameters3.7B

Licensemit

Downloads/mo484K

Likes433

Research Paper

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

arxiv.org

Build a pipeline with blip2-opt-2.7b

Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

Open Pipeline Builder

Alternative Models

microsoft/Florence-2-large

Scene Captioning