Mixpeek Logo
    Models/Captioning/microsoft/Florence-2-large
    HFScene Captioningmit

    Florence-2-large

    by microsoft

    Foundation model for unified vision tasks with sequence-to-sequence architecture

    1.2Mdl/month
    1,767likes
    777Mparams
    Identifiers
    Model ID
    microsoft/Florence-2-large
    Feature URI
    mixpeek://image_extractor@v1/microsoft_florence2_large_v1

    Overview

    Florence-2 is a versatile vision foundation model that handles captioning, object detection, grounding, and OCR in a single unified architecture using a sequence-to-sequence paradigm. It processes images and task-specific text prompts to produce structured outputs.

    On Mixpeek, Florence-2 provides detailed scene descriptions that go beyond simple captions — including spatial relationships, object attributes, and contextual information.

    Architecture

    DaViT vision encoder paired with a transformer-based sequence-to-sequence decoder. Supports multiple vision tasks via task-specific prompt tokens. Large variant uses 770M parameters.

    Mixpeek SDK Integration

    import { Mixpeek } from "mixpeek";
    
    const mx = new Mixpeek({ apiKey: "API_KEY" });
    
    await mx.collections.ingest({
      collection_id: "my-collection",
      source: { url: "https://example.com/video.mp4" },
      feature_extractors: [{
        name: "scene_description",
        version: "v1",
        params: {
          model_id: "microsoft/Florence-2-large"
        }
      }]
    });

    Capabilities

    • Dense captioning with region descriptions
    • Referring expression comprehension
    • Object detection and visual grounding
    • OCR with text localization

    Use Cases on Mixpeek

    Rich scene understanding for video analytics
    Multi-task visual extraction in a single pass
    Grounded captioning for accessibility

    Specification

    FrameworkHF
    Organizationmicrosoft
    FeatureScene Captioning
    Outputtext
    Modalitiesvideo, image
    RetrieverSemantic Search
    Parameters777M
    Licensemit
    Downloads/mo1.2M
    Likes1,767

    Research Paper

    Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks

    arxiv.org

    Build a pipeline with Florence-2-large

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Pipeline Builder