NEWVectors or files. Pick a path.Start →
    Models/Captioning/facebook/Perception-LM-3B
    HFScene CaptioningFAIR Noncommercial Research License

    Perception-LM-3B

    by facebook

    Meta Perception Language Model checkpoint for detailed image and video understanding

    1.3Kdl/month
    3B classparams
    Identifiers
    Model ID
    facebook/Perception-LM-3B
    Feature URI
    mixpeek://image_extractor@v1/facebook_perception_lm_3b_v1

    Overview

    Perception-LM-3B is part of Meta's PerceptionLM release for open, reproducible visual understanding research. The linked paper describes a transparent Perception Language Model stack for detailed image and video understanding, including human-labeled and synthetic data and a PLM-VideoBench evaluation for temporal perception.

    On Mixpeek, Perception-LM-3B is useful when teams want a research-friendly VLM for building searchable descriptions of images and video clips. Its license is research-only, so it should be treated as an evaluation and prototyping model rather than a default commercial production choice.

    Architecture

    Autoregressive vision-language model from the PerceptionLM family. The model combines a Perception Encoder visual backbone with a language decoder and is released in 1B, 3B, and 8B scales for detailed visual understanding experiments.

    Mixpeek SDK Integration

    import { Mixpeek } from "mixpeek";
    const mx = new Mixpeek({ apiKey: "API_KEY" });
    await mx.collections.ingest({
    collection_id: "vlm-evals",
    source: { url: "s3://benchmarks/video-clips/" },
    feature_extractors: [{
    feature: "scene_caption",
    model: "facebook/Perception-LM-3B",
    params: {
    sample_rate: "1fps",
    caption_detail: "dense"
    }
    }]
    });

    Capabilities

    • Detailed image and video understanding
    • Visual question answering over frames and clips
    • Temporal video perception research via PLM-VideoBench
    • Transparent data and training recipe for reproducible VLM evaluation
    • Useful baseline for comparing closed and open visual reasoning models

    Use Cases on Mixpeek

    Prototype video understanding pipelines with an open research checkpoint
    Compare caption quality across VLMs before selecting a production model
    Index image and video datasets for agent evaluation
    Build evidence traces for visual QA benchmarks

    Benchmarks

    DatasetMetricScoreSource
    PLM-VideoBenchCoverageIntroduced for temporal video understandingPerceptionLM paper
    Visual understanding tasksScopeImage and video understandingHuggingFace paper page

    Performance

    Input SizeImages or sampled video frames
    GPU LatencyInput dependent
    GPU ThroughputBatch dependent
    GPU MemoryModel dependent

    Research license requires access approval and noncommercial use

    Specification

    FrameworkHF
    Organizationfacebook
    FeatureScene Captioning
    Outputtext
    Modalitiesvideo, image
    RetrieverSemantic Search
    Parameters3B class
    LicenseFAIR Noncommercial Research License
    Downloads/mo1.3K

    Research Paper

    PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding

    arxiv.org

    Build a pipeline with Perception-LM-3B

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Studio