Perception-LM-3B

by facebook

Meta Perception Language Model checkpoint for detailed image and video understanding

1.3Kdl/month

3B classparams

HuggingFace Run on your data, free

Identifiers

Model ID

facebook/Perception-LM-3B

Feature URI

mixpeek://image_extractor@v1/facebook_perception_lm_3b_v1

Overview

Perception-LM-3B is part of Meta's PerceptionLM release for open, reproducible visual understanding research. The linked paper describes a transparent Perception Language Model stack for detailed image and video understanding, including human-labeled and synthetic data and a PLM-VideoBench evaluation for temporal perception.

On Mixpeek, Perception-LM-3B is useful when teams want a research-friendly VLM for building searchable descriptions of images and video clips. Its license is research-only, so it should be treated as an evaluation and prototyping model rather than a default commercial production choice.

Architecture

Autoregressive vision-language model from the PerceptionLM family. The model combines a Perception Encoder visual backbone with a language decoder and is released in 1B, 3B, and 8B scales for detailed visual understanding experiments.

Mixpeek SDK Integration

import { Mixpeek } from "mixpeek";

const mx = new Mixpeek({ apiKey: "API_KEY" });

// Managed: create a collection over a bucket; Mixpeek runs this model's extractor
const collection = await mx.collections.create({
  namespace_id: "my-namespace",
  collection_name: "my-collection",
  source: { type: "bucket", bucket_ids: ["bkt_your_bucket"] },
  feature_extractor: {
    feature_extractor_name: "scene_caption",
    version: "v1",
    parameters: { model_id: "facebook/Perception-LM-3B" },
  },
});

Capabilities

Detailed image and video understanding
Visual question answering over frames and clips
Temporal video perception research via PLM-VideoBench
Transparent data and training recipe for reproducible VLM evaluation
Useful baseline for comparing closed and open visual reasoning models

Use Cases on Mixpeek

Prototype video understanding pipelines with an open research checkpoint

Compare caption quality across VLMs before selecting a production model

Index image and video datasets for agent evaluation

Build evidence traces for visual QA benchmarks

Benchmarks

Dataset	Metric	Score	Source
PLM-VideoBench	Coverage	Introduced for temporal video understanding	PerceptionLM paper
Visual understanding tasks	Scope	Image and video understanding	HuggingFace paper page