Molmo2-8B

by allenai

Open VLM with video grounding: locate and track objects across frames

85Kdl/month

8Bparams

HuggingFace Run on your data

Identifiers

Model ID

allenai/Molmo2-8B

Feature URI

mixpeek://image_extractor@v1/allenai_molmo2_8b_v1

Overview

Molmo2 is a fully open (weights + data) vision-language model from AI2 that supports image, video, and multi-image understanding with strong spatial grounding. It can point to, track, and count objects in video, outperforming Qwen3-VL on video counting (35.5 vs 29.6) and Gemini 3 Pro on video pointing (38.4 vs 20.0 F1).

Built on Qwen3-8B and SigLIP 2 vision encoder, Molmo2 is unique in offering both open weights and open training data, enabling full reproducibility.

Architecture

8B parameter VLM using Qwen3-8B language backbone + SigLIP 2 vision encoder. Multi-image and video input via frame sampling. Spatial grounding via coordinate prediction in output tokens.

Mixpeek SDK Integration

import { Mixpeek } from "mixpeek";

const mx = new Mixpeek({ apiKey: "API_KEY" });

// Managed: create a collection over a bucket; Mixpeek runs this model's extractor
const collection = await mx.collections.create({
  namespace_id: "my-namespace",
  collection_name: "my-collection",
  source: { type: "bucket", bucket_ids: ["bkt_your_bucket"] },
  feature_extractor: {
    feature_extractor_name: "caption",
    version: "v1",
    parameters: { model_id: "mixpeek://video_extractor@v1/allenai_molmo2_8b_v1" },
  },
});