NEWManaged multimodal retrieval.Explore platform →
    Models/Captioning/allenai/Molmo2-8B
    HFScene CaptioningApache 2.0

    Molmo2-8B

    by allenai

    Open VLM with video grounding — locate and track objects across frames

    85Kdl/month
    8Bparams
    Identifiers
    Model ID
    allenai/Molmo2-8B
    Feature URI
    mixpeek://image_extractor@v1/allenai_molmo2_8b_v1

    Overview

    Molmo2 is a fully open (weights + data) vision-language model from AI2 that supports image, video, and multi-image understanding with strong spatial grounding. It can point to, track, and count objects in video — outperforming Qwen3-VL on video counting (35.5 vs 29.6) and Gemini 3 Pro on video pointing (38.4 vs 20.0 F1).

    Built on Qwen3-8B and SigLIP 2 vision encoder, Molmo2 is unique in offering both open weights and open training data, enabling full reproducibility.

    Architecture

    8B parameter VLM using Qwen3-8B language backbone + SigLIP 2 vision encoder. Multi-image and video input via frame sampling. Spatial grounding via coordinate prediction in output tokens.

    Mixpeek SDK Integration

    mixpeek.ingest.from_url(
    url="s3://footage/scene.mp4",
    collection="video_library",
    feature_extractors=[{
    "type": "caption",
    "model": "mixpeek://video_extractor@v1/allenai_molmo2_8b_v1"
    }]
    )

    Capabilities

    • Image understanding
    • Video understanding
    • Object pointing and tracking
    • Video counting
    • Multi-image reasoning
    • Visual grounding

    Use Cases on Mixpeek

    Video scene analysis with object tracking
    Temporal grounding for video RAG
    Frame-level annotation and description
    Agent visual perception

    Performance

    Input SizeVariable
    GPU Latency~120ms per frame (A100)
    GPU Throughput~8 frames/sec
    GPU MemoryModel dependent

    Specification

    FrameworkHF
    Organizationallenai
    FeatureScene Captioning
    Outputtext
    Modalitiesvideo, image
    RetrieverSemantic Search
    Parameters8B
    LicenseApache 2.0
    Downloads/mo85K

    Build a pipeline with Molmo2-8B

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Studio