Molmo2-8B
by allenai
Open VLM with video grounding — locate and track objects across frames
allenai/Molmo2-8Bmixpeek://image_extractor@v1/allenai_molmo2_8b_v1Overview
Molmo2 is a fully open (weights + data) vision-language model from AI2 that supports image, video, and multi-image understanding with strong spatial grounding. It can point to, track, and count objects in video — outperforming Qwen3-VL on video counting (35.5 vs 29.6) and Gemini 3 Pro on video pointing (38.4 vs 20.0 F1).
Built on Qwen3-8B and SigLIP 2 vision encoder, Molmo2 is unique in offering both open weights and open training data, enabling full reproducibility.
Architecture
8B parameter VLM using Qwen3-8B language backbone + SigLIP 2 vision encoder. Multi-image and video input via frame sampling. Spatial grounding via coordinate prediction in output tokens.
Mixpeek SDK Integration
mixpeek.ingest.from_url(url="s3://footage/scene.mp4",collection="video_library",feature_extractors=[{"type": "caption","model": "mixpeek://video_extractor@v1/allenai_molmo2_8b_v1"}])
Capabilities
- Image understanding
- Video understanding
- Object pointing and tracking
- Video counting
- Multi-image reasoning
- Visual grounding
Use Cases on Mixpeek
Performance
Common Pipeline Companions
Specification
Build a pipeline with Molmo2-8B
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Studio