Marlin-2B
by NemoStation
2B video VLM with second-precise temporal captioning and grounding
NemoStation/Marlin-2Bmixpeek://image_extractor@v1/nemostation_marlin_2b_v1Overview
Marlin-2B is a 2-billion parameter video vision-language model from NemoStation that specializes in dense video captioning with second-level timestamp precision and temporal grounding. It tops the CaReBench leaderboard at the 2B scale and competes with models 3-4x its size on temporal understanding tasks. Built on Qwen3.5-2B, it processes video at 2 FPS with up to 240 frames, making it practical for production video indexing.
Architecture
Video VLM built on Qwen3.5-2B with a temporal-aware visual encoder. Processes video at 2 FPS sampling rate with a 240-frame cap (covering up to 2 minutes of video). Generates timestamped captions with [start:end] markers and supports temporal grounding queries that return specific time ranges.
Mixpeek SDK Integration
from mixpeek import Mixpeekmx = Mixpeek(api_key="YOUR_KEY")mx.ingest.videos(source="s3://media/raw-footage/",collection="video_archive",feature_extractors=[{"name": "scene_caption","model": "NemoStation/Marlin-2B","params": {"fps": 2, "max_frames": 240, "timestamps": True}}])
Capabilities
- Dense video captioning with second-precise timestamps
- Temporal grounding — find specific moments from natural language queries
- Video summarization with temporal structure
- Scene transition detection and labeling
- Multi-event timeline generation from continuous video
Use Cases on Mixpeek
Benchmarks
| Dataset | Metric | Score | Source |
|---|---|---|---|
| CaReBench | Score | #1 at 2B scale | Competitive with 7B+ models |
| TimeLens-Bench | Temporal Acc | Matches Gemini-2.0-Flash | At 1/10th the parameter count |
Performance
Specification
Build a pipeline with Marlin-2B
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Studio