NEWManaged multimodal retrieval.Explore platform →
    Models/Captioning/NemoStation/Marlin-2B
    HFScene CaptioningApache 2.0

    Marlin-2B

    by NemoStation

    2B video VLM with second-precise temporal captioning and grounding

    4Kdl/month
    2Bparams
    Identifiers
    Model ID
    NemoStation/Marlin-2B
    Feature URI
    mixpeek://image_extractor@v1/nemostation_marlin_2b_v1

    Overview

    Marlin-2B is a 2-billion parameter video vision-language model from NemoStation that specializes in dense video captioning with second-level timestamp precision and temporal grounding. It tops the CaReBench leaderboard at the 2B scale and competes with models 3-4x its size on temporal understanding tasks. Built on Qwen3.5-2B, it processes video at 2 FPS with up to 240 frames, making it practical for production video indexing.

    Architecture

    Video VLM built on Qwen3.5-2B with a temporal-aware visual encoder. Processes video at 2 FPS sampling rate with a 240-frame cap (covering up to 2 minutes of video). Generates timestamped captions with [start:end] markers and supports temporal grounding queries that return specific time ranges.

    Mixpeek SDK Integration

    from mixpeek import Mixpeek
    mx = Mixpeek(api_key="YOUR_KEY")
    mx.ingest.videos(
    source="s3://media/raw-footage/",
    collection="video_archive",
    feature_extractors=[{
    "name": "scene_caption",
    "model": "NemoStation/Marlin-2B",
    "params": {"fps": 2, "max_frames": 240, "timestamps": True}
    }]
    )

    Capabilities

    • Dense video captioning with second-precise timestamps
    • Temporal grounding — find specific moments from natural language queries
    • Video summarization with temporal structure
    • Scene transition detection and labeling
    • Multi-event timeline generation from continuous video

    Use Cases on Mixpeek

    Indexing ad creative libraries with temporal metadata
    Building searchable video archives with timestamp-level granularity
    Automated highlight detection and clip extraction
    Video content moderation with temporal evidence
    Meeting/lecture video segmentation and indexing

    Benchmarks

    DatasetMetricScoreSource
    CaReBenchScore#1 at 2B scaleCompetitive with 7B+ models
    TimeLens-BenchTemporal AccMatches Gemini-2.0-FlashAt 1/10th the parameter count

    Performance

    Input SizeVariable
    GPU LatencyInput dependent
    GPU Throughput~8 videos/min (A100, 30s clips at 2 FPS)
    GPU Memory~5 GB

    Specification

    FrameworkHF
    OrganizationNemoStation
    FeatureScene Captioning
    Outputtext
    Modalitiesvideo, image
    RetrieverSemantic Search
    Parameters2B
    LicenseApache 2.0
    Downloads/mo4K

    Build a pipeline with Marlin-2B

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Studio