NEWVectors or files. Pick a path.Start →
    Models/Captioning/moondream/moondream3-preview
    HFScene CaptioningOther

    moondream3-preview

    by moondream

    Compact visual reasoning model for fast image QA and scene captions

    276Kdl/month
    Compact VLMparams
    Identifiers
    Model ID
    moondream/moondream3-preview
    Feature URI
    mixpeek://image_extractor@v1/moondream3_preview_v1

    Overview

    Moondream3 Preview is a compact image-text model from Moondream focused on visual question answering, captioning, and deployable visual reasoning. It continues the Moondream line's emphasis on small-model ergonomics while keeping enough visual reasoning quality for production perception pipelines.

    On Mixpeek, Moondream3 is a useful second-stage model after cheap embedding retrieval. Use it to caption candidate images, answer bounded visual questions, or extract concise observations that an agent can cite.

    Architecture

    Image-text-to-text model exposed through Hugging Face Transformers custom code. It supports caption generation, visual question answering, and streaming output for interactive applications.

    Mixpeek SDK Integration

    import { Mixpeek } from "mixpeek";
    const mx = new Mixpeek({ apiKey: "API_KEY" });
    await mx.collections.ingest({
    collection_id: "image-library",
    source: { url: "s3://assets/images/" },
    feature_extractors: [{
    feature: "scene_caption",
    model: "moondream/moondream3-preview",
    params: { caption_length: "short" }
    }]
    });

    Capabilities

    • Image captioning with short and detailed modes
    • Visual question answering over retrieved images
    • Compact deployment compared with large VLMs
    • Streaming generation support

    Use Cases on Mixpeek

    Caption retrieved product or ad images before agent reasoning
    Answer visual QA questions over screenshots and camera frames
    Generate compact evidence summaries for multimodal search results
    Run lightweight inspection passes over candidate image sets

    Performance

    Input SizeImage plus instruction
    GPU LatencyOutput length dependent
    GPU ThroughputBatch dependent
    GPU MemorySmall VLM deployment class

    Best used after first-stage retrieval or for high-throughput caption generation

    Specification

    FrameworkHF
    Organizationmoondream
    FeatureScene Captioning
    Outputtext
    Modalitiesvideo, image
    RetrieverSemantic Search
    ParametersCompact VLM
    LicenseOther
    Downloads/mo276K

    Research Paper

    Moondream3 Preview model card

    arxiv.org

    Build a pipeline with moondream3-preview

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Studio