NEWManaged multimodal retrieval.Explore platform →
    Models/Captioning/OpenGVLab/InternVL3-78B
    HFScene CaptioningMIT

    InternVL3-78B

    by OpenGVLab

    78B flagship multimodal LLM for image, video, and document understanding

    450Kdl/month
    78Bparams
    Identifiers
    Model ID
    OpenGVLab/InternVL3-78B
    Feature URI
    mixpeek://image_extractor@v1/opengvlab_internvl3_78b_v1

    Overview

    InternVL3-78B is OpenGVLab's flagship open-source multimodal LLM, scaling the InternVL3 architecture to 78B parameters for state-of-the-art performance across image understanding, video comprehension, document analysis, and chart interpretation.

    InternVL3-78B achieves top results among open-source MLLMs on general multimodal benchmarks, reasoning tasks, and agentic evaluations. On Mixpeek, it serves as the highest-quality option for scene description, visual Q&A, and structured extraction from complex visual content where accuracy matters more than latency.

    Architecture

    InternViT-6B vision encoder + InternLM3-78B language model with dynamic resolution support. 78B total parameters. Processes images at up to 4K resolution with tile-based encoding. Supports interleaved image-text and multi-frame video input.

    Mixpeek SDK Integration

    from mixpeek import Mixpeek
    mixpeek = Mixpeek(api_key="YOUR_API_KEY")
    mixpeek.ingest.videos(
    collection="documents",
    source={"type": "s3", "bucket": "visual-docs"},
    pipeline={
    "captioning": {
    "model": "mixpeek://image_extractor@v1/opengvlab_internvl3_78b_v1"
    }
    }
    )

    Capabilities

    • State-of-the-art open-source multimodal understanding
    • High-resolution image analysis with dynamic tiling
    • Complex document and chart comprehension
    • Multi-frame video understanding
    • Structured data extraction from visual content

    Use Cases on Mixpeek

    High-accuracy scene captioning for critical pipelines
    Complex document analysis (charts, tables, diagrams)
    Visual Q&A requiring deep reasoning
    Agent visual perception for complex environments
    Quality-critical content moderation

    Benchmarks

    DatasetMetricScoreSource
    MMMUAccuracy72.2Model card
    MathVistaScore74.5Model card
    DocVQAAccuracy94.8Model card

    Performance

    Input SizeVariable
    GPU Latency~120ms per image (A100 80GB)
    GPU Throughput~8 images/sec (A100)
    GPU Memory~160 GB (2x A100 80GB)

    Specification

    FrameworkHF
    OrganizationOpenGVLab
    FeatureScene Captioning
    Outputtext
    Modalitiesvideo, image
    RetrieverSemantic Search
    Parameters78B
    LicenseMIT
    Downloads/mo450K

    Research Paper

    Model paper or technical report

    arxiv.org

    Build a pipeline with InternVL3-78B

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Studio