InternVL3-78B

by OpenGVLab

78B flagship multimodal LLM for image, video, and document understanding

450Kdl/month

78Bparams

HuggingFace Run on your data

Identifiers

Model ID

OpenGVLab/InternVL3-78B

Feature URI

mixpeek://image_extractor@v1/opengvlab_internvl3_78b_v1

Overview

InternVL3-78B is OpenGVLab's flagship open-source multimodal LLM, scaling the InternVL3 architecture to 78B parameters for state-of-the-art performance across image understanding, video comprehension, document analysis, and chart interpretation.

InternVL3-78B achieves top results among open-source MLLMs on general multimodal benchmarks, reasoning tasks, and agentic evaluations. On Mixpeek, it serves as the highest-quality option for scene description, visual Q&A, and structured extraction from complex visual content where accuracy matters more than latency.

Architecture

InternViT-6B vision encoder + InternLM3-78B language model with dynamic resolution support. 78B total parameters. Processes images at up to 4K resolution with tile-based encoding. Supports interleaved image-text and multi-frame video input.

Mixpeek SDK Integration

import { Mixpeek } from "mixpeek";

const mx = new Mixpeek({ apiKey: "API_KEY" });

// Managed: create a collection over a bucket; Mixpeek runs this model's extractor
const collection = await mx.collections.create({
  namespace_id: "my-namespace",
  collection_name: "my-collection",
  source: { type: "bucket", bucket_ids: ["bkt_your_bucket"] },
  feature_extractor: {
    feature_extractor_name: "s3",
    version: "v1",
    parameters: { model_id: "mixpeek://image_extractor@v1/opengvlab_internvl3_78b_v1" },
  },
});