Keye-VL-8B-Preview

by Kwai-Keye

Short-video VLM with temporal precision via 3D positional encoding

38Kdl/month

8Bparams

HuggingFace Run on your data

Identifiers

Model ID

Kwai-Keye/Keye-VL-8B-Preview

Feature URI

mixpeek://video_extractor@v1/kwai_keye_vl_8b_v1

Overview

Keye-VL is a multimodal VLM specifically engineered for short-form video understanding while maintaining general vision-language abilities. Built by Kuaishou (operator of one of the world's largest short-video platforms), it uses 3D RoPE for unified text/image/video processing with one-to-one correspondence between position encoding and absolute time.

Trained on 600B+ tokens with video emphasis, Keye-VL excels at understanding the dominant content format of the modern internet: short clips.

Architecture

8B parameter model built on Qwen3-8B + SigLIP vision encoder. Uses 3D RoPE (Rotary Position Embedding) for unified spatial-temporal encoding, enabling precise temporal grounding in video content.

Mixpeek SDK Integration

import { Mixpeek } from "mixpeek";

const mx = new Mixpeek({ apiKey: "API_KEY" });

// Managed: create a collection over a bucket; Mixpeek runs this model's extractor
const collection = await mx.collections.create({
  namespace_id: "my-namespace",
  collection_name: "my-collection",
  source: { type: "bucket", bucket_ids: ["bkt_your_bucket"] },
  feature_extractor: {
    feature_extractor_name: "caption",
    version: "v1",
    parameters: { model_id: "mixpeek://video_extractor@v1/kwai_keye_vl_8b_v1" },
  },
});