Keye-VL-8B-Preview
by Kwai-Keye
Short-video VLM with temporal precision via 3D positional encoding
Kwai-Keye/Keye-VL-8B-Previewmixpeek://video_extractor@v1/kwai_keye_vl_8b_v1Overview
Keye-VL is a multimodal VLM specifically engineered for short-form video understanding while maintaining general vision-language abilities. Built by Kuaishou (operator of one of the world's largest short-video platforms), it uses 3D RoPE for unified text/image/video processing with one-to-one correspondence between position encoding and absolute time.
Trained on 600B+ tokens with video emphasis, Keye-VL excels at understanding the dominant content format of the modern internet: short clips.
Architecture
8B parameter model built on Qwen3-8B + SigLIP vision encoder. Uses 3D RoPE (Rotary Position Embedding) for unified spatial-temporal encoding, enabling precise temporal grounding in video content.
Mixpeek SDK Integration
mixpeek.ingest.from_url(url="s3://ugc/short-clip.mp4",collection="short_videos",feature_extractors=[{"type": "caption","model": "mixpeek://video_extractor@v1/kwai_keye_vl_8b_v1"}])
Capabilities
- Short-video understanding
- Temporal grounding
- Image understanding
- Video QA
- Scene classification
- Action recognition
Use Cases on Mixpeek
Performance
Common Pipeline Companions
Specification
Build a pipeline with Keye-VL-8B-Preview
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Studio