Semantic Video Chunking: Scene Detection

Just as RAG systems break down text documents into meaningful chunks for better processing and retrieval, video content benefits from intelligent segmentation through scene detection. This approach parallels text tokenization in several crucial ways:

Semantic Coherence

Scene detection identifies natural boundaries in video content, maintaining semantic completeness just like how text chunking preserves sentence or paragraph integrity. Each scene represents a complete "thought" or action sequence, rather than arbitrary time-based splits. For example, in a cooking tutorial:

Retrieval Precision

Scene-based chunks enable precise content retrieval. Instead of returning entire videos, systems can identify and serve the exact relevant scene, similar to how RAG systems return specific text passages rather than complete documents.

Vector Embedding Quality

Scene-based chunking produces higher quality embeddings because:

Each embedding represents a coherent visual concept
The embeddings aren't "confused" by mixing multiple scenes
The semantic space remains clean, enabling better similarity matching

Processing Efficiency

Like token windows in language models, scene-based chunking helps manage video processing:

Smaller, focused chunks enable efficient processing
Parallel processing becomes more feasible
Storage and retrieval operations are optimized
Reduces redundant processing of similar frames

Multimodal Understanding

Unlike text tokenization, video scenes often contain multiple modalities (visual, audio, text-on-screen) that need to be processed together. This complexity makes intelligent chunking even more crucial for maintaining context and enabling accurate understanding.

💡

Learn multimodal understanding for free: https://mixpeek.com/education

Introduction

Video understanding at scale requires efficient processing and indexing of video content. This tutorial demonstrates how to implement dynamic video chunking using scene detection, generate embeddings with Mixpeek, and store them in Weaviate for semantic search capabilities.

Prerequisites

pip install scenedetect weaviate-client python-dotenv requests

Implementation Guide

1. Scene Detection with PySceneDetect

First, let's implement the scene detection logic:

from scenedetect import detect, ContentDetector
import os

def detect_scenes(video_path, threshold=27.0):
    """
    Detect scene changes in a video file using content detection.
    
    Args:
        video_path (str): Path to the video file
        threshold (float): Detection threshold (lower = more sensitive)
    
    Returns:
        list: List of scene timestamps (start, end) in seconds
    """
    # Detect scenes using content detection
    scenes = detect(video_path, ContentDetector(threshold=threshold))
    
    # Convert scenes to timestamp ranges
    scene_timestamps = []
    for scene in scenes:
        start_time = scene[0].get_seconds()
        end_time = scene[1].get_seconds()
        scene_timestamps.append((start_time, end_time))
    
    return scene_timestamps

2. Video Chunking Utility

Create a utility to split the video into chunks based on detected scenes:

import subprocess

def chunk_video(video_path, output_dir, timestamps):
    """
    Split video into chunks based on scene timestamps.
    
    Args:
        video_path (str): Path to the source video
        output_dir (str): Directory to save video chunks
        timestamps (list): List of (start, end) timestamps
    
    Returns:
        list: Paths to generated video chunks
    """
    chunk_paths = []
    
    for idx, (start, end) in enumerate(timestamps):
        output_path = os.path.join(output_dir, f"chunk_{idx}.mp4")
        
        # Use ffmpeg to extract the chunk
        command = [
            'ffmpeg', '-i', video_path,
            '-ss', str(start),
            '-t', str(end - start),
            '-c', 'copy',
            output_path
        ]
        
        subprocess.run(command, capture_output=True)
        chunk_paths.append(output_path)
    
    return chunk_paths

3. Mixpeek Integration

Set up the Mixpeek client for generating embeddings:

import requests
import json
from typing import List

class MixpeekClient:
    def __init__(self, api_key: str):
        self.api_key = api_key
        self.base_url = "https://api.mixpeek.com"
        
    def generate_embedding(self, video_url: str, vector_index: str) -> dict:
        """
        Generate embeddings for a video chunk using Mixpeek.
        """
        headers = {
            'Content-Type': 'application/json',
            'Authorization': f'Bearer {self.api_key}'
        }
        
        payload = {
            "type": "url",
            "value": video_url,
            "vector_index": vector_index
        }
        
        response = requests.post(
            f"{self.base_url}/features/extractors/embed",
            headers=headers,
            json=payload
        )
        
        return response.json()

4. Weaviate Integration

Set up the Weaviate client for storing embeddings:

import weaviate
from datetime import datetime

def setup_weaviate_schema(client):
    """
    Set up the Weaviate schema for video chunks.
    """
    class_obj = {
        "class": "VideoChunk",
        "vectorizer": "none",  # We'll use custom vectors from Mixpeek
        "properties": [
            {
                "name": "videoId",
                "dataType": ["string"]
            },
            {
                "name": "chunkStart",
                "dataType": ["number"]
            },
            {
                "name": "chunkEnd",
                "dataType": ["number"]
            },
            {
                "name": "sourceUrl",
                "dataType": ["string"]
            }
        ]
    }
    
    client.schema.create_class(class_obj)

def store_embedding(client, embedding_data: dict, chunk_metadata: dict):
    """
    Store video chunk embedding in Weaviate.
    """
    vector = embedding_data["embedding"]
    
    properties = {
        "videoId": chunk_metadata["video_id"],
        "chunkStart": chunk_metadata["start_time"],
        "chunkEnd": chunk_metadata["end_time"],
        "sourceUrl": chunk_metadata["url"]
    }
    
    client.data_object.create(
        "VideoChunk",
        properties,
        vector=vector
    )

5. Putting It All Together

Here's how to use all the components together:

import os
from dotenv import load_dotenv

def process_video(video_path: str, upload_base_url: str):
    """
    Process a video through the entire pipeline:
    1. Detect scenes
    2. Create chunks
    3. Generate embeddings
    4. Store in Weaviate
    """
    load_dotenv()
    
    # Initialize clients
    mixpeek_client = MixpeekClient(os.getenv("MIXPEEK_API_KEY"))
    weaviate_client = weaviate.Client(os.getenv("WEAVIATE_URL"))
    
    # Detect scenes
    scenes = detect_scenes(video_path)
    
    # Create chunks
    output_dir = "video_chunks"
    os.makedirs(output_dir, exist_ok=True)
    chunk_paths = chunk_video(video_path, output_dir, scenes)
    
    # Process each chunk
    video_id = os.path.basename(video_path)
    
    for idx, (chunk_path, (start_time, end_time)) in enumerate(zip(chunk_paths, scenes)):
        # Upload chunk and get URL (implementation depends on your storage solution)
        chunk_url = f"{upload_base_url}/{os.path.basename(chunk_path)}"
        
        # Generate embedding
        embedding_data = mixpeek_client.generate_embedding(
            chunk_url,
            "video_vector"
        )
        
        # Store in Weaviate
        chunk_metadata = {
            "video_id": video_id,
            "start_time": start_time,
            "end_time": end_time,
            "url": chunk_url
        }
        
        store_embedding(weaviate_client, embedding_data, chunk_metadata)

Searching Video Chunks

Here's how to search through the processed video chunks:

def search_video_chunks(client, query_vector, limit=5):
    """
    Search for similar video chunks using the query vector.
    """
    response = (
        client.query
        .get("VideoChunk", ["videoId", "chunkStart", "chunkEnd", "sourceUrl"])
        .with_near_vector({
            "vector": query_vector,
            "certainty": 0.7
        })
        .with_limit(limit)
        .do()
    )
    
    return response["data"]["Get"]["VideoChunk"]

Best Practices

Scene Detection Tuning
- Adjust the threshold based on your video content
- Consider using multiple detection methods for different types of content
- Implement minimum/maximum chunk duration constraints
Embedding Storage
- Use batch processing for multiple chunks
- Implement error handling and retries
- Consider implementing a caching layer
Performance Optimization
- Process chunks in parallel when possible
- Implement progressive loading for large videos
- Use appropriate video codec settings for chunks

Conclusion

This pipeline enables efficient video understanding by:

Breaking videos into meaningful segments
Generating rich embeddings for each segment
Enabling semantic search across video content

The combination of PySceneDetect, Mixpeek, and Weaviate creates a powerful system for video understanding and retrieval.

All this in two API calls

Want to implement this entire pipeline in just two API calls? Here's how you can do it with Mixpeek:

Ingest video:

import requests

url = "https://api.mixpeek.com/ingest/videos/url"

payload = {
    "url": "https://example.com/sample-video.mp4",
    "collection": "scene_tutorial"
}
headers = {"Content-Type": "application/json"}

response = requests.request("POST", url, json=payload, headers=headers)

print(response.text)

Hybrid search videos:

import requests

url = "https://api.mixpeek.com/features/search"

payload = {
    "queries": [
        {
            "type": "text",
            "value": "boy outside",
            "vector_index": "multimodal"
        },
        {
            "type": "url",
            "value": "https://example.com/dog.jpg",
            "vector_index": "multimodal"
        }
    "collections": ["scene_tutorial"],
}
headers = {"Content-Type": "application/json"}

response = requests.request("POST", url, json=payload, headers=headers)

That's it! All the complexity of scene detection, chunking, embedding generation, vector storage, and semantic search is handled for you in these two simple API calls.

Try it out for free