Mixpeek Logo
    Schedule Demo
    ESEthan Steininger
    6 min read

    Semantic Video Chunking: Scene Detection

    Intelligent video chunking using scene detection and vector embeddings. This tutorial covers how to break down videos into semantic scenes, generate embeddings, and enable powerful semantic search capabilities.

    Semantic Video Chunking: Scene Detection
    Technical Guides

    Just as RAG systems break down text documents into meaningful chunks for better processing and retrieval, video content benefits from intelligent segmentation through scene detection. This approach parallels text tokenization in several crucial ways:

    Semantic Coherence

    Scene detection identifies natural boundaries in video content, maintaining semantic completeness just like how text chunking preserves sentence or paragraph integrity. Each scene represents a complete "thought" or action sequence, rather than arbitrary time-based splits. For example, in a cooking tutorial:

    Ingredient Prep Mixing Cooking Plating [0.2, 0.8, 0.1, ...] [0.6, 0.3, 0.9, ...] [0.4, 0.7, 0.5, ...] [0.9, 0.2, 0.4, ...] Scene-Based Video Chunking with Vector Embeddings 0:00 1:45 3:30 5:15 7:00

    Retrieval Precision

    Scene-based chunks enable precise content retrieval. Instead of returning entire videos, systems can identify and serve the exact relevant scene, similar to how RAG systems return specific text passages rather than complete documents.

    Vector Embedding Quality

    Scene-based chunking produces higher quality embeddings because:

    1. Each embedding represents a coherent visual concept
    2. The embeddings aren't "confused" by mixing multiple scenes
    3. The semantic space remains clean, enabling better similarity matching

    Processing Efficiency

    Like token windows in language models, scene-based chunking helps manage video processing:

    1. Smaller, focused chunks enable efficient processing
    2. Parallel processing becomes more feasible
    3. Storage and retrieval operations are optimized
    4. Reduces redundant processing of similar frames
    Semantic Chunking: Text vs Video Text Document Chunking The quick brown fox jumps over the lazy dog. Machine learning models process data in chunks. This helps maintain context and meaning while enabling efficient processing of information. {"text": "The quick brown fox jumps over the lazy dog.", "embedding": [0.2, 0.8, ...]} {"text": "Machine learning models process data in chunks.", "embedding": [0.5, 0.3, ...]} {"text": "This helps maintain context and meaning while...", "embedding": [0.7, 0.4, ...]} Video Scene Detection {"scene": "Opening Scene", "duration": "0:00-1:30", "embedding": [0.1, 0.9, ...]} {"scene": "Action Sequence", "duration": "1:31-3:00", "embedding": [0.8, 0.2, ...]} {"scene": "Closing Scene", "duration": "3:01-4:30", "embedding": [0.3, 0.7, ...]} Key Benefits: • Maintains semantic meaning • Enables efficient processing • Improves retrieval accuracy • Preserves context

    Multimodal Understanding

    Unlike text tokenization, video scenes often contain multiple modalities (visual, audio, text-on-screen) that need to be processed together. This complexity makes intelligent chunking even more crucial for maintaining context and enabling accurate understanding.

    💡
    Learn multimodal understanding for free: https://mixpeek.com/education

    Introduction

    Video understanding at scale requires efficient processing and indexing of video content. This tutorial demonstrates how to implement dynamic video chunking using scene detection, generate embeddings with Mixpeek, and store them in Weaviate for semantic search capabilities.

    Video Processing Pipeline Video Input S3 / Direct Upload / URL Scene Detection PySceneDetect / Content-based Video Chunking FFmpeg Extraction / Parallel Processing mixpeek Semantic Embedding Generation weaviate Vector Database Storage Semantic Search Scene Retrieval / Ranking Data Flow Types: Raw Video → Scene Timestamps → Video Chunks → Embeddings → Vector Storage → Search Results

    Prerequisites

    pip install scenedetect weaviate-client python-dotenv requests
    

    Implementation Guide

    1. Scene Detection with PySceneDetect

    First, let's implement the scene detection logic:

    from scenedetect import detect, ContentDetector
    import os
    
    def detect_scenes(video_path, threshold=27.0):
        """
        Detect scene changes in a video file using content detection.
        
        Args:
            video_path (str): Path to the video file
            threshold (float): Detection threshold (lower = more sensitive)
        
        Returns:
            list: List of scene timestamps (start, end) in seconds
        """
        # Detect scenes using content detection
        scenes = detect(video_path, ContentDetector(threshold=threshold))
        
        # Convert scenes to timestamp ranges
        scene_timestamps = []
        for scene in scenes:
            start_time = scene[0].get_seconds()
            end_time = scene[1].get_seconds()
            scene_timestamps.append((start_time, end_time))
        
        return scene_timestamps
    

    2. Video Chunking Utility

    Create a utility to split the video into chunks based on detected scenes:

    import subprocess
    
    def chunk_video(video_path, output_dir, timestamps):
        """
        Split video into chunks based on scene timestamps.
        
        Args:
            video_path (str): Path to the source video
            output_dir (str): Directory to save video chunks
            timestamps (list): List of (start, end) timestamps
        
        Returns:
            list: Paths to generated video chunks
        """
        chunk_paths = []
        
        for idx, (start, end) in enumerate(timestamps):
            output_path = os.path.join(output_dir, f"chunk_{idx}.mp4")
            
            # Use ffmpeg to extract the chunk
            command = [
                'ffmpeg', '-i', video_path,
                '-ss', str(start),
                '-t', str(end - start),
                '-c', 'copy',
                output_path
            ]
            
            subprocess.run(command, capture_output=True)
            chunk_paths.append(output_path)
        
        return chunk_paths
    

    3. Mixpeek Integration

    Set up the Mixpeek client for generating embeddings:

    import requests
    import json
    from typing import List
    
    class MixpeekClient:
        def __init__(self, api_key: str):
            self.api_key = api_key
            self.base_url = "https://api.mixpeek.com"
            
        def generate_embedding(self, video_url: str, vector_index: str) -> dict:
            """
            Generate embeddings for a video chunk using Mixpeek.
            """
            headers = {
                'Content-Type': 'application/json',
                'Authorization': f'Bearer {self.api_key}'
            }
            
            payload = {
                "type": "url",
                "value": video_url,
                "vector_index": vector_index
            }
            
            response = requests.post(
                f"{self.base_url}/features/extractors/embed",
                headers=headers,
                json=payload
            )
            
            return response.json()
    

    4. Weaviate Integration

    Set up the Weaviate client for storing embeddings:

    import weaviate
    from datetime import datetime
    
    def setup_weaviate_schema(client):
        """
        Set up the Weaviate schema for video chunks.
        """
        class_obj = {
            "class": "VideoChunk",
            "vectorizer": "none",  # We'll use custom vectors from Mixpeek
            "properties": [
                {
                    "name": "videoId",
                    "dataType": ["string"]
                },
                {
                    "name": "chunkStart",
                    "dataType": ["number"]
                },
                {
                    "name": "chunkEnd",
                    "dataType": ["number"]
                },
                {
                    "name": "sourceUrl",
                    "dataType": ["string"]
                }
            ]
        }
        
        client.schema.create_class(class_obj)
    
    def store_embedding(client, embedding_data: dict, chunk_metadata: dict):
        """
        Store video chunk embedding in Weaviate.
        """
        vector = embedding_data["embedding"]
        
        properties = {
            "videoId": chunk_metadata["video_id"],
            "chunkStart": chunk_metadata["start_time"],
            "chunkEnd": chunk_metadata["end_time"],
            "sourceUrl": chunk_metadata["url"]
        }
        
        client.data_object.create(
            "VideoChunk",
            properties,
            vector=vector
        )
    

    5. Putting It All Together

    Here's how to use all the components together:

    import os
    from dotenv import load_dotenv
    
    def process_video(video_path: str, upload_base_url: str):
        """
        Process a video through the entire pipeline:
        1. Detect scenes
        2. Create chunks
        3. Generate embeddings
        4. Store in Weaviate
        """
        load_dotenv()
        
        # Initialize clients
        mixpeek_client = MixpeekClient(os.getenv("MIXPEEK_API_KEY"))
        weaviate_client = weaviate.Client(os.getenv("WEAVIATE_URL"))
        
        # Detect scenes
        scenes = detect_scenes(video_path)
        
        # Create chunks
        output_dir = "video_chunks"
        os.makedirs(output_dir, exist_ok=True)
        chunk_paths = chunk_video(video_path, output_dir, scenes)
        
        # Process each chunk
        video_id = os.path.basename(video_path)
        
        for idx, (chunk_path, (start_time, end_time)) in enumerate(zip(chunk_paths, scenes)):
            # Upload chunk and get URL (implementation depends on your storage solution)
            chunk_url = f"{upload_base_url}/{os.path.basename(chunk_path)}"
            
            # Generate embedding
            embedding_data = mixpeek_client.generate_embedding(
                chunk_url,
                "video_vector"
            )
            
            # Store in Weaviate
            chunk_metadata = {
                "video_id": video_id,
                "start_time": start_time,
                "end_time": end_time,
                "url": chunk_url
            }
            
            store_embedding(weaviate_client, embedding_data, chunk_metadata)
    

    Searching Video Chunks

    Here's how to search through the processed video chunks:

    def search_video_chunks(client, query_vector, limit=5):
        """
        Search for similar video chunks using the query vector.
        """
        response = (
            client.query
            .get("VideoChunk", ["videoId", "chunkStart", "chunkEnd", "sourceUrl"])
            .with_near_vector({
                "vector": query_vector,
                "certainty": 0.7
            })
            .with_limit(limit)
            .do()
        )
        
        return response["data"]["Get"]["VideoChunk"]
    
    Video Scene Search Process Search Query "cooking scene" [0.2, 0.8, 0.5, ...] Vector Search Ranked Results Scene 1 - 95% Match Cooking demonstration Time: 2:30-4:15 Scene 2 - 87% Match Kitchen preparation Time: 0:45-2:15 Scene 3 - 82% Match Recipe overview Time: 0:00-0:45 More results... Embedding Similarity Search Metrics: • Vector Dimension: 1024 • Distance Metric: Cosine Similarity • Response Time: <100ms • Top-k Results: 10

    Best Practices

    1. Scene Detection Tuning
      • Adjust the threshold based on your video content
      • Consider using multiple detection methods for different types of content
      • Implement minimum/maximum chunk duration constraints
    2. Embedding Storage
      • Use batch processing for multiple chunks
      • Implement error handling and retries
      • Consider implementing a caching layer
    3. Performance Optimization
      • Process chunks in parallel when possible
      • Implement progressive loading for large videos
      • Use appropriate video codec settings for chunks

    Conclusion

    This pipeline enables efficient video understanding by:

    • Breaking videos into meaningful segments
    • Generating rich embeddings for each segment
    • Enabling semantic search across video content

    The combination of PySceneDetect, Mixpeek, and Weaviate creates a powerful system for video understanding and retrieval.

    All this in two API calls

    Want to implement this entire pipeline in just two API calls? Here's how you can do it with Mixpeek:

    Ingest video:

    import requests
    
    url = "https://api.mixpeek.com/ingest/videos/url"
    
    payload = {
        "url": "https://example.com/sample-video.mp4",
        "collection": "scene_tutorial"
    }
    headers = {"Content-Type": "application/json"}
    
    response = requests.request("POST", url, json=payload, headers=headers)
    
    print(response.text)
    Ingest Video Url - Mixpeek

    Hybrid search videos:

    import requests
    
    url = "https://api.mixpeek.com/features/search"
    
    payload = {
        "queries": [
            {
                "type": "text",
                "value": "boy outside",
                "vector_index": "multimodal"
            },
            {
                "type": "url",
                "value": "https://example.com/dog.jpg",
                "vector_index": "multimodal"
            }
        "collections": ["scene_tutorial"],
    }
    headers = {"Content-Type": "application/json"}
    
    response = requests.request("POST", url, json=payload, headers=headers)
    Search Features - Mixpeek
    This endpoint allows you to search features.

    That's it! All the complexity of scene detection, chunking, embedding generation, vector storage, and semantic search is handled for you in these two simple API calls.

    Join the Discussion

    Have thoughts, questions, or insights about this post? Be the first to start the conversation in our community!

    Start a Discussion
    ES
    Ethan Steininger

    December 22, 2024 · 6 min read