Mixpeek Logo
    Schedule Demo
    ESEthan Steininger
    4 min read

    AI Video Tagging With Dynamic Taxonomies

    AI video tagging used to mean manual review and basic object detection. With multimodal models and dynamic taxonomies, you can now automatically detect brand moments, inappropriate content, actions, moods and trending content at scale.

    AI Video Tagging With Dynamic Taxonomies
    Industry Solutions

    Dynamic taxonomies enable automatic classification of video content at scale. Instead of manually tagging thousands of hours of footage, multimodal AI can identify scenes, moods, actions, and key moments across your video library.

    Video Segmentation & Taxonomy Classification 0:00 - 0:15 0:15 - 0:30 0:30 - 0:45 0:45 - 1:00 high_energy emotional dialog action 0.92 0.85 0.88 0.95 Legend: Video Segment | Confidence Score Above | Classifier Connection

    Real-World Applications

    Content Libraries

    • Scene-level categorization for episodic content
    • Identification of specific actions (fights, chases, emotional moments)
    • Automated content moderation
    • Mood-based classification for recommendation systems

    News & Sports

    • Automatic distinction between studio/field footage
    • Action detection (goals, plays, celebrations)
    • Speaker/anchor identification
    • On-screen text extraction and classification

    User-Generated Content

    • Brand moment detection
    • Inappropriate content flagging
    • Action/mood classification
    • Trending content identification

    Implementation Guide

    Define Your Taxonomy Structure

    Create hierarchical classifications that match your content:

    POST /entities/taxonomies
    {
      "taxonomy_name": "content_classifier",
      "nodes": [
        {
          "name": "moods",
          "embedding_config": [
            {
              "embedding_model": "multimodal",
              "type": "text",
              "value": "Scene mood and emotional atmosphere analysis"
            }
          ],
          "children": [
            {
              "name": "high_energy",
              "embedding_config": [
                {
                  "embedding_model": "multimodal",
                  "type": "video",
                  "value": "https://assets.example.com/reference/action_scene.mp4"
                },
                {
                  "embedding_model": "text",
                  "value": "Fast-paced, dynamic, intense action and movement"
                }
              ]
            },
            {
              "name": "emotional",
              "embedding_config": [
                {
                  "embedding_model": "multimodal",
                  "type": "video",
                  "value": "https://assets.example.com/reference/dramatic_scene.mp4"
                },
                {
                  "embedding_model": "text",
                  "value": "Dramatic, emotional, intimate character moments"
                }
              ]
            }
          ]
        }
      ]
    }
    
    Taxonomies - Mixpeek
    Create and manage hierarchical classifications for multimodal content organization

    Set Up Processing Pipeline

    Configure your namespace and collection:

    POST /namespaces
    {
      "namespace_name": "video_processing",
      "vector_indexes": ["multimodal", "text"],
      "payload_indexes": [
        {
          "field_name": "taxonomy.classifications",
          "type": "keyword",
          "field_schema": {
            "type": "keyword",
            "is_tenant": false
          }
        }
      ]
    }
    
    Namespaces - Mixpeek
    Create isolated environments for organizing and managing your search applications

    Process Videos

    Ingest videos with intelligent sampling and taxonomy classification:

    POST /ingest/videos/url
    {
      "url": "https://content.example.com/videos/episode_123.mp4",
      "collection": "premium_content",
      "feature_extractors": {
        "interval_sec": 10,
        "embed": [
          {
            "type": "url",
            "vector_index": "multimodal"
          }
        ],
        "describe": {
          "enabled": true,
          "vector_index": "text"
        }
      },
      "taxonomy_config": {
        "taxonomy_ids": ["tax_abc123"],
        "confidence_threshold": 0.75,
        "min_segment_duration": 5
      }
    }
    
    Feature Extraction - Mixpeek
    Configure and customize multimodal feature extraction for different content types
    Video Ingestion Intelligent Sampling conf: 0.92 Taxonomy Classification conf: 0.88 Searchable Tags action high_energy mood scene 🔍 Searchable Raw Video Key Frames Classifications Search Index

    Intelligent Sampling Settings

    Choose sampling intervals based on content type:

    Content Type Interval (sec) Rationale
    Action/Sports 5-10 Capture rapid changes
    Dialog Scenes 15-20 Focus on key moments
    News/Interviews 20-30 Capture scene changes
    💡
    For a more intelligent sampling, consider dynamic scene splitting: https://blog.mixpeek.com/dynamic-video-chunking-scene-detection/

    Key Optimizations

    Reference Selection

    • Use high-quality, representative video clips for each category
    • Include multiple examples per taxonomy node
    • Update reference content as your library evolves

    Confidence Thresholds

    • Start high (0.85+) for critical classifications
    • Lower (0.7+) for general categorization
    • Adjust based on validation results

    Search Integration

    Query classified content:

    POST /features/search
    {
      "collections": ["premium_content"],
      "queries": [
        {
          "vector_index": "multimodal",
          "type": "text",
          "value": "high energy action sequence"
        }
      ],
      "filters": {
        "AND": [
          {
            "key": "taxonomy.classifications.node_id",
            "operator": "in",
            "value": ["tax_node_high_energy"]
          }
        ]
      },
      "group_by": {
        "field": "asset_id",
        "max_features": 5
      }
    }
    
    Queries - Mixpeek
    Build powerful multimodal search queries across text, images, and videos

    Practical Tips

    1. Start Small
      • Begin with 2-3 main categories
      • Validate classification accuracy
      • Expand based on results
    2. Optimize Processing
      • Use appropriate sampling intervals
      • Batch process similar content
      • Monitor classification confidence
    3. Maintain Quality
      • Regularly update reference content
      • Review edge cases
      • Adjust thresholds based on needs

    Common Challenges

    1. Mixed Content
      • Solution: Use multiple reference examples
      • Example: News segments with both studio/field footage
    1. Temporal Context
      • Solution: Adjust sampling intervals
      • Example: Sports highlights need denser sampling
    1. Scale Issues
      • Solution: Batch processing with appropriate intervals
      • Example: Process episodic content in seasons
    Taxonomy Configuration Decision Tree Content Type? High Action (Sports, Action) Dialog Heavy (News, Interviews) Mixed Content (UGC, Shows) Configuration: interval_sec: 5 confidence: 0.85 embedding: multimodal min_segment: 3s Configuration: interval_sec: 15 confidence: 0.75 embedding: text+multimodal min_segment: 10s Configuration: interval_sec: 10 confidence: 0.80 embedding: multimodal min_segment: 5s Optimize for rapid changes Prioritize speaker detection Balance accuracy/performance Decision Levels: Content Type Category Configuration Optimization

    The power of dynamic taxonomies comes from combining intelligent sampling with multimodal understanding. By properly configuring your taxonomy structure and processing pipeline, you can automatically classify thousands of hours of content with high accuracy.

    Additional Learning

    Here's how some other leaders in the space are thinking about the same problem:

    ES
    Ethan Steininger

    December 30, 2024 · 4 min read