NEWAgents can now see video via MCP.Try it now →
    5 min read

    Multimodal Taxonomies: How to Classify Video, Images, Audio, and Text with One Category System

    Traditional taxonomies classify one content type at a time. Multimodal taxonomies unify classification across every format using embedding similarity the missing layer between raw AI features and structured, searchable metadata.

    Multimodal Taxonomies: How to Classify Video, Images, Audio, and Text with One Category System
    Multimodal AI

    TL;DR: Traditional taxonomies classify one content type at a time. Text gets labels, photos get tags, video gets a separate system. Multimodal taxonomies unify classification across every format by matching content against reference collections using embedding similarity. They bridge raw AI features and structured, searchable metadata.


    What Is a Taxonomy?

    A taxonomy is a classification system that organizes content into categories. Gmail sorting emails into Primary/Social/Promotions, Shopify categorizing products into Google's 5,500+ product taxonomy, YouTube classifying videos for ad targeting. All taxonomies.

    In data infrastructure, taxonomies solve three problems: discovery (navigating categories instead of guessing search terms), governance (enforcing policies by content type), and enrichment (attaching structured metadata to unstructured content so downstream systems can filter, sort, and search it).

    Traditional taxonomies are manual and single-modal. A human reviews an article and assigns "Sports > Basketball > NBA." A separate system tags an image "outdoor, basketball court." Another transcribes a video. Each modality gets its own pipeline, its own maintenance burden. That was fine when content was mostly text.

    Why Single-Modal Classification Breaks

    Scale. YouTube receives 720,000 hours of video every day. TikTok ingests 34 million videos daily. That's 272 per second. A trained analyst can classify ~10,000 documents per year. To manually classify one day of TikTok, you'd need 3,400 analysts working full-time for a year.

    Context blindness. A meme with "this is fire" means different things depending on whether the image shows a concert or a burning building. An ICCV 2025 study quantified this: text-only models achieved F1 of 0.75–0.81 on video moderation. Adding visual and audio signals pushed that to 0.84–0.91. The missing 10–15% is cross-modal context.

    Consistency drift. The IAB Content Taxonomy has grown from ~400 categories in v2 to 1,500+ in v3, and even with that specificity, human reviewers routinely disagree on assignments.

    What Makes a Taxonomy "Multimodal"

    A multimodal taxonomy classifies content by understanding it across all modalities simultaneously, then matching against reference categories using embedding similarity rather than keyword rules.

    The key difference: instead of writing rules ("if text contains 'basketball' AND image has an orange round object..."), a multimodal taxonomy works like a semantic JOIN. You define categories with a reference collection of representative examples. New content is matched against those references using vector similarity across all extracted features: visual, audio, and textual, all at once.

    Traditional (Single-Modal) Video Manual review Label A Image Image tagger Label B Text Keyword rules Label C 3 pipelines. 3 labels. No cross-modal context. Multimodal Taxonomy Video Image Audio Text Feature Extraction Taxonomy Similarity JOIN Category Sports Sub NBA Brand Nike Conf. 94% 1 pipeline. Unified label. Full context.

    Flat vs. Hierarchical

    Flat Taxonomies

    Single-level reference collection. Every document is matched against the same categories, best match wins.

    Use cases: Face enrollment, logo detection, product recognition, entity linking. Fast to set up. Start here if your categories don't have meaningful parent-child relationships.

    Hierarchical Taxonomies

    Categories organized into a tree where classification cascades from broad to specific. Each level narrows the search space using different features, executing like a Common Table Expression (CTE). Each level builds on the previous.

    A document classified as "Nike → Athletic → Running" inherits enrichment fields from all three levels. Different levels can use different feature extractors: logo embeddings for brand detection, scene classification for categories, activity recognition for subcategories.

    Hierarchical Taxonomy – CTE-style Execution L0 Brand Detection logo emb. L1 Nike +brand_id L1 Adidas L2 Athletic +category L2 Lifestyle L3 Running +SKU L3 Basketball Inherited enrichment (Running): L0 brand_id + L1 brand + L2 category Nike → Athletic → Running → SKU

    Use cases: Media content classification, product categorization, organizational hierarchies, content moderation.

    How It Works

    1. Feature extraction. Multiple AI models extract features from each modality: CLIP embeddings from video frames, speech transcription from audio, object detection from images, sentence embeddings from text. Each becomes a queryable vector.

    2. Input mapping. Configures which extracted features query which taxonomy level. A face-based taxonomy uses face embeddings; a content classification taxonomy might use CLIP at the top level and audio features deeper down.

    3. Similarity matching. Each document's features are compared against the reference collection using a retriever, the same infrastructure used for semantic search. Documents exceeding the threshold get enriched.

    4. Enrichment. Structured metadata from the reference collection is attached to the document: brand name, content policy, compliance flags, campaign IDs. Configurable field paths, target names, and merge modes (replace or append).

    Real-World Applications

    Advertising. The IAB Content Taxonomy defines 1,500+ categories for programmatic ad targeting. Text-only classifiers can't categorize a cooking video with no description or a sports highlight with only crowd noise. AWS published a reference architecture requiring five separate services. A retriever-powered taxonomy collapses that into one pipeline.

    Media asset management. Libraries of 100,000+ video assets need search across visual content, dialogue, and audio. A hierarchical taxonomy classifies a broadcast as "Live Sports → Football → NFL → Highlight → Touchdown" using different features at each level, enriching with rights info and licensing metadata. Manual tagging costs $15–25 per asset. See how video search changes this.

    E-commerce. Shopify's multimodal system (BERT + MobileNet-V2) increased leaf-node classification precision by 8% and nearly doubled coverage vs. text-only. A 2025 study found CLIP-based fusion achieved 98.59% hierarchical F1 with a two-stage pipeline: lightweight text model first, multimodal model only when confidence is low.

    Content moderation. An ICCV 2025 study tested multimodal AI on 1,500 videos across 12 languages. Best model (Gemini-2.0-Flash) achieved F1=0.91 vs. human F1=0.98, at 1/35th the cost ($28 vs. $974). The practical solution: multimodal AI handles the first pass, low-confidence cases escalate to humans.

    Brand safety. Enforcing "Talent X cannot appear within 5 seconds of a competitor product in negative-sentiment content" requires cross-modal reasoning: face recognition, logo detection, audio sentiment, temporal proximity. A multi-stage retrieval pipeline connects these with taxonomy enrichment for contract terms and compliance status.

    Building a Multimodal Taxonomy

    Create reference collections

    # Flat taxonomy: employee face recognition
    curl -sS -X POST "$MP_API_URL/v1/taxonomies" \
      -H "Authorization: Bearer $MP_API_KEY" \
      -H "X-Namespace: $MP_NAMESPACE" \
      -H "Content-Type: application/json" \
      -d '{
        "taxonomy_name": "employee_faces",
        "taxonomy_type": "flat",
        "retriever_id": "ret_face_matcher",
        "input_mappings": {
          "query_embedding": "mixpeek://face_detector@v2/face_embedding"
        },
        "source_collection": {
          "collection_id": "col_employee_embeddings",
          "enrichment_fields": [
            { "field_path": "metadata.name", "merge_mode": "enrich" },
            { "field_path": "metadata.department", "merge_mode": "enrich" }
          ]
        }
      }'
    

    Go hierarchical when you need precision

    curl -sS -X POST "$MP_API_URL/v1/taxonomies" \
      -H "Authorization: Bearer $MP_API_KEY" \
      -H "X-Namespace: $MP_NAMESPACE" \
      -H "Content-Type: application/json" \
      -d '{
        "taxonomy_name": "content_classification",
        "taxonomy_type": "hierarchical",
        "retriever_id": "ret_scene_classifier",
        "input_mappings": {
          "query_embedding": "mixpeek://clip@v1/scene_embedding"
        },
        "hierarchy": [
          {
            "node_id": "brands",
            "collection_id": "col_brand_references",
            "enrichment_fields": ["metadata.brand_name", "metadata.brand_id"]
          },
          {
            "node_id": "categories",
            "collection_id": "col_content_categories",
            "parent_node_id": "brands",
            "enrichment_fields": ["metadata.category", "metadata.content_policy"]
          },
          {
            "node_id": "campaigns",
            "collection_id": "col_campaign_assets",
            "parent_node_id": "categories",
            "retriever_id": "ret_campaign_matcher",
            "enrichment_fields": ["metadata.campaign_id", "metadata.flight_dates"]
          }
        ]
      }'
    

    Choose an execution mode

    Mode When Tradeoff
    materialize After ingestion (~30s) Low latency, results persisted
    on_demand Query time (retriever stage) Always-fresh reference data, higher latency
    retroactive Manual trigger via API Batch reclassification after taxonomy updates

    Attach to a collection:

    {
      "taxonomy_applications": [
        { "taxonomy_id": "tax_content_classification", "execution_mode": "materialize" }
      ]
    }
    

    Test before you materialize

    curl -sS -X POST "$MP_API_URL/v1/taxonomies/<taxonomy_id>/enrich" \
      -H "Authorization: Bearer $MP_API_KEY" \
      -H "X-Namespace: $MP_NAMESPACE" \
      -H "Content-Type: application/json" \
      -d '{
        "source_documents": [
          { "document_id": "doc_test_001", "mixpeek://clip@v1/scene_embedding": [0.12, 0.34] }
        ],
        "mode": "on_demand"
      }'
    

    If categories are wrong, add more reference examples. The taxonomy improves because matching is based on collection contents. No model retraining required.

    Governance

    There is no finished taxonomy. Updating a multimodal taxonomy means updating its reference collections, not rewriting rules or retraining models. Add examples, remove outdated categories, and the taxonomy adapts.

    Version your taxonomies before structural changes. Use retroactive application to reclassify existing documents after updates. Combine with clustering to discover new category candidates from unmatched documents.


    Start flat. Add hierarchy when you need precision. Version everything. Update reference collections instead of rewriting rules.

    Ready to build? Get started with Mixpeek or explore the taxonomy API reference.