Multimodal Taxonomies: Classify Video, Images, Audio & Text

TL;DR: Traditional taxonomies classify one content type at a time. Text gets labels, photos get tags, video gets a separate system. Multimodal taxonomies unify classification across every format by matching content against reference collections using embedding similarity. They bridge raw AI features and structured, searchable metadata.

What Is a Taxonomy?

A taxonomy is a classification system that organizes content into categories. Gmail sorting emails into Primary/Social/Promotions, Shopify categorizing products into Google's 5,500+ product taxonomy, YouTube classifying videos for ad targeting. All taxonomies.

In data infrastructure, taxonomies solve three problems: discovery (navigating categories instead of guessing search terms), governance (enforcing policies by content type), and enrichment (attaching structured metadata to unstructured content so downstream systems can filter, sort, and search it).

Traditional taxonomies are manual and single-modal. A human reviews an article and assigns "Sports > Basketball > NBA." A separate system tags an image "outdoor, basketball court." Another transcribes a video. Each modality gets its own pipeline, its own maintenance burden. That was fine when content was mostly text.

Scale. YouTube receives 720,000 hours of video every day. TikTok ingests 34 million videos daily. That's 272 per second. A trained analyst can classify ~10,000 documents per year. To manually classify one day of TikTok, you'd need 3,400 analysts working full-time for a year.

Context blindness. A meme with "this is fire" means different things depending on whether the image shows a concert or a burning building. An ICCV 2025 study quantified this: text-only models achieved F1 of 0.75–0.81 on video moderation. Adding visual and audio signals pushed that to 0.84–0.91. The missing 10–15% is cross-modal context.

Consistency drift. The IAB Content Taxonomy has grown from ~400 categories in v2 to 1,500+ in v3, and even with that specificity, human reviewers routinely disagree on assignments.

What Makes a Taxonomy "Multimodal"

A multimodal taxonomy classifies content by understanding it across all modalities simultaneously, then matching against reference categories using embedding similarity rather than keyword rules.

The key difference: instead of writing rules ("if text contains 'basketball' AND image has an orange round object..."), a multimodal taxonomy works like a semantic JOIN. You define categories with a reference collection of representative examples. New content is matched against those references using vector similarity across all extracted features: visual, audio, and textual, all at once.

Flat vs. Hierarchical

Flat Taxonomies

Single-level reference collection. Every document is matched against the same categories, best match wins.

Use cases: Face enrollment, logo detection, product recognition, entity linking. Fast to set up. Start here if your categories don't have meaningful parent-child relationships.

Hierarchical Taxonomies

Categories organized into a tree where classification cascades from broad to specific. Each level narrows the search space using different features, executing like a Common Table Expression (CTE). Each level builds on the previous.

A document classified as "Nike → Athletic → Running" inherits enrichment fields from all three levels. Different levels can use different feature extractors: logo embeddings for brand detection, scene classification for categories, activity recognition for subcategories.

Use cases: Media content classification, product categorization, organizational hierarchies, content moderation.

How It Works

1. Feature extraction. Multiple AI models extract features from each modality: CLIP embeddings from video frames, speech transcription from audio, object detection from images, sentence embeddings from text. Each becomes a queryable vector.

2. Input mapping. Configures which extracted features query which taxonomy level. A face-based taxonomy uses face embeddings; a content classification taxonomy might use CLIP at the top level and audio features deeper down.

3. Similarity matching. Each document's features are compared against the reference collection using a retriever, the same infrastructure used for semantic search. Documents exceeding the threshold get enriched.

4. Enrichment. Structured metadata from the reference collection is attached to the document: brand name, content policy, compliance flags, campaign IDs. Configurable field paths, target names, and merge modes (replace or append).

Real-World Applications

Advertising. The IAB Content Taxonomy defines 1,500+ categories for programmatic ad targeting. Text-only classifiers can't categorize a cooking video with no description or a sports highlight with only crowd noise. AWS published a reference architecture requiring five separate services. A retriever-powered taxonomy collapses that into one pipeline.

Media asset management. Libraries of 100,000+ video assets need search across visual content, dialogue, and audio. A hierarchical taxonomy classifies a broadcast as "Live Sports → Football → NFL → Highlight → Touchdown" using different features at each level, enriching with rights info and licensing metadata. Manual tagging costs $15–25 per asset. See how video search changes this.

E-commerce. Shopify's multimodal system (BERT + MobileNet-V2) increased leaf-node classification precision by 8% and nearly doubled coverage vs. text-only. A 2025 study found CLIP-based fusion achieved 98.59% hierarchical F1 with a two-stage pipeline: lightweight text model first, multimodal model only when confidence is low.

Content moderation. An ICCV 2025 study tested multimodal AI on 1,500 videos across 12 languages. Best model (Gemini-2.0-Flash) achieved F1=0.91 vs. human F1=0.98, at 1/35th the cost ($28 vs. $974). The practical solution: multimodal AI handles the first pass, low-confidence cases escalate to humans.

Brand safety. Enforcing "Talent X cannot appear within 5 seconds of a competitor product in negative-sentiment content" requires cross-modal reasoning: face recognition, logo detection, audio sentiment, temporal proximity. A multi-stage retrieval pipeline connects these with taxonomy enrichment for contract terms and compliance status.

Building a Multimodal Taxonomy

Create reference collections

# Flat taxonomy: employee face recognition
curl -sS -X POST "$MP_API_URL/v1/taxonomies" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "taxonomy_name": "employee_faces",
    "taxonomy_type": "flat",
    "retriever_id": "ret_face_matcher",
    "input_mappings": {
      "query_embedding": "mixpeek://face_detector@v2/face_embedding"
    },
    "source_collection": {
      "collection_id": "col_employee_embeddings",
      "enrichment_fields": [
        { "field_path": "metadata.name", "merge_mode": "enrich" },
        { "field_path": "metadata.department", "merge_mode": "enrich" }
      ]
    }
  }'

Go hierarchical when you need precision

curl -sS -X POST "$MP_API_URL/v1/taxonomies" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "taxonomy_name": "content_classification",
    "taxonomy_type": "hierarchical",
    "retriever_id": "ret_scene_classifier",
    "input_mappings": {
      "query_embedding": "mixpeek://clip@v1/scene_embedding"
    },
    "hierarchy": [
      {
        "node_id": "brands",
        "collection_id": "col_brand_references",
        "enrichment_fields": ["metadata.brand_name", "metadata.brand_id"]
      },
      {
        "node_id": "categories",
        "collection_id": "col_content_categories",
        "parent_node_id": "brands",
        "enrichment_fields": ["metadata.category", "metadata.content_policy"]
      },
      {
        "node_id": "campaigns",
        "collection_id": "col_campaign_assets",
        "parent_node_id": "categories",
        "retriever_id": "ret_campaign_matcher",
        "enrichment_fields": ["metadata.campaign_id", "metadata.flight_dates"]
      }
    ]
  }'

Choose an execution mode

Mode	When	Tradeoff
materialize	After ingestion (~30s)	Low latency, results persisted
on_demand	Query time (retriever stage)	Always-fresh reference data, higher latency
retroactive	Manual trigger via API	Batch reclassification after taxonomy updates

Attach to a collection:

{
  "taxonomy_applications": [
    { "taxonomy_id": "tax_content_classification", "execution_mode": "materialize" }
  ]
}

Test before you materialize

curl -sS -X POST "$MP_API_URL/v1/taxonomies/<taxonomy_id>/enrich" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "source_documents": [
      { "document_id": "doc_test_001", "mixpeek://clip@v1/scene_embedding": [0.12, 0.34] }
    ],
    "mode": "on_demand"
  }'

If categories are wrong, add more reference examples. The taxonomy improves because matching is based on collection contents. No model retraining required.

Governance

There is no finished taxonomy. Updating a multimodal taxonomy means updating its reference collections, not rewriting rules or retraining models. Add examples, remove outdated categories, and the taxonomy adapts.

Version your taxonomies before structural changes. Use retroactive application to reclassify existing documents after updates. Combine with clustering to discover new category candidates from unmatched documents.

Start flat. Add hierarchy when you need precision. Version everything. Update reference collections instead of rewriting rules.

Ready to build? Get started with Mixpeek or explore the taxonomy API reference.