NEWVectors or files. Pick a path.Start →

    What is Semantic Join

    Semantic Join - A cross-collection enrichment operation that attaches context from one collection to results from another, using semantic similarity as the join key.

    A semantic join is the multimodal equivalent of a SQL JOIN. In structured databases, JOINs combine rows from different tables using foreign keys. In a multimodal data warehouse, enrich stages combine results from different collections using embedding similarity or document relationships. This enables cross-referencing without pre-defined foreign keys.

    How It Works

    After a retrieval pipeline produces results from one collection (e.g., media library search), an enrich stage queries a second collection (e.g., brand safety scores) to attach contextual data to each result. The join can be by document ID, semantic similarity, or metadata matching.

    Examples

    • Search media library for celebrity appearances → enrich with brand safety scores from a separate collection
    • Find similar products → enrich with pricing and availability from a catalog collection
    • Detect copyrighted audio → enrich with licensing terms from a rights database
    • Find relevant document passages → enrich with author and classification metadata

    Best Practices

    • Use enrich stages after reduce stages to minimize the number of cross-collection lookups
    • Keep enrichment collections focused: one collection per enrichment type (brand scores, rights, metadata)
    • Use semantic joins for fuzzy matching and document_enrich for exact ID-based joins

    Related Pages

    • Document Enrich stage: /docs/retrieval/stages/document-enrich
    • Retrieval Cookbook: /docs/retrieval/cookbook
    • Blog: Multi-Stage Retrieval Pipelines - /blog/multi-stage-retrieval-pipelines
    Managed Mixpeek

    Put multimodal search to work

    Connect a bucket and Mixpeek runs the whole multimodal search pipeline for you: extraction, indexing, and search over your own objects. No models to wire up, nothing to host.

    Start with Managed
    MVS · bring your own

    Already have vectors?

    Keep your embeddings on your own cloud and run dense, sparse, and BM25 search directly on object storage. First 1M vectors free.

    Start with MVS