> ## Documentation Index
> Fetch the complete documentation index at: https://docs.mixpeek.com/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Mixpeek + Databricks

> Multimodal data warehouse meets data lakehouse -- complementary layers for the modern data stack

## Overview

Mixpeek and Databricks occupy different layers of the data stack. Mixpeek ingests unstructured multimodal files and extracts structured features, embeddings, and classifications. Databricks provides the lakehouse platform -- Delta Lake for storage, Unity Catalog for governance, and integrated ML for training and serving models. Together, they give you a complete path from raw media files to governed, analytics-ready data.

<CardGroup cols={2}>
  <Card title="Mixpeek" icon="wand-magic-sparkles">
    Ingests unstructured files, extracts features (embeddings, transcripts, classifications, metadata), and powers multimodal retrieval.
  </Card>

  <Card title="Databricks" icon="database">
    Stores structured outputs in Delta tables, enforces governance via Unity Catalog, and runs ML training and serving at scale.
  </Card>
</CardGroup>

## Architecture

```
                        Mixpeek                              Databricks
               +-----------------------+            +------------------------+
               |                       |            |                        |
  Files -----> |  Buckets & Collections|            |   Delta Lake Tables    |
  (images,     |                       |            |                        |
   video,      |  Decompose files into |  export    |  - classifications     |
   audio,      |  features:            | ---------> |  - extracted metadata  |
   PDFs)       |   - embeddings        |            |  - taxonomy labels     |
               |   - transcripts       |            |  - document payloads   |
               |   - classifications   |            |                        |
               |   - metadata          |  enrich    |  Unity Catalog governs |
               |                       | <--------- |  all tables. Databricks|
               |  Retrieval & Search   |            |  ML retrains models.   |
               +-----------------------+            +------------------------+
```

## Use Cases

### Write extracted features as Delta tables

After Mixpeek processes your files, export the structured outputs -- transcripts, object detections, taxonomy labels, metadata -- as Delta tables. This makes them queryable with Spark SQL, joinable with your existing business data, and available to any tool in the Databricks ecosystem.

### Use Unity Catalog for governance

Unity Catalog provides fine-grained access control, lineage tracking, and audit logging for all data assets. Once Mixpeek outputs land in Delta tables, Unity Catalog governs who can access them and how they flow through your organization.

### Combine Mixpeek retrieval with Databricks ML

Use Mixpeek to power real-time multimodal search and retrieval. Feed the same structured features into Databricks ML for batch training -- fine-tune classifiers, build recommendation models, or run large-scale analytics on extracted content.

## Quick Start

Export Mixpeek document metadata to a Delta table using the Mixpeek Python SDK and the Databricks SQL Connector.

<Steps>
  <Step title="Install dependencies">
    ```bash theme={null}
    pip install mixpeek databricks-sql-connector
    ```
  </Step>

  <Step title="List documents from Mixpeek">
    ```python theme={null}
    from mixpeek import Mixpeek

    client = Mixpeek(api_key="your-api-key")

    # List documents from a collection
    documents = client.collections.documents.list(
        collection_id="your-collection-id",
        page_size=100
    )
    ```
  </Step>

  <Step title="Write to a Delta table via Databricks SQL">
    ```python theme={null}
    from databricks import sql
    import json

    connection = sql.connect(
        server_hostname="YOUR_WORKSPACE.cloud.databricks.com",
        http_path="/sql/1.0/warehouses/YOUR_WAREHOUSE_ID",
        access_token="YOUR_ACCESS_TOKEN"
    )

    cursor = connection.cursor()

    # Create table if it does not exist
    cursor.execute("""
        CREATE TABLE IF NOT EXISTS mixpeek_catalog.default.documents (
            document_id STRING,
            source_url STRING,
            content_type STRING,
            metadata STRING,
            created_at TIMESTAMP
        )
    """)

    # Insert each document
    for doc in documents:
        cursor.execute(
            """
            INSERT INTO mixpeek_catalog.default.documents
                (document_id, source_url, content_type, metadata, created_at)
            VALUES (%s, %s, %s, %s, %s)
            """,
            (
                doc.get("document_id"),
                doc.get("source", {}).get("url"),
                doc.get("content_type"),
                json.dumps(doc.get("metadata", {})),
                doc.get("created_at"),
            )
        )

    connection.commit()
    cursor.close()
    connection.close()
    ```
  </Step>
</Steps>

<Tip>
  For production workloads, write Mixpeek outputs to cloud storage (S3 or ADLS) and use Databricks Auto Loader to incrementally ingest new files into Delta tables.
</Tip>

## When to Use Each

| Capability                                                  | Mixpeek                   | Databricks                        |
| ----------------------------------------------------------- | ------------------------- | --------------------------------- |
| Ingest unstructured files (video, images, audio, PDFs)      | Yes                       | No                                |
| Extract features (embeddings, transcripts, classifications) | Yes                       | No                                |
| Multimodal semantic search                                  | Yes                       | No                                |
| Structured SQL analytics                                    | No                        | Yes (Spark SQL)                   |
| Data governance and lineage                                 | Document-level ACL        | Unity Catalog                     |
| ML model training and serving                               | No                        | Yes (MLflow, Model Serving)       |
| Streaming ingestion                                         | Webhooks + batch triggers | Structured Streaming, Auto Loader |

<Info>
  Mixpeek handles everything before the data is structured. Databricks handles everything after. Use both to bridge the gap between raw multimodal files and governed, ML-ready data.
</Info>

## Related

* [Taxonomies](/enrichment/taxonomies) -- classify content and export labels
* [SQL Lookup Stage](/retrieval/stages/sql-lookup) -- query external databases from retriever pipelines
* [API Call Stage](/retrieval/stages/api-call) -- call external APIs during retrieval
* [Webhooks](/operations/webhooks) -- trigger Databricks jobs when Mixpeek processing completes
