> ## Documentation Index > Fetch the complete documentation index at: https://docs.mixpeek.com/docs/llms.txt > Use this file to discover all available pages before exploring further. # Mixpeek + Databricks > Multimodal data warehouse meets data lakehouse -- complementary layers for the modern data stack ## Overview Mixpeek and Databricks occupy different layers of the data stack. Mixpeek ingests unstructured multimodal files and extracts structured features, embeddings, and classifications. Databricks provides the lakehouse platform -- Delta Lake for storage, Unity Catalog for governance, and integrated ML for training and serving models. Together, they give you a complete path from raw media files to governed, analytics-ready data. Ingests unstructured files, extracts features (embeddings, transcripts, classifications, metadata), and powers multimodal retrieval. Stores structured outputs in Delta tables, enforces governance via Unity Catalog, and runs ML training and serving at scale. ## Architecture ``` Mixpeek Databricks +-----------------------+ +------------------------+ | | | | Files -----> | Buckets & Collections| | Delta Lake Tables | (images, | | | | video, | Decompose files into | export | - classifications | audio, | features: | ---------> | - extracted metadata | PDFs) | - embeddings | | - taxonomy labels | | - transcripts | | - document payloads | | - classifications | | | | - metadata | enrich | Unity Catalog governs | | | <--------- | all tables. Databricks| | Retrieval & Search | | ML retrains models. | +-----------------------+ +------------------------+ ``` ## Use Cases ### Write extracted features as Delta tables After Mixpeek processes your files, export the structured outputs -- transcripts, object detections, taxonomy labels, metadata -- as Delta tables. This makes them queryable with Spark SQL, joinable with your existing business data, and available to any tool in the Databricks ecosystem. ### Use Unity Catalog for governance Unity Catalog provides fine-grained access control, lineage tracking, and audit logging for all data assets. Once Mixpeek outputs land in Delta tables, Unity Catalog governs who can access them and how they flow through your organization. ### Combine Mixpeek retrieval with Databricks ML Use Mixpeek to power real-time multimodal search and retrieval. Feed the same structured features into Databricks ML for batch training -- fine-tune classifiers, build recommendation models, or run large-scale analytics on extracted content. ## Quick Start Export Mixpeek document metadata to a Delta table using the Mixpeek Python SDK and the Databricks SQL Connector. ```bash theme={null} pip install mixpeek databricks-sql-connector ``` ```python theme={null} from mixpeek import Mixpeek client = Mixpeek(api_key="your-api-key") # List documents from a collection documents = client.collections.documents.list( collection_id="your-collection-id", page_size=100 ) ``` ```python theme={null} from databricks import sql import json connection = sql.connect( server_hostname="YOUR_WORKSPACE.cloud.databricks.com", http_path="/sql/1.0/warehouses/YOUR_WAREHOUSE_ID", access_token="YOUR_ACCESS_TOKEN" ) cursor = connection.cursor() # Create table if it does not exist cursor.execute(""" CREATE TABLE IF NOT EXISTS mixpeek_catalog.default.documents ( document_id STRING, source_url STRING, content_type STRING, metadata STRING, created_at TIMESTAMP ) """) # Insert each document for doc in documents: cursor.execute( """ INSERT INTO mixpeek_catalog.default.documents (document_id, source_url, content_type, metadata, created_at) VALUES (%s, %s, %s, %s, %s) """, ( doc.get("document_id"), doc.get("source", {}).get("url"), doc.get("content_type"), json.dumps(doc.get("metadata", {})), doc.get("created_at"), ) ) connection.commit() cursor.close() connection.close() ``` For production workloads, write Mixpeek outputs to cloud storage (S3 or ADLS) and use Databricks Auto Loader to incrementally ingest new files into Delta tables. ## When to Use Each | Capability | Mixpeek | Databricks | | ----------------------------------------------------------- | ------------------------- | --------------------------------- | | Ingest unstructured files (video, images, audio, PDFs) | Yes | No | | Extract features (embeddings, transcripts, classifications) | Yes | No | | Multimodal semantic search | Yes | No | | Structured SQL analytics | No | Yes (Spark SQL) | | Data governance and lineage | Document-level ACL | Unity Catalog | | ML model training and serving | No | Yes (MLflow, Model Serving) | | Streaming ingestion | Webhooks + batch triggers | Structured Streaming, Auto Loader | Mixpeek handles everything before the data is structured. Databricks handles everything after. Use both to bridge the gap between raw multimodal files and governed, ML-ready data. ## Related * [Taxonomies](/enrichment/taxonomies) -- classify content and export labels * [SQL Lookup Stage](/retrieval/stages/sql-lookup) -- query external databases from retriever pipelines * [API Call Stage](/retrieval/stages/api-call) -- call external APIs during retrieval * [Webhooks](/operations/webhooks) -- trigger Databricks jobs when Mixpeek processing completes