data pipeline
Apache Spark Connector
Distributed multimodal processing at warehouse scale
Process millions of documents, images, and media files in parallel using Apache Spark with Mixpeek enrichment. The connector provides a custom Spark data source that distributes Mixpeek API calls across your cluster, enabling warehouse-scale multimodal feature extraction.
spark
distributed processing
data lake
big data
batch enrichment
feature extraction
Get Started
Integrations
Apache Spark 3.x
Databricks
Amazon EMR
Google Dataproc
Quick Install:
npm install @mixpeek/sparkUse Cases
1
Bulk enrichment of data lake content
2
Feature extraction for ML training datasets
3
Parallel taxonomy classification of media archives
4
Streaming enrichment with Structured Streaming
Features
Custom Spark DataSource V2 for Mixpeek
Partition-aware parallel enrichment
Automatic rate limiting and retry logic
Support for Spark Structured Streaming
UDF wrappers for inline enrichment in SQL queries
Get Started
Integrations
Apache Spark 3.x
Databricks
Amazon EMR
Google Dataproc
Details
LicenseApache 2.0
Categorydata pipeline
Registrynpm
Quick Info
LicenseApache 2.0
Categorydata pipeline
Registrynpm
Ready to integrate?
Get started with Apache Spark Connector in minutes. Check out the documentation or explore the source code on GitHub.
