data pipeline

Apache Spark Connector

Distributed multimodal processing at warehouse scale

Process millions of documents, images, and media files in parallel using Apache Spark with Mixpeek enrichment. The connector provides a custom Spark data source that distributes Mixpeek API calls across your cluster, enabling warehouse-scale multimodal feature extraction.

spark

distributed processing

data lake

big data

batch enrichment

feature extraction

Get Started

Integrations

Apache Spark 3.x

Databricks

Amazon EMR

Google Dataproc

Quick Install:

npm install @mixpeek/spark

Use Cases

Bulk enrichment of data lake content

Feature extraction for ML training datasets

Parallel taxonomy classification of media archives

Streaming enrichment with Structured Streaming

Features

Custom Spark DataSource V2 for Mixpeek

Partition-aware parallel enrichment

Automatic rate limiting and retry logic

Support for Spark Structured Streaming

UDF wrappers for inline enrichment in SQL queries

Get Started

Integrations

Apache Spark 3.x

Databricks

Amazon EMR

Google Dataproc

Details

LicenseApache 2.0

Categorydata pipeline

Registrynpm

Resources

Documentation

Learn how to use this connector

Quick Info

LicenseApache 2.0

Categorydata pipeline

Registrynpm

Ready to integrate?

Get started with Apache Spark Connector in minutes. Check out the documentation or explore the source code on GitHub.