Mixpeek Logo
    data pipeline

    Apache Spark Connector

    Distributed multimodal processing at warehouse scale

    Process millions of documents, images, and media files in parallel using Apache Spark with Mixpeek enrichment. The connector provides a custom Spark data source that distributes Mixpeek API calls across your cluster, enabling warehouse-scale multimodal feature extraction.

    spark
    distributed processing
    data lake
    big data
    batch enrichment
    feature extraction
    Quick Install:
    npm install @mixpeek/spark

    Use Cases

    1

    Bulk enrichment of data lake content

    2

    Feature extraction for ML training datasets

    3

    Parallel taxonomy classification of media archives

    4

    Streaming enrichment with Structured Streaming

    Features

    Custom Spark DataSource V2 for Mixpeek
    Partition-aware parallel enrichment
    Automatic rate limiting and retry logic
    Support for Spark Structured Streaming
    UDF wrappers for inline enrichment in SQL queries

    Get Started

    Integrations

    Apache Spark 3.x
    Databricks
    Amazon EMR
    Google Dataproc

    Details

    LicenseApache 2.0
    Categorydata pipeline
    Registrynpm

    Quick Info

    LicenseApache 2.0
    Categorydata pipeline
    Registrynpm

    Ready to integrate?

    Get started with Apache Spark Connector in minutes. Check out the documentation or explore the source code on GitHub.