Connect Backblaze B2 buckets to Mixpeek for automatic multimodal extraction at a fraction of AWS S3 prices. Store your videos, images, and documents in B2, run feature extractors and embeddings through Mixpeek, and write indexed results back to B2 — with zero egress fees through Bandwidth Alliance partners.

Read the Docs Start Building Schedule Walkthrough

Measurable impact from day one

What teams see after connecting Backblaze B2 to Mixpeek

75%

Lower storage costs

$6/TB/month on B2 vs $23/TB on AWS S3, saving $850/month on a 50TB library

Egress fees

Free data transfer through Bandwidth Alliance partners (Cloudflare, Fastly, Bunny CDN)

Code changes required

S3-compatible API means existing SDKs and tools work out of the box with B2

<1 hr

Time to first extraction

Connect a B2 bucket, configure extractors, and start processing in under 60 minutes

100%

End-to-end on B2

Source objects, extracted features, and vector indexes all stored on Backblaze

1000s

Assets processed in parallel

Ray GPU clusters process thousands of assets concurrently across your entire library

Running AI pipelines on stored media

Before

$23/TB/month on AWS S3

$0.09/GB egress on every extraction

Budget limits processing scope

Separate storage and index infra

After

$6/TB/month on Backblaze B2

$0 egress via Bandwidth Alliance

Process everything you need

End-to-end on one platform

The Problem

Teams building multimodal AI pipelines hit a cost wall fast. AWS S3 charges $23/TB/month for storage and $0.09/GB for egress — costs that compound quickly when you're storing terabytes of video, images, and documents, then moving them to processing infrastructure. A 50TB media library costs $1,150/month just to store, and every extraction run that pulls data out of S3 adds egress fees on top. Teams end up choosing between processing everything they need and staying within budget.

The Solution

Mixpeek connects directly to Backblaze B2 via the S3-compatible API — same SDKs, same tools, no code changes. B2 stores your data at $6/TB/month (75% less than S3) with free egress through Bandwidth Alliance partners like Cloudflare. Mixpeek reads objects from your B2 buckets, runs multimodal extractors — visual embeddings, object detection, face recognition, OCR, and transcription — then indexes everything into retrievers. Processed results and vector indexes are written back to B2 through Mixpeek Vector Store, keeping your entire pipeline on low-cost infrastructure end-to-end.

Pipeline Architecture

Hover over each step to see how the components connect

B2 Bucket Connection

S3-Compatible API

Connect your Backblaze B2 bucket to Mixpeek using the S3-compatible API. Same endpoint format, same SDKs — just point to your B2 region (e.g., s3.us-west-004.backblazeb2.com).

Object Discovery

Include Patterns

Mixpeek scans your B2 bucket and applies include patterns to select which objects to process. Filter by file extension, path prefix, or naming convention.

Multimodal Extraction

Extractors

Selected objects are processed through parallel extractors: visual embeddings, object detection, face identity, OCR, speech transcription, and scene splitting — running on Ray GPU clusters.

Feature Indexing

Collections

Extracted features are stored in Mixpeek collections with full lineage back to the source B2 object, including bucket, key, and extraction metadata.

Search Retriever

Feature Search + Filters

A retriever combines vector similarity, face identity matching, metadata filters, and full-text search. Query across all extracted features from a single API call.

Results to B2

Mixpeek Vector Store

Processed results, vector indexes, and scan reports are written back to Backblaze B2 via Mixpeek Vector Store. Your data stays on B2 end-to-end — zero egress fees.

Backblaze B2 Integration Deep Dive

Point a Mixpeek connector at your B2 bucket endpoint using the S3-compatible API. Mixpeek treats B2 buckets identically to AWS S3 — no adapter code, no migration. Set up collections with the extractors you need, configure include patterns to control which objects get processed, and Mixpeek handles the rest. New objects added to B2 are detected and processed automatically. The pipeline decomposes each asset into extracted features — scene compositions, detected objects, recognized faces, on-screen text, and transcribed speech — then indexes everything into a retriever with feature search and metadata filtering. Batch processing runs across your entire library in parallel on Ray GPU clusters, and results are written back to B2 via Mixpeek Vector Store with full lineage tracking.

Solution