We Built a Pre-Publication IP Clearance Pipeline

Every major IP enforcement tool finds violations after they're live. We built one that catches them before publication. Here's the architecture.

The Problem: Content Velocity vs. Clearance Bottleneck

A mid-size media company publishes 200-400 creative assets per week. Each one needs to be checked for unauthorized faces, trademarked logos, and copyrighted audio before it ships. A single missed celebrity likeness in an ad campaign can trigger a seven-figure lawsuit. A logo that's "close enough" to a registered trademark gets a cease-and-desist within hours.

The current options are bad:

Manual review doesn't scale. A trained compliance analyst can review maybe 50 assets/day with any rigor. That's a week's backlog by Tuesday.
Post-publication enforcement (Pixsy, Red Points, VISUA, Copyseeker) finds violations after they're already live. You pay for takedowns, not prevention. The damage — legal exposure, brand risk, platform penalties — is already done.
Perceptual hashing alone catches exact and near-exact copies, but misses stylized logos, different angles of the same face, or AI-generated content that's "inspired by" but not pixel-identical to protected IP.

We wanted a pipeline that clears content before publication, runs in under a second per image and a few seconds per video, and catches the hard cases that hashing misses.

Architecture Overview

The system is built on three primitives from the Mixpeek API: Buckets (storage + ingestion triggers), Collections (processing pipelines with feature extractors), and Retrievers (multi-stage search).

The high-level flow:

Content Asset (image/video/audio)
    |
    v
Bucket Upload (triggers collection pipeline)
    |
    v
Collection Pipeline (parallel extractors)
    |--- Face Detection → Face Embedding (ArcFace 512d)
    |--- Scene Splitting → Object Detection (YOLO) → Logo Embedding (SigLIP 768d)
    |--- Audio Extraction → Spectrogram Fingerprinting
    |
    v
Vector Storage (Qdrant — one namespace, three vector spaces)
    |
    v
Retriever (multi-stage search across all three corpora)
    |
    v
Clearance Result: { faces: [...], logos: [...], audio: [...] }

Three detection layers run in parallel within a single collection pipeline. Each layer has its own feature extractor, its own embedding model, and its own reference corpus. A single retriever execution searches all three and returns a unified result.

The key insight: the same pipeline that processes your content for clearance also builds your reference corpus. Celebrity headshots, trademarked logos, and copyrighted audio tracks are all ingested through the same bucket-collection flow. The only difference is metadata tagging — reference items get a corpus_type: "reference" field, content to be checked gets corpus_type: "submission".

Query Pre-Processing: Why It Matters More Than Model Quality

This is the section most teams skip, and it's the one that matters most. The quality of what you feed into your embedding model determines your recall far more than which embedding model you pick.

Scene Splitting for Video

Naive approach: sample every Nth frame and run detection on each. This is expensive and produces massive redundancy — a 30-second talking-head clip generates 900 frames at 30fps, most of which are nearly identical.

Better approach: split by scene boundaries first. Mixpeek's scene_splitting extractor uses PySceneDetect's content-aware detection to identify hard cuts and gradual transitions. A typical 30-second ad breaks into 3-8 scenes. Run detection on representative frames from each scene, not every frame.

# Collection config — scene splitting feeds into face detection
{
    "collection_name": "ip_clearance_pipeline",
    "feature_extractors": [
        {
            "feature_extractor_name": "scene_splitting",
            "version": "v1",
            "parameters": {
                "threshold": 27.0,
                "min_scene_len": 15
            }
        },
        {
            "feature_extractor_name": "face_identity",
            "version": "v1",
            "parameters": {
                "quality_threshold": 0.4,
                "min_face_size": 40,
                "detection_threshold": 0.5
            }
        },
        {
            "feature_extractor_name": "object_detection",
            "version": "v1",
            "parameters": {
                "model": "yolov8x-worldv2",
                "confidence_threshold": 0.25,
                "classes": ["logo", "brand", "trademark", "sign", "label"]
            }
        }
    ]
}

This preprocessing step cuts compute by 10-50x on video inputs while actually improving detection quality — scene-representative frames are more likely to show faces and logos in clear, unblurred positions than arbitrary frame samples.

Face Cropping Before Embedding

You don't embed the full frame. You detect faces first (SCRFD — Sample and Computation Redistribution for Face Detection), crop each face region, align it to a normalized 112x112 template using 5 facial landmarks, and then generate the identity embedding.

This is obvious in hindsight, but I've seen teams embed full frames and wonder why their face search has 40% recall. The face is 3% of the pixel area in a wide shot. The embedding is dominated by the background.

Object Proposals for Logo Isolation

Same principle for logos. YOLO generates bounding box proposals for logo-like regions. Each region is cropped and embedded independently with SigLIP. A single frame might yield zero logo proposals (clean background) or five (product shelf shot). Each proposal becomes a separate search query against the logo reference corpus.

The alternative — embedding the full frame and hoping the model attends to the logo — works surprisingly well for prominent logos (center frame, large area) and fails badly for small, peripheral, or partially occluded marks. The crop-then-embed approach handles both cases.

The Pipelines in Detail

Face Detection → Recognition

Pipeline stages:

Scene splitting (video only): PySceneDetect content-aware detection → representative frames
Face detection: SCRFD-2.5G scans each frame. Outputs bounding boxes, confidence scores, and 5 facial landmarks per detected face.
Alignment: Landmarks are used to warp each face to a canonical 112x112 frontal pose. This normalization is what makes the system robust to head tilt, partial profile views, and camera angle variation.
Embedding: ArcFace (ResNet-100, trained on MS1MV3) generates a 512-dimensional identity embedding. Cosine similarity in this space corresponds directly to identity — same person across lighting, age, expression, and moderate pose changes.
ANN search: Each face embedding is searched against the reference corpus in Qdrant. Threshold: cosine similarity >= 0.28 (conservative; FAR ~1e-4 on LFW benchmark).

Why ArcFace and not CLIP/SigLIP for faces? CLIP embeds semantic similarity. Two different red-haired women in similar settings score high. ArcFace is trained with angular margin loss specifically for identity discrimination — it is not interchangeable with general-purpose image embedders for biometric matching.

# Retriever config — face search stage
{
    "stage_id": "feature_search",
    "stage_type": "search",
    "parameters": {
        "searches": [
            {
                "feature_uri": "mixpeek://face_identity@v1/arcface_embedding",
                "query": {
                    "input_mode": "content",
                    "value": "{{submission_face_crop}}"
                },
                "filters": {
                    "corpus_type": "reference"
                },
                "top_k": 10,
                "score_threshold": 0.28
            }
        ]
    }
}

Logo Detection → Recognition

Pipeline stages:

Scene splitting (video only): same as above
Object detection: YOLOv8x-WorldV2 generates bounding box proposals for logo-like objects. We use the open-vocabulary variant so it generalizes beyond a fixed class set — you can prompt it with arbitrary class names at inference time.
Region cropping: Each detected region is cropped with 10% padding
Logo embedding: SigLIP (ViT-B/16, 768-dimensional) embeds each cropped region. SigLIP over CLIP because its sigmoid pairwise loss produces better-calibrated similarity scores for retrieval tasks.
Dual matching: Each crop is matched against the reference corpus via both (a) embedding cosine similarity and (b) perceptual hash distance. Either signal above threshold triggers a match.

The dual matching is important. Perceptual hashing catches trivial copies (exact logo, maybe resized or JPEG'd) cheaply. The embedding catches stylized variants, partial logos, color inversions, and the deformed versions that generative AI tends to produce. Running both in parallel with an OR-gate means high recall without relying solely on either approach.

# Why pHash + embedding, not just embedding?
#
# pHash: O(1) lookup, exact/near-exact matches, zero false negatives on
#         trivial copies. Catches ~60% of real-world violations.
# Embedding: Handles stylization, partial occlusion, AI-generated variants.
#            Catches the remaining 40% that hashing misses.
#
# Running both costs almost nothing extra — the hash is computed during
# ingestion and stored as a payload field. At query time, it's a
# Qdrant payload filter, not a separate search.

Audio Fingerprinting

Pipeline stages:

Audio extraction: FFmpeg strips the audio track from video assets
Spectrogram generation: Short-time Fourier transform → mel spectrogram
Fingerprint embedding: The spectrogram is embedded into a dense vector representation for similarity search
ANN search: Search against a reference corpus of copyrighted audio tracks, jingles, and licensed music

Audio is the least mature of the three pipelines and the one where perceptual fingerprinting (Chromaprint/AcoustID-style) still outperforms learned embeddings for exact match detection. The embedding approach shines for covers, remixes, and tempo-shifted versions.

Models and Tuning

Why SigLIP Over CLIP

We evaluated both extensively for the logo matching use case. SigLIP (768-d, ViT-B/16) wins on three dimensions that matter for retrieval:

Calibrated scores: SigLIP's sigmoid loss produces similarity scores that are directly interpretable as match confidence. CLIP's softmax-normalized scores are relative within a batch, which makes threshold-setting fragile.
Cropped region performance: On our internal eval set of 500 logo crops against 3,000 reference brands, SigLIP at threshold 0.75 achieves 94% recall / 97% precision vs CLIP's 89% recall / 93% precision at its optimal threshold.
Zero-shot generalization: SigLIP handles brand logos it's never seen in training better than CLIP, likely due to the per-pair sigmoid loss not pushing negatives to the same scale as positives.

Face Recognition: ArcFace Tradeoffs

ArcFace ResNet-100 is the default. It's 512 dimensions, runs at ~3ms per face on GPU, and achieves 99.83% accuracy on LFW. The tradeoffs:

Pose sensitivity: Accuracy degrades beyond ~60-degree profile angles. The alignment step mitigates this for moderate poses, but a face visible only in full profile may not match.
Aging: Embeddings shift over 10+ year spans. A reference photo from 2010 may not match the same person in 2026 at conservative thresholds. Mitigation: include multiple reference images spanning different time periods.
Low resolution: Faces below ~40px wide after detection don't produce reliable embeddings. We set min_face_size: 40 as a hard floor.

Custom YOLO Models

The object detection stage supports custom model deployment. If your reference corpus is highly specialized — say, you need to detect pharmaceutical packaging marks or specific regulatory symbols — you can train a custom YOLO model and upload it as a ZIP file to the platform. The inference service loads it on-demand.

For generic logo detection, YOLOv8x-WorldV2's open-vocabulary capability is sufficient. You specify the classes you care about at query time:

# Open-vocabulary object detection — no retraining needed
{
    "feature_extractor_name": "object_detection",
    "parameters": {
        "model": "yolov8x-worldv2",
        "classes": ["Nike swoosh", "McDonald's arches", "Apple logo", "brand logo"]
    }
}

The Reranker

ANN search returns top-K candidates fast but approximate. For the final ranking, we run a cross-encoder reranker on the top results. The cross-encoder sees both the query and candidate simultaneously (not independently embedded), which allows it to capture fine-grained differences that dual-encoder models miss.

This is especially valuable for logos where the top-10 ANN results might include 3 genuine matches and 7 visually similar but legally distinct marks. The cross-encoder precision on this disambiguation step is what separates "useful tool" from "alert fatigue generator."

The Dataset Challenge

Building the Reference Corpus

This is where we spent most of our time and where most teams underinvest.

Faces: How many reference images per identity do you need? Our empirical finding: 5-10 quality reference images per person covers enough pose/lighting variation to achieve >95% recall at FAR 1e-4. With only 1 reference image, recall drops to ~70%. Below 3, it's unreliable.

We built our initial corpus from FaceScrub (~530 identities, ~2,900 images after URL death) supplemented with Wikipedia Commons portraits. The URL attrition on FaceScrub is brutal — it's a 2014 dataset and ~85% of the original URLs are dead. For production, you need a maintained reference database, not a research dataset.

# Corpus quality matters more than quantity
# These are our empirical numbers on the FaceScrub + Wikipedia corpus:
#
# References/identity | Recall@FAR=1e-4 | Notes
# ------------------- | ---------------- | -----
# 1                   | ~70%             | Single frontal portrait
# 3                   | ~88%             | Frontal + 2 varied poses
# 5                   | ~94%             | Diverse lighting/angle
# 10                  | ~97%             | Diminishing returns here
# 20+                 | ~98%             | Not worth the curation cost

Logos: We use LogoDet-3K (158,654 images, 3,000 brands, MIT license) as the base. The critical preprocessing step: LogoDet-3K uses numeric company IDs, not brand names. You must resolve the ID-to-brand mapping before ingestion, or your metadata says "Food/12345" instead of "McDonald's." We burned half a day on this.

For logos, variation coverage matters: you need the logo on white, on dark backgrounds, in color, in grayscale, at multiple scales, and ideally in real-world context (storefront, product packaging, screen captures). A single clean vector logo file is insufficient as a reference.

The Cold Start Problem

When you first deploy, your reference corpus is whatever you curated. It doesn't cover edge cases — unusual lighting conditions, rare logo variants, faces that are only partially visible. The system's recall on day one is measurably worse than on day 30.

The fix is interaction feedback, which feeds directly into the continuous learning loop described in the next section.

Continuous Learning: The Interaction Loop

This is the part that actually matters long-term, and it's the part most blog posts about ML pipelines skip.

Every user interaction with the system generates a signal:

Click on a result → implicit positive signal
Skip a result → weak negative signal
Long view (>3s on a result) → implicit positive signal
Explicit feedback → "correct match" / "false positive" / "missed match"
Threshold override → analyst manually approves/rejects at a specific confidence level

These signals are captured by Mixpeek's interaction tracking and stored in ClickHouse for analytics. The analytics endpoints expose confidence distributions and signal patterns per retriever:

# Analyzing signal patterns for a retriever
import requests

# Get confidence distribution — where are matches landing?
response = requests.get(
    "https://api.mixpeek.com/v1/analytics/retrievers/{retriever_id}/confidence",
    headers={"Authorization": f"Bearer {api_key}"}
)

# Returns histogram of match confidence scores
# If there's a bimodal distribution with a gap at 0.35,
# that's your natural threshold — not the default 0.28.

# Get signal breakdown — what's being confirmed vs. rejected?
signals = requests.get(
    "https://api.mixpeek.com/v1/analytics/retrievers/{retriever_id}/signals",
    headers={"Authorization": f"Bearer {api_key}"}
)

Over time, these signals inform three adjustments:

Threshold tuning: If analysts are consistently rejecting matches above 0.28 but below 0.35, raise the threshold. If they're marking missed matches in the 0.22-0.28 range, consider lowering it for specific corpora.
Fusion weight adjustment: The retriever combines face, logo, and audio signals with configurable weights. If logo matches have a higher false-positive rate than face matches in your domain, down-weight them.
Reference corpus expansion: When a new face or logo is confirmed as a genuine match but wasn't in the reference corpus, add it. This is how the system improves its recall over time without model retraining.

The system doesn't auto-adjust thresholds (that would be terrifying for a compliance tool). It surfaces the data; a human makes the call. But having the data surface automatically, instead of discovering your false-positive rate through legal complaints, is the difference between a proactive tool and an expensive audit.

Performance Numbers

Measured on our production deployment with a corpus of ~3,000 brand logos and ~530 face identities:

Input Type	Latency (p50)	Latency (p95)	Notes
Single image	220ms	480ms	Face + logo detection + search
Video (30s, ~5 scenes)	2.1s	3.8s	Includes scene splitting
Video (60s, ~12 scenes)	4.3s	7.2s	Scales linearly with scenes
Audio-only check	180ms	350ms	30s clip fingerprint
Batch (1000 images)	~4 min	~7 min	Parallelized across workers

False positive rates at the default thresholds:

Detection Layer	Threshold	FPR	Recall
Face (ArcFace)	cosine >= 0.28	~0.01%	~94%
Logo (SigLIP)	cosine >= 0.75	~3%	~94%
Logo (pHash)	hamming <= 8	~0.1%	~60%
Logo (combined)	either above	~3.1%	~97%

The logo FPR is higher because visually similar but legally distinct marks are common (think: any swoosh-like shape near the Nike threshold). The reranker brings this down to ~0.5% in practice, but we report the pre-reranker numbers because that's what you'll see before adding the cross-encoder stage.

What We'd Do Differently

Having built this and deployed it, here's what we'd change if starting over:

1. Start with perceptual hashing, add embeddings second

pHash is trivial to implement, runs in microseconds, and catches ~60% of real-world logo violations (exact copies, resizes, JPEG re-compressions, minor crops). If you're building an MVP, ship pHash first and add the embedding pipeline when you need to catch stylized variants. We built the embedding pipeline first because it was more interesting, which was a mistake.

2. Invest in reference corpus quality before model selection

We spent weeks evaluating ArcFace vs. AdaFace vs. CosFace when the actual bottleneck was that 40% of our FaceScrub reference URLs were dead. 10 good reference images per identity with a mediocre model beats 1 reference image with SOTA. This isn't a theoretical observation — our recall jumped 12 percentage points from corpus cleanup alone, without touching the model.

3. Build the feedback loop from day one

We added interaction tracking as an afterthought after the first deployment. This meant three weeks of production data with no signal capture — three weeks of analyst corrections that were lost. The feedback loop is what makes the system improve over time. Bolt it in from the start, even if the analytics dashboard comes later.

4. Don't underestimate the logo disambiguation problem

There are thousands of logos that are vaguely circular with a swoosh-like element. At embedding similarity 0.7, the Nike swoosh matches dozens of unrelated marks. The cross-encoder reranker exists because of this problem. If your use case involves logos at all, budget for a reranking stage — you will need it.

5. Video is a multiplier, not a different problem

The per-frame detection is identical to image detection. The only video-specific work is scene splitting and deduplication (the same face appearing across multiple frames of the same scene should not generate multiple alerts). We initially over-engineered the video pipeline before realizing it's just "image pipeline + scene splitting + dedup."

Try It

The demo is live at copyright.mixpeek.com. Upload an image or video and see the face/logo/audio detection results in real time.

The API is documented at mixpeek.com/docs. If you're building a pre-publication clearance pipeline, the key endpoints are:

Buckets for ingesting both reference corpora and content submissions
Collections with face_identity, object_detection, and audio extractors for processing
Retrievers with multi-stage search for running clearance checks

If you're building something similar and hit a wall, the tutorial walks through the full pipeline setup from corpus ingestion to retriever configuration.

We also open-sourced the demo frontend — it's a React app that calls the Mixpeek API. Clone it, point it at your namespace, and you have a working IP clearance UI.