Perceptual Image Hashing: How Agents Recognize the Same Picture After It Has Been Re-Encoded, Cropped, and Recolored

Why This Is a Different Problem From Image Embeddings

Most visual retrieval an agent does is semantic. You embed an image into a vector, embed the query the same way, and find pictures that *look like* the query: "find product photos similar to this one," "retrieve frames that contain a dog." That is similarity search, and it is the right tool when the agent wants a *kind* of image.

Perceptual hashing answers a fundamentally different question: "is this the same picture I already have?" Not a different photo of the same subject, not a similar scene -- the same source image, possibly re-saved at a lower quality, resized for a thumbnail, lightly cropped, watermarked, or run through a brightness filter. A semantic embedding is the wrong tool here because it generalizes: it will happily call two different photos of the same handbag a match, and its score drifts continuously, so there is no clean yes/no identity line. Perceptual hashing is built to be *specific and robust at the same time*, and it gets there with a compact deterministic fingerprint rather than a learned vector.

For an agent, this is the capability behind near-duplicate detection (is this upload already in the library?), known-content matching (does this image match an entry in a banned or copyrighted set?), and frame deduplication (collapse the dozens of near-identical frames a static shot produces before you pay to embed them). It is identity-level retrieval, it produces a tiny fixed-size code, and it scales to hundreds of millions of references with sub-millisecond comparisons.

The Core Constraint: Robustness Under Transformation

A naive approach -- hash the raw image bytes with a cryptographic hash like SHA-256 -- fails immediately and by design. Cryptographic hashes are built so that flipping a single bit produces a completely different digest. Re-save a JPEG at 90 percent quality instead of 95 and every byte changes, so the SHA differs even though the picture is visually identical. We need a hash with the opposite property: visually similar inputs must produce *similar* codes, and the codes must survive the transformations real images go through:

Re-compression (JPEG/WebP re-encoding at different quality)

Resizing and rescaling (thumbnails, responsive variants)

Small brightness, contrast, and gamma shifts (filters, auto-correction)

Minor crops, borders, and watermarks (a logo stamped in the corner)

Format conversion (PNG to JPEG, color profile changes)

The insight that makes perceptual hashing work: the *low-frequency, coarse structure* of an image -- the broad arrangement of light and dark regions -- is what survives all of these. Fine detail is exactly what compression, resizing, and filtering throw away first. If we build our fingerprint out of coarse structure and discard high-frequency detail, we get a representation that is compact, distinctive, and survivable.

Step 1: Normalize Away the Easy Variation

Every perceptual hash starts the same way: collapse the dimensions that should not matter for identity.

1. Convert to grayscale. Identity should not hinge on a color filter, and luminance carries most of the structure. 2. Downscale hard. Shrink to a tiny fixed grid (8x8 for average and difference hashing, 32x32 before the transform for pHash). Aggressive downscaling is not a side effect -- it *is* the low-pass filter that throws away the fragile high-frequency detail and forces a thumbnail and its full-resolution source to converge to the same small image.

from PIL import Image

def normalize(path, size):
    img = Image.open(path).convert("L")            # grayscale
    return img.resize((size, size), Image.LANCZOS)  # tiny fixed grid

After this step a 4000x3000 photo and its 200x150 thumbnail are nearly the same little array of pixels. Everything that follows is a way to turn that array into a comparable bit string.

Step 2: Three Classic Hash Constructions

Average hash (aHash)

The simplest. Downscale to 8x8, compute the mean pixel value, then set each bit to 1 if that pixel is brighter than the mean and 0 otherwise. The result is a 64-bit code.

import numpy as np

def average_hash(path):
    px = np.asarray(normalize(path, 8), dtype=np.float64)
    return px > px.mean()   # 8x8 boolean grid -> 64 bits

aHash is fast and intuitive, but because it thresholds against a single global mean it is brittle: a brightness shift that moves many pixels across the mean flips many bits at once.

Difference hash (dHash)

Instead of comparing each pixel to a global mean, compare *adjacent* pixels. Downscale to 9x8, then for each row set a bit to 1 if a pixel is brighter than the pixel to its right. You get 8x8 = 64 comparisons, 64 bits.

def difference_hash(path):
    px = np.asarray(normalize(path, 9).resize((9, 8)), dtype=np.float64)
    return px[:, 1:] > px[:, :-1]   # left-vs-right gradient -> 64 bits

dHash encodes *relative gradients* rather than absolute brightness, so a uniform brightness or contrast change leaves most bits untouched. It is the workhorse default in many production dedup pipelines: cheap, and noticeably more robust than aHash.

Perceptual hash (pHash, the DCT method)

The most robust of the classics. Downscale to a larger grid (typically 32x32), apply a 2D Discrete Cosine Transform (DCT), and keep only the top-left block of low-frequency coefficients (commonly 8x8). The DCT concentrates the coarse, slow-varying structure into those low-frequency terms -- exactly the part that survives compression and resizing -- and discards the high-frequency detail that does not. Threshold those coefficients against their median (skipping the very first DC term, which just encodes overall brightness) to produce 64 bits.

from scipy.fft import dct

def phash(path):
    px = np.asarray(normalize(path, 32), dtype=np.float64)
    d = dct(dct(px, axis=0, norm="ortho"), axis=1, norm="ortho")
    low = d[:8, :8]                 # keep low-frequency block
    med = np.median(low[1:].flatten())  # skip DC term at [0,0]
    return low > med                # 64 bits

pHash costs more (a transform instead of a threshold) but it is the most resistant to compression, gamma shifts, and minor edits, which is why "perceptual hash" colloquially means pHash. A close cousin, wavelet hash (wHash), swaps the DCT for a Haar wavelet transform and keeps the low-frequency wavelet coefficients; it behaves similarly and can be marginally more robust to small spatial shifts because wavelets are localized in space as well as frequency.

Step 3: Matching With Hamming Distance

A perceptual hash is useful only because of how you compare two of them. The codes are bit strings, and the distance metric is the Hamming distance: the number of bit positions that differ.

def hamming(a, b):
    return int(np.count_nonzero(a.flatten() != b.flatten()))

The whole design pays off here. Because visually similar images produce similar codes, a re-compressed or resized copy lands a *small* Hamming distance from the original -- typically 0 to 6 bits out of 64 -- while an unrelated image sits far away, usually 25 to 35 bits (a random pair of 64-bit codes differs in about 32 bits on average). That wide separation is what gives you a clean identity threshold.

Distance 0 to ~5: almost certainly the same image (re-saved, resized, lightly filtered).

Distance ~6 to ~10: likely the same content with a heavier edit (a crop, a watermark, a strong filter). This is the gray zone where you trade false accepts against false rejects.

Distance > ~12: treat as different.

Set the exact cutoff the way you tune any retriever: with a labeled set of true near-duplicates and impostor pairs, picking the threshold that gives the precision/recall you need. Different hash constructions want different cutoffs, so calibrate per hash type.

Step 4: Searching Millions of Hashes Without a Linear Scan

Comparing one query hash to one reference is trivial. Comparing it to 200 million references on every lookup is not. A brute-force Hamming scan over a large catalog is too slow for an interactive agent, so production systems use one of two structures.

BK-tree. A metric tree built for discrete distances like Hamming. Each node stores a hash, and children are bucketed by their exact distance to the parent. A query with threshold *r* only descends into children whose stored distance lies within *r* of the query's distance to the node, pruning most of the tree. This turns a linear scan into a sublinear traversal for small *r*.

Multi-index hashing (the banding trick). Split each 64-bit hash into, say, 4 bands of 16 bits. If two hashes are within Hamming distance 3 over the full code, then by the pigeonhole principle at least one of the 4 bands must match *exactly* (3 differing bits cannot cover all 4 bands). So you build 4 exact-match hash tables, one per band; at query time you look up each band, union the candidates, and only then compute full Hamming distance on that small candidate set. This converts approximate matching into a handful of exact lookups plus a cheap verification.

def band_keys(h):
    bits = h.flatten()
    return [bytes(np.packbits(bits[i:i+16])) for i in range(0, 64, 16)]
# index[band_position][band_key] -> set(image_ids); query unions the four buckets

Both structures share a property worth noting for agents: inserts are cheap and incremental. Registering a newly banned or newly ingested image is an append, not a rebuild, so the reference set grows continuously without recomputing anything.

What Perceptual Hashing Cannot Do

The same coarse-structure trick that makes pHash robust is also its ceiling. Because it keys on the global low-frequency layout, it breaks under transformations that *rearrange* that layout:

Rotation and flipping move structure to new positions, so a 90-degree rotation reads as a different image. (Some pipelines hash a few canonical orientations to cover this.)

Heavy cropping that removes a large fraction of the frame changes the coarse layout enough to break the match.

Different photo of the same subject is not a near-duplicate at all -- it is a *semantically* similar but pixel-distinct image, which is exactly the embedding's job, not the hash's.

This is why mature systems pair the two. The hash is a fast, cheap, high-precision first pass for exact and near-exact duplicates; the embedding handles the semantic, transformation-heavy cases the hash cannot. A common pattern: hash everything on ingest to collapse obvious duplicates for free, then embed the survivors for semantic search.

Hashing vs Embeddings: Pick by the Question

Question the agent asks

Right tool

Why

Is this the same image I already have?	Perceptual hash	Specific, transformation-robust, clean yes/no via Hamming distance
Has this exact picture been banned before?	Perceptual hash	Identity-level match against a known-content set, cheap and incremental
Find visually similar products	Image embedding	Generalizes across instances and viewpoints
Collapse near-identical frames before embedding	Perceptual hash	Sub-millisecond dedup avoids paying to embed redundant frames
Find a different photo of this subject	Image embedding	The hash is too literal; this is semantic similarity

A capable agent keeps both in its toolbox and routes by intent: identity questions to the hash index, similarity questions to the vector index.

In Mixpeek

In Mixpeek terms, the two tools live behind the same retrieval surface but use different feature extractors. A reference library is ingested into a collection with a perceptual hash extractor, which builds the Hamming-searchable index described above; the same source can also carry an image embedding extractor for semantic search. An agent's tool then chooses the path that fits the query.

{
  "collection": "image_references",
  "feature_extractors": [
    { "feature": "perceptual_hash", "model": "phash-dct-64" },
    { "feature": "image_embedding", "model": "google/siglip-base-patch16-224" }
  ]
}

An identity query ("is this upload already in the library, or does it match a banned image?") runs against the hash index and returns the matched reference id plus the Hamming distance as a confidence signal. A semantic query ("find product photos that look like this one") runs against the embedding index. A high-value pattern for video: run the perceptual hash extractor over sampled frames first to collapse the long runs of near-identical frames a static shot produces, so the embedding extractor only pays for visually distinct keyframes. Because hash ingestion is append-only, registering a new reference is an incremental insert -- the agent can add an image and immediately match against it without recomputing the catalog.

Key Takeaways

1. Perceptual hashing is identity, not similarity. It answers "is this the same picture?" and survives re-compression, resizing, and filtering, where a cryptographic hash flips entirely and a semantic embedding generalizes to the wrong instance.

2. Robustness comes from coarse structure. Grayscale, downscale hard, and keep only low-frequency information (a global mean for aHash, adjacent-pixel gradients for dHash, low-frequency DCT coefficients for pHash); the fragile high-frequency detail is exactly what transformations destroy.

3. Hamming distance gives a clean threshold. Near-duplicates land within a few bits of 64 while unrelated images sit near 32 bits, so a single distance cutoff makes a reliable yes/no decision -- calibrate it per hash type with labeled pairs.

4. Scale with BK-trees or multi-index banding. Pigeonhole banding turns approximate matching into a few exact lookups plus cheap verification, and both structures support cheap incremental inserts.

5. Use embeddings and hashes together. The hash is a fast, high-precision first pass for exact and near-exact duplicates and frame dedup; the embedding handles rotation, heavy crops, and semantically similar but pixel-distinct images the hash cannot.

Why This Is a Different Problem From Image Embeddings

The Core Constraint: Robustness Under Transformation

Step 1: Normalize Away the Easy Variation

Step 2: Three Classic Hash Constructions

Average hash (aHash)

Difference hash (dHash)

Perceptual hash (pHash, the DCT method)

Step 3: Matching With Hamming Distance

Step 4: Searching Millions of Hashes Without a Linear Scan

What Perceptual Hashing Cannot Do

Hashing vs Embeddings: Pick by the Question

In Mixpeek

Key Takeaways

Further Reading

Put multimodal search to work

Already have vectors?

Run this on your own data

Related guides

Audio Fingerprinting: How Agents Recognize a Specific Recording in Noise

Streaming Video Understanding: How Agents Watch an Unbounded Live Feed in Real Time

How Vision-Language Models Fuse Image and Text Tokens