NEWVectors or files. Pick a path.Start →
    Perception
    16 min read
    Updated 2026-06-30

    Instance-Level Visual Matching: Finding the Same Object, Not Just Similar Ones

    A vendor-neutral guide to geometric visual matching — keypoint detection, local descriptors, descriptor matching, and RANSAC geometric verification — the pipeline an agent uses to confirm two images contain the *same* physical object or scene, which a similarity embedding cannot decide on its own.

    Keypoint Matching
    SuperPoint
    RANSAC
    Geometric Verification
    Instance Retrieval
    Agent Perception

    Similar Is Not the Same



    A vision embedding answers one question well: *how visually similar are these two images?* Two red sneakers, two photos of the Eiffel Tower, two invoices with the same template all score high. That is exactly what you want for "find more like this." It is exactly what you do not want when the question is "is this the same physical object, scene, or document?" — find every photo of *this specific* painting; detect that *this* logo crop was reused; confirm two frames show the *same* location despite a different angle. A cosine score of 0.9 cannot tell "the same thing from a new viewpoint" apart from "a different but similar thing."

    Deciding *sameness* needs geometric correspondence: not a single vector per image, but a set of local points that can be matched between two images and then checked for a consistent spatial transformation. The classic pipeline is four stages — detect → describe → match → verify — and the last stage is where "some features matched" becomes "these are provably the same scene."

    Stage 1 — Detection: Repeatable Keypoints



    A keypoint is a small, distinctive location — a corner, a blob, a junction — that can be re-found in another image of the same scene under a different scale, rotation, or lighting. *Repeatability* is the whole game: a detector is only useful if it fires on the same physical points across views.

    Classic detectors find these analytically: the Difference-of-Gaussians extrema used by SIFT locate blob-like structures across a scale pyramid (giving scale invariance); Harris-style detectors find corners. Learned detectors like SuperPoint train a small CNN to predict keypoint locations *and* their descriptors in one pass, learning repeatability directly from data rather than hand-designed filters. Either way the output is the same: a sparse set of (x, y, scale, orientation) points per image.

    Stage 2 — Description: A Fingerprint per Keypoint



    Each keypoint gets a descriptor — a vector summarizing the pixel pattern in its local neighborhood, built to be invariant to the nuisances that don't change identity (rotation, illumination, small viewpoint change). SIFT's descriptor is a 128-d histogram of gradient orientations in a grid around the point; learned descriptors (SuperPoint and successors) output a vector trained so that the *same* physical point in two images lands close in descriptor space and *different* points land far apart.

    The key difference from a whole-image embedding: there are *hundreds* of these vectors per image, each tied to a specific location, not one global vector. That locality is what makes geometric reasoning possible in the next two stages.

    Stage 3 — Matching: Putting Points in Correspondence



    Given descriptors from image A and image B, matching finds, for each point in A, its most similar point in B (nearest neighbor in descriptor space). Raw nearest-neighbor matching is noisy, so the classic filter is Lowe's ratio test: keep a match only if the best neighbor is meaningfully closer than the *second*-best (a low first-to-second distance ratio). A point that matches two candidates almost equally well is ambiguous — discard it.

    Modern learned matchers go further by reasoning about all matches *jointly* instead of independently. SuperGlue and its faster successor LightGlue use an attention-based graph network that considers the full set of keypoints in both images and their spatial layout, which resolves repetitive structure (rows of windows, tiled patterns) that breaks independent matching. LoFTR drops explicit keypoints entirely and matches densely in a coarse-to-fine way, which is what you want on textureless or low-feature surfaces where a detector finds nothing stable.

    Stage 4 — Geometric Verification: The Step That Decides Sameness



    Even after the ratio test, a chunk of matches are wrong. The decisive insight is that correct matches all agree on a single geometric transformation between the two images, and wrong ones don't. If both images show the same planar object, the correct correspondences are related by a homography (a 3×3 projective transform); for a rigid 3D scene from two viewpoints, by the fundamental/essential matrix (epipolar geometry).

    You recover that transform with RANSAC (Random Sample Consensus), which is robust to a match set that is mostly outliers:

    1. Randomly sample the minimal number of matches needed to fit the model (4 for a homography). 2. Fit the candidate transform from that sample. 3. Count inliers — every other match that the transform explains within a pixel tolerance. 4. Repeat many times; keep the model with the most inliers, then refit on all of them.

    The output that matters is the inlier count. "37 geometrically-consistent correspondences" is a fundamentally stronger statement than "cosine 0.91" — it means there exists a real spatial mapping under which these points line up, which is the operational definition of *same scene*. A handful of inliers means coincidence; dozens of inliers under a clean transform means a confirmed match. That number is a calibrated, explainable same/not-same score.

    Doing It at Scale: Retrieve, Then Verify



    You cannot run RANSAC between a query and every image in a million-item archive — geometric matching is far too expensive for a full scan. So instance retrieval uses the same two-stage pattern as text retrieve-then-rerank:

    1. Recall with a global embedding. A whole-image vector (CLIP, DINO, or a retrieval-tuned global descriptor) indexed in a vector store returns the top-K *visually similar* candidates cheaply via approximate nearest-neighbor search. This stage optimizes recall — get the true match into the shortlist. 2. Verify with local matching. Run keypoint matching + RANSAC only on those K candidates, and re-rank by inlier count. This stage optimizes precision — promote the candidate that is geometrically the same and demote lookalikes.

    The embedding stage makes the search tractable; the geometric stage makes it *correct* about identity. Neither alone is enough: embeddings over-retrieve similar-but-different items, and geometric matching is too slow to be the first filter.

    Where It Breaks



  1. Textureless or repetitive surfaces — blank walls, sky, tiled patterns — yield few or ambiguous keypoints. Detector-free dense matchers (LoFTR) and joint matchers (LightGlue) help, but identity on a featureless object is inherently hard.
  2. Extreme viewpoint or scale change stretches descriptor invariance past its limit; correspondence collapses.
  3. Deformable or articulated objects violate the rigid/planar assumption behind homography and epipolar models — RANSAC has no single transform to find.


  4. What This Unlocks for an Agent



    "Same vs similar" is a decision agents make constantly: copy and reuse detection, product matching across marketplaces, visual place recognition, document/edition matching, brand-asset tracking. An agent that only has cosine similarity can rank candidates but cannot *commit* to "this is the same one." Adding the geometric stage gives it a defensible answer — and the inlier count is an explanation it can act on or surface to a human.

    Doing It on Mixpeek



    The pattern maps directly onto a two-stage retriever: a visual `feature_search` stage recalls candidates by embedding, and a re-rank stage applies geometric verification to the shortlist.

    from mixpeek import Mixpeek

    mx = Mixpeek(api_key="API_KEY")

    # Stage 1 (recall): embedding nearest-neighbors get the visually-similar shortlist. # Stage 2 (precision): geometric re-rank promotes the SAME object by inlier count. results = mx.retrievers.execute( retriever_id="your-retriever-id", query="your search query", )


    The agent calls one retriever and gets back identity-confirmed matches, not just lookalikes. For the broader recall-then-precision pattern this generalizes, see multi-stage retrieval; for the global-embedding recall stage, see contrastive learning; and for the near-duplicate cousin of this problem, see perceptual image hashing.
    Managed Mixpeek

    Put multimodal search to work

    Connect a bucket and Mixpeek runs the whole multimodal search pipeline for you: extraction, indexing, and search over your own objects. No models to wire up, nothing to host.

    Start with Managed
    MVS · bring your own

    Already have vectors?

    Keep your embeddings on your own cloud and run dense, sparse, and BM25 search directly on object storage. First 1M vectors free.

    Start with MVS

    Build a Multimodal Search Pipeline

    Give agents searchable access to video, image, audio, and document evidence with Mixpeek.

    Start BuildingRead Docs