How is scene-similarity different from "more like this" based on metadata?

Metadata tags describe what a piece of content is about — genre, topic, cast. Scene embeddings describe what it looks and feels like — color palette, lighting mood, pacing, composition. A user who loves rain-soaked noir films and a user who loves bright romantic comedies might both watch "Drama," but their visual preferences are completely different. Scene similarity finds content the user will actually enjoy, not just content in the same category.

How does the reinforcement learning layer work?

Engagement signals — clicks, completions, skips, likes — are collected at the retriever level. The retriever reranks candidates using a Thompson sampling model that weights toward content similar to what a specific user has engaged with. The model updates in real time without retraining the underlying embeddings.

Does this require large amounts of user data to work?

No. The scene-similarity component works from the first query with no user history — it ranks by visual resemblance to the seed content. The RL personalization layer improves as engagement data accumulates, but cold-start users still get high-quality visual recommendations from day one.

Intermediate

Entertainment

E-commerce

media

6 min read

Visual Taste & Recommendations

Build visual recommendation engines that match on aesthetics, mood, and composition — not just metadata tags. Scene-similarity search with reinforcement learning from user behavior.

Who It's For

Streaming platforms, e-commerce companies, stock media libraries, and content marketplaces that want to recommend visually similar content based on what users actually engage with

Problem Solved

Collaborative filtering recommends what similar users watched. Tag-based systems recommend what has the same labels. Neither captures why a viewer chose a moody, rain-soaked thriller over a bright action sequence — the visual aesthetic, pacing, and emotional texture that define taste.

Ready to implement?

Schedule a Demo View Documentation

See It in Action

Upload a scene or image to find visually similar content ranked by aesthetic similarity

Why Mixpeek

Scene-level embeddings capture visual aesthetics that metadata tags miss. The retriever pipeline supports real-time reranking with RL signals without retraining. The same pipeline works for video, images, and short-form clips.

Overview

Visual taste is expressed in the textures, palettes, and compositions a user repeatedly selects — not in genre tags. A film buff who consistently picks dimly lit, slow-burn dramas and a viewer who always chooses high-saturation, fast-cut action films both click "Drama," but their visual preferences have nothing in common. Scene-similarity embeddings capture the visual signal that collaborative filtering and taxonomy matching miss.

Challenges This Solves

Metadata tags miss aesthetic preferences

Genre, director, and cast tags describe content categories, not the visual and emotional qualities that drive viewing decisions

Impact: Recommendation CTR plateaus as users learn the system recommends "more of the same category" rather than "more of what they actually like"

Cold start for new content

New titles have no engagement history, so collaborative filtering cannot rank them — they are invisible in recommendations until they accumulate clicks

Impact: New content gets buried, reducing catalog utilization and hurting the discovery experience

Cross-catalog similarity

A user who liked a specific scene in one title may love visually similar content from a completely different genre or era — but keyword matching cannot find it

Impact: Serendipitous discovery is eliminated; users churn when the catalog feels exhausted