Turning Frames into DataFrames: AI-Powered Video Analytics

Suppose you want to run an analytical query on your basketball footage:

Show me every jump shot from each season's highest scoring player where the opposing team is winning by 3 or less

—you need a pipeline of modular, composable feature extractors that can split, aggregate, and merge across objects, actions, game context, and external stats.

💡

Mixpeek builds custom extractors for your workflow, we also have existing ones:

Required Feature Extractors:

Ok, so how does it all work together?

We Need to Create 3 Indexes

To support these advanced, analytical and semantic search queries, we build three separate indexes:

Objects
Actions
Game Context

Each of these indexes are constructed using group_by operations over key entities. This allows us to precompute, aggregate, and store enriched data upfront. Then when we run a query, the most relevant results are simply the intersection between the indexes.

Note: splitting videos often requires zero or few-shot segmentation models.

Object Index

Split: Detect all player objects across frames.
Group By: Player ID (using jersey OCR, pose, face recognition).
External Join: Pull player stats via URL/API lookup
Generate: Compute video clip embeddings for each player’s segments.
Merge: Attach stats to each clip embedding.

# Step 1: Split video into segments with detected players
clips = detect_objects_and_segment(video)  # returns list of {clip_id, player_id, timestamp, ...}

# Step 2: Group clips by player_id
grouped_by_player = group_by(clips, key="player_id")

# Step 3: Fetch external stats per player
def fetch_player_stats(player_id):
    url = f"https://stats.api.com/players/{player_id}/season_stats"
    return http_get(url)

# Step 4: Attach stats to each clip
enriched_clips = []
for player_id, player_clips in grouped_by_player.items():
    stats = fetch_player_stats(player_id)
    for clip in player_clips:
        clip["stats"] = stats
        enriched_clips.append(clip)

# Step 5: Save enriched clips to index
index_store.save("object_index", enriched_clips)

Action Index

Split: Classify action segments (e.g., jump shots, dunks, assists).
Group By: Player ID or action type.
Aggregate: Summarize frequency, duration, success rate (if available).
Merge: Tag clips for retrieval based on action + player pairings.

# Step 1: Run action classifier to label clips
labeled_clips = classify_actions(video)  # returns list of {clip_id, action_label, player_id, timestamp, ...}

# Step 2: Filter for jump shots
jump_shots = filter(lambda c: c["action_label"] == "jump_shot", labeled_clips)

# Step 3: Group jump shots by player
jump_shots_by_player = group_by(jump_shots, key="player_id")

# Step 4: Enrich each jump shot with metadata (e.g., shot_clock, defender_distance)
for player_id, clips in jump_shots_by_player.items():
    for clip in clips:
        metadata = extract_context_metadata(clip["clip_id"])
        clip["metadata"] = metadata

# Step 5: Save to action index
index_store.save("action_index", jump_shots)

Game Context Index

Split: Extract time, score, shot clock, and period using scoreboard overlays or synced metadata.
Group By: Game ID or quarter.
Compute: Derive score_diff, time_remaining, and other contextual flags.
Merge: Add context to each clip (e.g., “opponent up by ≤ 3”).

# Step 1: Extract scoreboard info from each clip
clips_with_scores = extract_scoreboard(video)  # returns list of {clip_id, team_score, opponent_score, timestamp, ...}

# Step 2: Compute score differential
for clip in clips_with_scores:
    clip["score_diff"] = clip["opponent_score"] - clip["team_score"]

# Step 3: Flag relevant clips (opponent winning by 3 or fewer points)
flagged_clips = filter(lambda c: 0 < c["score_diff"] <= 3, clips_with_scores)

# Step 4: Tag with flag for downstream retrieval
for clip in flagged_clips:
    clip["flag"] = "close_game_opponent_leading"

# Step 5: Save to game context index
index_store.save("game_context_index", flagged_clips)

Query-Time Result

At retrieval, your retriever can now filter indexed clips with something like:

SELECT *
FROM jump_shot_clips
WHERE player_id = season_top_scorer
  AND score_diff <= 3

Supported Content-Based Queries by Input Type

Input Type	Example Query	Index Used	Feature Extractors Involved	Query Mechanics
Text	“Jump shots by the top scorer when the team is losing by ≤3”	Objects, Actions, Context	Action classifier, Score diff calculator, Stats join	Semantic query → match metadata and tags across precomputed indexes
Image	"Find plays where this player's pose matches this still image"	Objects, Actions	Pose estimation, Embedding similarity	Image embedding → nearest neighbor search in object/action clip embeddings
Video	"Show me clips like this sequence"	Actions, Game Context	Temporal embedding, Sequence clustering	Video embedding → similarity over sequence vectors
Text + Image	"Find all jump shots like this frame by LeBron James"	Objects, Actions	Object detection, Face ID, Action tagging	Image filters object, text filters action → intersected at retrieval
Text + Video	"Give me all clutch plays like this highlight reel"	All indexes	Action + Context tagging, Embedding scoring	Combines natural language and visual patterns → top-K matches across all indexes
Image + Stats	"Show me this player’s plays when his FG% is over 60%"	Objects, Game Context	Object detection, External stats join	Image localizes player, external stats filter → combined pre-indexed data
Multimodal (Text + Image + Context)	"Where is this player shooting threes in the last minute of tied games?"	All indexes	Object, Action, Score/time context	Multi-filter match across all indexed dimensions

This makes it easy to support hybrid queries that combine natural language, visual similarity, and structured filters — all made possible because your footage is preprocessed into queryable, analytics-friendly data structures.

Why Precompute This?

Doing this at index time means:

You avoid re-scanning raw video every time someone queries
You can attach aggregated stats directly to search results
You bake in assumptions about how people want to explore the footage

It’s like adding indexes and rollups to a database table — you make future queries faster by doing the hard part up front.

TL;DR

If your video is structured into object events, group_by is a useful pattern — not just for querying, but for indexing. Precomputing Split → Aggregate → Merge helps turn raw footage into something explorable, search-ready, and analytics-friendly.

You’re not just parsing frames — you’re building DataFrames.

Purpose-Built, for You

We offer purpose-built extraction pipelines, called Feature Extractors for each access pattern.

Feature Extractors are executed in parallel, so to chain them together you create a Collection with Extractors, and use that Collection as the source for a new one.

Then Finally, you pair them with purpose built Retrievers for the ultimate Video Search and Analytics experience.

Going Beyond

At Mixpeek we introduce the concept of Taxonomies, which are flat or hierarchal collections that can be used as a join (materialized or computed). This enables you to enrich your new videos processed outputs with the overlap of another.

We also introduce Clusters, which act as a group (not to be confused with pandas' operation) which allows you to cluster, and group.