Turning Frames into DataFrames: AI-Powered Video Analytics
By applying the classic group_by pattern to structured video data at index time, you can turn raw frames into searchable, analyzable DataFrames aligned with how your users explore footage.

Suppose you want to run an analytical query on your basketball footage:
"Show me every jump shot from each season's highest scoring player where the opposing team is winning by 3 or less"
—you need a pipeline of modular, composable feature extractors that can split, aggregate, and merge across objects, actions, game context, and external stats.
Existing Feature Extractors:
Ok, so how does it all work together?
We Need to Create 3 Indexes
To support these advanced, analytical and semantic search queries, we build three separate indexes:
- Objects
- Actions
- Game Context
Each index is constructed using group_by
operations over key entities. This allows us to precompute, aggregate, and store enriched data upfront.
Note: splitting videos often requires zero or few-shot segmentation models.
Object Index

- Split: Detect all player objects across frames.
- Group By: Player ID (using jersey OCR, pose, face recognition).
- External Join: Pull player stats via URL/API lookup (
https://stats.nba.com/players/{id}/season_stats
). - Generate: Compute video clip embeddings for each player’s segments.
- Merge: Attach stats to each clip embedding.
# Step 1: Split video into segments with detected players
clips = detect_objects_and_segment(video) # returns list of {clip_id, player_id, timestamp, ...}
# Step 2: Group clips by player_id
grouped_by_player = group_by(clips, key="player_id")
# Step 3: Fetch external stats per player
def fetch_player_stats(player_id):
url = f"https://stats.api.com/players/{player_id}/season_stats"
return http_get(url)
# Step 4: Attach stats to each clip
enriched_clips = []
for player_id, player_clips in grouped_by_player.items():
stats = fetch_player_stats(player_id)
for clip in player_clips:
clip["stats"] = stats
enriched_clips.append(clip)
# Step 5: Save enriched clips to index
index_store.save("object_index", enriched_clips)
Action Index

- Split: Classify action segments (e.g., jump shots, dunks, assists).
- Group By: Player ID or action type.
- Aggregate: Summarize frequency, duration, success rate (if available).
- Merge: Tag clips for retrieval based on action + player pairings.
# Step 1: Run action classifier to label clips
labeled_clips = classify_actions(video) # returns list of {clip_id, action_label, player_id, timestamp, ...}
# Step 2: Filter for jump shots
jump_shots = filter(lambda c: c["action_label"] == "jump_shot", labeled_clips)
# Step 3: Group jump shots by player
jump_shots_by_player = group_by(jump_shots, key="player_id")
# Step 4: Enrich each jump shot with metadata (e.g., shot_clock, defender_distance)
for player_id, clips in jump_shots_by_player.items():
for clip in clips:
metadata = extract_context_metadata(clip["clip_id"])
clip["metadata"] = metadata
# Step 5: Save to action index
index_store.save("action_index", jump_shots)
Game Context Index

- Split: Extract time, score, shot clock, and period using scoreboard overlays or synced metadata.
- Group By: Game ID or quarter.
- Compute: Derive
score_diff
,time_remaining
, and other contextual flags. - Merge: Add context to each clip (e.g., “opponent up by ≤ 3”).
# Step 1: Extract scoreboard info from each clip
clips_with_scores = extract_scoreboard(video) # returns list of {clip_id, team_score, opponent_score, timestamp, ...}
# Step 2: Compute score differential
for clip in clips_with_scores:
clip["score_diff"] = clip["opponent_score"] - clip["team_score"]
# Step 3: Flag relevant clips (opponent winning by 3 or fewer points)
flagged_clips = filter(lambda c: 0 < c["score_diff"] <= 3, clips_with_scores)
# Step 4: Tag with flag for downstream retrieval
for clip in flagged_clips:
clip["flag"] = "close_game_opponent_leading"
# Step 5: Save to game context index
index_store.save("game_context_index", flagged_clips)
Query-Time Result
At retrieval, your retriever can now filter indexed clips with something like:
SELECT *
FROM jump_shot_clips
WHERE player_id = season_top_scorer
AND score_diff <= 3
Supported Content-Based Queries by Input Type
Input Type | Example Query | Index Used | Feature Extractors Involved | Query Mechanics |
---|---|---|---|---|
Text | “Jump shots by the top scorer when the team is losing by ≤3” | Objects, Actions, Context | Action classifier, Score diff calculator, Stats join | Semantic query → match metadata and tags across precomputed indexes |
Image | "Find plays where this player's pose matches this still image" | Objects, Actions | Pose estimation, Embedding similarity | Image embedding → nearest neighbor search in object/action clip embeddings |
Video | "Show me clips like this sequence" | Actions, Game Context | Temporal embedding, Sequence clustering | Video embedding → similarity over sequence vectors |
Text + Image | "Find all jump shots like this frame by LeBron James" | Objects, Actions | Object detection, Face ID, Action tagging | Image filters object, text filters action → intersected at retrieval |
Text + Video | "Give me all clutch plays like this highlight reel" | All indexes | Action + Context tagging, Embedding scoring | Combines natural language and visual patterns → top-K matches across all indexes |
Image + Stats | "Show me this player’s plays when his FG% is over 60%" | Objects, Game Context | Object detection, External stats join | Image localizes player, external stats filter → combined pre-indexed data |
Multimodal (Text + Image + Context) | "Where is this player shooting threes in the last minute of tied games?" | All indexes | Object, Action, Score/time context | Multi-filter match across all indexed dimensions |
This makes it easy to support hybrid queries that combine natural language, visual similarity, and structured filters — all made possible because your footage is preprocessed into queryable, analytics-friendly data structures.
Why Precompute This?
Doing this at index time means:
- You avoid re-scanning raw video every time someone queries
- You can attach aggregated stats directly to search results
- You bake in assumptions about how people want to explore the footage
It’s like adding indexes and rollups to a database table — you make future queries faster by doing the hard part up front.
TL;DR
If your video is structured into object events, group_by
is a useful pattern — not just for querying, but for indexing. Precomputing Split → Aggregate → Merge
helps turn raw footage into something explorable, search-ready, and analytics-friendly.
You’re not just parsing frames — you’re building DataFrames.
Purpose-Built, for You
We offer purpose-built extraction pipelines, called Feature Extractors for each access pattern.
Feature Extractors are executed in parallel, so to chain them together you create a Collection with Extractors, and use that Collection as the source for a new one.
Then Finally, you pair them with purpose built Retrievers for the ultimate Video Search and Analytics experience.
Going Beyond
At Mixpeek we introduce the concept of Taxonomies, which are flat or hierarchal collections that can be used as a join
(materialized or computed). This enables you to enrich your new videos processed outputs with the overlap of another.
We also introduce Clusters, which act as a group
(not to be confused with pandas' operation) which allows you to cluster, and group.
Join the Discussion
Have thoughts, questions, or insights about this post? Be the first to start the conversation in our community!
Start a Discussion