How does video face detection handle multiple people in the same frame?

The SCRFD face detection model can detect dozens of simultaneous faces per frame. Each face receives an independent bounding box, confidence score, and facial landmark set. Cross-frame tracking links detections of the same person across frames, so a person appearing in frames 1-500 is reported as a single face track.

Can I match detected faces against a known identity database?

Yes. Provide a `reference_faces` array with labeled face images and the API will match detected faces against your references using ArcFace 512-dimensional embeddings. Each detection includes the best match identity (if above the `match_threshold`) along with a similarity score.

What is the minimum face size that can be detected in video?

The SCRFD model can detect faces as small as 20x20 pixels in a frame. For reliable identity matching via ArcFace embeddings, faces should be at least 80x80 pixels. The `min_face_size` parameter lets you filter out very small faces that would produce low-quality embeddings.

Does video face extraction support face clustering without a reference database?

Yes. Set `cluster_faces` to true and the API will automatically group all face detections into clusters representing unique individuals, without requiring any reference images. Each cluster includes a representative face crop, total screen time, and first/last appearance timestamps.

media

Video
Faces
Converter

Detect, track, and extract all faces appearing in a video. Returns aligned face crops, bounding boxes, timestamps, and optional identity embeddings for each detected face. Supports face clustering to group appearances of the same person across the video.

Max file size: 5 GB

Estimated: 3-15 min per hour of video

5 input formats

How It Works

Upload a video file or provide a URL to the Mixpeek API.

Frames are sampled and processed through a face detection model (SCRFD) to locate faces.

Detected faces are tracked across frames to maintain identity continuity.

Each unique face is aligned and cropped using facial landmark detection.

Face identity embeddings (ArcFace 512D) are generated for each unique face, enabling clustering and matching.

Code Examples

from mixpeek import Mixpeek

client = Mixpeek(api_key="YOUR_API_KEY")

result = client.convert(
    source="https://example.com/panel-discussion.mp4",
    from_format="video",
    to_format="faces",
    options={
        "cluster_faces": True,
        "include_embeddings": True,
        "min_face_size": 80,
        "sample_fps": 2
    }
)

print(f"Unique faces found: {len(result.face_clusters)}")
for cluster in result.face_clusters:
    print(f"  Person {cluster.id}: {cluster.screen_time}s on screen")
    print(f"    First seen: {cluster.first_appearance}s")
    print(f"    Detections: {cluster.detection_count}")

Use Cases

Build face-searchable indexes for video libraries and media archives

Track on-screen talent appearances across advertising campaigns

Create face-based navigation for interview and panel discussion videos

Detect and count unique individuals in surveillance or event footage

Supported Input Formats

MP4

MOV

AVI

MKV

WebM

Quick Info

Categorymedia

Max File Size5 GB

Est. Time3-15 min per hour of video

Extractorface-identity

Try This Conversion

Get started with the Mixpeek API and convert your first file in minutes.

Frequently Asked Questions

Related Converters

Video

Images

Video to Keyframes

Automatically detect scene changes and extract representative keyframes from any video. Each keyframe includes a timestamp, scene label, and optional caption generated by a vision model.

Video

Thumbnails

Video to Thumbnails

Generate optimized thumbnail images from video files. Uses intelligent frame selection to pick the most visually appealing and representative frames, with optional face detection and composition scoring.

Video

Scenes

Video to Scenes

Automatically segment videos into individual scenes using visual and audio cue detection. Each scene includes a start and end timestamp, a representative keyframe, a descriptive label, and a confidence score for the detected boundary.

Video

Metadata

Video to Metadata

Extract comprehensive technical and semantic metadata from video files. Returns codec details, resolution, duration, frame rate, and AI-generated semantic tags including detected objects, scenes, dominant colors, and content categories.

Image

Metadata

Image to Metadata

Extract comprehensive technical and semantic metadata from images. Returns EXIF data, camera settings, GPS coordinates, and AI-generated semantic tags including detected objects, scene type, dominant colors, and content categories.

Ready to convert video to faces?

Start using the Mixpeek Video to Faces in minutes. Sign up for a free API key and follow the documentation to get started.

VideoFacesConverter

How It Works

Code Examples

Use Cases

Supported Input Formats

Quick Info

Try This Conversion

Frequently Asked Questions

Related Converters

Video to Keyframes

Video to Thumbnails

Video to Scenes

Video to Metadata

Image to Metadata

Ready to convert video to faces?

Video
Faces
Converter