Face embeddings are compact numerical vectors that encode the unique identity characteristics of a human face. Generated by deep neural networks trained on face recognition tasks, these embeddings enable face verification (is this the same person?), face identification (who is this person?), and face search (find all appearances of this person) at scale.
A face embedding pipeline first detects faces in an image or video frame using a face detector (e.g., SCRFD, RetinaFace). Detected faces are then aligned to a canonical pose using facial landmark coordinates. The aligned face crop is passed through a deep neural network (e.g., ArcFace, FaceNet) that produces a fixed-dimensional embedding vector (typically 128-512 dimensions). Two face embeddings from the same person will have high cosine similarity, while embeddings from different people will be distant in the vector space.
Modern face embedding models use ResNet or Vision Transformer backbones trained with angular margin losses (ArcFace, CosFace) that maximize inter-class separation while minimizing intra-class variance. The training process uses large-scale face datasets with millions of identities. Preprocessing includes face detection, 5-point landmark alignment, and affine transformation to a standard 112x112 pixel crop. The final embedding is L2-normalized to unit length. Distance thresholds (typically 0.3-0.5 cosine distance) determine match vs non-match decisions.