Photos applications on iOS, iPadOS, and macOS offer users seamless experiences to explore, search, and relive their memories captured in images with loved ones. Underpinning these functionalities are sophisticated machine learning algorithms operating privately on-device, meticulously curating and organizing photos, Live Photos, and videos. At the heart of this system lies an algorithm dedicated to recognizing people in images using machine learning based on their visual appearance.
This person recognition capability is integral to various features within Photos. As illustrated in Figure 1A, users can effortlessly identify individuals in a photo by swiping up, tapping the person’s circle, and seamlessly navigate to view other images featuring that person. Furthermore, the People Album, depicted in Figure 1B, provides a dedicated space to browse and confirm person tags across their photo library. Users have the flexibility to manually add names to recognized individuals and efficiently search for specific people by name, as shown in Figure 1C.
Beyond individual recognition, Photos leverages this identity information to construct a private, on-device knowledge graph. This graph discerns meaningful patterns within a user’s library, such as significant groups of people, frequented locations, past travels, events, and even the last captured photo of a specific person. This knowledge graph is the engine behind the popular Memories feature in Photos, crafting engaging video compilations centered around diverse themes within a user’s collection. Memories intelligently utilizes prominent individuals in a user’s life to create themed memories, such as a “Together” memory, exemplified in Figure 1D.
Figure 1: Interface examples of person recognition features in Apple Photos, including identified people in images, People Album, person search, and person-specific Memories.
Overcoming Challenges in Person Recognition within User-Generated Content
Recognizing people in images using machine learning from user-generated content presents unique challenges due to the inherent variability in image characteristics. Individuals can appear in photos at varying scales, under diverse lighting conditions, in different poses and expressions, and captured by a wide array of cameras. A robust system must build a comprehensive knowledge graph to enable users to view all photos of a specific person, even in scenarios where the subject is not posing, such as candid shots of dynamic scenes like a child bursting a bubble or friends celebrating with a toast.
Ensuring fairness in person recognition outcomes is paramount. Apple products are used globally by a diverse population. The goal is to provide a consistent and exceptional user experience for everyone, irrespective of the photographic subject’s skin color, age, or gender.
Two-Phase Approach to Person Recognition in Image Libraries
The process of recognizing people in images using machine learning within image libraries is structured into two interconnected phases. The first phase involves the progressive construction of a gallery of known individuals as the library expands. The second phase focuses on assigning new person observations to either an existing individual within the gallery or classifying them as an unknown individual. Both phases rely on feature vectors, also known as embeddings, to represent person observations.
Figure 2: Diagram illustrating the two-phase process of person recognition in image collections using machine learning.
Embedding Extraction: Capturing Visual Identity
The process begins by detecting faces and upper bodies in each image. Facial detection can be challenging due to occlusions or subjects looking away from the camera. To address this, the system also considers upper bodies, which often exhibit consistent characteristics like clothing within a short timeframe, providing valuable cues for person identification across temporally proximate images.
A deep neural network is employed to process the entire image and output bounding boxes for detected faces and upper bodies. A matching algorithm then associates face bounding boxes with corresponding upper bodies, considering factors like bounding box area, position, and region overlap.
Extracted face and upper body crops are then fed into separate deep neural networks. These networks are designed to generate feature vectors, or embeddings, that represent the visual characteristics of the detected regions. Embeddings from different crops of the same person are designed to be closely clustered, while embeddings from different individuals are separated in the embedding space. This process is repeated across all images in a Photos library, resulting in a comprehensive collection of face and upper body embeddings.
Gallery Construction: Building a Knowledge Base of Known Individuals
The gallery serves as a repository of frequently encountered individuals within a user’s Photos library. To construct this gallery in an unsupervised manner, Photos utilizes clustering techniques. These techniques group face and upper body feature vectors that correspond to the individuals detected in the library. A novel agglomerative clustering algorithm was developed for Photos, enabling efficient incremental updates to existing clusters and scaling to accommodate large libraries.
The clustering process begins with an algorithm that combines face and upper body embeddings for each observation. This initial step is intentionally conservative, ensuring high precision by only grouping very similar instances. Each initial cluster represents a tight grouping of highly similar embeddings and is characterized by the running average of its embeddings as new instances are added.
Recognizing the temporal instability of upper body embeddings compared to face embeddings (due to clothing changes, for example), the initial clustering pass carefully compares upper body embeddings only within the same “moment.” A moment is defined by metadata such as time and location, linking together photos taken in close temporal and spatial proximity. Within a moment, a person’s upper body appearance is expected to be relatively consistent.
The algorithm utilizes a distance metric, Dij, which combines face embedding distance (Fij) and upper body embedding distance (Tij) using the formula: Dij = min(Fij, α⋅Fij + β⋅Tij). This formula prioritizes face embeddings but incorporates upper body embeddings when available and relevant within the same moment, weighted by tunable parameters α and β.
Following the initial greedy clustering, a second pass using hierarchical agglomerative clustering (HAC) is performed to further expand clusters and improve recall. This second pass relies solely on face embedding matching, enabling cluster growth across moment boundaries. The HAC algorithm iteratively merges cluster pairs that minimize the linkage distance, employing a median distance linkage strategy with optimizations for runtime and memory efficiency, achieving accuracy comparable to average-linkage HAC.
Figure 3 illustrates the processing time efficiency of this clustering approach compared to typical average linkage clustering as iterations grow, highlighting its scalability for large image libraries.
Figure 3: Processing time per iteration for typical average linkage clustering compared to our approach (lower is better).
This clustering algorithm runs periodically, typically during overnight device charging. It assigns every observed person instance to a cluster. The K largest clusters are likely to correspond to K distinct individuals in a user’s library. Heuristics based on cluster size distribution, inter- and intra-cluster distances, and user input within the app are used to determine the set of clusters that constitute the gallery of known individuals.
Identity Assignment: Matching New Observations to the Gallery
The second phase of recognizing people in images using machine learning is to assign identity to new person observations by matching them against the constructed gallery. Inspired by K-means based feature representation learning, instead of using a simple nearest neighbor approach, each cluster in the gallery is represented by a set of canonical exemplars (X0, X1, …Xc), forming a dictionary D.
Given a new observation y, identity assignment is formulated as a sparse coding problem: minimizing ||y – D⋅x||22 + λ⋅||x||1. The final assignment of y is made to the cluster corresponding to the maximum total “energy” of the sparse code x. This sparse coding approach provides improved accuracy compared to nearest neighbor classification, especially when cluster sizes are small or when multiple clusters might represent the same identity. This technique enables rapid person identification as new photos are captured, allowing Photos to dynamically adapt to user libraries.
Filtering Unclear Faces: Enhancing Gallery Quality
To maintain the quality of the person recognition system, a filtering mechanism is crucial to remove observations that are not well-represented by face and upper body embeddings. Without filtering, false positives or out-of-distribution detections could accumulate in the gallery, degrading recognition accuracy.
Figure 4: Examples of challenging faces for recognition, including false positives and unclear or poorly captured faces.
This filtering process is an essential component of the pipeline, ensuring that only high-quality, representative embeddings contribute to the gallery and the overall accuracy of recognizing people in images using machine learning.
Ensuring Fairness in Face Embedding Generation
Achieving consistent accuracy across diverse demographics is a critical challenge in face representation learning. The model must perform equitably across age groups, genders, ethnicities, skin tones, and other attributes. Fairness is considered from the outset, influencing data collection, failure analysis, and model evaluation.
Data Diversity and Augmentation: Mitigating Bias
Continuous efforts are made to enhance dataset diversity and monitor for biases across demographic axes. Awareness of data biases informs subsequent data collection efforts and model training strategies. Curated datasets, utilizing paid crowd-sourced models with managed crowds, are employed to gather representative image content from participants worldwide, spanning diverse demographics.
Data augmentation plays a crucial role in improving model generalization. During training, random combinations of transformations are applied to input images, including pixel-level changes (color jitter, grayscale conversion), structural changes (flipping, distortion), Gaussian blur, compression artifacts, and cutout regularization. Transformations are added incrementally in a curriculum-learning fashion, progressing from easier to harder examples as training advances.
To address challenges posed by face masks during the COVID-19 pandemic, a synthetic mask augmentation technique was developed. Face landmarks are used to generate realistic mask shapes, overlaid with random textures onto input faces. This augmentation enables the model to focus on unmasked facial regions and maintain accuracy for masked faces without compromising performance on non-masked faces.
Network Architecture: Balancing Accuracy and Efficiency
Network architecture design prioritizes maximizing accuracy while ensuring efficient on-device performance, low latency, and a compact memory footprint. Trade-offs between accuracy and computational cost are carefully considered at each stage of network design. The chosen architecture is inspired by the lightweight and efficient AirFace model, with optimizations and increased depth.
Figure 5: Schematic representation of the face embedding network architecture optimized for on-device person recognition.
The network backbone features alternating shallow strided modules for dimensionality reduction and depth increase, alongside deep modules incorporating MobileNetv3-inspired bottlenecks. Each bottleneck employs an inverted residual and linear structure with a lightweight attention layer, including point-wise expansion convolution, depth-wise convolution, a channel attention block inspired by Squeeze and Excitation, and a point-wise reduction convolution. Residual connections, non-linear activations, and batch normalization are incorporated within bottlenecks.
A linear global depth-wise convolution, as used in MobileFaceNet, converts the final feature map into the face embedding. This approach is more effective than pooling mechanisms, allowing the network to focus on relevant receptive field parts for face recognition. The embedding is then normalized to create the final face embedding representation.
Model Training: Optimizing Embedding Space
The training objective is to create an embedding space where embeddings of the same person are tightly clustered (intra-class compactness) and embeddings of different people are well-separated (inter-class discrepancy). This concept is visualized in Figure 6.
Figure 6: Illustration of the model training procedure and the distribution of face embeddings before and after training, demonstrating improved clustering and separation.
The training process, similar to ArcFace, compares extracted face embeddings to a set of anchors representing the center of face embeddings for each person in the training dataset. Cosine similarity between the embedding and anchors is computed, and an angular margin penalty m is added to the positive anchor’s cosine similarity to increase inter-class distance. The features are rescaled by s, and softmax cross-entropy loss is computed.
Margin-mining softmax, inspired by Support Vector Guided Softmax and CurricularFace, is integrated into the loss function to underweight easy examples and emphasize hard examples. A re-weighting function f modulates cosine similarity for negative anchors based on their difficulty. This approach significantly improves model accuracy by preventing the loss from being dominated by easy examples and facilitating convergence even with smaller networks.
Figure 7 demonstrates the impact of different training procedures on model accuracy, showing the benefits of augmentations and margin-mining softmax.
Figure 7: Impact of training procedure on model accuracy.
The model is trained from random initialization using the AdamW optimizer with a carefully tuned learning rate schedule based on the One Cycle Policy.
To quantify uncertainty and detect out-of-distribution samples, an embedding confidence branch is fine-tuned using random, non-face crops, and out-of-domain data. This branch helps identify anomalies and prevent overconfidence when encountering unfamiliar inputs.
On-Device Performance: Real-Time Efficiency
On-device performance is crucial for maintaining user privacy and enabling real-time features. The model is optimized to run end-to-end on the Apple Neural Engine (ANE), achieving face embedding generation in under 4ms on recent iOS hardware, an 8x improvement over GPU-based models. This efficiency enables real-time person recognition capabilities.
Visualizing Results: Improved Person Recognition
The advancements in recognizing people in images using machine learning, available in Photos on iOS 15, significantly enhance person recognition accuracy. As shown in Figure 8, the system can now reliably recognize individuals in challenging scenarios, including extreme poses, with accessories, or with occluded faces. The combination of face and upper body recognition enables matching people even when their faces are not visible. This results in a significantly improved Photos experience, accurately identifying important people in previously challenging situations.
Figure 8: Examples of successful person recognition using the improved machine learning architecture in challenging real-world scenarios.
Acknowledgments
The authors acknowledge the contributions of Floris Chabert, Jingwen Zhu, Brett Keating, and Vinay Sharma to this research.
References
Adam Coates, Andrew Y. Ng. Learning Feature Representations with K-means. In Neural Networks: Tricks of the Trade, 2012. [link].
Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, Quoc V. Le, Hartwig Adam. Searching for MobileNetV3. arXiv:1905.02244, May, 2019. [link].
Ilya Loshchilov, Frank Hutter. Decoupled Weight Decay Regularization. arXiv:1711.05101, November, 2017. [link].
Jiankang Deng, Jia Guo, Niannan Xue, Stefanos Zafeiriou. ArcFace: Additive Angular Margin Loss for Deep Face Recognition. arXiv:1801.07698, January, 2018. [link].
Jie Hu, Li Shen, Samuel Albanie, Gang Sun, Enhua Wu. Squeeze-and-Excitation Networks. arXiv:1709.01507, September, 2017. [link].
Leslie N. Smith. A disciplined approach to neural network hyper-parameters. arXiv:1803.09820, March, 2018. [link].
Sheng Chen, Yang Liu, Xiang Gao, Zhen Han. MobileFaceNets: Efficient CNNs for Accurate Real- Time Face Verification on Mobile Devices. arXiv:1804.07573, April, 2018. [link].
Terrance DeVries, Graham W. Taylor. Improved Regularization of Convolutional Neural Networks with Cutout. arXiv:1708.04552, November, 2017. [link].
Yuge Huang, Yuhan Wang, Ying Tai, Xiaoming Liu, Pengcheng Shen, Shaoxin Li, Jilin Li, Feiyue Huang. CurricularFace: Adaptive Curriculum Learning Loss for Deep Face Recognition. arXiv:2004.00288, April, 2020. [link].
Xianyang Li, Feng Wang, Qinghao Hu, Cong Leng. AirFace: Lightweight and Efficient Model for Face Recognition. arXiv:1907.12256, July, 2019. [link].
Xiaobo Wang, Shuo Wang, Shifeng Zhang, Tianyu Fu, Hailin Shi, Tao Mei. Support Vector Guided Softmax Loss for Face Recognition. arXiv:1812.11317, December, 2018. [link].