There are several techniques to leverage computer vision for facial recognition. In this scenario, we are demonstrating an approach for facial recognition to identify persons of interest. There are multiple approaches to achieve a positive ID. Deep learning may be applied to map and identify a three dimensional plane for human faces. Alternatively, 2D object detection techniques to focus on human faces may also be applied. The former requires a large-scale scanned 3D faces which is prohibitively expensive to acquire, requires techniques which are presently being researched and developed, and achieves accuracy levels far below 80%. Seeing success with this approach will be difficult, uncertain, and expensive. We believe this approach is less practical. The latter (2D object detection techniques) requires high resolution photography as a dataset — which is achievable with a high quality camera. 2D deep learning techniques, convolutional neural nets (of CNNs) are applied today to various types of identification. Leveraging 2D facial recognition techniques identify a person of interest is practical, can achieve a high degree of accuracy (above 80%), and reduces total technical debt. We recommend developing a 2D facial recognition model prototype as a pragmatic approach for positive facial identification of a person of interest.
A General Approach:
Data is key. Initially, you need to assess and map the data acquisition process and data structures (e.g. what camera is used, how is the lighting, how many people generally appear in the photo). Based on the above, you should gather speed and accuracy requirements, establishing what is deemed “acceptable”. Then, assess and document hardware constraints.
Once a proper assessment is complete, begin and establish UI/UX flows for using the system and define API endpoints. Establish a system for data collection of videos/photos and labels. Then parse datasets into training, validation, and testing buckets. A proper data preprocessing pipeline will need to be developed to reduce blur and noise in the image and video data.
Architect a deep learning algorithm (CNN) or apply transfer learning to an existing model; trained on specific face detection, embedding and classification models. After the core face detection and recognition models are developed, you will need to develop a tracking model to track multiple objects throughout the video.
Finally, a database and backend that integrates and serves the models will be developed. A human-in-the-loop process to continuously improve the model will need to be defined and implemented.
We recommend starting from open source solutions for face detection, keypoint extraction, alignment and embeddings. This allows a quick evaluation of existing techniques on the application of interest, and informs us the directions to focus to improve either the modeling techniques or data acquisition. Then labeled images or videos are collected, with faces bounding boxes and identities annotated.
Tune the selected model with further feature engineering and hyperparameter adjustments. Then, test the system and identify where the system fails. Once the model has been selected, you will need to create a plan for re-training, testing, deployment and monitoring plans.
For face detection and recognition models, we recommend using a convolutional neural net (e.g. ResNet) as a backbone. For face detection loss function, we recommend using SSD loss (cross-entropy and regression loss). There are various choices for loss functions for the face embedding model:
For keypoint extraction, we recommend using regression loss.
And, of course, you need training data. This may be a challenge. To achieve high levels of accuracy, you will need labeled training data (1 to 10 million faces).
Given an image, we need to detect the pixel region in which the face is present. Below is an example of our work where face detection is displayed.
Output: 1) bounding boxes (top, left, bottom, right) that contain frontal or slightly side-facing faces; 2) keypoints on faces that correspond to eyes, nose and mouth
Adopt multi-path state-of-the-art convolutional neural nets for computation.
Use facial keypoints to perform 2D transformations on the image, so that eyes and month are in roughly normalized positions.
Input: a cropped image patch with detected face
Output: image patch with face aligned
Face Embedding (Descriptor)
Input: cropped face image patch
Output: a vector that describe the face
The face embedding model will be a convolutional neural net. It can be based on a pre-trained model (For example: originally for classifying 1000 object categories — e.g. cats and dogs). Perform transfer learning, and fine tune the model on our face matching dataset, using either cross-entropy loss or triplet loss (there are new losses being proposed in more recent literature).
Face Clustering on the Graph
The embedded face descriptors live in a vector space. The distance of two faces in this vector space indicates how different they are. Clustering algorithms are performed on the face graph specified by these pairwise distances between face vectors. Each cluster includes faces that are very likely of the same person.
Input: descriptors of faces
Output: face clusters, each face cluster is a set of faces that very likely belong to the same person.
Project Flow Chart:
Pre-processing and Standardization
Labeled training data is required to provide photos, labeled bounding boxes for faces, and identity (people’s name or ID) for each bounding box. The labels need to be organized into an easy-to-consume format, like PASCAL VOC format. For the scope of the prototype, we will assume data availability and structure.
Improving the System After Deployment
Construct a human-in-the-loop component that audits the model in production and annotates the correct labels of bounding boxes and person identities. This includes developing a batch training process that keeps ingesting new training data.
The purpose of this human-in-the-loop component is to enlarge the database of known faces and embedding vectors. They will try adding variation to the photos due to different lighting, face angle, with glass / w.o. glass etc. To increase the recall rate of recognizing that person in a new photo.
Accuracy depends on the distractor ratio (# of distractors per query face) of the 1:N face comparison. On Megaface (N=1 million), the most accurate systems achieve around 30% accuracy. On 1:1 face comparison, 99%+ accuracy can be achieved.
The quality of the images for training and validating will impede project success:
Complicated scenarios provide additional noise, affecting the capability to identify faces accurately. For example, crowded scenes and occlusion from other objects must be isolated.
Additionally, the data size can be a burden on technology infrastructure and resources. Tens of millions of photos create huge amount of data. This is problematic for limited compute resources and creates long training cycles. This yields slow iteration.
Hardware provides constraints. If the model is deployed “on the edge” (e.g. on a Raspberry Pi or mobile phone) instead of a cloud instance, only a small subset of models can be used. High amounts of usage can be problematic. The requests to the face recognition API can swamp the compute capacity, if many streams of camera data simultaneously consume the API.
There will be speed constraints. If model is required to generate real-time predictions, this restricts the size and type of models that can be used.
Use a high quality and high resolution camera; install the camera in such a way to achieve the optimal angle, distance, and lighting.
Set expectation on model’s limitation. Focus on narrower use cases that identify the highest value targets.
More compute resource may be required. We will develop a better distributed model training algorithm to lighten the load on resources.
High Amount of Usage
Provision enough machine resources during model deployment.
First, data needs to be gathered into a bucket in object storage (e.g. AWS S3). Work on developing a training pipeline using a TensorFlow/PyTorch/Caffe framework. Then GPU resources need to be provisioned. Once the models are trained, model artifacts and metrics are persisted into an object storage.
During model training, we recommend performing continuous hyperparameter tuning. The nobs to which we tune include and are not limited to:
Distributed training may be needed. If the number of images exceeds millions, distributed training on multiple GPU instances is often needed. We recommend Horovod (from Uber).
While 3D facial recognition is a viable solution, it’s not without challenges. It’s computationally expensive and current accuracy benchmarks may be too low for many applications. 2D object detection for face ID should be good enough to handle most use cases. The above provides an exploration of one approach of many. We encourage others to share theirs!