A General Approach for Using 2D Object Detection for Facial ID

There are several techniques to leverage computer vision for facial recognition. In this scenario, we are demonstrating an approach for facial recognition to identify persons of interest. There are multiple approaches to achieve a positive ID. Deep learning may be applied to map and identify a three dimensional plane for human faces. Alternatively, 2D object detection techniques to focus on human faces may also be applied. The former requires a large-scale scanned 3D faces which is prohibitively expensive to acquire, requires techniques which are presently being researched and developed, and achieves accuracy levels far below 80%. Seeing success with this approach will be difficult, uncertain, and expensive. We believe this approach is less practical. The latter (2D object detection techniques) requires high resolution photography as a dataset — which is achievable with a high quality camera. 2D deep learning techniques, convolutional neural nets (of CNNs) are applied today to various types of identification. Leveraging 2D facial recognition techniques identify a person of interest is practical, can achieve a high degree of accuracy (above 80%), and reduces total technical debt. We recommend developing a 2D facial recognition model prototype as a pragmatic approach for positive facial identification of a person of interest.


A General Approach:

Data is key. Initially, you need to assess and map the data acquisition process and data structures (e.g. what camera is used, how is the lighting, how many people generally appear in the photo). Based on the above, you should gather speed and accuracy requirements, establishing what is deemed “acceptable”. Then, assess and document hardware constraints.

Once a proper assessment is complete, begin and establish UI/UX flows for using the system and define API endpoints. Establish a system for data collection of videos/photos and labels. Then parse datasets into training, validation, and testing buckets. A proper data preprocessing pipeline will need to be developed to reduce blur and noise in the image and video data.

Architect a deep learning algorithm (CNN) or apply transfer learning to an existing model; trained on specific face detection, embedding and classification models. After the core face detection and recognition models are developed, you will need to develop a tracking model to track multiple objects throughout the video.

Finally, a database and backend that integrates and serves the models will be developed. A human-in-the-loop process to continuously improve the model will need to be defined and implemented.


Model Selection/Development:

We recommend starting  from open source solutions for face detection, keypoint extraction, alignment and embeddings. This allows a quick evaluation of existing techniques on the application of interest, and informs us the directions to focus to improve either the modeling techniques or data acquisition. Then labeled images or videos are collected, with faces bounding boxes and identities annotated.

Tune the selected model with further feature engineering and hyperparameter adjustments. Then, test the system and identify where the system fails. Once the model has been selected, you will need to create a plan for re-training, testing, deployment and monitoring plans.


Technique Recommendation:

For face detection and recognition models, we recommend using a convolutional neural net (e.g. ResNet) as a backbone. For face detection loss function, we recommend using SSD loss (cross-entropy and regression loss). There are various choices for loss functions for the face embedding model:

    • Cross entropy loss
    • Triplet loss
    • Center loss

For keypoint extraction, we recommend using regression loss.

And, of course, you need training data. This may be a challenge. To achieve high levels of accuracy, you will need labeled training data (1 to 10 million faces).


Face Detection

Given an image, we need to detect the pixel region in which the face is present. Below is an example of our work where face detection is displayed.

AI computer vision

Input: image

Output: 1) bounding boxes (top, left, bottom, right) that contain frontal or slightly side-facing faces; 2) keypoints on faces that correspond to eyes, nose and mouth

Adopt multi-path state-of-the-art convolutional neural nets for computation.

convolutional neural nets for computation

Face Alignment

Use facial keypoints to perform 2D transformations on the image, so that eyes and month are in roughly normalized positions.

computer visions 2D transformation

Input: a cropped image patch with detected face

Output: image patch with face aligned

Face Embedding (Descriptor)

Input: cropped face image patch

Output: a vector that describe the face

The face embedding model will be a convolutional neural net. It can be based on a pre-trained model (For example: originally for classifying 1000 object categories — e.g. cats and dogs). Perform transfer learning, and fine tune the model on our face matching dataset, using either cross-entropy loss or triplet loss (there are new losses being proposed in more recent literature).

face cluster computer vision

Face Clustering on the Graph

The embedded face descriptors live in a vector space. The distance of two faces in this vector space indicates how different they are. Clustering algorithms are performed on the face graph specified by these pairwise distances between face vectors. Each cluster includes faces that are very likely of the same person.

clustering computer vision

Input: descriptors of faces

Output: face clusters, each face cluster is a set of faces that very likely belong to the same person.


Project Flow Chart:

Pre-processing and Standardization

Labeled training data is required to provide photos, labeled bounding boxes for faces, and identity (people’s name or ID) for each bounding box. The labels need to be organized into an easy-to-consume format, like PASCAL VOC format. For the scope of the prototype, we will assume data availability and structure.

computer vision process

Improving the System After Deployment

Construct a human-in-the-loop component that audits the model in production and annotates the correct labels of bounding boxes and person identities. This includes developing a batch training process that keeps ingesting new training data.

The purpose of this human-in-the-loop component is to enlarge the database of known faces and embedding vectors. They will try adding variation to the photos due to different lighting, face angle, with glass / w.o. glass etc. To increase the recall rate of recognizing that person in a new photo.

Accuracy depends on the distractor ratio (# of distractors per query face) of the 1:N face comparison. On Megaface (N=1 million), the most accurate systems achieve around 30% accuracy. On 1:1 face comparison, 99%+ accuracy can be achieved.


Potential Challenges:

The quality of the images for training and validating will impede project success:

  • Lighting too weak or too strong.
  • Low resolution.
  • Motion blur.
  • Fisheye camera.

Complicated scenarios provide additional noise, affecting the capability to identify faces accurately. For example, crowded scenes and occlusion from other objects must be isolated.

Additionally, the data size can be a burden on technology infrastructure and resources. Tens of millions of photos create huge amount of data. This is problematic for limited compute resources and creates long training cycles. This yields slow iteration.

Hardware provides constraints. If the model is deployed “on the edge” (e.g. on a Raspberry Pi or mobile phone) instead of a cloud instance, only a small subset of models can be used. High amounts of usage can be problematic. The requests to the face recognition API can swamp the compute capacity, if many streams of camera data simultaneously consume the API.

There will be speed constraints.  If model is required to generate real-time predictions, this restricts the size and type of models that can be used.


Overcoming Challenges:

Image Quality

Use a high quality and high resolution camera; install the camera in such a way to achieve the optimal angle, distance, and lighting.

Complicated Scenario

Set expectation on model’s limitation. Focus on narrower use cases that identify the highest value targets.

Data Size

More compute resource may be required. We will develop a better distributed model training algorithm to lighten the load on resources.

High Amount of Usage

Provision enough machine resources during model deployment.


Training Process:

First, data needs to be gathered into a bucket in object storage (e.g. AWS S3). Work on developing a training pipeline using a TensorFlow/PyTorch/Caffe framework. Then GPU resources need to be provisioned. Once the models are trained, model artifacts and metrics are persisted into an object storage.

During model training, we recommend performing continuous hyperparameter tuning. The nobs to which we tune include and are not limited to:

    • Backbone architecture: VGG, ResNet, DenseNet, MobileNet, NASNet.
    • Learning rate scheduling.
    • Batch size.

Distributed training may be needed. If the number of images exceeds millions, distributed training on multiple GPU instances is often needed. We recommend Horovod (from Uber).

While 3D facial recognition is a viable solution, it’s not without challenges. It’s computationally expensive and current accuracy benchmarks may be too low for many applications. 2D object detection for face ID should be good enough to handle most use cases. The above provides an exploration of one approach of many. We encourage others to share theirs!