Much of the world's information is locked away in paper and PDF documents. Many companies process hundreds of thousands of documents and contracts each year. The data within these contracts, however, remains largely locked away in paper and PDF documents. Extracting the data is challenging as document layouts and data field locations vary widely from document to document. For certain documents, the pages are often scanned, out of order, and the data itself may be typed, handwritten, or a combination of both.
Here we describe a production deployed AI architecture capable of detecting, classifying, and extracting data fields across an array of contract types and variants. The system uses a combination of content classification and computer vision algorithms. It is capable of identifying individual contract pages, regardless of order, from many states and can correctly locate, classify, and extract over a dozen unique data fields.
We will use a production deployed system we recently developed at Keller Williams, the world’s largest real estate company, as our case study. Keller Williams uses content classification and computer vision algorithms to extract data from contracts. The system gives Keller Williams an enormous new source of data for predictive analytics by unlocking data previously trapped in hundreds of thousands of documents.
First we evaluated several techniques that claim to have this capability. The challenge was selecting a technique that allowed us to achieve an extremely high level of accuracy. We used the single shot multi-box (SSD) technique, a deep convolutional neural nets with millions of parameters, for object detection from TensorFlow. Anchor boxes were adapted to account for the shapes of text fields. We used learned embeddings for categorical variables like document types and pages. The output of the model includes both the bounding boxes of relevant text fields, and also the category of the field (for example, cash amount, loan amount, property address).
We based our implementation on the RetinaNet model proposed in a 2017 research paper “Focal Loss for Dense Object Detection”. It supports a variety of backend models and uses a focal loss that addresses the extreme foreground-background class imbalance encountered during training of dense detectors. We augmented the model’s architecture by incorporating the learned categorical embeddings described above to give it contextual information about the input document type and page number.
We used an iterative process of model training and data refinement. We had to develop a system for labeling bounding boxes and correcting incorrect category labels. Coordinates must be discovered and corrected. The model was trained on GPU machines in the cloud. Given the large variation in the shape and size of bounding boxes, the model sometimes missed very small text fields or very elongated ones. We adapted the anchor boxes and augment the coordinate labels to achieve a better detection accuracy.
Building and training an accurate model is one thing, making the model deployable and user accessible is another effort. Our model was deployed as an independent web service with REST API. After a user uploads a PDF document, the service converts each page into a high-resolution image, classifies the document type and page number, detects and classifies target text fields, and extracts each region and performs OCR. The service uses a queue-based architecture to handle requests asynchronously.
By introducing different techniques like focal loss, adaptive anchor boxes and learned embeddings, we produced a highly accurate document text field detection model with a mAP (mean average precision) of over 0.9, significantly improved over the off-the-shelf open source model (with 0.3 mAP) of an SSD-based deep convolutional neural network.