Much of the world’s information is locked away in paper and PDF documents. Keller Williams, the world’s largest real estate company, processes hundreds of thousands of contracts each year. The data within these contracts, however, remains largely locked away in paper and PDF documents.
Extracting the data is challenging as document layouts and data field locations vary widely from document to document. For certain documents, the pages are often scanned, out of order, and the data itself may be typed, handwritten, or a combination of both. We created a system that uses content classification and computer vision algorithms (CNN and OCR) to extract data from their offer contracts.
We produced a highly accurate text field detection model with a mAP (mean average precision) of over 0.9, significantly improved over the off-the-shelf open source model (with 0.3 mAP) of an SSD-based deep convolutional neural network. The system gives Keller Williams an enormous new source of data for predictive analytics by unlocking data previously trapped in hundreds of thousands of documents.