Paper to Digital

December 14, 2018

|

AI Industry Insights

Paper to Digital: Unlocking Data in Documents

No items found.

Much of the world's information is locked away in paper and PDF documents. Many companies process hundreds of thousands of documents and contracts each year. The data within these contracts, however, remains largely locked away in paper and PDF documents. Extracting the data is challenging as document layouts and data field locations vary widely from document to document. For certain documents, the pages are often scanned, out of order, and the data itself may be typed, handwritten, or a combination of both.

Here we describe a production deployed AI architecture capable of detecting, classifying, and extracting data fields across an array of contract types and variants. The system uses a combination of content classification and computer vision algorithms. It is capable of identifying individual contract pages, regardless of order, from many states and can correctly locate, classify, and extract over a dozen unique data fields.

We will use a production deployed system we recently developed at Keller Williams, the world’s largest real estate company, as our case study. Keller Williams uses content classification and computer vision algorithms to extract data from contracts. The system gives Keller Williams an enormous new source of data for predictive analytics by unlocking data previously trapped in hundreds of thousands of documents.

paper to digital

Technique Exploration

First we evaluated several techniques that claim to have this capability. The challenge was selecting a technique that allowed us to achieve an extremely high level of accuracy. We used the single shot multi-box (SSD) technique, a deep convolutional neural nets with millions of parameters, for object detection from TensorFlow. Anchor boxes were adapted to account for the shapes of text fields. We used learned embeddings for categorical variables like document types and pages. The output of the model includes both the bounding boxes of relevant text fields, and also the category of the field (for example, cash amount, loan amount, property address).

We based our implementation on the RetinaNet model proposed in a 2017 research paper “Focal Loss for Dense Object Detection”. It supports a variety of backend models and uses a focal loss that addresses the extreme foreground-background class imbalance encountered during training of dense detectors. We augmented the model’s architecture by incorporating the learned categorical embeddings described above to give it contextual information about the input document type and page number.

Model Training

We used an iterative process of model training and data refinement. We had to develop a system for labeling bounding boxes and correcting incorrect category labels. Coordinates must be discovered and corrected. The model was trained on GPU machines in the cloud. Given the large variation in the shape and size of bounding boxes, the model sometimes missed very small text fields or very elongated ones. We adapted the anchor boxes and augment the coordinate labels to achieve a better detection accuracy.

Solution Development

paper to digital

Building and training an accurate model is one thing, making the model deployable and user accessible is another effort. Our model was deployed as an independent web service with REST API. After a user uploads a PDF document, the service converts each page into a high-resolution image, classifies the document type and page number, detects and classifies target text fields, and extracts each region and performs OCR. The service uses a queue-based architecture to handle requests asynchronously.

Accuracy

By introducing different techniques like focal loss, adaptive anchor boxes and learned embeddings, we produced a highly accurate document text field detection model with a mAP (mean average precision) of over 0.9, significantly improved over the off-the-shelf open source model (with 0.3 mAP) of an SSD-based deep convolutional neural network.

AI via Fierce Humanism - Building Better than Good Enough

Gartner® Research Identifies Shift Toward AI-Native Team Models; Cites KUNGFU.AI

Leadership in the Age of AI

Designing Organizations for AI-Driven Decision Making

Gartner® Identifies Fundamental Shift in AI Services Market; Cites KUNGFU.AI Among Emerging AI-Native Providers

The Super-Weight Phenomenon: What Hidden Parameters Reveal About Large Language Models

Does AI Coding Assistance Actually Improve Productivity?

2026: The Year AI Grows Up

How We Use AI to Engineer AI

Guiding America’s Boardrooms into the Age of AI

AI Leaders Summit: Exclusive One-on-one's with AI Experts

Don’t Poison Your Own Well with GenAI, Use it to Dig Deeper

You Made It to Production: Now What?

Rethinking the AI Development Lifecycle

Why 90% of AI Projects Fail Before They Launch

A Gold Medal Moment for AI

Part 3: How to Choose an AI Governance Model That Works for Your Organization

The Real Breakthrough Behind DeepSeek R1

Anthropic Cracks Open the Black Box of AI

Predicting Cancer Before It Starts: An AI Milestone in Women’s Health

Reinforcement Learning: AI’s Next Big Leap

Copyright, Fair Use, and the Fight Over AI Training Data

The Real Illusion in Apple’s “Illusion of Thinking” Paper

Part 2: Designing AI Governance That Works

Part 1: Why AI Governance is a Strategic Imperative

Most People Don't Expect AI to Benefit Them. What Can We Do About That?

From Brain to Machine: How Neuroscience Is Shaping the Future of AI

KUNGFU.AI Partners with NACD to Equip Boards for the Age of AI

What Does “Productivity” Mean in an AI-Enabled World?

The Emergence of Product Analytics: An Under-appreciated Yet Critical Part of AI Development

The Academic in Industry: A Cultural and Pragmatic Shift

AI & Authenticity—What Does It Mean to Be "Real" in 2025?

AI is Like a Road Trip: Why You Need a Flexible Strategy, Not Just a Destination

Why Most AI Implementations Fail—And How to Get It Right

Reclaiming Attention in the Age of AI

Are Agents the Future?

Tired of the Hype? Let’s Baseline 10 Commonly Misused AI Terms

KUNGFU.AI’s AI Hiring Survival Guide

Part 3: How to Procure AI Services Through an RFP Process

Data Science: Bridging the Gap Between Business and Analytics

Part 2: Planning for Next Year’s AI Budget: A Strategic Guide for C-Level Executives

Part 1: Building vs. Buying an AI Team: What’s Best for Your Business?

Mash-Up: AI and Potatoes USA Join Forces Against Misinformation

KUNGFU.AI Updates Ethical Pledge on Facial Recognition

3 Steps to Designing AI That Fits Like a Glove

LLMs are Engines. It’s Time for Vehicles.

Product Sense: A Hidden Lynchpin in Data Science and AI

Not Budgeting for AI Today is like Having Bet on the Slide Rule, Calculator or Fax

The Top AI Events We’re Looking Forward to in 2024

2024 Will Be The Year of The AI Budget

Engineering Explained: GPT-4V(ision)

KUNGFU.AI and CDAO Collaborate on AI Strategy for Defense Enterprise Ecosystem

Engineering Explained: Opportunity Sizing and ROI Analysis

Engineering Explained: Bayesian Mechanics

Celebrating Our Success: We Made the Inc. 5000 List of Fastest-Growing Private Companies in America!

10 Things Companies Should Think About When Devising an AI Strategy

Engineering Explained: Large Language Models

Engineering Explained: Diffusion Models

Understanding Data Science and Related Sub Sciences

KUNGFU.AI Joins Tradewinds’ Marketplace, Empowering Businesses with Cutting-Edge AI Services

How to Navigate the AI Industry: Join our Career Workshops

Innovation in the Age of Regulation: Building AI with Federated Learning

AI is the Future. ChatGPT is the assistant.

KUNGFU.AI’s Approach to Developing an ‘AI Center of Excellence’

KUNGFU.AI Joins INSA to Expand Government Partnerships and Reach

Data-Driven Decision-Making: Making Confident and Proactive Business Decisions

Navigating the Ethical Implications of Data Interpretation

Overcoming Cognitive Bias in Data Analysis and Decision-Making

ConvNeXt: A Transformer-Inspired CNN Architecture

How to Build a Great AI Engineering Team

Engineering Explained: LayoutLMv3 and the Future of Document AI

Turning Away Our First Client

AI Simplified: An Introduction to Artificial Intelligence

Introducing KUNGFU.AI Lab Days

Large Language Models: Three Stages of Adoption

The Future of AI: Can Open-Source Community Keep Up with Large Corporations?

How to Use ChatGPT: Our Step by Step Guide

What is ChatGPT? Everything You Need to Know.

Savimbo and KUNGFU.AI Partner to Bring AI to Rainforest Conservation

Data, Security, and Ethical Risks of AI Use in Healthcare

Engineering Explained: OpenAI's ChatGPT

4 Ways to Mitigate Bias and Prioritize Patients

We Used ChatGPT to Figure Out How Businesses Can Use ChatGPT

Want to WFH? Check Out These 10 Flexible Remote Companies

Where We Are and What's Coming

Meet the Team: Benjamin Klein

The First Mile of Any AI Project is Most Critical

Edge Computing for Business: What You Should Know

What You Should Know Before Investing in Computer Vision

KUNGFU.AI Presents: Using Computer Vision to Solve Business Challenges with WM

KUNGFU.AI Presents: Unlocking Greater Business Intelligence with Graphs

How Multitask Learning in Computer Vision Can Solve Your Business Challenges

Now Is the Time to Invest in Computer Vision and Secure a Competitive Advantage

Designing Your First NLP Annotation Job

Autism Acceptance Day

5 Ways to Realize ROI on AI investments

Join Us for Giving Tuesday

KUNGFU.AI Achieves Machine Learning Partner Specialization in the Google Cloud Partner Advantage Program

KUNGFU.AI Presents: The Obstacles in Building Product AI and How to Overcome Them

KUNGFU.AI Presents: The AI Ethical Imperative

Related resources

AI Industry Insights

AI via Fierce Humanism - Building Better than Good Enough

AI Industry Insights

Leadership in the Age of AI

AI Industry Insights

Designing Organizations for AI-Driven Decision Making