The state-of-the-art of extracting and unlocking unstructured data in documents
Much of the worlds information is held on paper and PDFs, or are simply scans of physical documents. Document Analysis and Recognition (DAR) is the term for the effort to use computers to crack open these static documents to make them more usable and useful.
Once unlocked and machine readable, there are a lot of things that can be done with documents using what’s called text mining or text analytics, including:
- Auto-summarization or semantic summarization
- Machine translation
- Natural language understanding
- Question answering
- Relationship extraction
- Text to speech
- Syntax Analysis
- Entity Recognition
- Content Classification
- Text visualization
Multiple companies offer text analytics as machine-learning-as-a-service microservices including:
- Wolfram Natural Language Understanding System
- Microsoft Azure Text Analytics
- Amazon Comprehend
- Google Cloud Natural Language
- Watson Natural Language Understanding
For documents that are not machine readable — like those that are scanned as PDFs — optical character recognition (OCR) is the key means for text recognition and is the conversion of characters in a digital image to digital text. Although commercial OCR dates back to the 1950s and results can be very impressive, obtaining consistently high accuracy rates continues to be a challenging problem.
The best commercial OCR capabilities are available as machine-learning-as-a-service microservices including:
- Google Cloud Vision API
- Microsoft Azure Computer Vision API
- Amazon Rekognition Text in Image and Computer Vision
Recognizing handwritten text is an even more formidable task than OCR and the state-of-the-art is not very good. Handwriting Text Recognition (HTR) systems must handle overlapping characters, a mixture of cursive and non-cursive, and huge variations in writing styles. The task can be nearly impossible in some cases. Many of us have even had the strange experience of struggling to read our own handwriting.
Until recently, HTR recognition accuracy improved at a slow pace. Most gains were minimal and resulted from small tweaks to existing language model techniques, such as Hidden Markov Models (HMMs). The core algorithms remained fundamentally unchanged and recognition rates were low for even the best HTR systems.
Recent advances in machine learning, however, have revolutionized the field. In particular, the use of Convolutional Neural Networks (CNNs or ConvNets) and Long Short-Term Memory (LSTMs) networks have produced the most significant accuracy improvements in decades. These hybrid deep networks are more robust, handle a larger range of handwriting inputs, and constitute a fundamentally new approach to HTR.
LSTM networks are a type of Recurrent Neural Network (RNN) that can learn tasks requiring memories of events that happened thousands or even millions of discrete time steps earlier. This makes them ideal for HTR where letter and word orders are highly correlated.
Tesseract 4.0, an open source multilingual OCR/HTR engine maintained by Google, was re-architected in the summer of 2017 to use a hybrid CNN/LSTM deep neural network. The model was trained for several weeks on a corpus of 400,000 text lines spanning approximately 4,500 fonts. The reported accuracy gains are tremendous and the engine now supports over 100 languages.
Despite the impressive gains achieved with deep learning techniques, HTR continues to trail OCR in performance and accuracy. There are several key best practices one can follow, however, to help improve recognition results. These include
- Image Resizing — Most systems work best on images that have a DPI of 300 or higher. Resizing smaller images can often dramatically improve recognition accuracy.
- Binarization — Binarization is the process of converting color images to black and white. HTR systems don’t require color information, so most will automatically convert images before processing them. This procedure can produce suboptimal images, however, when the page background contrast varies too widely, so it’s important to make sure your images have a good separation of text from background.
- Noise Removal — Random variation in image brightness or color (noise) can also reduce recognition accuracy. Most HTR systems attempt to denoise input images, but certain types of noise cannot be eliminated. To minimize noise levels, always use good illumination when scanning documents.
- Deskewing — Documents that are not well aligned when scanned produce skewed output, with text flowing across the page at an angle instead of horizontally. This can severely affect line segmentation and reduce recognition accuracy.
- Lexical Matching — Recognition accuracy can also be improved if the output is constrained by a lexicon — a list of words that are allowed to occur in a document. This is typically a dictionary of valid words in the language being processed. This simple technique can eliminate may common errors.
- Field Specific Models — Field specific models use transfer learning, both fine tuning and head retraining, to extend an existing model by training it on additional data sets specific to the problem domain. By reducing the range of inputs each model must recognize, field specific models often have better performance and higher accuracy.
For the times when computers can’t accurately assess either text or handwritten data, have low confidence on their findings, or run across situations with exceptions, the fallback is to create a human-in-the-loop workflow to properly identify what was written. In other words, a person is asked to read what something says and type the answer. With this approach, an overall workflow can be very accurate, even if the OCR and HTR can’t handle certain situations. Top vendors of these human-in-the-loop workflow services include Alegionand Figure Eight.
Finally, for those interested in digging in deeper into these areas, there are several important technical conferences on Document Analysis and Recognition held annually:
- International Workshop On Document Analysis Systems (DAS 2018), which was held in April in Vienna, Austria
- Summer School on Document Analysis (SSDA 2018), July 2–6, 2018 in La Rochelle, France
- International Conference on Frontiers in Handwriting Recognition (ICFHR 2018), August 5–8, 2018 in Niagara Falls, NY
- International Conference on Document Analysis and Recognition (ICDAR 2019), September 22–25, 2019 in Brisbane, Australia
- Text Analytics Forum, November 7–8, 2018 in Washington, DC
New deep learning techniques have revolutionized the field of document and text analysis and are contributing to dramatic improvements in the state-of-the-art. Unlocking insights from unstructured data captured in static documents has broad applications with new use cases popping up all the time. Unfathomable amounts of data and insights are currently hidden in billions of physical and PDF documents. Imagine the intelligence and informed actions your business could unlock with these new technologies.