First American Data & Analytics Chief Information Officer Calvin Powell shared his thoughts last quarter on the evolution of data extraction technology and how that evolution is fueling the next generation of title and escrow automation. It’s true that machine learning (ML) and artificial intelligence (AI) have changed the way real estate data providers in our industry do business and expedite the time it takes to provide updated and searchable document images to the market. So much so that First American Data & Analytics has become the industry leader leveraging these techniques and continues to do so while innovating along the way.
We recently sat down with Prabhu Narsina, who manages Data Technologies and Data Science in Calvin’s group, to delve further into this topic and are excited to share just how First American Data & Analytics uses AI and ML in extracting information today, how the process used to be completed, and how the people behind the process are the most valued resource.
Q: First American Data & Analytics maintains the industry’s largest repository of document images: more than 8 billion and growing. We understand the information on these images and documents is searchable. What is involved in extracting this information and turning into a usable format?
A: Before we can dive into that question, we want to set the foundation for understanding. We are talking about unstructured documents - images of documents like deeds, mortgage, assignments, foreclosures, etc. The average county probably has about 1,000 different types of documents that it records. These documents vary from county to county across the United States. First American Data & Analytics collects all these document types from more than 2,000 counties; this means we are taking in and processing somewhere between 200,000 and 300,000 new documents every day, with each document ranging from one page to hundreds of pages.
The quality of documents is also different. Some documents could be a day old and in pristine condition, while others could date back to the 1890s. If it’s the latter, the document images may be very low quality and might have handwritten information, stamps, and/or watermarks that must be captured. Or they could be scanned tilted, vertical, horizontal, folded, reversed, etc.
Knowing that is what we collect, how does First American Data & Analytics make the data on these images usable? By that we mean, how do we turn images into text so they can be read by automated tools and searched? This has been an evolutionary process that has gone on for more than a decade. During the last five or six years, artificial intelligence (AI) and machine learning (ML) have transformed and accelerated this process.
Q: Can you walk us through how it was done before AI, and how it has progressed?
A: Historically, First American Data & Analytics and other major data providers captured critical real estate information manually using double-key and verify processes. This was just what it sounds like: two processors would each enter the same information and the result would be compared using software that identified the differences. This was the process because it worked well with documents that were challenging to early Optical Character Recognition (OCR) technology, such as forms with handwriting on them. But it was expensive and slow.
To control cost, usually no more than 40 to 50 fields on a form would be captured. We knew that we were leaving a lot of valuable information on the table by not extracting all data elements that could be on a single image, perhaps as many as 400 to 500.
A little over 10 years ago, we started building our text version of document images that used OCR leveraged Lucene/Elastic search to automate data capture. This was successful up to a point, but we could see there was more that could be done.
Fast forward to the 2017-2018 timeframe. We started exploring how machine learning could transform information extraction. During the past five or six years, the combination of ML and OCR have allowed First American Data & Analytics to increase the number of fields captured to more than 450, increase the number of counties by four-fold, and do this with only a minimal increase in cost.
Q: Can you give us a deeper dive into how you are using machine learning?
A: One of the first steps of the ML exercise is to automate the process of identifying document type using the ML Classification technique. For example, we leverage simple XGBoost binary classification as well as modern, transformer-based and multiple ensemble models to achieve document recognition accuracy greater than 96%. This means we are only off by decimals compared to human accuracy. To achieve these high accuracy rates, we are very proactive in retraining and tuning the models, so they recognize more than 1,000 document classes across more than 2,000 counties.
The next major step of the process is information extraction. We have developed multiple ML Named Entity Recognition (NER) relationship extraction models to accomplish this. In the process, we leverage multiple ML techniques starting from simple Natural Language Processing (NLP) to more advanced bi-directional long- and short-term memory networks, used in deep learning, to extract the critical fields that make up more than 60% of our volume.
Interestingly, while the huge volume of what we process is a challenge in one respect, it is also a big advantage in another because it enables the active learning part of ML. The combination of deep learning and traditional methods let us see multiple occurrences of the same entities and relationships in the same document, and them link in information to the right entities and relationships.
We have developed very sophisticated workflows that take in hundreds of thousands of documents and images each day that go through multiple steps and several ML models to get document class, NER / Relation extraction, non-personal information detection, and much more. This platform allows us to onboard a new workflow quickly and is built on cloud architecture where we scale the infrastructure based on the document workload entering our system, allowing us to meet our service level agreements with not much additional cost.
In fact, by using deep learning inferences in sequence, rather using batch inference, we’re getting even faster and the technology on our platform is reducing our costs by a factor of 10.
Q: Tell us a little bit about your team and what's next?
A: We have great leadership in our organization. Calvin Powell encourages and drives innovation, making it easy to try new things and keep getting better. We have a great data science team led by Madhu Kolli who has deep experience and passion in machine learning and data and analytics.
These complex problems gave birth to new ideas and techniques, which resulted in filing multiple patents in ML and deep learning during the last couple of years. So far, we have successfully processed billions of document images and more than 100 million documents through multiple ML models designed for historical documents.
We still have a lot more improvements to make in extraction to reduce cost and to develop more data and AI products and services. As we know, success feeds success, and we are working on many projects within the different business units in the First American organization that are solving for these interconnected issues.