Book Image

Intelligent Document Processing with AWS AI/ML

By : Sonali Sahu
Book Image

Intelligent Document Processing with AWS AI/ML

By: Sonali Sahu

Overview of this book

With the volume of data growing exponentially in this digital era, it has become paramount for professionals to process this data in an accelerated and cost-effective manner to get value out of it. Data that organizations receive is usually in raw document format, and being able to process these documents is critical to meeting growing business needs. This book is a comprehensive guide to helping you get to grips with AI/ML fundamentals and their application in document processing use cases. You’ll begin by understanding the challenges faced in legacy document processing and discover how you can build end-to-end document processing pipelines with AWS AI services. As you advance, you'll get hands-on experience with popular Python libraries to process and extract insights from documents. This book starts with the basics, taking you through real industry use cases for document processing to deliver value-based care in the healthcare industry and accelerate loan application processing in the financial industry. Throughout the chapters, you'll find out how to apply your skillset to solve practical problems. By the end of this AWS book, you’ll have mastered the fundamentals of document processing with machine learning through practical implementation.
Table of Contents (16 chapters)
Part 1: Accurate Extraction of Documents and Categorization
Part 2: Enrichment of Data and Post-Processing of Data
Part 3: Intelligent Document Processing in Industry Use Cases

Automating mortgage processing data capture and data categorization with IDP

The first stage of the IDP pipeline is the data capture stage. During this stage, all documents (such as URLA-1003, Form W-2, pay stubs, bank statements, credit card statements, mortgage notes, Form 1099, ID documents such as a passport and driver’s license, and any other documents) are collected and aggregated in a central secure data store on Amazon S3. You can define the right access control for the data on S3. This is the data capture stage of IDP.

At times, we know the document type, and can do further extraction. But most often, we do not have any specific way of identifying the documents; in that scenario, we need to classify documents before further extraction. We can use Textract to extract raw text from any type of document. Then, we can create sample label data for training a Comprehend classifier. Amazon Comprehend classification can help accurately categorize documents for mortgage application...