Book Image

Intelligent Document Capture with Ephesoft - Second Edition

Book Image

Intelligent Document Capture with Ephesoft - Second Edition

Overview of this book

Table of Contents (14 chapters)

No blank forms available for training


Classification is the most accurate when the system is trained with blank forms (a form that has not been completed). If blank forms are not available, accurate classification can still be achieved.

The first option is to redact (remove sensitive and instance-unique data) on the samples you have before uploading them to Ephesoft for training.

The second option involves editing the HOCR file that is created after clicking on Learn Files in the Batch Class Management administrative interface. The HOCR file is the XML representation of the OCR output.

The XML file can be edited to remove any content that is not part of the blank form. After the XML file is updated, click on Learn Files again to update the index files used by Ephesoft. This will not overwrite the changes that have been made to the XML file; this will only happen if the source TIFF is updated.