Cogwheel Gear Mechanism Icon on Blue Arrow on a Grey Background..jpegAs more and more companies start or continue their digital transformation, automating document capture is one of the key components of this journey. Many companies today use capture products to import electronic, scanned or faxed documents and data into content repositories and databases but still manually sort, classify and index documents. Capture technology can streamline processing by automating document classification and data extraction. Many companies are looking to or are already augmenting these capabilities to increase processing efficiency, improve SLAs, and lower costs.

For an optimal automated document classification and data extraction solution, there are five key areas to understand: document capture automation capabilities, document data models, optimal document sample set, classification and data extraction accuracy, and machine learning. The first part of this series will explore the capabilities and subsequent posts will discuss the other areas in detail.

Document Capture Automation Capabilities

Document Sources

Documents are received by companies from a variety of sources and in various formats. Capture technology can process electronic documents such as Microsoft Word, Excel, text-based PDFs, text files, and E-mails. These documents will be natively text searchable and therefore will have the highest quality. Image-based documents such as TIFF, JPEG, and PNG were at some point paper and have been digitized through a scanner, fax machine, or a mobile phone’s camera and will generally have lower and a much wider array of quality levels than electronic documents. Some Microsoft Word and PDF files may have a mix of non-searchable text that resides in images and searchable text.

Improving Image Quality

The quality of paper based documents depends on a number of factors including document resolution (DPI) after digitization, image quality or noise level (ex. specks, halftone, color level, roller lines, etc.), and orientation. Image pre-processing can improve the quality of image-based documents by correcting orientation, converting to black and white and removing specks, halftone, backgrounds, and other elements that impact machine readability. The correct processing must be used to achieve the highest image quality and therefore the best results.

Document Classification

Classification is the automated process of using a combination of optical character recognition (OCR), text extraction, and image analysis to identify documents against a known set of documents and document pages. Features of documents such as keywords, phrases of text, form numbers, and logos that people typically look for when manually classifying can be used to classify a document automatically. Other activities can also be automated including page sorting, document separation, and processing priority adjustment (based on process SLAs). Classification is required for accurate data extraction.

Data Extraction

Data extraction is the process of locating data within the content of documents and transforming it for use in vital processes and applications. Manual indexing is reduced significantly while processing time and accuracy are improved. Names, addresses, phone numbers, IDs, social security numbers, and dollar amounts are all examples of business critical data that can be extracted from forms.

Text must be searchable in the content of the document to enable data extraction. For electronic documents text is available immediately. For image-based documents, OCR is required to enable text searching. There is a wide array of OCR engines available in the market today. Many offer the ability to improve text results by utilizing different character sets and also applying additional image pre-processing filters. Once the content of the document is available for analysis, data extraction rules can run and populate the index fields. Various methods of extraction exist including zonal, freeform, handwriting, barcode, checkbox, and signature detection.  The form layout and field definitions determine the best method of extraction for each field.  Zonal data extraction is most appropriate for structured content where specific data elements exist in the same general location on every document. Freeform data extraction is most appropriate for unstructured content where a single data element can exist anywhere on a document. Documents can have both structured and unstructured sections and pages so a combination of the different extraction types must be used.

Document classification and data extraction are key components of a successful document automation solution. Defining the document data model is a critical first step. This will be explored in detail in the next post of this series.