(888) 575-9299



What is Data Capture?

Data capture is the process of collecting data from paper or image format documents. Data is typically collected in field formats and stored in a database. The purpose of data capture is to replace the manual entry of documents and expedite the process by utilizing computer-processing power.

The Steps in Data Capture Includes:

– Input of image files, document
– Classification
– Location of fields
– Extraction of data in those fields
– Quality checking of the extracted data
– Export of the data in a desired format

2 Types of Data Capture

– Fixed Forms Processing
– Semi-Structured Forms Processing

There is some overlap between these two types, but the primary difference lies in the setup and configuration of the system. Fixed forms processing uses static coordinates on an image to find fields, while semi-structured forms processing uses relative coordinates and context.

Data Capture Uses – Identifying the Value

The uses of data capture vary across different document types and business processes, with some areas having a much greater demand. Organizations determine data capture needs based on the cost of doing business with certain document types. Below is a list of the top seven uses of data capture. This list does not take into account the number of installations for each type of use, only the need to use data capture.

Accounts Payable (AP) Automation

The most common document type found in AP automation is commercial invoices. AP processing also includes checks, purchase orders, remittance stubs, and occasionally bills of lading. The ability to automate these documents means not only a reduction in their manual entry cost, but also two other major benefits: speed of entry and employee time optimization. Now that organizations can enter these documents faster, they can reduce the time it takes to process them, and in many cases can reduce penalties or be able to take advantage of net discounts. Since the operators who typically enter these documents are not data entry clerks but employees at a higher pay scale, automation allows them to spend more time on critical thinking tasks.

Medical Billing

The most common document types in medical billing are Explanation of Benefits (EOB), Health Care Financing Administration Forms (HCFA), and Universal Billing Forms (UB). EOBs are sent by payers and processed by billers, whereas HCFA and UB documents are sent by billers and processed by payers. These documents represent the two directions of the billing process within the health care industry. Automating EOBs allows billers to retrieve the money owed to them more quickly. EOBs are classified as the most complex document type to automate, and because of their complexity, they also have the highest premium cost for entry. Hospitals and pharmacies regularly receive multi-patient EOBs which are commonly hundreds of pages long. Automating HCFA and UB documents allows payers to enter the procedures that are associated with their customers’ claims into their systems more quickly. This allows them to access the amount owed faster, reduce operation cost, and increase margins. Occasionally, billers and payers will process both documents at the same time for purposes of reconciliation.

Survey Entry Automation

While documents requiring hand-print processing (ICR) and mark-sense processing (OMR) represent a much smaller paper volume than the majority of the documents an organization encounters, they are still a common tool for collecting feedback, testing, or surveying of specific groups. Since each field on every page is in the same location, survey automation requires fixed form processing. Because of this, organizations can adapt survey entry automation technology quickly without much additional effort. The challenge of automating these documents is in the nature of hand-printed text. Handwriting is constantly changing. It varies from one individual to another, and sometimes even for the same individual over time. Because of this, it is necessary to have a properly designed form in order to constrain the writing as much as possible. Automating these forms reduces the cost of entry even with the mandatory step of quality assurance, and is usually much more accurate than manual entry.

Handwriting is constantly changing. Because of this, it is necessary to have a properly designed form in order to constrain the writing as much as possible.

Bill Automation

While similar to AP processing, it differs in the fact that corporate bills for utilities tend to be very different from commercial invoices, and typically have their own business process, value, and method of processing. Bill processing, unlike AP automation, does not combine document types. The most popular form of bill processing is the processing of telecommunication bills, bills generated by phone utilities for phone and cellular service. For large organizations, these bills create large avoidable expenses, and can quickly get out of control. Processing telecommunication bills requires a specific setup of the data capture system. These bills usually contain multiple pages of repeating data elements. Organizations require different information depending on their business process. For most organizations, circuit level data (the highest level of detail on a telecommunication bill) is not necessary, although for some it is essential. Like other bill types, the automation of these bills is an opportunity for companies to take advantage of NET discounts, pay bills on time, and reduce entry cost. Regarding telecommunication bills, automation helps manage usage fees more closely, and increases an organization’s ability to negotiate service plans.

Mortgage Documents

The property closing process is filled with a large variety of critical documents. For banks, mortgage companies, and title insurance companies, the information contained within these documents is critical to ensuring that a loan is properly structured and complete. While these packets contain the same information for the most part, their format varies greatly, and they often include special information based on the situation or state of a loan. The primary purpose of automating mortgage documents has been to classify these pages so that they are suitable for automatic filing. For certain document types, organizations will take the additional step of getting information, usually loan or property details. Automating these documents makes the entire process of recording and monitoring loans substantially easier. Before automation was used, these documents generally were not keyed, but stored in physical storage and retrieved on an as need basis. With automation, the search and retrieval process can be instantaneous and the need for storage is reduced.

Logistics Documents Automation

The automation of delivery documents may be a part of AP automation or it may be its own distinct process. The documents in this category include bills of lading (BOL), packing slips, and packing lists. While packing slips and packing lists are sometimes identical, they can occasionally come at different stages in the shipping and receiving process, and obtain separate values. The purpose of manually checking these documents is to verify that the contents of a shipment are accurate or to inform the receiving organization’s downstream processes associated with new deliveries. Automating these documents allows staff to focus less on the details of each shipment, which increasing their ability to monitor the overall shipping process and increase efficiency. In addition, automation may allow organizations, typically manufacturers, to initiate faster downstream processes associated with part deliveries.

Human Resources

Human resource departments have a large range of document types that can be automated. The majority of the documents that a human resource department faces are one-off, new-hire documents that are too low in volume to warrant the use of technology. These documents, which consist of hire forms and resumes, are not usually automated. There are times, however, when organizations have a large volume of paper forms due to changes in employment policy, insurance plans, HR compliance, or just as a result of employee surveys. Because it is the HR staff’s responsibility to enter these documents, the entry time detracts from critical HR tasks. By automating these documents, organizations get instant value from them while saving precious staff time. In the cases of policy or insurance changes, these documents also pose a legal risk value, making time-to-entry critical.

Data Capture Approaches – Best Practices to Obtain Value

There are several ways for data capture applications to classify pages, find fields, and extract data. All data capture packages employs the following to recognize information on scanned images:

– Optical Character Recognition (OCR)
– Intelligent Character Recognition (ICR) for hand-print
– Optical Mark Recognition (OMR)
– Barcode technologies

Locating and Identifying Values

For the purposes of this discussion, templates are static and definitions are dynamic. The methods used to classify and locate information vary slightly from package to package, but there are four primary approaches:

1. Templates
2. Iterative Templates
3. Data Keyword Pairs
4. Combination of Iterative Templates and Data Keywords Pairs

Semi-Automated: The ‘Assisted’ Capture Approach

Semi-Automated data capture is the perfect solution for companies that are cautious about document automation, or want to start out slow. Semi-automated data capture is still a manual process. The goal is to make the operator using the software as efficient as possible by clicking on text versus typing it.

Semi-Automation – The Operator’s Assistant

Operators scan documents, view each page and click-enter each field in order. Operators choose field lists per document type. As they click-enter, the software automatically navigates from one field to the next. Assisted capture solutions do not require special expertise during integration and can be installed and configured in hours. Even if an organization automates only one field with assisted capture and manually enters the rest, there is a time savings. Assisted capture solutions are less expensive to purchase and install. Additionally the ROI calculation for semi-automated systems may not be as high as full automation but it is easier to control and calculate.

The typical integration of semi-automated data capture reduces full-time employees needed for data entry by


Insight Examples

To give an example of the savings; an average data capture field is twelve characters in length. In assisted capture this means that, on average, per mouse-click an operator enters twelve characters of data, which is a total savings of eleven manual operations per field.

The typical integration of semi-automated data capture reduces full-time employees needed for data entry by three to five times (AIIM e-Doc 2008). Additionally because an operator is seeing every image and character the process has equivalent accuracy to double-blind data-entry in half the time. Organizations are able to integrate assisted capture in days and start obtaining the benefit from document automation immediately.

Assisted Capture to Fully Automated Data Capture

When organizations have obtained success through assisted capture, most are able to easily upgrade to fully automated data capture approaches. Multiple solutions allow for the template logic created during the use of assisted capture to be ported to a fully automated system. Organizations are choosing assisted capture instead of full automation so they can introduce automation sooner with less risk. Assisted capture provides a level of predictability and obviousness that fully automated solutions do not.

Semi-Automated Data Capture Integration

One of the greatest challenges of automation technology is how the technology fits into the organization. With semi-automated data capture the impact of change is minimal and usually just a process of educating manual key entry operators on how to click on data versus type. Because of this, organizations use assisted capture as not only a way to quickly automate but also to prepare their environment for automation. Assisted capture solutions demonstrate similar capabilities in export and quality assurance to fully automated solutions. In the last two years if an assisted capture solution has a clear upgrade path to full automation it has been chosen over competing full automation solutions sixty percent of the time (AIIM e-Doc 2008).

Fully Automated: The Template Approach


‘Templating’ is the process of picking zones on a document where information is statically located. The template approach is a fully automated solution. In this approach, the software does not guess where information is located; it simply always looks in the same x, y, (height, width) location for each individual field. These fields are usually defined in the software through a process of viewing a sample image and rubber banding field locations (drawing rectangles on the image through the software’s user interface). These locations are then stored as a template and applied during processing to images of the template type. Templates are only used in fixed-forms processing data capture.

Iterative Templates

The iterative template approach was a technology that came out of manual zoning. This approach does not require any more expertise than templating. With this approach, there is a phase of training that follows these steps. First, an operator creates a new definition for a document type. He or she then loads a set of samples that represent that document type and the variations within it. The operator must iterate over each page in the training set and rubber-band the same fields. As the operator goes from image to image, the software calculates the variations in field location from page to page. By doing so, the software understands how a field may move from one location to another on the page. Once the training is done, the definition can be applied in production. This approach is employed by semi-structured forms processing systems.

Semi-Structured: The Data Keyword Pairs Approach

Semi-Structured or Data Keyword Pairs

The Semi-Structured or data keyword pairs approach is quite unique compared to all other approaches. In this approach, coordinates are very rarely considered. If they are considered, they are relative to the subject page being processed at any given time. In this approach, there is also a stage of training where an operator loads a set of sample images of a particular document type. The operator then analyzes the variance across the images and determines the best logic to locate fields. Most fields are located using keywords. Fields can also be located using graphics, lines, and white spaces. Once a keyword has been identified for a particular field, the operator provides the logic to tell the software where the field is located relative to that keyword or object.


To find the invoice number on an invoice page, the logic would start by looking for the words “Invoice No.”, “Invoice Number”, “Invoice #”, or any other similar phrase that appears on the page. Now that the keyword for the invoice field is found, the next step of logic is to tell the software that the invoice number is to the right of the keyword, some number or percent of pixels below the top of the keyword, or some number or percent of pixels above the bottom of the keyword. In this case, the logic would probably also specify the type of characters an invoice number will contain. In this approach, the software guesses where the information is located and picks the best guess if there are several.

Logic Flexibility

There is a tug of war in the setup stage of this approach between how flexible the logic should be and how constrained it should be. The amount of flexibility is determined by the variance across pages and the complexity of the document type. Flexibility is usually determined at the field level. For example, in most commercial invoices, the total amount due is located on the bottom right portion of the last page of the invoice.


Some data capture packages will combine iterative templates and data keyword pair approaches collectively as one or as a choice to the operator based on document types and the setup operator’s skill set. Other solutions that are fully automated will incorporate the assisted capture approach for quality assurance.

Preparing for Document Automation

Several uses of data capture that represent the vast majority of documents being automated includes semi-automated data capture, template, semi-structured and combination of template and semi-structured. There are other document types that are less common but also benefit from automation, and are in some respects similar to one or more of the types mentioned.

There is no question about the value of automating the entry of paper documents. Computer processing is cheap, accurate, and stable. Human labor is expensive and slow. It is easy to see the value of automation for any organization; the trick is in preparation to automate. The difference between successful data capture projects and the unsuccessful ones very rarely has to do with the technology itself, but rather has to do with the amount of preparation done by the organization to secure success. Organizations that take the proper steps to prepare for the induction of data capture technology have a greater success rate and ROI, as well as fewer surprises.

Needs Analysis

All large projects kick off with a needs analysis. In the needs analysis phase, organizations develop their wish list of automation capabilities to apply to their documents. For most organizations automating paper, this is a request to automate one particular document type associated with a single business process. It is recommended for organizations to initially pick a discrete collection of document types that are a part of a single business process. Ideally, the document types will be of moderate organizational risk, and have a high value associated with their automation, meaning automation will provide an ROI but not change dramatically how things are done. Later, as the organization learns how to automate, it should move on to higher risk projects, yielding even higher rewards. When an organization knows what documents it wishes to automate, it can start collecting the critical facts. All of these steps happen before any vendor selection, and before any testing of technology takes place. The objective is a better understanding of the need before any investigation of technology.

Preparing Sample Sets

The sample documents an organization uses to evaluate technology is the most important tool for gauging potential value, measuring exceptions, and ultimately picking a solution. The biggest mistake is forgoing the process of picking samples. Without a well-prepared static sample set, there is no consistency, which in the end diminishes the value of the sample set and testing period. Because this white paper deals with such a range of document types, the exact calculation of sample set quantities and variance is not specific. The following page contains a general guide to make organizations aware of the elements they should consider.

Sample Set

The sample set is the collection of already imaged documents on which the prospective software packages are configured. Sample sets should consist of a fixed number of each document type the organization plans to automate. For example, if it includes AP processing of invoices, purchase orders, and checks, then there are three types. An organization should have no less than ten sample production documents per type. The documents should be exactly as the data capture software will receive them at its final integration. The number of samples will be scaled based on production volume, but should not exceed fifty per type as quality analysis then becomes unbearable. Each type should contain as much or as little variation as is experienced in the production environment. If, for example, an organization processes a thousand commercial invoices a month and has a thousand separate vendors in its system, each sample invoice should be from a different vendor. But if an organization processes the same volume from only three separate vendors, then there should be several samples of each and more for the greatest two contributors in volume. Because organizations are sharing private data they should take the proper measures to protect themselves. If an organization must sanitize documents before providing them to any vendor, it should not black-out (redact) information it expects the data capture system to collect. The best option is to substitute real information with fake, as redaction could impact the technology evaluation process. The above sample set is ideal for demos, and estimating value; for a proof of concept, the sample set needs to be revamped.

Production Sample Set

Production sample sets are the samples that are run through the prospective software packages after setup has been done on the above sample set. The production set should be two times the volume of sample sets and have exactly the same variational makeup. The reason the software is tested on an independent sample set is to best approximate the production environment and to isolate any effects of setup on static documents.

Truth Data

Truth data is the 100% accurate, manually entered data for a given set of documents. While truth data should ideally be prepared for both the sample set and the production sample set, many organizations will evaluate accuracy at the point of proof of concept, so truth data for the production sample set may be sufficient. The purpose of the truth data is to compare the prospective products’ recognition results to already known, 100% accurate, manually entered data.

Evaluation Method and Criteria

Organizations need to agree internally on the method that will be used for testing products and how they will be measured before any actual testing is done.

Recommended methods for most organizations’ needs include:

– Vendor Discovery
– View a “canned” demo of each prospective product
– Modify the prospective vendor list based on the demos
– Have vendor perform setup on the sample set
– See demo of each prospective product on the sample set
– Modify the prospective vendor list
– Begin price negotiation
– Obtain a trial from remaining vendors with tailored configuration for the sample set
– Run the production sample set through setup of the final prospective vendors’ products

Vendor Discovery
At the first interaction with the vendor, organizations should remove any potential deal killers such as the pricing model or support concerns. The organization needs to focus on the benefit of the technology and understand from the vendor the amount of preparation required and the skill level required based on work that has been done on the above sample set.

View a “canned” demo of each prospective product
Canned demos are pre-configured demonstrations of the software on vendors’ picked sample documents. These demos do not require the vendor to perform any work other than presentation of the demo.

At each step, organizations should evaluate the speed of creation, speed of processing, and accuracy. Using this method in conjunction with the above facts, the organization should end up with a vendor list and associated performance score for each vendor based on the organization’s needs and expectations.

Understanding Data Capture Accuracy

Data capture accuracy is often confused with full-page optical character recognition accuracy (OCR), which has a single percentage of error based on the number of correct characters. This confusion can cause many problems when an organization is determining the system for evaluating data capture products. Data capture accuracy is derived from a series of accuracy calculations. The calculations go step-wise, and each step impacts the next; failure in one step often prevents the other accuracy levels from being calculated. The best way to consider the accuracy of a package is to measure both the actual accuracy achieved based on truth data, and based on the percentage of uncertainty. Uncertainty is the percentage of characters a software package flags for a manual review. If there are one hundred characters in a document and the percentage of uncertainty is five percent, then an operator will look at five characters. In data capture, it is important for organizations to understand that despite the reported accuracy rate, there is always a potential for false positives. A false positive is a result that technology reports as accurate, but in reality is not. False positives are combated during setup with business rules and data types. Below are the stages of accuracy calculation.

“Data capture accuracy is derived from a series of accuracy calculations and is not a single percentage.”

Page ID

The identification of a page and its associated document type. If a page is not identified as any type the accuracy is zero percent; if the page type is accurately identified, it’s accuracy is one hundred percent. Pages identified as a wrong type (false positive) result in zero percent accuracy. Page ID is an all or nothing accuracy calculation. A zero percent accuracy may stop processing of a document entirely or result in false positives.

Field Location

This is the process of zoning the fields to be recognized on the document. If a field is not located at all, it is zero percent accurate. If a field is partially located, it is one to ninety nine percent accurate, depending on total possible field length and the length identified.

Character Level OCR

This is the accuracy reported by the OCR engine per character. When referring to accuracy, Character Level OCR is the most commonly known and used form of measurement. Character level accuracy ranges from zero to one hundred percent accurate. If a field has ten characters and nine are correct, the field is ninety percent accurate.

Business Rules and Data Types

This accuracy changes based on different software packages, but is applied at the final step of recognition. If a field does not match a particular data type, for example a proper date format, it could impact the accuracy of that field by one hundred percent, or reduced by a percentage representing the number of characters of the whole that does not match the data type. If a business rule states that a particular field should be five characters long, and it is recognized as seven characters long, it could impact the accuracy of that field by one hundred percent or less. Business rules and data types are the final tools to enhance accuracy and avoid false positives.

All four of the above accuracy calculations can be rolled into a single percentage that becomes the final calculation of data capture accuracy per page. Production environment accuracy is usually determined by the running production of documents for a set period of time and averaging the per page accuracies. Companies dealing heavily with business process driven automation consider accuracy only on a complete document level as opposed to page level. For example, consider AP automation of invoices, PO’s, and checks. There may be an average page level accuracy of ninety five percent on the invoices, ninety five percent on the PO’s and only seventy five percent on checks and because a document is only as strong as its least accurate type, the accuracy of AP automation would be considered to be seventy five percent on average.

Goal Setting: Accuracy, Speed, Exceptions, and Expectations

With all the tools mentioned, an organization can now estimate the levels of automation it can expect to achieve with data capture technology. In this estimation, an organization should include detailed answers to the five categories below for determining success.

1. Desired Accuracy Range

Organizations should be realistic in the accuracy range they expect. The range should have as a mid-point the estimated accuracy achievable on the subject document types. The reason this should be the mid-point is because this is the organization’s biased expectation, and the real accuracy will generally be higher or lower. Based on our experience with this type of technology, the estimate might be off and organizations should be prepared to make adjustments. The calculation of the range is based on the monthly page volume of documents, and the accuracy it will take to automate the percentage needed to reduce data entry cost.

2. Exception Documents

Organizations have been surprised that while they obtain tremendous ROI when they initially use data capture, the rate at which ROI decreases is tremendous due to poorly planned exception handling.

Exception documents are documents for which a configuration was unable to extract any, or was only able to extract minimal, usable data. Exceptions often mean additional setup and fine-tuning of a configuration. Because of this they may dramatically impact ROI, as each round of fine-tuning will result in an internal or external cost. Organizations should isolate this variable and determine an acceptable range of exceptions. Exceptions will always occur, and the range is relative to the monthly page volume, variations between document types, and the expected amount of new variations per month or year. At this time the organization must also decide the number of times any one exception must repeat before a round of fine-tuning is considered. It is not advisable to fine-tune for a class of exceptions that occur once or even five times. It helps organizations to list an acceptable cost range for working with an exception document type, and to step into the calculation of the number of fine-tuning rounds permitted in a given time period. Organizations have been surprised that while they obtain tremendous ROI when they initially use data capture, the rate at which ROI decreases is tremendous due to poorly planned exception handling.

3. Technical Ability

Organizations should be aware of the technical ability of the staff appointed to configure and use the data capture system. Usually, organizations do not have to consider the operating complexity of the data capture product, as software companies design data capture products to be simple to operate once they are configured. What varies more is the setup of a system. The complexity range that is acceptable is determined by the skill set of the staff designated to set up and support the data capture system. Some packages offer “what you see is what you get” or WYSIWYG type setup that only requires personnel who are familiar with basic Windows application usage. However some packages require a developer-level of expertise. Most packages have both, and developer assistance is only needed as one-off support. Setup complexity is important during the initial integration, though this may be handled by the vendor and during exception handling fine-tuning. The complexity of a product is not necessarily indicative of its accuracy.

4. Processing Speed

There are several stages in data capture processing where speed is an important consideration. In regards to this type of technology, a slower speed usually correlates to more accuracy. Organizations need to make the decision of what speed of processing, setup, and exception setup is acceptable for their business process. Speed of processing is determined by measuring how long manual entry takes per page. Documents entered using data capture should take less time to enter than if they were entered manually. For some organizations, entry in the same amount of time is acceptable; for most, it is between thirty to two hundred percent less time per page. Expectations should be realistic and based highly on the complexity of the documents. Often, organizations focusing only on speed will pick faster, less accurate technology, and will not obtain an ROI as there will be more manual quality assurance checking.

5. Setup Time

The amount of time it takes to set up and train for expectations impacts the time it will take to start gaining value from automation. The more time it takes to set up, the longer it takes to start automating documents. However, generally, the more setup time spent, the more accurate the system will be. Organizations need to know the range of time for initial setup that is acceptable. In data capture, the average is between 3 to 6 months, with outliers on either side for initial setup. The average time per exception document type can range from minutes to weeks.

Estimating ROI

With the above five categories for determining success, a company can now make an initial estimate of its ROI based on a certain percentage of automation. Expectations change frequently for organizations not yet experienced in data capture. As organizations become more familiar with the technology and educated in expectations for their area of automation, they increase their ability to estimate accuracy and performance. ROI ranges are highly dependent on document types and quality. Most general business documents of good quality are moderately complex to automate. Examples of complex document types are EOBs and student transcripts, which require a substantial additional fine- tuning effort. On the easier side are packing slips, and survey forms.

Data Capture ROI

Organizations should be mindful of the formula they use to calculate ROI, and repeat that formula during the evaluation of each product at each stage even as expectations change. Usually, data capture ROI is determined by how much money is saved in automation. There are, however, many other areas for organizations to gain ROI from data capture automation that should be considered. For example, automation often frees employee time to be spent on more critical tasks, thus making them more efficient. When an employee is more efficient, less staff is necessary, which decreases staffing costs. Another example is the reduced cost of paper storage. Because of automation, some organizations have a lower need to physically store paper, which reduces monthly storage fees. For some particular industries, ROI is based on a reduction of risk associated with compliance, or even a reduction of legal fees such as worker’s compensation claims due to manual entry.

Assisted Capture ROI

In assisted capture, the calculation of ROI is fairly basic and straight forward. It’s a process of counting the number of user operations that are saved with click-entry versus manual key entry, then calculating what volume of savings is required to replace the work of one operator. As an organization’s paper volume increases, so do the savings. The easiest way would be to calculate how many pages can be entered manually compared to click entered. This gives you a percentage in terms of time savings.

Basic Sample Calculation


Average Operator Hourly Wage: $8.00
Average Document Content: 33 fields of 12 characters
Average Operations Per Operator Per Hour: 11,600 (Data Entry Management Association 1998)
Average Number of working days per month: 22 Average Number of working hours per day: 8 hours
Average Cost of Semi-Automated License: $6,600 one-time, $1,100 annual support
Average Professional Services for Semi-Automated Solution: $10,000

Calculating for 100,000 pages a month data entry:

Manual Entry

In an 8 hour day a part time employee will enter approximately
234 (11,600 key- strokes / (33 fields x 12 characters) x 8 hours) pages a day or
5,148 (234 pages * 22 days) pages a month.
It would take 20 (100,000 pages / 5,148 pages) full time employees (FTE) to handle the monthly data entry volume for a total cost of $28,160 (20 employees x 8 hours x 22 days x $8.00).

Total Manual Monthly Cost: $28,160

Return On Investment:
1.5 months

Assisted Capture

An operator performs an average of 34 clicks (one per field and one for template selection) and an average of 10 keystrokes (verification of data) a total of 44 operations to enter a page.

The operator can then click-entry 263 (11,600 key- strokes / 44 operations) pages an hour, 2,109 (263 pages x 8 hours) pages a day, or 46,398 (2,109 pages x 22 days) pages a month. It would take 3 (100,000 / 46,400) FTEs to click-entry the entire monthly volume for a total cost of $4,224 (3 x 8 x 22 x $8.00).

Total Semi-Automated Monthly Cost: $4,224
Monthly Savings of using Semi-Automated: $23,936

3 Year Total Cost of Semi-Automated Solution: $36,400 (($6,600 one-time + $1,100 annually x 2) x 3 licenses + $10,000 services)

Return On Investment: 1.5 months ($36,400 cost / $23,936 savings).

Preparing and Evaluating a Vendor List

Now that the majority of the fact gathering is done, organizations can prepare the vendor list. Initially, they should be broad and look for prospective vendors that fit just the type of processing in question, either fixed or semi-structured document processing. Vendor lists should include the vendor name, their contact information, and the product they have that might be a fit. Organizations should set the number of iterations and evaluation criteria that are used for each vendor based on the above research. Once there is a completed vendor list, the evaluation process can begin. Because of the nature of the technology, there are several areas where vendors should be tested thoroughly.

Do They Provide Demos?

Vendors should give organizations the option to first see a canned web demo of the product in question. Once a canned demo has been given, vendors should accept sample document sets and be willing to perform some basic setup on the samples. When this has been done, seeing a demo based on the organization’s sample set will clarify the potential of the technology. Many vendors have websites where organizations can simply upload documents to test. When evaluating the demo with the sample set of documents, organizations need to understand the complexity of the setup that was done. How long did the vendor take to do the setup? What was the skill level of the person who performed the setup? Finally, what were the challenges during their setup?

Do They Provide Trials?

Does the vendor allow organizations to try the software for a period of time in a production environment? Often, the complexity of data capture software can be such that a trial of the software without guidance or initial setup is more harmful than good. If this is the case with a particular vendor, they should offer to do a setup for the organization and provide a trial that is operator mode only, or clearly explain the skill level required and what one may expect to encounter when performing setup without training.

How is Their Support?

The trial period is an opportunity for organizations to contact the vendor’s general technical support line. It is important to know if the vendor’s technical support team is responsive and can answer their questions or escalate requests promptly.

Test the Metrics of the Software

Test the Metrics of the Software Many vendors have different versions of the data capture product they offer. Organizations need to find the package that is the best fit for their needs. Differences in packages are usually associated with the volumes they can process, and it comes down to cost. During the trial of the appropriate packages, it is the organization’s responsibility to test all of the metrics: installation, speed of recognition, accuracy, and export file formats. Export formats are sometimes the primary focus of an organization. Organizations should not pick data capture products based on export format but be concerned only that the export gives them all of the data required to get to the desired format. Sometimes the data capture software provides not only data but also image file results. If this is the case, then image format capabilities and resulting file size will be an additional consideration. Many packages include compression and file tools that offer companies the ability to go directly from a data capture process to an Enterprise Content Management (ECM) or Document Archive System without any additional steps.

Form Design and Scan Settings – Best Practices

There are many factors that come into play when integrating data capture technology. Because of the interpretive nature of this technology, there are also many nuances to contend with. Even so, there are some clear ways to make the integration of data capture technology more accurate. Below are some of the primary influences on data capture accuracy that all organizations should consider.

Form Design

The way in which forms are created can dramatically impact the data capture accuracy when being processed and scanned. Organizations that have control over the creation of their forms are in the best control of this factor of accuracy. The best practices for printing forms are based on fixed or semi-structured types. The most control can be gained on fixed forms, and thus the greatest impact, but semi-structured typographic forms also have potential for improvement.

Fixed Form Design

Does the vendor allow organizations to try the software for a period of time in a production environment? Often, the complexity of data capture software can be such that a trial of the software without guidance or initial setup is more harmful than good. If this is the case with a particular vendor, they should offer to do a setup for the organization and provide a trial that is operator mode only, or clearly explain the skill level required and what one may expect to encounter when performing setup without training.

1. Cornerstones

Make sure your form has corner stones in each corner of the page. The corner stones should be at 90 degree angles to each of their neighbors. The ideal type is black 5 mm squares.

2. Form Title

A clear title in 24 point or higher print that does not use a stylized font.

3. Completion Guide

It is optional but sometimes useful to print a guide on how to best fill in the fields of the type you use at the top of the form.

4. Mono-Spaced Fields by Data Type

For the fields to be completed, it is best to use field types that are character-bycharacter separated. Each character block should be 4 mm x 5 mm and should be separated by 2 mm or more. The best types of fields to use in order are letters separated by dotted frames, letters separated by drop-out color frames, and letters separated by complete square frames.

5. Segmented Fields by Data Type

For certain fields, it will be important to segment the field in portions to enhance ICR accuracy. The best example is date: instead of having one field for the complete date, split it into 3 separate parts, the first being a month field, the next a day field, and the last a year field. The same is done with numbers, codes, and phone numbers.

6. Separate Fields

Separate each field by 3mm or more.

7. Consistent Fields

Make sure the form uses consistent field types.

8. Form Breaks

It is okay to break the form up into sections and separate those sections with solid lines. This often helps with template matching.

9. Placement Field Names

This is for the text that indicates what a field is, such as “first name” or “last name”. It is best to put these left justified to the left of the field at a distance of 5mm or more. DO NOT put the field descriptor in dropout in the field itself.

10. Barcodes

Barcode form identifiers are useful in form identification. Use a unique ID per form page and place the barcode at the bottom of the page at least ten mm from any field.

Semi-Structured Form Design

1. Spacing

Provide sufficient space in each field for data to be entered.

2. Limit Use of Lines

Text can often be printed on lines and this is problematic, no matter which technology or imaging tool is used.

3. Field Names

Print field labels to the left of input text. It’s best not to allow input text to be below field labels, as the field label then often interferes with OCR.

4. Effective Dropout

When using dropout, make sure the form has some black-only elements. If all referencing elements on the form drop out, the data capture software has no reference points to find even the first field. It’s best to have field names as black text that would show up in a scan

Scan Prep

Proper scan settings are absolutely critical to obtaining the highest level of accuracy in data capture. While there are many scan settings that are based on document type, there are a few ways to ensure that all documents are scanned properly.

1. Resolution

The optimal resolution at which to scan documents for data capture is 300 DPI. This setting is optimal for accuracy and speed of scan. Companies working with documents with small font or hand-print may consider scanning at a higher resolution, but this is rare.

2. Color Scanning

To ensure that the data capture software has the greatest possible amount of information to work with, organizations should scan in color. Often, organizations will pick a lower bit depth, considering only file size. Scanning in color will help obtain the highest accuracy and is a format that can be compressed and re-purposed.

3. Image Pre-Processing

Occasionally after a document scan, image preprocessing provides additional benefit to the accuracy of data capture. The types of image processing should only be chosen by organizations when necessary and proven to help accuracy. To do this, an image should be tested both with and without image processing. The types of image processing that are most beneficial to data capture are thresholding, despeckling, rotation, deskew, background removal, and correction of linear distortion.

The Values of a Smarter Document Capture

Putting it All Together

Data capture is valuable to any organization with paper. Data capture reduces data entry cost, and enhances the value that paper documents provide an organization. For data capture technology to obtain the greatest return on investment, it is the responsibility of organizations to do the proper planning prior to testing and deploying this technology. By building an understanding of the internal business processes to be automated, preparing sample sets, creating evaluation criteria, and preparing for exceptions, organizations can exceed their expectations. Well-prepared organizations can increase their ROI and achieve a greater degree of automation than originally anticipated. Successful integration leads to the introduction of the technology across other business units and other paper-based processes.

Request Demo Download Whitepapaer


See how SoftWorks AI can help your organization

Contact Us