AUTOMATED DATABASE POPULATION OF ATTRIBUTE DATA
Alex Mackintosh
PlanGraphics, Inc.
1300 Spring Street, Suite 306
Silver Spring, MD 20910-3616
plangss@ix.netcom.com
ABSTRACT
This paper and presentation is intended as a general overview of Windows-based, Optical Character Recognition (OCR) software, for use in populating Geographical Information Systems attribute databases.
Public Agencies and Utility Companies involved in the population of the attribute portion of their GIS databases are often faced with the problem of how to convert hardcopy attribute data to a digital tabular format in a cost effective manner. Some Optical Character Recognition software is capable of converting text images to digital formats using efficient neural networking. When coupled with a database interface, hardcopy attribute data may be converted and input into a digital database in a cost-effective, automated, fashion.
This paper addresses the basic characteristics of OCR software and the methodology of populating an ArcInfo database using OCR and a corresponding database interface. Public Agencies and Utility Companies should compare the costs of developing attribute databases manually with respect to the automated advantages that OCR and data input applications can offer. Various aspects of automated attribute conversion must be determined including: raster image formats, scanning variables, OCR accuracy, database input, error rates and methods of correction, and the resulting costs/benefits of automated data conversion techniques .
INTRODUCTION
The proliferation of OCR software for automated conversion of text images has led to the development of a number of OCR packages that convert text data using a neural networking algorithms for recognizing characters. These packages are capable of being trained to recognize images of characters of poor quality and even handwritten images. Initial considerations similar to those while planning graphic data conversion must be made with respect to the overall characteristics of the source data.
Before automated text recognition is performed however, appropriate scanning resolutions of the hardcopy text images must be determined. Many inexpensive desktop scanners operate at standard and fine modes fax resolutions of 200x100 and 200x200 dots per inch (dpi), respectively. The volume and condition of the source data may require a higher quality scanner capable of producing resolutions of 300 dpi or higher. For most source data in good condition, without folds and discoloration, and consisting of an 8 1/2" by 11" text image, a resolution of not less than 200 dpi is recommended for most OCR packages.
These resolutions provide crude images of text whose original sizes vary from eight to twelve points. The OCR software contains various tools for recognizing and converting text at a user defined confidence level. It will also format the converted text into various ASCII delimited formats. The converted text may then be corrected using a standard or user supplied dictionary.
The ACSII delimited text file is a common format that any database engine can import. The import commands will place the delimited text into the appropriate database fields, assuming that they are in the correct order in the delimited file. If not, a simple program performing a formatted read with appropriate condition statements and pointers may solve the problem. Further checks may then be run on the data to ensure that it's content is correct.
INITIAL CONSIDERATIONS
One of the first steps of an automated data conversion campaign is to assess the quality of the source material. An inventory of source material should be assembled so that each category of material may be assessed in terms of suitability with respect to application of OCR techniques. For each candidate data source, important aspects must be considered such as:
The condition of the source material will dictate whether scanning is worth the effort. Older documents in poor condition may be scanned using a variety of novel techniques besides the traditional scanner and still yield an image of the text desired. If the image is noisy with significant amounts of extraneous linework, higher scanning resolutions and/or image scrubbing may be necessary. Both solutions entail higher costs in the forms of larger file sizes/storage space, and increased preparation effort, respectively.
The volume of acceptable source material will determine if the investment in hardware, software and limited application development is cost effective versus manual data entry. Conversion of manual databases consisting of forms or recipe cards with handwritten entries require enormous amounts of man-hours to convert. The advantage of OCR using neural networking algorithms is that the relatively inexpensive software may be trained to recognize handwritten entries and output the result with some chosen level of confidence to an ASCII delimited format. A further investment in simple database applications programming can provide the desired database. The man-hours required to actually operate the OCR software are minimal after parameters providing acceptable results from the automated character recognition have been determined.
The size of the source text may present a problem because higher resolutions are necessary to define text of lesser point sizes adequately. If small text has to converted larger file sizes, regardless of format, will result from scanning at higher resolutions. Larger file sizes will also burden the host system if work is not scheduled properly.
High resolutions of text images will produce finer images that make the tasks of OCR software easier, however, file sizes will roughly quadruple when increasing from 100 to 300 dpi. This will also decrease the speed of the automated character recognition, although not severely. Background noise, caused by folds and discoloration in the source material, also increases in images of higher resolution. This can interfere with OCR algorithms ever if they are properly trained to recognize handwriting and aberrant text. Caution should be exercised when estimating conversion times with poor quality source material requiring scanning at very high resolutions.
INPUT & OUTPUT FORMATS
Most OCR packages expect raster image formats commonly found throughout the document management industry such as TIF, BMP, PCX and DCX. Compressed formats such as JPEG are not supported by most OCR packages. These packages may accept input from other sources such as fax input from software such as FaxWorks and WinFax, and support common scanner drivers such as TWAIN and ISIS.
Output may be exported to any word processing package, spreadsheet or database engine. Generic DOS text or rich text format (RTF) is available. Formatted reads of less intelligence may be accomplished through the use of word processor macros. The easiest output for an import command in a given database engine to handle is delimited text of some kind such as, comma or dot delimited. Depending on the results of the OCR process, reformatting may have to be applied to force a group of delimited "words" into common categories, such as double road names. An example of this may be ",red,robin," in comma delimited format being reformatted to ",red robin," to prevent "red" being separated from "robin" in the case of "Red Robin Road".
Other dynamic links that are standard in Windows based packages such as object linking and embedding are also available for use with most OCR software. This feature is useful in training OCR software that uses neural algorithms.
RECOGNITION PROCESS
OCR software offers a variety of tools found commonly in Windows-based raster imaging software packages including tool and status bars, editing tools, hypertext help and various menus such as viewing, proofing and configuring. Basic recognition tasks consist of the following:
The size of the area chosen for conversion will depend on surrounding clutter from other characters or graphics in the text. OCR software ignores graphics, however recognition problems may be created where graphics and text meet inadvertently.
Various editing menus exist to assist the operator in setting the software up for optimal performance. Some of these include variable magnifications, image attributes variation and template associations. Varying image attributes may enhance the manner in which the OCR software views text. Lowering the software, or display, resolution (not the scanned resolution) will allow the software to "see" a clearer image of the text thus improving the initial chances of correct recognition.
Frame menus assist the operator in controlling the recognition process over an entire page, allowing multiple frames to be defined with various parameters set for each. This allows control over the order of text processing, important when only various portions of a document page are deemed worthy of conversion.
Configurations settings are most important for successful OCR software operation. They determine the characteristics of what is being read such as: text type, plain, italic, numeric, image; text size, points, inches; and maximum text size to be read. Output formats are also determined in configurations such as plain DOS text or rich text format (RTF), text with defining attributes usually hidden from the user.
CONFIDENCE LEVELS OF RECOGNITION
Most OCR software allows the user to set margins of acceptable error when attempting to recognize a text image. This is necessary because no image of any text, particularly handwritten text is perfect. Neural algorithms allow the software to "learn" patterns of text with respect to their shapes. Settings may then be made to recognize other instances of shapes that are similar within certain tolerances. Confidence levels are synonymous with measures of certainty. If one is 100% confident, then a person is confident that he/she is absolutely correct, conversely if one is 0% confident, then that person is sure that he/she is completely wrong. Modern statistics commonly use the 95% confidence level as a measure of confidence.
For simple machine generated text, two confidence levels may be set; global and table-based. Global confidence levels are applied for all text that is recognized. Table-based confidence levels allow the user to set confidence levels based on individual characters. This forms the basis for more sophisticated analysis of text recognition based on errors that tend to repeat depending on the text (or combination of) involved. Table-based confidence levels may be saved and reused on certain types of documents where appropriate.
Error detection tools include spell checks and automatic corrections based on the software or user specific dictionaries. Various properties such as uppercase words, numeric expressions, roman numerals, proper nouns, math functions, abbreviation and acronyms may be ignored. Resolution of patterns involving text and numerals may be varied according to ambiguities dominant in the patterns. Proof reading may be performed manually along with corrections when errors in the conversion are highlighted during proofing.
DATABASE INPUT
The first step in the final stage of the attribute input process, is reformatting the delimited text to assure that it reflects the items or fields found in the recipient database. Most of the delimited text will belong in their very own database field, however some database fields, particularly those designed to contain proper names will have difficulty accepting proper names that are delimited. A custom AML or Avenue script may be written to recognize these occurrences and remove the delimiters from their positions between proper names. A simple algorithm that checks delimited items against standard lists of proper names is one method of accomplishing this. Another, less reliable method, counts delimiters and eliminates them based on counted positions. This will work of the "look" of the incoming data is constant.
The second step is to import the prepared data into the database file itself. After this has been managed, checks may be made on the attribute values contained in each field and their frequencies tabulated. The organization may then use AMLs or Avenue scripts to replace the errant values based on their record numbers and frequencies.
COSTS/BENEFITS
The costs associated with OCR technology are typical of any automated conversion effort. Hardware required consists of a scanner capable of up to 400 dpi resolution, a 486 or better PC with 16 Mb RAM. Scanners may cost from $1,200 to $3,000 for an adequate desk top model. Hard drive space and configuration depends on the amount of source material to be converted, with 2 Gb of disk space, a good starting point for small projects.
Most OCR software does not require much disk space, 7 Mb to 25 Mb for desktop, $100 and professional versions $700 respectively. Costs for enterprise OCR solutions capable of handling E-Mail and other imaging such as gray scales increase substantially due to licensing.
The effort required for an individual to re-type or enter data from a hardcopy 8 1/2"x11" form into a database depends on how complete the form is, however, if twenty minutes is used as a benchmark, then OCR software is clearly faster, requiring only a couple of minutes from the initial image loading to the final, delimited ASCII text output. The overhead of setting the OCR process up and the development of the associated AMLs and/or Avenue scripts is worth the cost if there is a substantial amount of text conversion to be performed.
CONCLUSIONS
Many users of OCR software are often unfamiliar with the steps that are fundamental in producing successful results with automated text conversion processes. One solution to these problems is to find reliable OCR software that is easily understandable from the operator's perspective and provides output using neural networking algorithms. This software, if trained properly, should perform conversions automatically and reliably, enabling the user to have a high confidence level in the results.
In addition to the standard checks available for attribute content form operation of the OCR software, the entire text conversion process should provide additional checks at the following points:
ACKNOWLEDGMENTS
The author wishes to thank Oleg Feldgajer of International Neural Machines, Waterloo, Ontario, for providing valuable insights to Optical Character Recognition.