Public agencies involved in the population of the attribute portion of their GIS databases are often faced with the problem of how to convert hard-copy attribute data to a digital tabular format in a cost-effective manner. Optical Character Recognition (OCR) software is capable of converting text images to digital formats using efficient neural networking. When coupled with a database interface, hard-copy attribute data may be converted and input to a digital database in a cost-effective, automated fashion. This paper will address the basic characteristics of OCR software and the methodology of populating an ArcInfo database using OCR and a corresponding database input interface. Public agencies need to examine the costs of developing attribute databases manually with respect to the automated advantages that OCR and data input applications can offer. These organizations must examine the various formats of raster images, scanning techniques, OCR accuracy, database input application, error rates/corrections, and the resulting costs and benefits of using automated data input techniques. This paper and presentation is intended as a general overview of database population using automated techniques, OCR in particular. It will examine the following areas associated with automated attribute data input:
Recently, scanners have become a primary source of data input for GIS. This is because of lower hardware costs, better vectorization software, and a greater level of understanding and awareness about the benefits of scanning in GIS. In the wake of this increased scanner use, the question of scanner accuracy is inevitable. GIS users must be able to quantify the accuracy of the data going into their system. Although accuracy is not an issue with some GIS due to known inaccuracies in source data, most GIS have very specific accuracy requirements. In general, the average GIS database will require that input data be accurate to at least 0.018". This means that an input data location must be within 0.018" of its actual geographic location at the scale of the map. Some GIS have more stringent accuracy standards. Many Federal Government GIS, for example, require that input data be accurate to within 0.005". This means that a scanner cannot produce more positional accuracy error than the maximum error allowable in the GIS. Users are accustomed to dealing with standard accuracy issues such as media stability, source availability, and differences in data collection procedures. These issues are quantifiable and the user can decide whether the resultant data are acceptable for their GIS prior to integration. Now, with the recent influx of scanned data, there is a new issue to be dealt with: the accuracy of the input scanner. This is an issue that few users really know how to quantify, but it is of paramount importance. Since scanners still tend to be quite expensive, the impact of scanning large amounts of data that do not meet the accuracy requirements of the GIS can be devastating. Users must be able to measure the accuracy of their own scanner and service bureaus must be able to prove scanner accuracy to their clients. Unfortunately, scanner vendors are not much help in this area. A typical marketing flyer for a feed type scanner will claim that the device is accurate to plus or minus 0.04%. This sounds like a very accurate device, but the number 0.04% is actually rather ambiguous. Scanner manufacturers define accuracy as the ability of the scanner to produce an image with output dimensions that are exactly proportional to the input document. They can guarantee that the image will be dimensionally correct within the specified tolerances, but they say nothing about the data within the body of the image. While the image may have exactly the right amount of pixels, features within the image may be as far as three or four tenths of an inch from there correct location at the scale of the map, even though the scanner is operating within stated accuracy specifications. Three tenths of an inch can translate to several hundred meters of error on the ground, depending on the scale of the source map. This is generally unacceptable for any GIS. This paper will instruct GIS users in a method for using ArcInfo to determine their own scanner's accuracy, and explain how to interpret the results. This methods will help GIS users create a continuous surface from a test scan, measure errors digitally without the need for creating overlay plots (eliminating plotter error from the equation), display visual results on their computer screen or send output to a plotter, and pinpoint exactly which mechanism in their scanner is causing inaccuracies. The results will clearly represent the true accuracy of the scanner in a manner that is easy to understand and interpret.
The Missouri Department of Natural Resources (MDNR) participates with the Natural Resource Conservation Service (NRCS) in the National Cooperative Soil Survey (NCSS). This involves identifying, documenting, and mapping soils, developing this information for publication in the official survey books (usually by county), interpreting soils information for specific applications, and conducting related research. Soils maps and manuscripts have been developed and published for most of Missouri's 114 counties. The maps and interpretive information in the manuscripts represent a substantial investment (each county represents about 12 to 18 person-years of work) and are very valuable resources for natural resource planning, research, and development. There are several agencies and other interests who would like to have access to this information and would like to have it in an electronic format which is constant with geographic information system (GIS) technology. The overall task of this project is to develop in digital format individual county soil coverages for the State of Missouri using available GIS and scanning technology including raster-to-vector and optical character recognition software.
When developing a GIS database from scratch, it is prudent to carefully plan what information, particularly feature attribute information, should be captured during the collection phase and how it will be collected, and also to explore how it will be used and maintained. But what does one do when an essential item is overlooked? This paper describes various ideas, techniques, and methods of pattern recognition and artificial intelligence which can be used to capture some of this information which, at first glance, may not be readily available, but in reality, can be accessed and used. And, best of all, the computer can do most of the work for you!