This paper describes methods for reducing data conversion costs by implementing statistical quality assurance procedures. The sampling approach identifies systematic data conversion problems at much lower cost than 100% inspections. This information may then be used to help data conversion contractors in process improvement efforts.
The National Imagery and Mapping Agency (NIMA) and
the Central Intelligence Agency's Map Services Center (MSC) have
an enormous inventory of paper maps that must be converted into
digital form. The data capture process is slow and quality assurance
of the digital products is costly. Therefore, these agencies require
an improved process for data conversion.
This paper is concerned with the quality assurance
(QA) of digital map data. It describes efforts to reduce the cost
of QA through the introduction of statistical sampling techniques.
Sampling can not achieve the same levels of data confidence and
accuracy as 100% checking so this is an exercise in optimization.
That is, how can we maximize data quality with limited resources
for QA?
Optimization implies that cost measures are being
traded off against some set of requirements. The NIMA digital
map data are produced for a user community with varying data content
requirements. Although many different map products are being produced,
this paper uses the conversion of the 1:250,000 scale Joint Operations
Graphics as an example. The 1:250,000 sources are converted into
a digital product called Vector Smart Map Level 1 or VMap1. VMap1
is a digital data product specification conforming to the Vector
Product Format Standard. VMap1 contains a number of feature classes
categorized as follows:
The VMap1 data are used for planning and some operations.
The following table lists feature classes that users feel are
critical for their missions. Some feature classes are more important
than others and a sampling-based QA approach must take advantage
of this.
Boundary | ||||
Data Quality | ||||
Elevation | ||||
Hydrography | ||||
Industry | ||||
Physiography | ||||
Population | ||||
Transportation | ||||
Utilities | ||||
Vegetation |
Formal statistical sampling theory deals with the formulation of parameter estimators that minimize a function, the loss function, that describes how costly it is for estimates to be in error. Textbook statistics has a symmetric loss function in the background so that traditional measures of central tendency and dispersion are correct (that is, they minimize the symmetric loss function). The classic example of an asymmetric loss function is in the estimation of the size of a reservoir. An over estimate is less costly than an underestimate so a biased population estimator is called for. Also, the loss function may show that certain components of a complex or aggregated estimator are more important. Then, stratified or other heterogeneous sampling strategies can mitigate the importance of the sampling error introduced into the parameter estimates.
The following charts illustrate the trade off in optimizing the loss function. Here, the term "Cost" is a measure of the resources needed to check data. The first chart shows that the relation between confidence and the sampling rate is a concave function of the sampling rate and that there is some error left even if all features are checked. Confidence is some measure of data accuracy. Its exact meaning is discussed later. The relation is concave because succeeding increments to the sampling rate do not contribute as much to enhancing confidence as did earlier increments. In part, this is due to the fact that sampling error varies with the square root of the sample size. The next chart indicates that increasing the sampling rate costs more and that the relation is roughly proportional. Therefore, as shown in the third chart, confidence is a concave function of QA costs. Now, add a fourth dimension to the problem. It is harder to check a complex map than a simple one. This is true because it is harder to distinguish features on a complex map than a simple one and not because there are more features. The chart indicates that it costs more to reach a given level of confidence with a complex map than a simple one. The trade is to sample complex maps (or complex areas within a single map) at a higher rate in order to reach a pre-selected level of confidence.
The loss function for estimating the quality of the
VMap1 digital data is subjective and hard to quantify. However,
it is clear that not all categories of features are as important
to users as others Therefore, QA procedures can be based on weighting
categories differently so that more of the features in these categories
are looked at than in others.
With good quantitative data on the costs of QA it would be possible to construct a procedure that would allow a QA manager to select a confidence level and then to generate automatically a feature sampling scheme that would minimize the cost of achieving this level of confidence. In practice, there are several conceptual difficulties to be resolved before a fully automatic process can be developed. Some of these issues are described in Section 3.
Suppose we are evaluating a production process. We
want to be sure that our machine to make #10 hex nuts is performing
satisfactorily. The machine makes 1,000,000 nuts per day. It is
too costly to measure each nut to ensure that it meets specifications
so we are going to sample the production runs to measure the proportion
of nuts that fail to meet specs.
There are two basic forms of sampling, probability
samples and judgment samples. For probability samples sampling
errors can be calculated and biases in selection and estimation
are nonexistent. The biases and sampling errors of judgment samples
can not be calculated from the sample but must be determined by
expert judgment.
Probability samples are purely objective; a nut is
selected at random and measured. We record whether or not it met
specifications. Because the sampling is random, the selection
process is unbiased. The proportion of faulty nuts is measured
and the sampling error of this measurement can be calculated.
Now, we know a statistic (the measured proportion of faulty nuts)
and we know its sampling error. Therefore, we can formulate a
hypothesis test that the proportion of faulty nuts is lower than
some acceptable threshold and we can assign a confidence level
to this measure. For example, we might be able to say that the
proportion of faulty nuts is less that 0.01 percent with a confidence
of 99 percent. We are 99 percent certain that less than 0.01 percent
of the nuts are faulty.
The sampling procedure is random. We could put 1,000,000
nuts in a barrel, stir up the barrel, and then pick out the predetermined
number of nuts (that is, the number of nuts that need to be measured
to make the sampling error low enough to achieve the 99 percent
confidence level) to be measured. Alternatively, we could compute
a random sequence and use this to pick the nuts. Computed random
sequences are really pseudo-random but this is good enough.
Sampling-based quality checking is the key to process
improvement since it reveals systematic problems in the production
process. Suppose, that a daily QA procedure shows that the proportion
of faulty nuts is higher on Mondays than on other days of the
week. This may imply that some maintenance problem occurs over
the weekend or that the machine (or its operators) needs extra
time to reach its efficient operating level.
Judgment samples depend on selecting "typical" or "representative" measurands or by selecting weighting factors that make allowances for characteristics of the population being measured that are not accounted for by the sampling itself. A stratified sample is a mix. The population is divided into segments, the segments are assigned weights judgmentally, and then the segments are sampled randomly.
The quality assurance of spatial databases is concerned with measuring the proportion of spatial features that are "correct." We want to know the correctness of features over space and across the various categories of features. The sampling of a spatial database is judgmental because we know two things. First, data conversion is more difficult for regions of a map that are more dense and second, not all feature categories are equally important to the users of the database. Therefore, the QA sampling scheme needs to be stratified over space and across feature categories. Then, we need methods to select random regions of the map to check and random features to check within categories.
Each feature is either correct or incorrect according
to a set of measurement criteria. Therefore, the statistical problem
is to estimate the proportion, p, of correct features from a large
population of features. The binomial distribution is used for
this case. The estimate of p is the proportion of correct features
found in the sample and the sampling error, p, is equal
to the square root of (p * (1-p)) / n, where n is the sample size.
If n is large, then the binomial distribution may be approximated
by a normal distribution.
Suppose we have a specification for the required value of p, the proportion of correct features. Then, for any desired level of confidence (the -level of a hypothesis test) we can calculate the sample size needed to achieve this level of confidence. This calculation is based on simplifying assumptions that are not hard to accept.
The quality of a spatial database is the accuracy
and completeness of its spatial features and attributes. NIMA
and MSC's goal in converting paper maps is to capture in digital
form all the information represented on the paper map. Then, the
definition of a high-quality spatial database is one that could
be used to replicate the paper map. The elements of the spatial
database need to be assessed with respect to spatial accuracy,
attribute accuracy, and feature completeness.
The single QA measure is the proportion of spatial
features that are represented correctly in the database. To be
correct, a feature has to be located in the right place, be categorized
properly, and have all of its attribute values correct. Errors
of commission occur if a feature is in the wrong place (or has
the wrong shape), is denoted as being in the wrong feature class,
or has an incorrect set of attribute values. Errors of omission
occur because a feature has been overlooked and is not in the
database.
There are two parts to the definition of digital
map database correctness. First, the formal Product Specification
describes the data dictionary for the database. The data dictionary
defines feature classes, feature types, required attributes, and
the valid domains of these attributes. The Product Specification
rigidly constrains the feature definitions allowed in the database.
Unfortunately, the data dictionary is not enough. Data capture
contractors also need a set of Digitizing Guidelines that spell
out in as much detail as possible the identification and interpretation
of cartographic symbolization on the source maps. Together, these
two documents comprise the definition of what is correct in the
digital database.
It is useful to define the terms "validation" and "verification" for use in the discussion of QA. A digital database is valid if it conforms to the Product Specification's data dictionary. Conformance is necessary for correctness but it is not sufficient. A feature may match the data dictionary and still be incorrect either because it is in the wrong place or because it has valid but incorrect attributes. Verification is defined as the process of checking data dictionary-compliant features to make sure they are attributed correctly and that they replicate the feature on the map.
This section describes sampling-based QA methods.
Sampling is needed because a full population census is too costly.
We are concerned here with the time of human QA technicians. Any
automated QA method, regardless of its computational difficulty,
will avoid sampling because it can be executed during off hours
and can, therefore, economize on the technician's time.
Validation can be automatic. A program can read the digital database and compare its content to the data dictionary. In addition, an automated checker can quality assure the format of a delivered database. That is, a program can validate both the content and structure of digital map data.
We want to stratify our sampling according to map
complexity. This is because we anticipate that error rates are
higher where map features are more dense. If we ignore that fact
that errors of omission are also more likely where features are
dense then we can define a spatial sampling scheme weighted by
feature density from the digital database itself.
ArcInfo has commands to extract the vertices of
each kind of feature (point, line, polygon, etc.) and to write
these extracted points into a point coverage. ArcInfo also has
a command to combine all the point features in overlapping coverages
into one coverage. Then we can create a single coverage that contains
all the vertices in the database. Now, construct an array of rectangles
(at some resolution) that covers the area of the database being
checked and tabulate the number of points from the all inclusive
point coverage that fall within each rectangle in the array. This
gives us a polygon coverage where the density of the source map
is (or, at least, is close enough to) an attribute of each polygon.
Next sort the rectangles by number of vertices and then normalize
the density measure such that it sums to one over all the rectangles.
A uniform (0, 1) random number is compared to the cumulative distribution
of rectangles to select a rectangle to be included in the sample.
Note that the probability of a rectangle being selected is roughly
proportional to the feature density of the area of the source
map it covers. Now, check features within this area manually.
A second form of spatial sampling makes use of random points. Random points can be generated by drawing two uniformly distributed random numbers, one for longitude (or the horizontal axis) and one for latitude. A (0,1) uniform random number can be transformed into the range of each axis. Now, draw a circle around each random point and manually check the features found within the circle. ArcInfo commands can do this selection programmatically. The sampling rate is controlled by the number of points selected and the radius of the search circle.
Feature classes differ with respect to how important
they are to users and how difficult they are to verify. Therefore,
a stratified feature sampling scheme is called for. We want to
select features to check randomly but with a probability of selection
higher for more important feature classes. A simple sampling scheme
is to give each feature in a particular class a weight and then
normalize the weights over all features to sum to one. Then sort
the weights and select a (0, 1) uniform random number to select
a feature to check manually.
This approach needs to be refined to handle Air Force requirements. Any vertical obstruction, spot elevation, tower, silo, etc. is very important to pilots. However, these feature types are in different feature classes so the simple approach described above is not adequate. We need to evaluate sampling schemes to handle this special case.
Over time, QA technicians notice patterns and correlation
in errors. Sometimes, this correlation may be traced back to an
element of the data capture contractor's process. This kind of
insight is important for process improvement and may add to the
power of a sampling approach. Adaptive sampling uses error correlation
to modify feature or area selection probabilities.
At this time, little information exists to formulate an adaptive sampling approach. However, if we are to make use of this technique in the future, we must capture key information about possible error sources so correlated errors can be identified and explained. This has implications for the way data capture contractors document their work and for the way QA errors are annotated.
Optimal sampling is one way to reduce the cost of
inspecting digital maps; enhancing QA technicians' productivity
is another. This section describes a workstation environment for
digital map database inspection. Both the concept of operations
and the functionality for the workstation are preliminary. This
is only a concept and not a design.
The key to the workstation concept is to give the QA technician rapid access to information and visualizations that will support human pattern recognition. Also, the workstation will serve as the executive for launching non-real time tasks (such as automatic validation routines) and for reviewing the results of such tasks. Interactive functions are to be performed as fast as possible.
Figure 2 shows the workstation's user interface.
The screen is divided into three areas. The image area on the
left displays two overlapping images. One image is the scanned
source map. This is shown as a full color image but it may also
be a scanned separate. A cartographic representation of the digital
database being inspected is overlaid on the scanned map. The cartographic
image is the output of a map production preprocessing step. It
is either a raster map image or a graphics metafile. It uses a
simplified symbology that has been designed to encode features
in a way meaningful to an expert user. The overlay is easier to
see on the computer screen than on the printed page. The images
are selected by picking from a drop down list. The images must
be registered so that the QA technician may evaluate feature coding
and spatial accuracy very rapidly. There are controls for rapid
pan and zoom (in and out).
The "Fade" control determines a degree
of transparency for the overlaid cartographic product. If the
fade control is at its maximum, the cartographic image is completely
transparent and only the scanned source map appears. Sliding the
control downward brings out the cartographic image in stages until
the scanned map disappears when the control is at the bottom.
Image fading is a powerful tool used in image analysis for feature
discrimination and change detection. Here, fading is used for
checking spatial accuracy, finding features that have been left
out, and for high-level feature coding checks. The mouse cursor
is pointing to a brown line that should be the center line of
the black railroad. These segments are in error since they do
not line up properly.
Image fading has to be instantaneous to be effective. A preliminary feasibility study showed that fade can be implemented as described using graphics display hardware that is supplied with today's high-end Windows-based platforms. The two images are converted to a simpler form using a reduced set of colors (16 colors for the scanned map and eight colors for the cartographic metafile) and then the fade operation is performed using color lookup table animation.
The second area of the screen holds a different kind
of cartographic representation of the digital database. It is
a full geographic information system (GIS) display. This screen
area is drawn using ArcPlot (ArcEdit, ArcView, or a MapObjects
application may be better). It uses a more elaborate feature symbolization
than the cartographic overlay in the first screen area and, therefore,
conveys more attribute information. A feature "identify"
tool is active on this window so the operator can view the complete
attributes for a selected feature or set of features. This area
of the screen is used for checking feature attribution.
The two screen areas are implemented as child windows. Therefore, they can be positioned on the screen independently. For example, they do not have to be overlapped if there is enough screen area for both. The spatial extents displayed by the two windows are coordinated. That is, if the active window is panned or zoomed, the other window will adjust automatically to display the same spatial extent.
The third screen area is a toolbar. Ten tools are defined here. They are:
The QA Workstation has a standard set of menus. The Reports menu contains commands for reviewing various kinds of database reports such as the results of automatic validation routines. In addition, it is way to access an ad hoc database reporting capability.
The literature on GIS database QA says little about
automatic spatial accuracy or attribute correctness checking.
However, some research results reported by the Ohio State University
GISOM Project are interesting. They verify some attribute correctness
and topological consistency constraints automatically. For example,
contour lines can not intersect. An intersection test goes beyond
the usual attribute verification tests (are all the values permissible?)
and also beyond general topological consistency checks such as
looking for open polygons.
An automated accuracy and completeness checks can be used for checking linear features. The technique is based on the vector to raster conversion algorithms in computer graphics. The idea is to compare the lines generated by plotting the line features in the digital database to the source pixels that defined the line on the source map. Line quality is measured quantitatively. This is done as follows. First, establish a rule for defining centerline pixels on the source image. Then, count the proportion of centerline pixels that are intersected by the vector line. Note that each pixel has a finite spatial extent and that the vector line has no thickness. If the proportion of centerline pixels crossed by the vector line does not exceed a threshold value, then flag the segment where this occurred for later operator evaluation. See the example in Figure 3. The centerline quality of this line is one.
Quality assurance is the empirical foundation for
process improvement steps. The systematic errors found while inspecting
a number of digital maps reveal aspects of the data development
process that require investments in new or enhanced technology
and practice. The goal of process improvement is to produce more
digital data at lower cost quicker than before. In addition, insights
from analyses of paper map conversions are applicable to the formulation
of new data capture methods to be applied to future data sources.
Our experience to date indicates four broad areas for process improvement efforts. They are:
Alan Freiden
Island System Design
311 Fort Howell Drive
Hilton Head Island, SC 29926-2765
Telephone: (803) 342-3830
email: afreiden@digitel.net
Mark Johnson
CIA Map Library
Telephone: (703) 742-8071