Michael Kennedy

Patricia E. Bomba

The Dot Probability Paradigm

for the Storage of Spatial Data


Abstract

Organizations and agencies which use GISs are requiring more precise "metadata" which describe the confidence one might place in stored spatial data. This is true not only for primary datasets but for derived datasets as well. The need for such metadata, and for the quality control(QC) which it supports, will increase as GISs are used more often to decide issues which may produce litigation. The approach proposed herein allows the user to ascertain the degree of accuracy of the spatial data concerned. The intent of its design is to provide a universal data frame that promotes truly "honest" GIS processing, while at the same time permitting "fuzziness" in GIS data which both polygon and cell paradigms deny.

The Dot-Probability Paradigm (DPP) is a GIS dataframe for the storage and manipulation of a real, network, and point spatial data; further the DPP has built into it the ability to provide the user with detailed information about the quality of data contained in a given dataset.

The DPP project was sponsored by Esri and the Ohio Center for Mapping.


1. Definition and Arrangement of Dots

The DPP is based on a raster of closely spaced points, to be referred to as "dots." For purposes of discussion and illustration let us consider that the nominal spacing of these dots is one centimeter (cm)in a square array. An alternative spacing might be to use some fraction of a degree of latitude and some (other) fraction of a degree of longitude. Any other set of measures which will form an approximately square grid would be usable.

Of the set of all dots, usually only a portion are considered "active" or "lit" for any given theme. For example, for some (a real) theme "X" only every 2048th dot in the horizontal direction, and also in the vertical direction, might be active, giving (with our assumption of 1 cm dot spacing) a distance between adjacent active dots of 2048 cm, or roughly 67 feet. Only active dots represent data.

The set of active dots is specified by a number called the dot cover parameter ("dcp") which is calculated as (2 to the power "k"),where "k" is an integer chosen by the user. The use of the "dcp"-- 2048, or 2 to the eleventh power in the case above -- allows variety in the amount of data stored in a given area, and in the level of detail. Given two themes of equal "dcp", the user is guaranteed that all the data stored in one theme will be positionally congruent with all the data in the other; should one theme be less dense than the other, all the data in the less dense theme will be positionally congruent with data in the more dense theme.

2. Areal Data Storage

Two values are stored in connection with each active dot:

"z" - the most likely value of the theme precisely note 1 at the dot, and

"p" - a measure of the correctness of "z".

For categorical data, a particular "p" is the probability(0<=p<=1) note 2 that the value of the corresponding "z" is correct. For continuous data, "p" is a statistic(for example, the variance) which promotes an understanding of the confidence one might have in the corresponding "z". A value of "p" of less than 0.5 makes the value of "z" dubious indeed, although the corresponding value of "z" might be the best single guess. The paradigm also includes handling a value of "z" recorded as"NODATA".

The size of the dataset for a given area obviously depends on the "dcp". For an area in which data exist at every dot, the "dcp" would be 1 ("k" = 0) which corresponds to the most dense level of data storage; if data existed at only one-fourth the dots (every other dot in each direction) the "dcp" would be 2 ("k" = 1). A "k"of 11 would produce a "dcp" of 2048, or data dots every 20.48 meters. (The "dcp" is constrained to powers of two with an integer exponent to allow in filling of themes and overlaying of themes, as explained below.)

As with many raster systems, the location of a given dot is found by calculation, rather than by retrieving its "x" and "y" coordinates from storage. This calculation may be done by integer arithmetic-- a fact which may be taken advantage of to produce extremely fast processing. Further, the value of "p" can be stored as a scaled integer, again obviating the need for floating point processing.

3. Assignment of Dot Values from External Areal Sources

Data for DPP datasets may be taken from a number of other sources:

  1. Dot values ("z") may be digitized or scanned from existing maps. Efficient software for the rapid digitizing of dots has been constructed and tested, but does not exist for popular machines. The associated values of "p" might be determined from knowledge of the accuracy claimed for the map, the parameters of the digitizing or scanning process, and, with categorical data, the closeness of a dot to a perceived boundary.
  2. Dot values and probabilities may be generated by re-sampling other raster structures, such as LANDSAT or SPOT scenes, or other grid databases. Associated probabilities could be based on the metadata for the particular scene, and on the closeness of a dot to the implied boundaries of change in those scenes or databases.
  3. Dot values may be generated from automated vector-based polygon areal data using point-in-polygon techniques. Associated probabilities might be derived from an understanding of the quality of the original data and closeness of each given point to a boundary in the original dataset. The authors agree that the polygon technique for spatial data storage has considerable advantage over the DPP with regard to geographic specificity. However, "popping" polygon coverages into dot coverages prior to overlaying could result in considerable efficiency during analysis. Further, since dots near the boundaries could be assigned smaller values of "p" than those in the interior, honesty about the quality of the overlay could be enhanced. The user is provided with a general function to assign the rate of " probability dropoff" for dots near a border. The function reduces the probability of correctness according to distance from the arc and parameters supplied by the user. Or the user may choose to supply a function to calculate the reduced probability. The use of this approach paves the way for fuzzy set processing.

4. Assignment of Dot Values by Infilling

Given the existence of a dot dataset of "dcp" value 2 to the "m", where "m" is greater than 0, approximately three times as many data may be generated by "infilling" the dataset. Infilling consists of assigning a value to each dot which is precisely between each pair of adjacent active dots in each horizontal row, likewise for adjacent active dots in vertical columns, and in the center of each square formed by a set of four original adjacent active dots. The result is a dot dataset with a "dcp" of 2 to the "m-1". Each new active dot's "z" value comes from consideration of its neighboring dots. Each "p" value comes (a) from consideration of the "p" values of the neighboring dots, (b) from consideration of the "z" values of neighboring dots, (c) from statistical measures related to the characteristics of the theme being portrayed, and (d) from consideration of the distance between active dots of the original theme.

If more intense infilling were desired (i.e., going from "k" equal to "m," to "k" equal to "m-2" or "m-3"-- which we might call "order 2 infilling " or "order 3 infilling")the DPP infilling procedure described above would not be appropriate (even though using it recursively would result in a database of correct "dcp").Rather infilling should take place by defining new dots closest to original active dots and repeating this procedure until the proper order of infilling has been achieved.

5. Assignment of Dot Values by Overlaying

Active dots of a new composite theme may be generated by superimposing(overlaying) two established dot datasets with identical "dcp's". Given two dot themes, say "A" and "B" consisting of thematic or categorical data, for each active dot location the assignment of the value of "z" in the new theme, (say "C"), is based on some function of the corresponding "z" values of "A" and "B". This function could be as simple as the appropriate value in the Cartesian product formed by the possible values of themes "A" and "B." In general, the value of "z" in theme "C" could be determined by a high-level language program, or by a ProLog or other AI procedure.

The value of "p" for a given dot represents the joint probability of the correctness of the new value of "z"; it could be the product of the "p" values of the pair of dots in "A" and "B", or the result of a more sophisticated statistical process.

The actual resulting probability of the overlay of a pair of dots from two themes may well not simply be the product of the associated values of "p". The resulting value of "p" in the composite theme may depend on many factors, but certainly one might consider the values of "z" in each theme. If one overlays land use and land cover, and the two values are "pasture" and "grass" the probability of correctness might be increased. On the other hand if the values are "pasture" and "pavement" the probability might be decreased. Therefore the software allows the user, for each value in the Cartesian product, to specify an integer "i" in the range[-m to m] such that, for positive "i" the probability "p" is increased above the product of the constituent probabilities; for negative "i" the probability is decreased. For "i" equal to "m", the resulting probability is one; for "i" equal to "-m", the resulting probability is zero. The function is linear.

If the dcp's of two themes "A" and "B" are not equal, and overlaying is desired, one theme might be infilled. Alternatively or additionally, the other theme might be thinned, with, of course, the concomitant loss of data.

6. Use of the DPP for point, line, and other data

When using the DPP with point and line data, the active dots are simply those dots used to define the feature; the "dcp" plays no role. That is, every dot is addressable. Thus, using our assumed resolution of 1 cm, points could be delineated within 0.5 cm.

The DPP could store traditional point data by providing the coordinates of the dot closest to the reference point on the object being depicted. The "z" value would simply name the object; the "p" value might indicate spatial (x,y) precision.

Linear data could be stored by simply connecting the relevant dots. The "p" values for each dot used could again indicate spatial(x,y) precision.

It is not hard to envision TIN data, land records data, and data from surveyors in this system.

7. Favorable characteristics of the Dot-Probability Paradigm

The authors contend that the Dot Probability Paradigm has a number of characteristics which make it attractive as a method of spatial data storage and processing.

  1. The DPP has built into it the basis for quality control (QC). Measures for overall coverage accuracy -- whether of input or computed coverages-- would be trivial to compute, as would be estimates of accuracy related to particular geographic areas within the coverage and/or particular feature types. And, of course, accuracy measures could be displayed visually.
  2. Developed and developing statistical and numerical analytic techniques may be applied to the data in the DPP. In particular, many of the techniques used to enhance satellite image data could be applied. Formally, a given DPP data set could be considered a point sample.
  3. The DPP would allow the tool of fuzzy set theory to be applied to spatial data in a straightforward way.
  4. The paradigm accommodates both traditional point data and linear data.
  5. The re-emergence of cell-based modeling systems as major analytical GIS tools, and their integration with vector based systems, fits well with the raster nature of the DPP.
  6. The increasing availability of parallel machines and appropriate algorithms would serve the DPP quite well. DPP spatial overlaying is fundamentally a set of processes -- each concerned with a "more local" geographic area than processes on areas represented by a polygon system -- which maybe conducted in parallel, as no particular sequence is implied.
  7. The DPP is highly compatible with remote sensing datasets and with raster-based graphics. Its compatibility with raster I/O devices is obvious.
  8. Integer arithmetic may be used to determine the addresses of dots. Such an addressing scheme may be many times faster than those requiring floating point arithmetic. Using a full (32 bit) word for each dimension of the address (for the subscripting value), assuming 31 bits for the number, would provide coverage of an area equivalent to more than one-fifth of the earth's total surface, with dots spaced 1 cm apart; this probably constitutes an adequate tile size at a reasonable resolution for most applications. The values of "p" likewise could be stored as integers, scaled when necessary for use in some computations.
  9. Viewed in another way, the DPP could serve as the vehicle for the implementation of the Strabo Technique, as proposed by Dr. Thomas Poiker of Simon Fraser University. Under Strabo -- which has parallels to the Delphi Technique-- experts provide spatially distributed information to which they attach levels of confidence.

8. Areas for Research

Several issues would have to be addressed before the DPP could become a viable technique for spatial data storage and processing although a prototype exists based on ArcInfo and GRID. Further, the DPP could lead to several interesting research projects. A partial list addressing these concerns might include:

  1. Theoretical and automated techniques for determining "p" values to be assigned to various types of coverages, and to features within a coverage.
  2. Reforming dot coverages into polygon coverages, perhaps taking probabilities of accuracy into account and forming polygons indicative of different levels of accuracy.
  3. Integration of the DPP into existing software packages .
  4. DPP and fuzzy sets; how can "z" and "p" play a role?
  5. Using the DPP in measuring heterogeneity; in measuring homogeneity; in auto-correlation.
  6. Mechanisms for data compression and use of raster-based storage techniques such as quad trees.
  7. Exploration of picture enhancement techniques applied to DPP.


Notes:

  1. To be more detailed, the value at the dot is the specific area feature which surrounds the dot, since the dot itself has zero dimension. back to article
  2. The value of "p", while formally a number between zero and unity, could be stored as an integer. For example, one might substitute "999" for 0.999, thus using scaling to obviate the need for floating point storage and processing. Further one could restrict the range of values which might be assigned to "p" to, say, 999 (for virtual certainty),99, 98, 97, and so on, which could facilitate the use of run-length encoding or other data compression schemes. back to article


Michael Kennedy

Department of Geography
1451 Patterson Office Tower
University of Kentucky
Lexington, KY 40506-0027
Phone: 606-257-6494
Fax: 606-323-1969
kennedy@ukcc.uky.edu