Organizations and agencies which use GISs are requiring more precise "metadata" which describe the confidence one might place in stored spatial data. This is true not only for primary datasets but for derived datasets as well. The need for such metadata, and for the quality control(QC) which it supports, will increase as GISs are used more often to decide issues which may produce litigation. The approach proposed herein allows the user to ascertain the degree of accuracy of the spatial data concerned. The intent of its design is to provide a universal data frame that promotes truly "honest" GIS processing, while at the same time permitting "fuzziness" in GIS data which both polygon and cell paradigms deny.
The Dot-Probability Paradigm (DPP) is a GIS dataframe for the storage and manipulation of a real, network, and point spatial data; further the DPP has built into it the ability to provide the user with detailed information about the quality of data contained in a given dataset.
The DPP project was sponsored by Esri and the Ohio Center for Mapping.
The DPP is based on a raster of closely spaced points, to be referred to as "dots." For purposes of discussion and illustration let us consider that the nominal spacing of these dots is one centimeter (cm)in a square array. An alternative spacing might be to use some fraction of a degree of latitude and some (other) fraction of a degree of longitude. Any other set of measures which will form an approximately square grid would be usable.
Of the set of all dots, usually only a portion are considered "active" or "lit" for any given theme. For example, for some (a real) theme "X" only every 2048th dot in the horizontal direction, and also in the vertical direction, might be active, giving (with our assumption of 1 cm dot spacing) a distance between adjacent active dots of 2048 cm, or roughly 67 feet. Only active dots represent data.
The set of active dots is specified by a number called the dot cover parameter ("dcp") which is calculated as (2 to the power "k"),where "k" is an integer chosen by the user. The use of the "dcp"-- 2048, or 2 to the eleventh power in the case above -- allows variety in the amount of data stored in a given area, and in the level of detail. Given two themes of equal "dcp", the user is guaranteed that all the data stored in one theme will be positionally congruent with all the data in the other; should one theme be less dense than the other, all the data in the less dense theme will be positionally congruent with data in the more dense theme.
Two values are stored in connection with each active dot:
"z" - the most likely value of the theme precisely note 1 at the dot, and
"p" - a measure of the correctness of "z".
For categorical data, a particular "p" is the probability(0<=p<=1) note 2 that the value of the corresponding "z" is correct. For continuous data, "p" is a statistic(for example, the variance) which promotes an understanding of the confidence one might have in the corresponding "z". A value of "p" of less than 0.5 makes the value of "z" dubious indeed, although the corresponding value of "z" might be the best single guess. The paradigm also includes handling a value of "z" recorded as"NODATA".
The size of the dataset for a given area obviously depends on the "dcp". For an area in which data exist at every dot, the "dcp" would be 1 ("k" = 0) which corresponds to the most dense level of data storage; if data existed at only one-fourth the dots (every other dot in each direction) the "dcp" would be 2 ("k" = 1). A "k"of 11 would produce a "dcp" of 2048, or data dots every 20.48 meters. (The "dcp" is constrained to powers of two with an integer exponent to allow in filling of themes and overlaying of themes, as explained below.)
As with many raster systems, the location of a given dot is found by calculation, rather than by retrieving its "x" and "y" coordinates from storage. This calculation may be done by integer arithmetic-- a fact which may be taken advantage of to produce extremely fast processing. Further, the value of "p" can be stored as a scaled integer, again obviating the need for floating point processing.
Data for DPP datasets may be taken from a number of other sources:
Given the existence of a dot dataset of "dcp" value 2 to the "m", where "m" is greater than 0, approximately three times as many data may be generated by "infilling" the dataset. Infilling consists of assigning a value to each dot which is precisely between each pair of adjacent active dots in each horizontal row, likewise for adjacent active dots in vertical columns, and in the center of each square formed by a set of four original adjacent active dots. The result is a dot dataset with a "dcp" of 2 to the "m-1". Each new active dot's "z" value comes from consideration of its neighboring dots. Each "p" value comes (a) from consideration of the "p" values of the neighboring dots, (b) from consideration of the "z" values of neighboring dots, (c) from statistical measures related to the characteristics of the theme being portrayed, and (d) from consideration of the distance between active dots of the original theme.
If more intense infilling were desired (i.e., going from "k" equal to "m," to "k" equal to "m-2" or "m-3"-- which we might call "order 2 infilling " or "order 3 infilling")the DPP infilling procedure described above would not be appropriate (even though using it recursively would result in a database of correct "dcp").Rather infilling should take place by defining new dots closest to original active dots and repeating this procedure until the proper order of infilling has been achieved.
Active dots of a new composite theme may be generated by superimposing(overlaying) two established dot datasets with identical "dcp's". Given two dot themes, say "A" and "B" consisting of thematic or categorical data, for each active dot location the assignment of the value of "z" in the new theme, (say "C"), is based on some function of the corresponding "z" values of "A" and "B". This function could be as simple as the appropriate value in the Cartesian product formed by the possible values of themes "A" and "B." In general, the value of "z" in theme "C" could be determined by a high-level language program, or by a ProLog or other AI procedure.
The value of "p" for a given dot represents the joint probability of the correctness of the new value of "z"; it could be the product of the "p" values of the pair of dots in "A" and "B", or the result of a more sophisticated statistical process.
The actual resulting probability of the overlay of a pair of dots from two themes may well not simply be the product of the associated values of "p". The resulting value of "p" in the composite theme may depend on many factors, but certainly one might consider the values of "z" in each theme. If one overlays land use and land cover, and the two values are "pasture" and "grass" the probability of correctness might be increased. On the other hand if the values are "pasture" and "pavement" the probability might be decreased. Therefore the software allows the user, for each value in the Cartesian product, to specify an integer "i" in the range[-m to m] such that, for positive "i" the probability "p" is increased above the product of the constituent probabilities; for negative "i" the probability is decreased. For "i" equal to "m", the resulting probability is one; for "i" equal to "-m", the resulting probability is zero. The function is linear.
If the dcp's of two themes "A" and "B" are not equal, and overlaying is desired, one theme might be infilled. Alternatively or additionally, the other theme might be thinned, with, of course, the concomitant loss of data.
When using the DPP with point and line data, the active dots are simply those dots used to define the feature; the "dcp" plays no role. That is, every dot is addressable. Thus, using our assumed resolution of 1 cm, points could be delineated within 0.5 cm.
The DPP could store traditional point data by providing the coordinates of the dot closest to the reference point on the object being depicted. The "z" value would simply name the object; the "p" value might indicate spatial (x,y) precision.
Linear data could be stored by simply connecting the relevant dots. The "p" values for each dot used could again indicate spatial(x,y) precision.
It is not hard to envision TIN data, land records data, and data from surveyors in this system.
The authors contend that the Dot Probability Paradigm has a number of characteristics which make it attractive as a method of spatial data storage and processing.
Several issues would have to be addressed before the DPP could become a viable technique for spatial data storage and processing although a prototype exists based on ArcInfo and GRID. Further, the DPP could lead to several interesting research projects. A partial list addressing these concerns might include:
Michael Kennedy
Department of Geography1451 Patterson Office TowerUniversity of KentuckyLexington, KY 40506-0027Phone: 606-257-6494Fax: 606-323-1969kennedy@ukcc.uky.edu