Alan Witmer

The Best of Both Worlds:
Vector Conflation of Database Segments and Attributes

Abstract

Conflation technology offers a cost-effective, efficient method of combining the best attributes of separate vector databases. Using GDT's proprietary conflation technology as a case study, this presentation will discuss conflation techniques, the challenges of matching elements within two distinct databases, the hurdles that must be overcome, and the benefits of using conflation technology. The presentation will also highlight quality control issues such as preliminary database analysis and post conflation QC processes.


Introduction

GDT is a supplier of premiere digital mapping databases for a wide variety of business and consumer applications. Our customers demand the most current and most accurate data. We are faced with a mountain of potential update sources, representing more current or more accurate spatial and attribute information. To use traditional models of data maintenance would generate a prohibitive amount of manual work, with a high chance for mistakes and inconsistency. GDT has augmented our traditional map compilation activities with automated conflation to minimize the update cost and to improve our responsiveness to change.

There are five steps to the conflation process.

Our approach has been to modularize the software and the process for each step listed above, allowing for individualized development, tuning, independent operation, and QC. We generally refer to the three middle steps as "conflation," but the first and last steps add crucial value to the process.

Although GDT conflation is sometimes used to generate merged, new databases, our predominant concern is to bring goodness into our nationwide core database from an outside source. I'll refer to these databases as the target and source, respectively. The source database usually is a limited geographic coverage whose information must be imported while maintaining strict rules for geography, topology, and attributes.

The Challenge

Figure 1 illustrates the basic challenge of conflation. We see a view of two overlaid street centerline databases. At first glance, it seems that they represent the same area. We see a major road in each database, with the same route number and general heading. There is a development north of that road in each case, with some similarity in names and geography. Even the crook in North St below the highway (label 1) bears enough similarity to that of the unnamed road to prompt a mental match, despite the difference in detail. But there are significant issues for software: roads are more angular in one database, lengths and proportions vary significantly, and streets that should match are not often the nearest streets. The following labeled areas illustrate other common challenges:

  1. Corresponding streets meet in differing intersection configurations (like the North St/Unnamed intersection with Route 16).
  2. The names are similar, but not exact: "Alton Hgts Ln" versus "Afton Ln".
  3. Two stretches of road in one database (North St.) match to only one in the other, and the single item must be split in order to build a one-to-one relationship.
  4. The B St/ Unnamed match, continue further in one database than the other, and conflation must decide how much of the more-complete street should be matched.

Database Preparation

A technician will study a candidate source database and verify any accompanying accuracy claims. We assure that it is compatible with or convertible to our data model and rules. If initially accepted, we decide which layers and data fields should be subject to conflation, clean the coverage, and convert similar fields into a common representation. The databases may contain differing fields or use dissimilar grammar for their contents, in which case we may define a conversion function to improve compatibility for transfer or matching purposes. For realignment candidates, we use conflation-produced topology difference reports, described later, to pinpoint significant topological differences against our data. This helps us to quickly find problems in digitizing or interpretation before final acceptance.

 

Build a Common Representation

We build a topological representation from the selected arcs that we want to conflate in each database. The purpose is to:

We remove data that we have no interest in for conflation, or data that is in some way incompatible. This might include unnamed roads, items with key substrings in the name, or features of certain classifications. We may eliminate double-lane Boulevards or highways whose representation is different than ours. Or we may avoid ramps that have no routable connections.

Beyond filtering, we also decide what we mean by an entity; i.e. is it a uniformly named 1-cell, or may it change name midstream? May it be a block or place boundary at some points, but not at others? Must its feature classification remain constant along its length?

Filtering and entity definition generates a new topological model. The new 1-cells consist of one or multiple arcs in the original representation we'll call these chains. Notice the new 0-cells have occasional valence-2 nodes separating chains whose merger would not meet the criteria for a single entity.

We also build additional information at this time. In particular, the software locates and marks double-lane roads per user-supplied criteria. It assigns a directionality flag to indicate on which side the counterpart is found. This prevents ambiguity later, eliminating the possibility that the wrong lanes will be matched.

 

Matching

We can now match corresponding features from the two geographic databases. As shown in Figure 1, this results in a set of one-to-one, one-to-partial (e.g.: label 5- B Street), and one-to-many (e.g.: label 4 North Street) associations between the chains that were created in the previous step.

Matching, or correlation, seems a highly subjective exercise; any two people may arrive at different and valid match conclusions for an area, depending on the internalized stochastic model they apply to the task. Why? Subjectivity arises from the fact that corresponding line work may contain many geometric, topological, and attribute properties to be used as match criteria, none of which may match completely for a corresponding feature pair. Corresponding features may be spatially distant; may be incident at differing nodes (as in figure 1, label 2), names may be completely different, or attributes may be missing entirely. Furthermore, when there is different topology or feature completeness, we may need to match a mere portion of an object to another. Exactly where a chain should be split is subjective. We may decide based on a certain similarity in shape, or because a certain point appears similar to its mate's endpoint relative to nearby features, or by proportion based on the relative sizes of the features the chain is to match.

How, then, do we build a reliable correlator?

Automated matching of 1-cells always begins with node matching. Nodes are the confluence of a great deal of information, and are thus the places where pivotal matches can be assured. As with most other conflation software developers, we use iterative matching, choosing the strongest node matches in an early pass, and then conceptually rubber sheeting and re-matching in repeated passes as long as new matches are found. Following node match, we use the matched nodes as a guide to matching arc chains; however, those matches are stochastic in nature as well. Chains running between matched nodes are not rubberstamped as matches, and those running from matched nodes to unmatched areas may be split or even matched whole if doing so meets the user's match strength criteria.

The choice of criteria, their weights, and the choice of a summation methodology, is crucial to good matching.

 

Matching a Summation Methodology

A distance model is commonly used in conflation. This involves computing offset values of each criterion and summing them to arrive at a final "distance" for any potential match. It has the advantage of computational simplicity and it is easy to understand.

Instead of offset values, I suggest that a stochastic variable be used directly. The unit of measurement is a ratio: the probability that a matched pair has a given characteristic divided by the probability that a random unmatched pair to be considered has that characteristic. We use this for either discrete criteria, such as the existence of a boundary flag, or for continuous criteria such as distance. In the latter case we generate continuous functions to represent the numerator and denominator.

To calculate the likelihood that any pair is a match, we multiply together each criterion's stochastic value for that pair. This treatment is a direct representation of a model that assumes several independent or orthogonal statistical values. Of course, it is easier to process and debug if we work in exponential space, in which case we add the values' logarithms. It may sound as if we've come full circle to adding distances, but there is an important distinction: where distances have only positive values, each stochastic value may decrease or increase the accumulated probability of a match. Thus, in this model, a single criterion showing a poor match value may be more than negated by one or more other criteria with high match values, resulting in a meaningful match. When the stochastic model is sufficiently accurate and based on independent variables, we have the added advantage of describing the confidence of a match with some statistical qualifiers, thus: "we're over 99.9% confident of this match."

Finally, and perhaps the most important advantage of the stochastic model, is that it lends itself directly to parameter tuning via empirical methods. That is, you can directly produce a given criterion by sampling known matches and candidates to derive the two needed probability measures.

Matching Criteria

To match 0-cells or nodes, rubber-sheeted proximity is a primary match criterion. After accounting for rubber-sheet virtual repositioning, we use a probability function to determine the likelihood of a matched node at a given distance, and divide by the node density near the candidate to arrive at a very meaningful assessment. Figure 2 gives a sample of this function, as it would look if we assume a normal distribution for node offsets.

The second and stronger 0-cell criterion is the likelihood of match in the incident arc chains. Our fundamental concept works like this (though optimized considerably): compute match probabilities on the entire Cartesian product of all incident chains, noting all chain pairs whose match value exceeds a given match threshold. Process this list in order from the strongest matches to the weakest; each distinct match pair contributes a certain proportion of its match strength to the node's match criteria, while each incident chain not matched in this way results in a user-defined penalty.

1-cell match criteria can take all available attributes into account, such as name, feature classification and weight, boundary coding, double-lane direction flags, etc. Whenever an attribute value is more detailed than a simple flag, we use continuous or fuzzy logic to generate a spectrum of values, rather than simply match/unmatched.

An excellent example of a graduated technique can be used with the name criterion. Even postal-standardized names will differ due to misspellings, differences in word breakup or hyphenation, or in the completeness of the name itself. "Country Club Estates Rd" versus "Estates Rd" should contribute favorably to match likelihood. We use a pattern matcher to take advantage of naming similarities. Numeric names also represent a potential match snag, and require a parser to produce integers and fractions. Abbreviations must be decoded. "Fifth St" matches "5th St", "CORD 15.25" matches "County Road 15 1/4". Conversely, small numeric difference, even fractional, is significant despite high pattern match. "Forest Service Rd 1702" is a far less likely match for "Forest Service Rd 1703" than pattern match alone would indicate. We have tuned these tools to be efficient enough for heavy use during the node match stage.

Geospatial and topological criteria are valuable, and they are the sole information available when matching naked vectors. We use overall orientation, overall curve matching, a strong shape matcher, and neighborhood node topology as criteria to achieve confident results with all vector sources.

 

Improve the Database

With match information, we can identify and import new geography, update attributes or import new ones, realign the database, or generate a rubber sheet mapping for use in realigning associated data.

Conflation software can generate important reports, too. Our realignment software can be run in report-only mode, to generate important information before undergoing realignment or for use while qualifying incoming GIS. These include summary reports and edit lists geographically keyed information about the two data sets that describe:

Realignment relocates all unambiguously matched, topologically clean arcs. It maintains consistent topology using a constrained Delaunay triangulation to rubber sheet all unmatched items into an analogous relative position in the new geography. However, there are many hurdles that we have overcome due to the fact that individual triangles in the raw model may be inverted, producing crossing lines or inverted polygons. We have also modified the basic model to reduce or eliminate distortion in unmatched neighboring features. Finally, the model is constrained to gently sew realigned features with surrounding, unaligned features and to not move matched arcs that are tagged from a more accurate source. The software saves the final warping model to file, allowing us to use it at any time for our own use or to help customers relocate facilities and routes.

 

Final QC

Quality assurance reports are generated at various stages by the conflation software, but final QC is essential. It is important to detect crossing lines or topological anomalies, to detect inverted polygons in order to reliably re-connect with off-line polygonal information, find street chain naming and address anomalies, and more. Carefully tuned and monitored conflation processes minimize the work in this step.

 

Conclusions

Conflation is a software development challenge, drawing heavily on mathematical and statistical disciplines, as well as computer expert systems, geography and GIS. There is not a large body of expertise or a well-trodden path to successful conflation. And as with any en masse automated database modification technique, it must be carefully written, minutely tunable, and it must be monitored to ensure the quality of the updates. We consider conflation software as only one component of a larger update methodology, involving consistent quality control of both incoming resources and of the manual and automated processes to acquire that information.

Drawing on the techniques outlined here, GDT has successfully employed a variety of automated conflation processes over the past two years. We take advantage of our conflation technology in roughly a dozen distinct applications, including TIGER import, realignment to attributed or naked-vector GIS, production of hybrid products, polygonal attribute importing, and conflation services to adjust GDT or customer data to any adjusted street centerline base. Conflation-generated reports can help us qualify candidate data, to provide accuracy and displacement information, and point out local areas for manual enhancement or research. Conflation has proven to be a cost-effective and timely way to improve our database and our products, and to serve our customers' needs.


Author Information

Alan Witmer
Senior Software Engineer
Geographic Data Technology
11 Lafayette St, Lebanon, NH, USA 03766
Telephone: (603) 643-0330

E-mail: alan_witmer@gdt1.com