Stanley L. Dallal

Conflation of TIGER with DLG

An excellent source for geographic information is the TIGER files generated by the U.S. Census Bureau.  With the completion of TIGER 2000, map producers will have an enormous amount of information available to update and enhance their data sets. The process of transferring the TIGER attribution onto their data sets, known as conflation, often involves a high degree of costly manual work to identify and update corresponding features.  ESEA has developed an automated conflation system that has been used successfully to conflate U.S. Census TIGER data to a variety of data sets that have good positional accuracy but little or no attribution.  This paper describes some of the issues that arise when conflating TIGER data with USGS DLG data.

Background

Two excellent low cost sources of geographic information are U.S. Geological Survey (USGS) Digital Line Graph (DLG) data and U.S. Census Bureau TIGER line files.  In general, the TIGER data has richer attribution but poorer positional accuracy than the DLG data.  This is especially true for regions where the TIGER data was generated during the 1980 census. The need to conflate TIGER attribution to DLG may arise for applications that require data with both DLG positional accuracy and TIGER attribution (such as TIGER address ranges).

Conflation is generally a costly and time consuming task involving matching corresponding arcs between two data sets and then transferring the arc’s attributes from one data set to the other. ESEA has developed a stand-alone automated conflation system that dramatically reduces the time required to perform conflation.  The ESEA Conflation System (ECS) accepts two shape files as input.  The shape files are conflated together and a new shape file containing the conflated data is output. 

In practice, ESEA performs a conflation project with the use of ArcInfo and ECS.  The GIS is used to process data prior to and after automated conflation.  At ESEA, pre- and post-automated conflation processing is accomplished with ArcInfo procedures written in AML.  This paper describes a case study in which ESEA used ArcInfo and the ECS to transfer road name and address range attributes from TIGER data to DLG data for portions of New Mexico. 

Pre-Conflation Processing

ArcInfo 7.1 was used to exclude certain types of TIGER road features that were not of interest for this conflation project.  The purpose of this project was to transfer road names and address ranges from TIGER features to DLG. TIGER road features are classified according to the CFCC attribute.  TIGER road features that do not contain address ranges, such as limited access divided highways, four-wheel trails, driveways and service roads, were eligible for exclusion.

While it is desirable to eliminate extraneous TIGER road features to reduce ECS run-time, care must be taken not to remove features that are essential for maintaining similarity between the DLG and TIGER road networks.  This is because ECS performs optimally when the two data sets being conflated are similar. A higher degree of similarity results in a more accurate transfer of attributes.  Eliminating features that are necessary for maintaining TIGER network connectivity will cause TIGER data to differ from DLG and result in less accurate ECS results.  After careful review, it was decided that only TIGER road features with CFCC A74 could be eliminated.  These features correspond to driveways or service roads.

Removing the TIGER road features with CFCC A74 has the side effect of leaving extra nodes in the data that correspond to junctions where a TIGER arc with CFCC A74 had previously crossed another TIGER arc.  These nodes should be removed because they often breakup an arc’s address range.  For example, an arc with address range 0 to 100 might be divided into one arc with address range 0 to 50 and another arc with address range 52 to 100.  An AML was written to remove the degree 2 nodes that split arcs with the same road name. The AML unsplit the arcs and combined their address ranges together.

ESEA's has learned from experience with ECS that it becomes cumbersome to conflate more than about 50,000 arcs at a time.  Conflation with ECS is a cyclical process of setting conflation parameters, performing automatic matching and manually correcting the results.  As the number of arcs increases, it takes longer to do automatic matching and it becomes more difficult to keep track of which regions have previously been reviewed and corrected.  For these reasons, ESEA breaks up the data into individual tiles containing no more than about 50,000 arcs.

TIGER data for each county in New Mexico were combined into one large coverage that matched the extent of the DLG data.  There are more than 400,000 TIGER arcs in New Mexico.  A 4 by 4 grid was calculated to divide the New Mexico data into 16 tiles. Data tiling causes complications with respect to conflation along the tile boundaries. Arcs near the edge of a tile may match arcs in an adjacent tile.  To help remedy this situation, the tiles were chosen so that they have a slight overlap. The coordinates bounding each tile were saved into a file that ECS uses when reading in data.

The Conflation Process

ECS performs conflation on two coverages at a time. One of the coverages is identified to be more spatially accurate.  This coverage is referred to as the base coverage. The geometry of the base coverage is anchored and is not modified during the conflation process. The other, less spatially accurate coverage is referred to as the non-base coverage. The non-base geometry is transformed via rubber-sheeting to match the base geometry during conflation. In the project discussed in this paper, the base coverage was derived from DLG and the non-base coverage was derived from TIGER.

The conflation process is carried out with ECS in three steps: node matching, line (arc) matching, and feature merging. Each of these steps is discussed in turn below.

Node matching is performed to create rubber-sheeting transformations and to match node features.  Distance, topological and attribution measures are used for matching nodes.

In searching for candidate base and non-base node matches, ECS only considers node pairs within an operator-specified match distance.  All base and non-base node pairs separated by more than this match distance are excluded as node match candidates.

Along with the distance between candidate matching nodes, the number and distribution of lines that meet at the node is also an important measure of similarity.  The ECS operator can specify the relative importance of this type of matching.

ECS can use a comparison of attribute values associated with line or node features as an additional node match measure. If arcs are present, the match is performed on attributes of the arcs incident at the nodes being compared. This match measure is effective at finding unique intersections of attributed arcs.  For instance, it will match the intersection of Oak Street and Main Street between two road coverages.  When appropriate attributes exist in the two maps, such as street names, this measure can clearly generate very high confidence node matches.

The matched node pairs are used to generate a rubber-sheeting transformation that brings the non-base coverage into better alignment with the base coverage.  Rubber-sheeting and node matching proceed iteratively.  Each iteration produces a new transformation which brings the coverages into better alignment possibly causing some nodes that did not match previously to now match and become anchor points for a new rubber-sheeting transformation. The rubber-sheeting and node matching proceed until no new node matches are found.

The operator may view the node matches to determine if the anchor points found are sufficient for building a good rubber-sheeting transformation.  The operator can also choose to manually add or remove individual node matches. The operator may then relax the node match criteria (for instance, the match distance) and start a new match iteration to find more anchor points.  Alternatively, an iteration may be redone with stricter criteria if poorly qualified anchor points were found.

Line matching proceeds once node matching has been completed to the operator’s satisfaction.  For each line to be matched a region is considered within an operator-specified distance from the line.  A path in the other map is considered for matching if it lies within this distance.  In addition to geometry, line matching can also use attribute information to help identify matches.

Figure 1 shows a subregion of a Georgia Department of Transportation (DOT) base map in yellow and a TIGER non-base map in orange.  The TIGER features are offset from the DOT features in different directions and by varying amounts throughout the road network.

 

Figure 1: Line Match Conditions

Often a single arc in one map does not correspond to a single arc in the other map.  Three of these cases are highlighted in Figure 1.  Towards the top of the Figure, two DOT arcs match a single TIGER arc (a two-to-one match).  Likewise, there are two one-to-two matches highlighted where one DOT arc matches two TIGER arcs.  These must be converted into one-to-one matches in order for the TIGER attributes to be transferred across to the DOT data.  In these situations ECS will automatically add a node to the single arc creating two one-to-one matches.  Partial and many-to-many line matches are treated similarly.

ECS allows for the user to visually inspect the results of automatic node and line matching.  The ECS Conflation Display Console is shown below.  There are buttons to control whether point matches or line matches are displayed.  There are a number of selections that enable the operator to do manual editing to correct and augment the results of automatic matching.  These selections allow the operator to selectively add and delete point matches, add and delete line matches and to split and unsplit line features.

 

Figure 2: ECS Conflation Console

In general, the automatic conflation match results will be more accurate when the base and non-base maps are similar. Figures 3 and 4 show the results of automatic node matching for two regions in New Mexico.

 

Figure 3: Automatic Match Results for a Complex Region

 

Figure 4: Line Match Conditions for an Easy Match Region

In Figures 3 and 4, yellow represents DLG, orange represents TIGER, and the blue and green links connecting the TIGER and DLG arcs represent node matches automatically found by ECS.  In Figure 3, some of the matches may need to be corrected while in Figure 4 all of the correct matches have been automatically found.

The next step in the ECS conflation process is feature merging, which consists of selecting the desired features and attributes to include in the target conflated data set.  Target features can be any combination of matched features, unmatched base coverage features and unmatched non-base features.  Any combination of base and non-base attributes can be transferred to the target coverage. 

The road network shown in Figure 1 contains a number of features that have no corresponding feature in the other map. A few of these are highlighted as either “unmatched base” or “unmatched non-base.”  In a manual conflation, special care must be taken in merging the unmatched non-base features to avoid making them unconnected roads in the target map.  To handle this case, ECS automatically adjusts these features to connect them to the appropriate base feature.  In this way, the correct road network connectivity is generated in the target map.

Post-Conflation Processing

ECS adds a buffer region to each tile to ensure correct matching along tile boundaries.  This buffer region is removed by ECS during feature merging. ECS then writes the conflated data into shape files.  There is one shape file for each tile conflated.  ArcInfo 7.1 is used to combine the tiles into a single coverage. Arcs that had crossed tile boundaries will have a node where the boundary had been.  An AML is used to remove these degree two nodes and combine together the address ranges on the adjacent arcs. Another AML can be used to detect and correct unmatched TIGER arcs that have been broken along tile boundaries.

Quality assurance is then performed with an AML that scans the data for various conditions that suggest possible conflation errors.  Specifically, the AML adds an attribute to arcs that are involved in any of the following situations: a gap occurring within a sequence of arcs with the same road name, an address range that has reversed direction or changed from even to odd, an intersection where three or more of the roads coming out of the intersection have the same name.  The conflated coverage can be compared with the original data to determine whether the flagged conditions were introduced by the conflation.  The data can then be corrected if necessary.

Conclusion

Geographic data sets that have both rich attribution and good positional accuracy can be developed through conflating DLG and TIGER data.  Cost and time savings can be achieved through using an automated conflation system along with a professional GIS such as ArcInfo.  The GIS is required to perform pre and post automated conflation processing. For regions where DLG and TIGER data are very similar, automated conflation can produce acceptable results with no manual editing required.  For other regions where the data sets are dissimilar, automated conflation will require some manual editing.  An ArcInfo AML that performs sanity checks on the data is useful for detecting subtle errors that may arise during the conflation process.

Author Information


Stanley L. Dallal, Senior Software Engineer

ESEA

100 W. El Camino Real, Suite 74

Mountain View, CA 94040

Telephone: 650-962-1167

Fax: 650-962-0976

dallal@esea.com