Carol Brandt

PROGRESS REPORT ON AUTOMATED FILE MATCHING AND TIGER DATA BASE UPDATE

The Bureau of the Census (BOC) is actively examining methods to improve the quality of the TIGER (Topologically Integrated Geographic Encoding and Referencing) data base. One method in development is a semi-automated process where the BOC matches a provider file with the appropriate county partition of the TIGER data base.

First, we convert the provider file to a TIGER-like format. Then we can perform matching operations using software developed in- house. We may perform a rubber-sheeting operation on one of the files. Then we add new features and the associated attributes to the TIGER data base.

We intend to discuss the process to-date as well as the problems encountered, successes achieved, desired format for files we accept from a provider, possible information we can return to the provider, and future plans.


INTRODUCTION

The BOC has been investigating methods to update the TIGER data base using automated techniques. In the past few years, we have been focusing on the use of spatial files from other organizations (providers) and analyzing different methods to use their information. Much of the analysis portion of the work has been performed using ArcInfo.

The BOC is developing an automated process to match a provider file with the TIGER data base, compare the features and attributes, and modify the TIGER data base by adding features from the provider file and changing linear attributes (feature names and address range information). Most of this work is being done using software routines developed by the Geography Division programming staff. They are not using ArcInfo or other commercial software.

The BOC does not have the staff to review spatial files from every organization that offers information. Every file that we have received to date has required review and some form of resolution. Thus far, the review and resolution has been accomplished using a combination of ArcInfo and in-house software.

The BOC does not use ArcInfo to perform the standard update and maintenance of the TIGER data base. Interactive update and maintenance is performed using the Geographic Update System (GUS) developed by the programming staff in the Geography Division.

THE MATCH/MERGE PROCESS

To perform this automated comparison we must first convert the provider file into the TIGER format, which we call the Digital Exchange (DEX) file. The DEX file differs from the TIGER data base: it is only the linear half of the TIGER data base composed of lines, feature names and address range information.

Once the linear information is stored in the DEX file, additional programs are required before automated file matching can occur. One program deletes duplicate curve points or nodes - every file we have received to date has included duplicate curve points. If the provider file has topological structure, another program creates the polygons and a third calculates each polygon's perimeter, area, and centroid. In addition, the address range edit flags individual addresses where address range overlaps, out-of-parity, or out-of-range situations exist.

Where the provider file did not have full topological structure the programming staff devised methods to correct the file. They developed programs to identify lines having zero length, floating lines (a line or group of lines that exist within a polygon but do not intersect the polygon bounding chains), and a line that intersects itself. The programs eliminate some errors automatically and also creates files listing the errors found. Where the errors are not eliminated a clerical review for resolution can occur.

After successful completion of these programs and removal of any obvious errors, the DEX file and the TIGER data base are run through the automated match process. This process is composed of programs developed to match two spatial files using a combination of: Exact Name (Feature Identifier [FID]) Matching , FID Intersection Matching, FID Chaining Match - for matching FIDs, FID Chaining Match - for non-matching FIDs, TIGER/Line® Identifiers Matching (where the provider file originated from a TIGER/Line® file), and a geometric match. Some of these matching programs use topological relationship checking for verification.

Where the two files are not in close alignment, the process may shift the coordinates of one of the files. However, shifting coordinates of a file may change the topological structure and the file must go through the topological edits before proceeding.

The match process identifies features that exist in one file and not the other as well as features that match between the files. When new features (provider features having no match in the TIGER data base) are added to the TIGER data base, the associated attributes are added simultaneously.

For the existing feature chains in the TIGER data base, we want to expand and enhance the existing attribute information, not merely replace the present information. Modifying attributes involves comparison of the feature name and address range information for the matching chains. We have developed, and continue to refine, rules for adding and replacing attribute information in the TIGER data base.

Where the TIGER chain and matched provider chain both contain a feature name, we may use a string comparator to determine whether the feature name on the matched chains is the 'same' or 'different'. The string comparator assigns a value from 100 (an exact match) to zero (an exact non-match). Where the comparator score is less than a set value (indicating a "new" name), we add the provider name to the matched feature in the TIGER data base. Using feature naming rules, we then determine which name to flag as the primary and which to flag as the alternate.

One goal for the TIGER data base is to contain the most complete and accurate potential address ranges possible. Whenever possible, we merge the provider address information with the existing TIGER address information. We compare the address information, merge the information from the two files or replace the existing information with that stored in the provider file, and create the most complete potential address range(s) possible.

ALTERNATIVES

To perform the automated match successfully, the provider file must have linear topology. Where a provider file does not fulfill the requirement, we are discussing possible alternate methods for TIGER update including:

  1. modifying the provider file to display on-screen with the TIGER data base for heads-up digitizing;
  2. plotting maps from a provider file and performing interactive digitizing;
  3. extracting the attribute information from the provider file and geocoding the information to the TIGER data base. Where a provider record has no match in the TIGER data base, we would perform research to determine the correct location for the information, then add it using interactive digitizing; and
  4. accept address list files (where the provider has no spatial file) and geocode the information to the TIGER data base. Where a provider record has no match in the TIGER data base, we will ask the provider to provide the correct spatial location on a paper map for the address, so we can add it using interactive digitizing.

FILES RECEIVED TO-DATE

Over the past few years, we have accepted a number of files from different organizations to review, analyze, and develop methods to incorporate the information into the TIGER data base. ArcInfo has been invaluable to review files from users in different parts of the country and help us identify possible problems, suggest solutions, and hopefully, ask the correct questions before beginning any large-scale operation.

Of the files we have received and reviewed, no two files have had the same structure. The topological structure has ranged from full spatial topology to files containing coincident lines, duplicate line IDs, and lines that cross without intersecting.

File content is another issue. Files may contain the basic attribute information we want but stored in different order. Some files contain both potential and actual address information. Others store only actual address information, while a third group stores only potential address information. Some files contain both primary and alternate feature names, while other files contain only the primary feature names. Where one organization may store both names as attributes of the linear feature, another may store the primary name as an attribute of the linear feature and the alternate name in a related file.

FILE FORMAT

Any spatial files offered will go to the BOC regional office staff to perform an evaluation and conversion to a standard file format prior to conversion to a DEX file. At this time, the regional office staff is investigating conversion methods for different file formats. Possible acceptable file formats include: 1) the ARC export format, 2) the TIGER/Line® file Record Types 1 and 2 format, (with additional Record Types acceptable), and 3) a set of ASCII files, one containing the coordinate information and a unique line id from the spatial file, (a "spatial dump") and one or more other files containing the unique line id plus the attribute information. The attribute files can take the form of fixed length format or files containing fields delimited by a tab or other character (the provider must indicate the character). The provider must include metadata explaining the format. Providing files in the form as described in choice number three requires prior discussion and agreement with the BOC.

The BOC can accept files on a variety of media types and media formats. Media types include: Exabyte tape, quarter inch cartridge tape (QIC), 3 1/2 inch disks, and 5 1/4 inch disks. We can also accept files via the Internet. Formats created using standard operating system commands are acceptable for copying files to media. In a Unix system these commands include: tar, cpio, cp, or dd.

PROVIDER INPUT

We realize that not all organizations require full spatial topology and we will not refuse to accept a provider file if it does not conform to our ideal format. We will not ask a provider to modify their database to suit our requirements. However, we ask that you be cognizant of our needs. Think about the contents of your database. Does it contain coincident lines? If so, can you identify them? Can you eliminate the coincident lines prior to giving us a copy of the file? If you cannot eliminate them, can you give us a list of their ID numbers so that we can eliminate them and make the conversion to the DEX format easier? Is the feature naming in your file consistent? Does your file contain potential or actual address ranges?

Providing metadata with your file is important. With your metadata we can properly utilize your data, as well as code the information we include in the TIGER data base to benefit all future users of the data. Metadata can include: sources used to create your file, codes used to distinguish different features (road, hydrography, or railroad), the scale at which your file was developed, and the vintage of the data. Refer to the metadata standards developed by the Federal Geographic Data Committee (FGDC) for more complete information.

CENSUS ASSISTANCE

The BOC is looking for ways that we can lessen the burden for organizations that want to provide a copy of their spatial data base for TIGER update purposes. We are investigating several possibilities including: additional provider file formats for acceptance, alternative update methods that do not require a spatial file, supplying a list of steps required to eliminate topological problems in the provider file, and possibly, creating a macro for ARC users to output their data base information into a flat ASCII format.

ONE TEST CASE

The BOC received a file that originated from a PC based GIS. We had to convert the file into another form to review it in ArcInfo. Once loaded into an ARC coverage, we discovered that every linear feature in the file was a straight line - the originating file structure did not allow for shape points. In addition, the file contained almost 500 coincident lines. Other topological problems included lines crossing with no intersections and undershoots and overshoots. With much help from the programming staff, we removed the coincident lines. Then, to eliminate the topological problems we used CLEAN and BUILD in ARC. Finally, we could output the information in a format suitable for conversion to a DEX file.

Once the programming staff created the DEX file, they ran the match process, and modified the coordinates of the provider file to match the TIGER File. The resulting TIGER File, consisting of approximately 192,000 arcs, included an additional 7800 new arcs and approximately 1500 split arcs. Unfortunately, some of the new feature chains did not intersect with any existing TIGER chains. This added an interactive clerical resolution phase to connect these new features with the existing network. For the most part, we are pleased with the results. Adding that number of features automatically, even with clerical resolution, is much more cost effective than performing the update manually. The positional accuracy of the added features is yet to be determined.

SUMMARY

The results of the automated file matching process requires review and refinement. With every provider file we run through this process, we learn more, can modify and improve the process, and decrease the requirement for review and resolution. We want this process to benefit all users by passing information back to the original provider as well as making the information available in new versions of the TIGER/Line® files.

In addition, we hope to utilize what we have learned to support the development and maintenance of the "Framework" concept of the Federal Geographic Data Committee. Once there is a procedure in place, and we all can relate geo-spatial data to the framework, we can possibly then move to the next objective, a transaction-based exchange of updates.


Carol Brandt
Bureau of the Census
Geography Division
Washington, DC 20233-7400
Telephone: (301) 457-1100
FAX: (301) 457-4710