Updating TIGER with Non-Census Spatial Databases

Abstract

This paper examines recent developments and strategies to update TIGER with spatial databases maintained outside the Census Bureau. Since the early 1990s, Census Bureau geographers and programmers have been researching, developing, and refining automated tools and methodologies to transfer data from external spatial databases to TIGER. Recent and successful automated and interactive updates from a variety of local geographic information system (GIS) files provide a broad brush outline of existing capabilities as well as a framework for future developments. While the process has not led to a "black box" solution, these initial capabilities offer the Census Bureau a major opportunity to reduce maintenance costs, improve quality, and create more effective partnerships with the public, private, and academic communities.



Introduction

The Census Bureau has begun a very active nationwide update of the TIGER data base for Census 2000 that will have a significant impact on the Nation, in general, and the GIS community, in particular. A major resource for updating the TIGER data base is the independent development of a Master Address File (MAF). The MAF is a continually maintained, nationwide list of individual addresses that will be linked to the TIGER data base by geocoding each address in the MAF to an address range or location in TIGER. Building and maintaining an integrated TIGER/Master Address File (MAF) provides an ongoing mechanism for targeting updates in the Census Bureau's geographic data base. This infrastructure-building project, like the original development of the TIGER data base for the 1990 Census, portends significant and far-reaching benefits.

As a statistical agency, the Census Bureau is committed to the development and ongoing maintenance of a MAF through partnerships with the U.S. Postal Service and other federal, state, and local agencies. The MAF, unlike the many TIGER-related extracts, will not be a public product; all individual address information contained in the MAF is confidential and protected by Title 13 of the United States Code. The MAF will provide a continually updated inventory of the nation's housing units and business establishments to support all future statistical programs at the Census Bureau. An immediate use of the MAF will be to provide a list of addresses for the delivery of Census 2000 questionnaires. The MAF also will provide the basis for more timely small area data, intercensal sampling, and other non-decennial applications. While the statistical community will be the direct beneficiary of these ongoing developments, many indirect benefits will flow to the GIS community.

The integrated development of a MAF with the Census Bureau's geographic data base ensures that every "address" in the United States and its territories will have a "home" or link to a feature with address ranges in TIGER. As new addresses are added to the MAF, TIGER must be updated with new streets, street names, address ranges, and ZIP Codes so that all addresses in the MAF can be properly geocoded to their correct census geography. In the short term, the update and maintenance of this "home" for the MAF has proven to be highly labor-intensive, duplicating, in some cases, work already done by local governments and private agencies.

A primary reason for this situation is that a large number of local governments, and some private companies, now have, or are in the process of building, updated and accurate digital street centerline and attribute files. In this context, forging ahead with the same update methodologies and strategies used for the 1990 Census is leading increasingly to duplication of effort between the Census Bureau and both local government and the private sector. The existence of such files, resources which existed in only a few highly urban areas for the 1990 census, offer a potentially valuable resource for Census 2000 preparations and beyond.

The Census Bureau has reacted to this new and "messy" environment by building local government partnerships as well as developing Cooperative Research and Development Agreements (CRADAs) and new contractual arrangements (CONOPS) with the private sector. This paper will focus on recent experiences in acquiring, editing, and converting locally provided digital files into a standard exchange format for automated processing. This work has provided the Census Bureau with a framework that will enable it to process a wide variety of publicly available and privately-held GIS files in a production environment.

Digital Update

For the purpose of this paper, digital update refers to the automated transfer of spatial and spatially-related data from a source file to a target file (i.e., the TIGER data base). While digital files can be plotted and used as reference sources or used as a backcover for heads up digitizing, these methodologies, from a census perspective, are not qualitatively different from traditional update strategies using paper map sources. The automated transfer of spatial data in a production environment is a qualitatively new methodology for updating files. The process depends on robust matching algorithms and an organizational capability to process and facilitate a wide variety of files and formats in a timely and cost-efficient manner. The basic objective of this process is to maximize the automated transfer of good quality data and minimize the transfer of poor quality data.

The practice of digital update at the Census Bureau has been simplified from earlier proposed models that involved a more complex transfer of all types of data (e.g., road, railroad, and hydrographic features, turn restrictions, coordinate enhancement, etc.). Matching files, or replacing TIGER with a provider file, and maintaining all the geographic area relationships in the TIGER data base has been a persistent obstacle to making digital exchange a more fully automated production process. The current digital exchange model was designed to meet the basic needs of integrating the MAF with TIGER. As a result, the system considers only street features and transfers attributes that expand the ability to link addresses in the MAF to TIGER. This model avoids some of the more complex matching problems, utilizes existing TIGER software, and works within the limited time frame remaining for Census 2000.

Digital geographic files submitted to the Census Bureau are derived from a variety of GIS, computer-assisted design (CAD), or other mapping software packages in various formats, datums, and coordinate systems. The current digital update process demands that these files be based on a TIGER-like model of the real world. All street-centerline files must contain lines that are connected with nodes at each intersection; that is, they must contain the necessary structure to build polygon topology similar to TIGER. The minimum fields extracted are line ID, from/to coordinates, street name, address range, and ZIP Code, if available.

Figure 1

The current digital exchange model is quite conservative. There is no conflation of TIGER to better coordinates in local files, no replacement of TIGER address ranges with local address ranges when parity or other conflicts exist, and no update of non road features. However, as more files pass through the process, methodological refinements and loosening of constraints in the current matching algorithms are being explored and tested for implementation in a production environment.

Current Developments

Using a combination of commercial GIS and internal Census Bureau software and programs, digital files are now being systematically reviewed, edited, and converted to digital exchange (DEX) files that have the same internal structure as the TIGER data base. This work facilitates the matching process by allowing the Census Bureau to use software developed for the TIGER data base. A DEX file represents one county, or a portion of that county, and is compared to the corresponding TIGER county partition. These DEX files are then matched and run through a process that transfers base geographic data; that is, street centerlines, street names, address ranges, and ZIP codes from the source file to the TIGER data base. The technical aspects of the latest developments in this rubber sheeting and matching process will be presented at the 1997 Urban and Regional Information Systems Association (URISA) Conference.

Significantly, the entire process now takes days and weeks rather than months and years. For example, digital geographic files received from state and local governments for Lancaster and Newberry counties in South Carolina for the 1998 Dress Rehearsal were processed through the above system, including quality assurance, in less than one week. The actual matching programs took less than one day to electronically resolve 58 percent of the two county's workload or approximately 6,100 uncoded address clusters. Address clusters are census-defined work units that approximate a US Postal Service (USPS) ZIP+4 record. The same digital files were then used as a reference source back dropped behind TIGER, and used to interactively update TIGER (heads-up digitizing) to resolve the remaining clusters not currently handled by the automated process.

Other files have followed a similar path, with differences in processing time and geocoding results related to file size and quality. More powerful computers, increased disk space, and an increase in staff resources for this project will facilitate the further flow of files through the system. Technical and organizational enhancements are already underway to maximize data transfer capabilities.

Inventory and Assessment of Local Files

Through its ongoing acquisition and review of local digital geographic files, the Census Bureau has a unique vantage point concerning the availability and quality of updated street centerline GIS files throughout the country. Since most files are maintained for local purposes, they often contain more data than the Census Bureau requires or can currently handle. The quality of these files varies according to the resources, capabilities and reference sources available to local agencies.

Data quality is an area of considerable interest and concern in the exchange of digital data. In the absence of funding to quality evaluate the content of locally provided files through field checks, quality assurance procedures are being implemented in all stages of the digital update process. Census geographic staff perform a series of pre-processing edits of the local digital file using commercially available GIS software and post-processing edits of the updated TIGER data base using internal Census Bureau software. The automated update process utilizes geocoding checks to verify that DEX file updates enhance geocoding capability in the TIGER data base. Additional quality assurance operations are conducted to ensure that data were transferred according to specifications. The conservative nature of the current automated process minimizes the potential for introducing poor quality data from a provider file to the TIGER data base.

Geographic staff in the Census Bureau's 12 regional offices (ROs) are responsible for acquiring metadata and maintaining information on the availability of local digital geographic files. Through communications with local officials and review of the metadata, regional office staff make an initial determination of the file's "fitness of use" for automated or interactive processes to update the TIGER data base. In some cases, metadata is either non-existent or sparse which may be a further indication of data quality.

Acquiring Local Files

If the initial assessment is positive, RO staff will acquire the digital file. In many cases, RO staff may have already resolved most uncoded address clusters using other non-digital reference sources. In this context, benefits from automated processing would be minimal. However, the digital file may still be converted to a DEX file and used as an interactive heads-up reference source.

Since digital geographic files are developed on various systems and in different environments, it is necessary to develop standard input file formats to support the automated update process. Currently, digital files are imported, edited, and either converted to two ASCII files, one containing the line coordinates and the other containing the attribute data, or they are submitted as TIGER/Line record type 1 and 2 files. These files are referred to as DEX-ready files. These ASCII files are then converted to the TIGER db format DEX files so that they can be handled by existing TIGER system software.

Editing Local Files

The initial review and edit of incoming digital files is one of the most critical phases of the digital update process. Up-front spatial and/or attribute editing of the raw digital file familiarize census geographic staff with the overall quality of the file and ensure that the automated process maximizes geocoding returns and minimizes post-processing cleanup. This phase involves a careful review of metadata and, if necessary, follow-up with file providers on specific technical questions not answered by the metadata.

The methodology for editing locally provided files was designed to ensure the overall efficiency and quality of the digital transfer process. Generic editing modules utilizing ArcInfo and ArcView capabilities were designed for GIS operators to detect and correct a wide variety of situations that could either reduce the digital transfer of features and attributes or add poor quality data to TIGER. Few files need to pass through all the modules; some may need little or no editing while others may need extensive editing.

Since many local digital files provided to the Census Bureau are in ArcInfo format, regional office geographic staff are increasingly adopting ArcInfo and ArcView software to import, display, edit, and process the files into a standard input format. Files are imported into ArcInfo as a coverage and, if necessary, converted from their native datum and coordinate system to a NAD83 geographic projection. This conversion not only helps the matching process, but facilitates the interactive use of the file for "heads-up" digitizing and creates a source for potential coordinate fixes later in the decade.

A digital file is then reviewed to ensure it contains only street features. Any non street features are removed from the coverage. Some non-addressable street features also may be removed from the coverage (e.g., paper streets, private driveways, logging roads) based on the availability of this data being flagged in the provider file, the extent of these features and their impact on the file matching algorithms, and the determination of regional office geographers.

Every GIS file has its own unique characteristics and naming conventions; that is, they are not standardized. Fields names for the same street name components, from and to address ranges, and left and right ZIP Codes vary from local agency to local agency. While a few states or regional entities have adopted similar naming conventions for these fields, most files contain unique field names. Although this is not generally a problem, operator intervention is necessary to ensure proper fields have been selected and extracted to create the DEX files.

Nearly all GIS files use from three to five fields (e.g., prefix directional, prefix type, feature name, feature type, and suffix directional) for the feature identifier (FID). All FID fields are concatenated in creating the standard input format and then reparsed according to TIGER standards when the files are converted to a TIGER db structure. Census geographic staff must ensure that the concatenation of the various name fields will not create duplication of name elements (e.g., 'HWY' may be in both the feature name and feature type field such as "HWY 100 HWY") in the DEX file.

TIGER system software can handle some standardizations of feature names. However, some digital files have abbreviations or naming standards that require special handling. For example, leading 0's in numeric street names (e.g., 07th St) must be eliminated as the current version of the TIGER standardizer does not recognize this pattern. Highway naming conventions also may vary within a file or between different files in the same state. Long names may be abbreviated in non-standard ways. Some files will leave an unnamed street blank while others will use a variety of terms such as "unnamed", "Unnamed St 38", and "unknown", among others. Similar situations exist for private roads, ramps, and other street features. Through ongoing feedback between geographic and programming staff, pre-processing and post-processing edits are being enhanced to handle these situations. In general, census geographic staff must conform local data standards or lack of standards to USPS conventions.

Some local governments utilize only two characters for the street type field. This situation leads to unusual and, often, ambiguous abbreviations that may have alternative meanings (e.g., "CR" for "Circle" or "Crossing") and are not recognized or acceptable to the TIGER standardizer. In these cases, a new field is created with 4 characters, data are moved from the old field, and all cases of unacceptable abbreviations are systematically recoded to postal standards. Unacceptable abbreviations are determined by listing all unique occurrences of street types in a file and reviewing them against a list of acceptable abbreviations in the TIGER standardizer. In some cases, an abbreviation may be added to the standardizer if no possible conflict with other occurrences of this character string exist on a nationwide basis.

The initial review process also examines address ranges and ZIP Codes. Files are reviewed for any address range inconsistencies or orientation and parity problems along chains. Edits are currently being designed to automatically correct some of these deficiencies. ZIP Codes also are reviewed for coverage and consistency. While ZIP Codes are not required for the automated update process, their absence may minimize the ability to geocode uncoded address clusters.

If a moderate number of non-systematic errors in feature names, address ranges, or ZIP Codes are detected in this phase, they may be corrected, pending time and resources. Miskeyed entries are often easy to locate and fix. However, if there are a significant number of non-systematic errors, the file provider will be contacted for an updated and corrected version. If an updated file is not available and alternative paper or digital reference sources are unavailable to regional office staff for resolving uncoded address cluster fallout from the USPS Delivery Sequence File (DSF) and TIGER match, then the file may be used, with certain constraints, as an interactive heads-up source.

Post-processing Review of Local Files

Once a file has passed through all required edits, DEX-ready files are created and then processed into a TIGER db format or DEX file. At this stage, the DEX file is compared against the TIGER data base using internal TIGER Geographic Update System software called GusX which runs in X windows on a unix machine. While commercial GIS software can compare the digital file against TIGER/Line, GusX software compares the file against the current TIGER data base. Since TIGER/Line files are a TIGER data base extract, they are produced at a given point in time. As such, they quickly become outdated relative to the "live" TIGER data base. This situation is particularly important since the TIGER data base is currently experiencing rapid and ongoing update from a number of different geographic programs. In most cases, the initial phase of file review and editing minimizes the number of DEX files that are rejected or need to be re-created.

In order to make a more quantitative assessment of the quality of the digital file, the DEX file and TIGER data base are geocoded against the cluster file derived from the USPS Delivery Sequence File (DSF). These additional statistics provide an up-front indication of the comparative coverage and attribute accuracy of the two files. These statistics also are included as components in a model to determine file priority for the automated update process.

Following the automated update of the TIGER data base from the DEX file, a subsequent operation examines the quality of the data transfer process using a random sample of update actions. This operation is conducted to ensure that the software programs are working properly. Other interactive operations are performed to review and edit unconnected street features and other problematic situations resulting from the digital update process.

Once a file has completed these operations, it is used as a heads-up source for resolving address clusters not coded in the automated process. The current system has been resolving an average of 40 percent of uncoded clusters. This average reflects, in some sense, the initial conservative approach adopted to ensure quality and non-transfer of ambiguous data. Files that contain areas undergoing E-911 address conversion, or have features and attributes that do not exist in TIGER, will generally have more productive outcomes. Through time and additional experience in digital file update, the matching algorithms will become more aggressive. Ongoing feedback between census geographers and programmers resulting from the review, editing, and use of local digital files to update TIGER will lead to continuous improvements and enhancements in the digital update process.

The current digital exchange model is designed to assist ongoing efforts to link a nationwide MAF with the Census Bureau's geographic data base, reduce duplication of effort and costs, and enhance productivity and quality. Review of the original digital file and the converted DEX file are designed to maximize the electronic transfer of spatial and attribute data from a digital file to the TIGER data base. The automated process does not yet transfer enhanced coordinates nor does it replace, in a wholesale manner, lines and attributes in the TIGER data base with those from a local GIS file. The current process is based on fairly conservative algorithms and "safe" update routines.

Conclusion

Based on the number of local digital files received and processed, documented cost savings from digital versus traditional update methodologies, the quality of local GIS files, and the possibility of receiving many more updated digital files nationwide, Census Bureau management is becoming increasingly aware of the potential savings from electronic data exchange. These initial files are only the tip of the iceberg of publicly available and continually maintained digital geographic files nationwide. As such, long-term savings resulting from ongoing and future data partnerships could multiply several-fold. This optimism is tempered by the fact that near-term progress depends on enhancing the current limited capacity to process large numbers of digital files and refining computer matching algorithms. Clerical update remains the principle update strategy to meet Census 2000 goals for MAF/TIGER integration. As such, knowledge of this rapidly expanding universe of publicly available good quality digital geographic files remains limited and, often, anecdotal.

In light of recent successes in digital update, the Census Bureau has much to gain and little to lose in pursuing a long-term strategy of ongoing digital exchange with local participants and private agencies. Encouraging and actively pursuing digital geographic data sharing partnerships will reduce duplication of effort within the public sector as well as between the public and private sector. These data sharing partnerships should translate into lower costs for the Census Bureau and, consequently, the American taxpayer. As such, this digital processing of local geographic files fulfills the basic tenets of Vice-President Gore's Reinventing Government Initiative and works within the spirit of the overall objectives of Executive Order 12906 and our emerging National Spatial Data Infrastructure (NSDI). The implications of this evolving process for post-census maintenance of a national street centerline file for census and other purposes are profound.



Jon Sperling

Geography Division
US Census Bureau
Washington D.C. 20233-7400
Telephone:(301) 457-1100
Fax: (301) 457-4710