Serving the American FactFinder with TIGER Data

Authors

Ricardo J. Ruiz
Kimberly K. Newkirk
U.S. Census Bureau
Geography Division

Abstract

As part of the Census Bureau's efforts to quickly disseminate census data, the Bureau developed the American FactFinder web-based application. This application uses TIGER data to provide users an easy way to select the area for which they want census data and view the data.

The Geography division is responsible for the creation and maintenance of the TIGER database. We developed the DADS System to extract the geographic data from our proprietary database and create shapefiles plus other data products. These products are being delivered to the Data Access Dissemination System staff to support the American FactFinder web-based application and the creation of census data products.

This paper describes our DADS System and provides helpful information on techniques we used to successfully develop a fully automated system.

I. Introduction

To facilitate the dissemination of data to the public, the U.S. Census Bureau built the web-based application: the American FactFinder (AFF). AFF allows users to search for products or datasets. Datasets can be correlated or tabulated. Users can build queries or select pre-defined queries to produce results to view in tables or a thematic map.

The Geography division, as guardian of the TIGER database, provides all the geographic data to support AFF. We provide geographic data for all entities or entity parts for which the U.S. Census Bureau tabulates data. This tabulation scheme is known as the Summary Levels.

II. Summary Levels

A summary level type exists for each geography entity type for which data have been tabulated. The summary level types are defined by listing entity codes in a specific order. When these entity codes are concatenated for a given record, they form a unique identifier named the Geoheader.

A record within a summary level could be for a "whole" geographic entity (e.g., SL050 County), or for a combination of entity types that can result in a partial geographic entity (e.g., SL155 County part of Place). Some of these summary levels can be ordered to provide data for a whole entity followed by data for smaller areas within the whole entity. For example, see the following summary levels:

040 State
050 State-County
060 State-County-County Subdivision

Summary level 040 provides data for a state. Summary level 050 provides data for each county within the state and summary level 060 provides data for each county subdivision within each county within the state. If you aggregate the area measurements for all the 060 records for a given county, the results should match the area measurements for the county.

This example works well because states are 100% covered by counties or county equivalents and counties are 100% covered by county subdivisions. But, how do you deal with less than 100% coverage? This is where remainder records are used.

A remainder record serves as a place holder for an area not covered by the geographic entity of the summary level type in question. Not all summary level types require a remainder record. For example, see the following summary levels:

700 State-County-Voting District/Remainder
710 State-County-Voting District/Remainder -County Subdivision
720 State-County-Voting District/Remainder -County Subdivision-Place/Remainder
730 State-County-Voting District/Remainder -County Subdivision-Place/Remainder -Census Tract

Not all counties participated in the Voting District Delineation program. Those who did not, get a remainder record which covers the whole county. This is necessary to be able to create summary level 710 records for that county. Also for summary level 720, a remainder record is created for the area of a county subdivision within a voting district within a county that covers any area not covered by a valid place. The voting district and place remainder records allow us to provide data for all the census tracts in a county even if the county does not contain voting districts or it is not 100% covered by places.

III. Requirements

We were required to provide geographic data to support the AFF in the form of non-spatial and spatial products.

A. Non-Spatial Requirements

For non-spatial products, we provide text files consisting of attributes data for a number of summary levels. These products do not contain boundary information and may not be used to generate maps. Data for state based summary levels are provided in state level files. Data for summary levels that contain data that may cross state boundaries are provided in a National file.

There are two types of non-spatial products:

1. Geobucket Files

Geobucket files consist of summary level records not sort for a number of summary levels. These records are used by the AFF as index information to the spatial data.

2. Data Public Product (DPP) Files

DPP files consist of summary level records for a number of summary levels sorted in a specific order. Summary levels are grouped and ordered to provide all summary levels records for geographic entities sequentially. That is, all relevant summary levels for one entity are listed before the same summary level types are repeated for the next entity. DPP files are the bases for all tabulated data public products.

B. Spatial Requirements

For each geographic entity tabulated during the census 2000, AFF requires the boundaries for those entities in a spatial data file. In addition, AFF requires spatial data for feature networks such as roads, streams, and parks.

AFF uses SDE Layers stored in an Oracle database. We had the option of providing spatial data in shapefile or SDE Layer format. GEO chose shapefile format for reasons explained later. The Data Access and Dissemination System (DADS) staff in turn converts these shapefiles into SDE Layers.

Specific AFF functionality making use of spatial files for Census 2000 geographies includes:

Map-based selection of one or more geographies as part of a data query,
The creation of reference maps for identification of tabulation geographies, and
The creation of thematic maps to aid data visualization.

1. Coordinate Precision

Spatial data files were required with different coordinates precision. Depending on the level of detailed required to show the boundaries, AFF uses detailed or generalized data to display the boundaries.

a. Detailed Shapefiles

Detailed shapefiles offer the greatest coordinate resolution available in our TIGER database. National and statewide shapefiles for over 50 summary levels were requested.

b. Generalized Shapefiles

Generalized shapefiles describe shapes with fewer coordinates than a TIGER file so as to produce a more compact and faster drawing file. These shapefiles are more appropriate for a regional or large-area view where the difference in precision between a generalized and a detailed file is not visually noticeable.

2. Shape Types

All geographic boundaries are provided as polygon shapes. Some are also provided as line shapes. Polygon shapes facilitate area shading and names placement within an area. Line shapes are useful for display of complex line symbols (e.g. a dash-dot line) whose pattern would be obliterated if rendered twice on a shared line segment between two polygons. Line shapes are also useful for the selective display of boundary lines, such as the suppression of boundaries along the national coastline.

The feature network files contain point, line, or polygon shapes to match how they are stored in TIGER. AFF shows features on maps to provide orienting information for viewing the boundaries of geographic entities.

IV. DADS System

The Census Bureau is constantly facing the requirement to process a large amount of data in a short period of time. AFF requirements were no different. This was our main concern. Our DADS System had to process data for all fifty states, D.C., and Puerto Rico to produce hundreds of spatial data sets and text files within a three-month period. To accomplish this task, we needed to minimize human interaction with the system. In other words, automate the process as much as possible. This was our first design requirement.

Our second concern was late minute requirement changes. AFF is a fairly new application still evolving. Last minute requirement changes had a high probability. Therefore, our second design requirement was to make the system metadata driven. That is, develop the software so that requirement changes could be implemented immediately by simply updating configuration files that drive the system.

Our third major concern was future use. We expect more requests to produce non-spatial and spatial files for years to come. Although the requests will be similar, the system will need to adjust to variations on the requests.

We had about one year to design, develop, and test the DADS System. We had to decide which technology and tools to use. Our staff had plenty of experience developing C, Java, and perl applications. Our experience using relational databases was limited but we had a few Oracle contractors on-site available. The DPP and Geobucket files consist of many aggregate data. SQL queries against a relational database make calculations for aggregate attributes easy. After consulting with our Oracle contractors, we decided to use a relational database to produce the non-spatial products. Also, our staff’s desire to gain some experience working with Oracle made the decision easy.

On the spatial side, the decision was easier to make. The requirements called for SDE Layers or shapefiles. During the Census 2000 Dress Rehearsal, a Geography division team provided SDE Layers to support AFF. The team ran into a number of problems to produce SDE Layers for some areas of the country and was unable to determine the cause or find a workaround. The team was forced to produce shapefiles for the problematic areas. Since our staff had very limited experience with ArcSDE, we decided to develop AML scripts to produce shapefiles.

In summary, these are the tools and technologies used for different parts of the DADS System:

Perl scripts are the glue that keeps all the pieces together. The processing flow is controlled through the use of many perl scripts.
C is used to develop the software to extract data from our custom based TIGER database. The extracted data is stored in specific formats to support both non-spatial and spatial processing.
Java is used to load data into our Oracle database and to generate all the non-spatial products.
Oracle is used as an intermediate repository to create the non-spatial products. This facilitates the sorting of records and the aggregation of area for the creation of non-spatial products. In addition, production is controlled using an Oracle database and the Quality Assurance (QA) team also uses Oracle to facilitate the QA process.
ArcInfo tools and AML scripts are used to create all the spatial products.

We have five teams working on the DADS System. Teams split the work as follows:

Team 1 works on the Control System. As the name indicates, the Control System controls the processing for the creation, QA, and delivery of products.
Team 2 works on the Non-Spatial System. The Non-Spatial System processes data for the creation of the non-spatial (Geobuckets and DPP) files.
Team 3 works on the Spatial System. The Spatial System processes data for the creation of the shapefiles.
Team 4 works on the Generalization System. The Generalization System processes detailed data in preparation for the creation of the generalized shapes.
Team 5 works on the QA System. The QA System integrates the Non-Spatial and Spatial systems. It ensures all data products are consistent with our TIGER data and among each other.

Because different developers are working on the Non-Spatial and Spatial systems, we are able to use the resulting products from one system to QA the products for the other system. The QA System also uses TIGER/Line® and Geographic Reference files created by the Geography division as source files.

The next diagrams describe the Non-Spatial and Spatial systems.

A. Non-Spatial System

Non-Spatial Systems

B. Spatial System

Spatial System

Spatial Systems

The data are extracted from our internal TIGER database into ASCII files using a metadata driven C application. The ASCII data are in turn loaded into ArcInfo coverages using the GENERATE command.

The coverages are then manipulated using various INFO commands such as REGIONQUERY, REGIONDISSOLVE, DISSOLVE, COPY, RESELECT to produce coverages with the desired summary level information. Finally, the ARCSHAPE command is used to create the shapefiles.

V. Issues

A. Non-Spatial

Performance was the main issue. Three factors that helped us reach the desired level of performance were:

We divided our Oracle database into ordered group of rows using a partition key, in our case COUNTY, and distributed them to different disks. These partitions improved performance and maintenance by acting as separate entities.
We enabled the parallel query options. This allows a standard SQL query to kick-off multiple queries in parallel enabling the system to retrieve data much faster from each partition.
We used dynamic SQL processing. Due to the nature of the summary levels, it was hard to maintain programs with static SQL statements. We overcame this issue by dynamically constructing the SQL statements for each summary level based on the parameters. This improved programmer productivity tremendously.
We used the Window Aggregate family of Analytical functions to provide moving and cumulative processing for all the SQL aggregate functions such as SUM, AVG, MIN, and MAX. The analytical processing improved query speed and developer productivity by using flexible and powerful calculation expressions on existing SQL. The processing of an Analytical Processing takes place in three steps. First, all joins, WHERE, GROUP BY and HAVING clauses are performed. Second, the result set is made available to the analytical functions and all their calculations take place. Third, if the query has an ORDER BY clause at its end, the ORDER BY is processed to allow for precise output ordering.
Many geographic entities are limited to only parts of the country. For example, American Indian Areas, Consolidated Cities, and Places. We optimized the queries for the summary levels involving these areas by eliminating the rows where these areas are not located on the first sub-query in a query.

B. Spatial

The development and testing of the Spatial System was not a straightforward process. During this process, we ran into various issues. Some of these issues are described below:

Sometimes, the GENERATE command produced unexpected results. The attributes in the attribute table of the coverage looked like they had been shifted. We found that the files used by the GENERATE command MUST have an even number of characters in each record. When the records have an odd number of characters, they must be padded to obtain an even number, or there are unexpected results. To solve this problem, we programmed a line-feed character where the padding was needed.
Error handling is crucial for the DADS system because of the limited production schedule. There is no time to re-process multiple times before the cause of an error is found. One of the first obstacles was figuring out a way to have the AML communicate back to the controlling perl script when there was an error condition. We found that the ArcInfo returned value is based only on whether the system execution of the ‘arc’ command is successful. To solve this problem, we implemented inter-process signals using the UNIX ‘kill’ command. We used the following lines in the AML and perl scripts:

$ENV{AMLSIGPID} = $$

&sys kill –16 %AMLSIGPID%

$SIG{‘USR1’} = ‘aml_signal_catcher’

sub aml_signal_catcher {[insert error handling code here];}

Some of the summary levels required a national delivery. For this, we had to aggregate the state coverages into a national coverage. To do this, we used the APPEND and CLEAN commands. However, our first attempt to aggregate failed. We exceeded the maximum number of temporary ArcInfo files and encountered memory problems when CLEANing the APPENDed national coverage. During the CLEAN, there are many temporary files created. We found that we exceeded the maximum number of 10,000 files set by ArcInfo. We solved this problem by DISSOLVing component state datasets to reduce the total number of polygons.

Unfortunately after successfully delivering national shapefiles for our first set of summary levels, we encountered the problem of exceeding the maximum number of temporary files again. There was no major difference in data size from the first set of summary levels and the second set. With the help of Esri Support, we tried various methods to solve the problem. First, we attempted to break the nation into regions, APPENDing and CLEANing each region and then APPENDing into a national file. This resulted in fewer temporary files created, but the files were huge and caused memory problems. Finally we attempted to use MAPJOIN. The MAPJOIN command had previously been dismissed as a way to aggregate the states because it caused similar errors in early testing. However, MAPJOIN was successful and we were able to deliver our second set of summary levels.

VI. Conclusion

The design and development of the DADS System took many hours of research, coding, and testing. We gained valuable experience in Oracle, ArcInfo, and system integration during this project. The DADS System was an unprecedented task for the U.S. Census Bureau’s Geography Division.

The DADS System consists of over 20 perl scripts, 10,000 lines of C code, 5,000 lines of Java code, and 7,000 lines of code in over forty AML scripts. This system will support AFF and other areas around the U.S. Census Bureau for, at least, the next five years.

VII. Acknowledgments

Jay E. Spurlin, Project Leader of the Spatial System, for working on the initial design of the Spatial System and on the development of the many perl and AML scripts. He was responsible for the development of the spatial processing diagram.

Kimberly K. Newkirk, TIGER Systems Branch staff member, for working many hours developing and troubleshooting the Spatial System.

Mohamad Thahir, Oracle contractor, for helping us with the physical and logical designs of the database used by the Non-Spatial System. He also developed the Oracle procedure to process the queries.

Julie Liu, Oracle contractor, for building most of the SQL queries used for the Non-Spatial System.

Berhane Banko, SAIC contractor, for developing the Java applications to create the Non-Spatial products.

Krishna Tadepalli, SAIC contractor, for developing the Java applications to load the ASCII files into our Oracle database for the Non-Spatial System.

Ruth P. Johnson, TIGER Systems Branch staff member, for developing the TIGER Extraction software for both systems.

Constance Beard and Deanna Fowler, Cartographic Operations Branch staff members, for working with us to integrate their Generalization System into our Spatial System.

Barbara Rosen and her staff for working on the QA System.

Charles Dingman for all his geography tutorials and quickly resolving all specifications issues as they came.

Gerard Boudriault and his staff for developing the DADS Control System.

VIII. References

American FactFinder Spatial Files Specification – Census 2000 Geographies
American FactFinder Spatial File Specification - Detailed Feature Network Files
American FactFinder Geobucket File Specification - Census 2000
DADS2000 Specification DPP Geography File

IX. Author Information

Ricardo J. Ruiz
Chief, TIGER Systems Branch
U.S. Census Bureau
Geography Division
4710 Silver Hill Road
Mail Stop 7400
Washington, DC 20233-7400
(301) 457-1066
(301) 457-5710
Ricardo.J.Ruiz@census.gov

Kimberly K. Newkirk
Computer Specialist
U.S. Census Bureau
Geography Division
4710 Silver Hill Road
Mail Stop 7400
Washington, DC 20233-7400
(301) 457-1066
(301) 457-5710
Kimberly.K.Kline@census.gov