EsriPaper

Esri Conference Proceedings, 2000, in press.
Providing Consistent and Flexible GIS for the U.S. Mexico Boarder

An Integrated Dual-Census Data-base Design.

W. James Hettrick, James B. Pick, Doug Mende, and Elliott Ellsworth

University of Redlands

Redlands, California 92373-0999

Introduction

For applications in public administration, a Geographical Information System (GIS) is helpful to decision makers at the township, city, regional, state, and national levels. It has been adopted extensively for decision-making in U.S. cities (Huxhold, 1991) and in most European cities. On the other hand, it has been applied to a limited extent at the local level in developing nations. The present paper offers a framework for GIS's to be built based on the census information from two nations. This framework becomes increasingly important as a GIS's areal units i.e. polygons become smaller. Multi-census data are often combined at the national levels and sometimes at the state levels, but less so for counties and rarely for city small area units, referred to as census tracts or block groups in the U.S. This paper will point out the challenges in building a dual census GIS and how the challenges may be overcome through an integrated data-base design.

Creating A Binational, Dual-Census GIS

Two or more national censuses rarely coordinate with each other on their design, operations, and standards. For instance, census unemployment definitions are different in many nations (for one example, see Fleck and Sorrentino, 1994). The United Nations and World Bank have difficulty in achieving consistency of definitions in their compilations of worldwide data-sets (United Nations, various years, and World Bank, 1999). They often have to eliminate nations that do not comply to standard definitions and/or nations that do not collect particular attributes. Another challenge to establishing consistent data-sets is the reduction in the number of attributes, as unit areas get smaller. This stems from the sheer size of collecting small area data and from the needs of confidentiality. Regarding confidentiality for instance, at the tract and block group levels, most information of the U.S. Economic Census is suppressed due to confidentiality (U.S. Census, 1999). Another large intra-censal difference is in the approaches to coding and table construction. Some censuses such as the U.S. Census are systematically hierarchical in their table construction, while others including Mexico are a-hierarchical and lack a systematic overall approach to table construction.

Four major challenges and technical solutions in dual-census GIS are:

Attribute definitional matching. Although some socio-economic attributes such as population, gender, births, and population density have standard definitions, other important attributes such as labor force participation, unemployment, and poverty lack standard definitions that are well accepted worldwide. The challenge for dual-census GIS is to find a limited set of attributes that match exactly and/or to identify a tolerance for other attributes that match closely enough, or can be adjusted to match closely. An example for the U.S. and Mexico is unemployment. In the Mexican Census underemployed can be defined as the ratio of persons working 1 to 32 hour per week to the economically active population. In the U.S. Census, the closest available match to this definition is persons working 1 to 35 hours per week to total labor force. These definitions are slightly different, not only in the range of hours of worked, but in shadings of meaning between "economically active" (Mexico) and "labor force" (U.S.). However, the present researchers considered the definitions close enough to "match." It is important to point out that perhaps twenty percent of the attributes available in both the U.S. and Mexican censuses can be matched exactly or closely. Thus, a limitation is that a dual-census GIS will have a reduced attribute set for binational maps and analysis than a single-census GIS.

Data-base design. The challenge here is to design data-bases that can systematically support a GIS for a large number of attributes and complex coverages. Two censuses produce individual data products that provide the data either in hardcopy, electronic, or web-accessible form. The problem for a dual-census GIS is that, between censuses, the data are arranged quite differently, in respect to coding, table dimensionality, order, calculation procedures, aggregation/ desegregation, and spatial referencing. It is almost impossible to construct a large-sized GIS accessing two nations' sets of data products directly, i.e. without translation or conversion.

Our approach is to convert unique data products of censuses into standardized sets of relational tables supported by a common data-base (see Figure 1). The pointers can be set differently in the relational tables for each census, so that the censuses' data dictionaries and relationships are equalized in format. There are three advantages to this approach: (1) data formats are standardized, (2) formula computations can be performed systematically on the data of both censuses, (3) the data can be easily aggregated as output with a single unique key field to reference the geography, and (4) the output data also can be transferred in a standard format for further statistical analysis or other modeling.

Although the process of aligning the dual-census data could be accomplished through tedious and complicated flat file conversion steps, the dual relational data-base design is enormously more efficient and more easily documented. Among other things, the macro scripts used to create formulas can be included directly as part of the documentation, especially since formulas for derived attributes are often quite complicated.

US and Mexico Census data provides the basis for comparative analysis of the US / Mexico Border Twin Cities project. This data is extracted from numerous CDs into Micorsoft Access files. These files allow for queries, manipulation of data and creation of tables to join with geographic data. A naming convention using all or part of the state name, summary tape file number and area information was established. For example, Arizona3a is the US Census data from summary tape file 3 for the entire State of Arizona; Cal3asd is the US Census data from summary tape file 3 for the San Diego area of the State of California.

Relational files were established for each of the US twin cities. Each of the US census CDs contained 34 tables, which contained summary data for the specific area. A relationship between the 34 tables was created using a common field, in this case logrecnu (or log record number). Once a relation had been established, the data could be arranged as desired and extracted through queries.

Once the data was moved into the Microsoft Access file, a query was performed to pull out data necessary to create a link to geographic layers. We took one of the files, usually ***3a00.dbf to query out at summary level 150. The data necessary to create "BKG_KEY" (the link) were Statefp, cnty, tractbna, blckgr, logrecnu, and sumlev. These data were brought into ArcView where new fields were created from these fields. In ArcView we added the .dbf file and began editing the table. We add a "junk", "newtract", and "bkg_key" field to the table. These are character fields. We calculate the junk field by combining trackbna + "00". This will add two zeros to the end of that field. We only need the left six characters, so we calculate the newtract field by taking the junk field + .left (6) command. Once the newtract field has been created, the statefp, cnty, newtract and blckgr fields can be added together to make the Bkg_key field. This is saved out as a database for import back into the Microsoft Access file. This updated .dbf is imported and linked to the other linked tables to create a relational database that is able to be joined to geographic layers. This will allow all census tables to be exported with a common BKG_KEY field to the US geography at a block group level. In this way the data could be linked back to geographic layers for display and analytical purposes.

Mexico Data Similar to US data, Mexico data was taken from the Scince CD and converted into tables in Microsoft Access. This data was arranged in tables by municipalities with ageb codes as a field in the table. There was not a link back to the geographic coverage at the ageb level. Therefore, a queried field was added in access. This field took the unique ID characters from the table and added the ageb code to create a key field. For example, the Mexicali area has an ID of 020020001. This "Area ID" and "ageb key" were added together to create a link back to the geographic layers, similar to the "BKG_KEY" field for the US data.

Creating variables with the US Census Data -

Census data contains unique fields for each variable. For example, total population is the census field P0010001. When queried, this field will generate the total population for an area. These fields were manipulated to generate "calculated variables". For example, a calculated variable for dependency was desired. The Dependency ratio was created by adding the census fields that contained data for persons of ages 0-17 with persons of age 65+ and then dividing this total by the fields for persons of ages 18-64. While this may not appear to be complicated, there are many fields for these categories that must be placed correctly in the equation. Due in part to limited detailed data from the Mexican Census, it was determined that, initially, sixteen variables (available from both US and Mexico Census data) would be needed to compare and contrast the Twin Cities. The process of calculating variables was performed for each Twin City (see Variable Query Structure below).

Creating variables with the Mexico Census Data -

We took the Mexican Scince data and created a query in Access to derive the sixteen variables. These same variables were generated for the U.S. side earlier. This data was exported as a .dbf and joined with the Mexico geographic layers. We were then able to query down to create a shape file for the specific ageb of interest. This data was matched up with the respective ageb coverage to generate a separate shape file for each area. Tijuana has five areas, the others have one each. An .avl (ArcView legend) file was created for each of the sixteen variables.

Microsoft Access - Variable Query Structure

From the thousands of data fields from the US and Mexican Census, we selected combinations of fields that would result in sixteen common variables that were generally comparable across the border. The sixteen variables and some of the fields necessary to generate these sixteen variables are listed below:

Block Group Key to link to GIS layer, BKG_KEY;

County Key to identify county, i.e. El Paso is 141, CNTY;

Total Population 1990, POP90

Population from zero to age 17, POP0_17

Population from 65 years and up, POP65+

Population between 18 and 64, POP18_64:

Ratio of Dependency; ages 0 through 17 and over 65 to ages 18 through 64 RDEP

Total Male Population, MALEPOP

Total Female Population, FEMPOP

Ratio of Males to Females, RGENDER

Population born in State, BORNIN

Ratio of population born in state to total population, RNATIVE

Population born outside of State, DIFFSTAT

Population over five years old, POP5+

In Migration - Population born outside of State to population over five years old, RINMIGRA

Population with 9th grade or lesser education, SCHL9+

Population over 15 years old, POP15+

Ratio of population with 9th grade (or lesser) education to population over 15, RPRIMED

Population with 9th to 12th grade or graduated from high school education, SCHL12+

Ratio of population with 9th to 12th grade, no diploma or graduated from high school education to population over 15, RSECED

Married population, MARRIED

Population over 12 years old, POP12+

Ratio of married persons to population over 15, RMARRIED
Economically active population, ECONACT

Ratio of economically active population to total population, RECONACT

Unemployed population, UNEMPLOY

Ratio of unemployed persons to economically active population, RUNEM

Underemployed population, UNDEREMP

Ratio of underemployed persons to economically active population, RUNDREMP

Population in poverty status between ages of 18 and 64, POV18_64

Ratio of persons in poverty to total population (ages 18 to 64), RPOVERTY

Total housing units, HOUSING

Total owner occupied housing units, OWNOCCP

Total occupied housing units, TOTOCC

Ratio of owner occupied to total occupied housing, ROWNHOME

Housing units with one bedroom, BDRM_ONE

Ratio of housing units with one room to total occupied housing units, RONEROOM

Housing units with kitchen facilities, KITCHEN

Ratio of housing units with kitchen facilities to total occupied housing units, RKITCHEN

Housing units with public sewer, SEWER

Ratio of housing units with public sewer to total occupied housing units, RSEWER.

County & Municipio Level Review

Since the 16 variables were already queried at the block group and ageb levels in Microsoft Access, we can use the same variable query to derive county and municipio level data. This is accomplished on the US data by finding the correct summary level for county level data. This summary level is "050" (block group level is "150"). STATEFP and CNTY fields are necessary to create a link field. On the Mexican data, we use the master table rather than municipio specific tables. For example, the ageb level query for Mexicali would look at table [20020001] and the municipio level query would look at table [2] using the same query.

The query for county level data in SQL format is as follow:

SELECT [2].MUN_CODE, [2].P01 AS POP90, [2].[P01]-[p08] AS POP0_17, [2].P10 AS [POP65+], [2]![P08]-[2]![P10] AS POP18_64, ([POP0_17]+[POP65+])/[POP18_64] AS RDEP, [2].[P01]-[P02] AS MALEPOP, [2].P02 AS FEMPOP, [MALEPOP]/[FEMPOP] AS RGENDER, [2].P11 AS BORNIN, [BORNIN]/[POP90] AS RNATIVE, [2].P14 AS DIFFSTAT, [2].P03 AS [POP5+], [DIFFSTAT]/[POP5+] AS RINMIGRA, [2].P21 AS [SCHL9+], [2].P06 AS [POP15+], [SCHL9+]/[POP15+] AS RPRIMED, [2].P24 AS [SCHL12+], [SCHL12+]/[POP15+] AS RSECED, [2].P30 AS MARRIED, [2].P05 AS [POP12+], [MARRIED]/[POP12+] AS RMARRIED, [ECONACT]/[POP90] AS RECONACT, [2].P34 AS ECONACT, [2].P35 AS UNEMPLOY, [UNEMPLOY]/[ECONACT] AS RUNEM, [2].P43 AS UNDEREMP, [UNDEREMP]/[ECONACT] AS RUNDREMP, [2].P46 AS POV, [POV]/[ECONACT] AS RPOVERTY, 0 AS HOUSING, [2].P70 AS OWNOCC, [2].P49 AS TOTOCC, [OWNOCC]/[TOTOCC] AS ROWNOCC, [2].P57 AS BDRM_ONE, [BDRM_ONE]/[TOTOCC] AS RONEROOM, [2].P61 AS KITCHEN, [KITCHEN]/[TOTOCC] AS RKITCHEN, [2].P64 AS SEWER, [SEWER]/[TOTOCC] AS RSEWER

FROM 2;

Where [2] represents Baja California. Each area is represented with a different character or numeric value. This SQL statement is copied into Microsoft WordPad. A "find and replace" function changes the query from one area to another. For example, the above SQL statement queries Baja California; to change the query to look at Chihuahua, we do a find and replace from [2] to [8], then copy this into a new query on Chihuahua municipios. The process is repeated for each state. Once completed you will have a query for each State (at municipio and county levels) that can be exported for use as a database.

These data are exported into a .dbf format that can be cut and pasted into a master .dbf file using excel. Once the master .dbf file is created it can be opened in ArcView and joined to the coverage or shape file for the related geography. The MUN_CODE or AGEB_CODE, STATEFP and CNTY fields are used to create links to the geography. In ArcView when a link is needed, you must edit the table, add a new field, and calculate that field with the other fields that make up the link. Once joined in ArcView, these data can be displayed by the sixteen variables. To be meaningful, however, these data should be analyzed in SPSS to determine appropriate cut points (see Cut Points below). Once cut points are determined, an .avl file can be created to quickly duplicate repetitive, standard legends for display. Other Variable that are not available through the Scince Data, will need to be queried from other INEGI CD's. These CD's are at the municipio level. It can be correlated with data from the US Counties CD or a combination of STF3 CDs. Binational variables to be queried from these data sources are: language, literacy, divorce, fertility, occupation, foreign born, higher education, male & female migration. Single nation variables are: ancestry, Hispanic origin, Mexican born, agricultural, and migration for the U.S.; and indigenous, Spanish, and migration for Mexico.

Join of DBF files to ArcView geographic features -

Using ArcView, a join was performed between geographic area features of US Census block groups and the equivalent Mexico agebs. These joined files were converted to shapefiles as the analysis was to be unique only to the geographic areas identified for analysis due to their apparent urban nature.

Cut Points -

Data was copied to an excel spreadsheet for analysis and determination of cut lines. This data was moved from the Access queries by: Selecting all data from Mexican query table, ctrl c, moving into the Excel document, ctrl v. Repeat process for U.S. side. Save spreadsheet. Cut points were developed for each of the following: U.S. Cities - An analysis of the 8 US cities independent of Mexican cities. Databases were assembled in sets of US cities and Mexico cities. The databases were assembled together into "us cities" and "mx cities" spreadsheets to determine proper cut points for these analyses. Cut points were determined by analyzing these spreadsheets using spss software. These cut points were then entered manually into the ArcView themes for the eight display areas. An .avl file was created to ease the processing of these legend cuts. Mexican Cities. - An analysis of the 8 Mexican cities independent of the US cities Same as US cities. All Cities - An analysis of all 16 cities. All of the databases were assembled together into an "all cities" spreadsheet to determine proper cut points for the project wide analysis. Cut points were determined by analyzing the "all cities" spreadsheet using spss software. These cut points were then entered manually into the Arcview themes for the eight display areas. An .avl file was created to ease the processing of these legend cuts.

Boundary file Matching. Another problem for dual-census GIS is the matching of boundary files i.e. sometimes called coverages (Clarke, 1997). In most nations, the census or an allied federal agency, is responsible for maintaining national boundary files, sometimes called "base coverages" (Huxhold, 1991). In the U.S., for instance, all the national geography is maintained by the U.S. Census Bureau in the nationwide "Tiger File" (U.S. Census, 1999). In other countries, the national geography is not unified into one national base file, but is maintained in a number of different base files that follow common definitions. Third party commercial vendors often build their national boundary files utilizing the national base file.

A particular challenge for coverages of dual census GIS's is to match the geographic boundaries at national borders. A starting point is that the two censuses' geographic projections need to be adjusted to be the same. A projection is a particular rendition of the 3-D surface of the earth to a flat surface and includes such well known ones as Albers equal area, asimuthal, cylindrical, Lambert conformal conic, and transverse (Clarke, 1997). However, the boundary file matching problem for borders extends beyond mere projection alignment. The problem is that the procedures to produce two nation's base files are different. The point, line, and polygon information is drawn from different sources such as satellites, GPS and other field measurements. At national borders, the differences show up acutely. When one nation's boundary file (or one derived from third party providers) is aligned with a second nation's, such border features as shared bridges, highways, railroads, rivers, etc. do not align. This requires development of special GIS macros to make the compromises necessary to align the borders. This is best done in an advanced GIS package, such as ArcInfo, that has a powerful macro language, "snapping" and other alignment features.

The method used to "normalize" the two nations' census boundary files involved studying the differences and then applying specific prescriptive GIS geographic editing. Because the U.S. Census had a clearer methodology used in its creation, this data was used as the border base data set. The Mexican data set was first converted and projected into Albers equal area, asimuthal, cylindrical, Lambert conformal conic (see Figure 3. and Figure 4.). The second step was to align or fit the known urbanized features (ageb, municipios) relationally to the US Counterpart features. Because basemap data was limited, some of the features were estimated based on known matches. The process did not use "rubber sheeting" as many alignment procedures would use, but rather used a move and rotation procedure similar to putting a puzzle together. The relational accuracy of the Mexican data was good so that it allowed for this method to occur quite easily. Where the Mexican data did not exactly align, decisions were made on how much movement would be allowed from node to node. This GIS process was carried out using Arc Macro Language scripts for working in Arcedit. Once the movement and rotation took place, the Mexican data border was removed and the nodes were snapped to the U.S. Tiger Line File. The result was a graphical seamless boundary that is displayed in the map book.

Modeling/Analysis with dual-census data. The essence of any GIS is its analysis and modeling features. This is tied to the idea that a GIS is a form of DSS, and that DSS is based on modeling features. In a dual-census GIS, there is the potential to do binational spatial analysis and modeling that is frequently restricted by lack of standardized data sets and geography. The spatial analysis can yield powerful results. For instance, joint environmental analysis can be done for air-polluted twin border cities; bi-national transport corridors can be planned spatially taking into account constraints, land features, and important determinant variables on both sides of the border. The obstacle to binational modeling is that the modeler needs to consider a valid model in both countries, not one with assumptions applying only to one nation or the other.

In summary, this section has examined the four major problems to achieving a dual-census GIS, and suggested for each ways to overcome the problems.

The U.S.-Mexico Border Twin Cities Project

The Twin Cities GIS Project utilized the general framework of this paper to construct a dual-census GIS for the twin cities of the U.S. and Mexico. There are eight major twin cities located along the 2,000 mile U.S.-Mexico border. The twin cities include the large ones of San Diego-Tijuana and El Paso (Texas)-Ciudad Juarez. They also include smaller cities such as Nogales (Arizona)-Nogales (Sonora) and Eagle Pass (Texas)-Piedras Negras (Coahuila). The total population of the twin cities in 1990 was 5.1 million, but this is expected to rise to 6.8 million by 2000 and an estimated 10.6 million by 2020 (Peach and Williams, 1999). The twin cities are particularly important as a result of the NAFTA agreement, since they serve important conduits for trade, commerce, and transportation between the two large nations.

The cities have suffered from lack of dual public administration planning. For instance, the metropolitan planning agency, SANDAG, has an advanced GIS for the metropolis of San Diego, but does not include any data on Tijuana, even though that city has surpassed one million population. Likewise, Tijuana's limited GIS from INEGI, the Mexican Census, includes no data on the U.S. side. Such planning deficiencies are repeated in all the border cities. Planning is better on the U.S. side and often utilizes GIS, but dual-census GIS's are lacking.

The Twin Cities GIS Project overcame these problems based on the system model discussed earlier and illustrated in Figure 1. Small area data were utilized for block groups in the U.S. and AGEBs (Area Geografía Basica) from the Mexican Census. Overall, these small area units had average population of 1,739. The U.S. Census data were converted and re-arranged in relational tables of a Microsoft Access data-base. Likewise, data from the Mexico Census of 1990 (CIEN files) were converted and re-arranged in tables in Access. In many cases, it is necessary to perform formula calculations to convert raw variables into finished ones, for instance, the project calculated "one-room housing" as the ratio of single bedroom housing units to all housing units. The formula computations for 16 matched variables were done in the Access scripting language. The variables included population, social, economic, and housing ones. The variables were output with systematic geographic key fields. The key field is a unique identifier for each block group and AGEB. In another procedure, the U.S. small area boundary files were acquired from the GIS provider Esri and the Mexican files from the INEGI; the U.S. and Mexico border and its features were aligned by utilizing macro language and spatial commands in the advanced GIS software, ArcInfo. The alignment consists of applying algorithms to modify the boundary files so the boundary edges conform and the common features such as roads, rivers, etc. join at the border, in order to produce a two-nation boundary file.. Finally, the two-nation attribute file output from Access was joined on the geographic key field to the two-nation boundary file (See Fig. 1). An example of the twin city output is the distribution of poverty in El Paso and Ciudad Juarez, shown in Figure 2. El Paso is to the north and east of the winding boundary line, and Ciudad Juarez is to the west and south. This one example indicates that poverty is located mostly in the city center of El Paso, whereas it is in the peripheral areas of Ciudad Juarez. Such information can be useful in joint binational planning between the two city governments. The project also utilized the dual-census data for spatial analysis and modeling; one example was applying cluster analysis to study urban structure and then constructing an index of binationality based on the cluster analysis to determine the extent of similarity or difference between the two sides.

Conclusion

There is a wealth of information from the Twin Cities GIS Project that can benefit planners in both the U.S. and Mexico. Even though there are constraints on which attributes can be used and on the matching of variables, there are many new perspectives in viewing these crucially important and often large "portal" cities. The challenge shifts from technical to political and governmental. Once technical barriers are overcome, the twin city governments have the will to cooperate and share data, knowledge, and expertise on an ongoing basis in order to understand each other's cities and cityspaces better? Beyond the cities' willpower, do the national governments provide support and encouragement for this cooperation. There needs to be a real governmental interest and motivation in order to have the most use and benefit of such a border twin cities GIS.

This example is important for other parts of the world as well. Cooperation among neighboring countries and blocs of nations is increasing, and dual-nation GISs can assist public administration in cooperative planning, whether for border cities or other bi-national geography. In order to achieve dual nation GIS's, public administration agencies at the levels of cities, states, national governments, and the European Union need to encourage systematic censal collection and standardized design frameworks such as the one illustrated to put binational or multi-nation GIS's into effect. The benefit could be great, since the world has numerous cities located at national boundaries and large amounts of population that is close to borders. Furthermore, many public administration problems and challenges of national borders remain, even as economic trading blocs contribute to the lowering of barriers.

Acknowledgement: Appreciation is expressed to the Ford Foundation, which sponsored the research project.

References

Clarke, Keith C. 1997. Getting Started with Geographic Information Systems. Upper Saddle River, New Jersey: Prentice Hall.

Fleck, Susan and Constacne Sorrention. 1994. "Employment and Unemployment in Mexico's Labor Force." Monthly Labor Review 117(11), pp. 3-31.

Huxhold, William E. 1991. An Introduction to Urban Geographic Information Systems. New York: Oxford University Press.

Murphy, Lisa. 1995. "Geographic Information Systems: Are They Decision Support Systems? Proceedings of the 28^th Annual Hawaii International Conference on System Sciences, 131-140.

Peach, James and James Williams. 1999. "Population Projections for the U.S.-Mexico Border Region." Paper Presented at the Meeting of the Association for Border Studies, Fort Worth, Texas, April.

Star, J. and J.E. Estes. 1990. Geographic Information Systems: Developments and Applications. Saddle River, New Jersey: Prentice Hall.

Turban, Efraim and Jay Aronson. 1998. Decision Support and Expert Systems. 5th Ed. Upper Saddle River, New Jersey: Prentice Hall.

United Nations. Various Years. U.N. Demographic Yearbook, New York: United Nations.

U.S. Bureau of the Census. 1999. Census Website http://www.census.gov.

World Bank, The. 1999. World Development Indicators. Washington, D.C.: The World Bank.

Figure 1. Design of Binational Twin City GIS

Figure 2. Sample Map

Figure 3. Border Matching Challenges

Figure 4. After Border Matching