Improving Locational Data of Environmental Concern:
EPA's Locational Data Improvement Project

Andrew T. Battin, U.S. Environmental Protection Agency
Charles D. Catlin, U.S. Environmental Protection Agency
Robert G. Palmer, EPA Systems Development Center
Theresa A. Urban, EPA Systems Development Center

Analysis of United States Environmental Protection Agency databases revealed that locational (latitude/longitude) data were either inaccurate or nonexistent. As a result, the Agency began the process of improving the locational component of its information systems.

EPA began the Locational Data Improvement Project (LDIP) by address-matching records pulled from Agency databases. The intent of this effort was to acquire as many documented coordinates as possible. Improved coordinates are passed through verification checks (point in polygon) and stored in the Locational Reference Tables (LRT). To be loaded into the LRT, extensive documentation must accompany the data. EPA also developed standards for documentation.

Once coordinates are collected for 100 percent of EPA-regulated facilities, operable units, and monitoring sites, efforts will focus on improving existing data (e.g., replacing/adding coordinates derived from more precise methods, such as global positioning systems [GPS]). The resulting data set is available on-line for use in geographic information systems (GIS) applications. In addition, an on-line mapping application is available to the public.


Evolution of the Locational Data Improvement Project (LDIP)

The United States Environmental Protection Agency (EPA) relies on a diverse reporting community to monitor over 700,000 EPA-regulated facilities and collects data for hundreds-of-thousands of monitoring sites. This information is stored in several Agency-regulated program systems databases, each of which contains information on a specific media (e.g., air, water, hazardous waste). An analysis of these databases in early 1990 revealed that locational data were either inaccurate or nonexistent.

In 1990 the Agency formed the Locational Accuracy Task Force (LATF), which over a period of six months, collected and weighed a considerable amount of information on geocoding technologies and programmatic requirements. As a result, LATF recommended that EPA:

EPA Locational Data Policy

LATF's efforts lead to EPA's adoption of a Locational Data Policy (LDP) (EPA/IRM Policy Manual, Chapter 13). The policy established the principles for collecting and documenting latitude/longitude coordinates for facilities, sites and monitoring and observation points regulated or tracked under Federal environmental programs within the jurisdiction of EPA. The intent of the policy was to extend environmental analyses and allow data to be integrated based on location, thereby promoting the enhanced use of EPA's extensive data resources. The policy underscored EPA's intent to establish the data infrastructure necessary to enable data sharing and secondary data use.

LDP serves as a framework for collecting and documenting location data. It sets an accuracy goal of 25 meters, while allowing managers of individual data collection efforts to determine the level of precision and accuracy necessary to support their mission. The policy recommends the use of GPS to obtain latitude and longitude coordinates of the highest possible accuracy.

Throughout the early 1990s, the policy was not widely implemented by Agency program system managers due to the significant system redesign efforts necessary to accommodate the changes stipulated by the policy. This notion has changed, as evidenced by most of the current system reengineering efforts and the new programmatic focus on communities and "places", which underscores the importance of location data.

The problem remains, however, that most EPA systems were designed in response to managing information flow resulting from a legislative directive. In the past, legislation focused on tracking the number of chemical discharge permits given to facilities, and the type of chemical and amount of chemical emissions. There was little or no effort made to acquire accurate locations of the release of the chemicals, and even less to inter-relationships among air, water, and land-based chemical releases. All information was managed in separate computer systems without regard to the synergistic effects across the various media.

EPA strives to maximize the potential analyses able to be performed using spatial data in Agency databases. The first step was to standardize the type of information that accompanies latitude and longitude coordinates.

Data Standardization and Database Development

EPA's LDP mandates that latitude and longitude coordinates be collected by Agency program systems, and requires that these coordinates be documented with specific information. The most crucial elements include a method of collection (e.g., map interpolation), an accuracy value associated with the coordinate, and a description of the entity corresponding to the coordinate (e.g., front gate). This information is required in addition to, and not precluding, other critical location identification data that may be needed to satisfy individual program or project needs, such as street address, depth, elevation or altitude.

A technical workgroup within the Agency developed the Method Accuracy Description (MAD) (Version 6.1) Information Coding Standards for the U.S. Environmental Protection Agency's Locational Data Policy (LDP) to standardize the documentation provided by data collectors. Currently, there are nine required and nine recommended (or optional) data fields in the MAD codes. These codes are being reviewed and may be modified based on input from headquarter, regional, tribal, and state partners.

Concurrent with the LDP and development of the MAD codes, EPA's Office of Information Resources Management developed Envirofacts-a Relational Database Management System (RDBMS) which integrated five program systems: the Resource Conservation and Recovery Information System (RCRIS); the Permit Compliance System (PCS); the Aerometric Information Retrieval System (AIRS) Facility Subsystem (AIRS/AFS); the Comprehensive Environmental Response, Compensation, and Liability Information System (CERCLIS); and the Toxic Release Inventory System (TRIS) in Oracle.

Envirofacts also contains the Facility Index System (>FINDS), which crosslinks facilities existing in multiple program systems. Similarly, the Envirofacts Master Chemical Integrator (EMCI) provides an index for EPA-regulated chemicals listed by program system. The Locational Reference Tables (LRT) provide latitude and longitude coordinates for EPA-regulated facilities and operable units.

Envirofacts mirrors the contents of the original five program systems and is updated monthly. Locational data collected as a result of LDIP are stored in Envirofacts, which is a component of the Envirofacts Warehouse.

Locational Data Improvement Project (LDIP) Accomplishments

In 1996 EPA's Deputy Administrator and the Agency's Executive Steering Committee for Information Resources Management (IRM) allocated funds to begin the process of placing the contents of Envirofacts in a geographic context. This initiative formed the basis for the Locational Data Improvement Project (LDIP). The goal of the project is to generate latitude/longitude information of documented origin for all EPA program system records within a targeted accuracy of +/- 25 meters by the end of CY2000.

Critical LDIP activities include maintaining locational information in the Envirofacts Warehouse, improving facility identification through geocoding and system linkages, supporting other EPA initiatives, and providing the GIS community with access to locational data. The current focus of LDIP is to acquire location data for each EPAregulated facility. As the project proceeds, latitude and longitude coordinates will be collected for operable units (e.g., pipes and stacks) and monitoring sites, with the assistance of regional, state, tribal, and local partners.

The first major activity under LDIP was to obtain latitude and longitude coordinates for EPA program system records through address matching. The intent of this effort was to acquire as many documented coordinates as possible; the accuracy of the location was also documented. Over one million records were sent to a vendor for address matching. The results of this effort greatly improved the quantity of documented location information, as shown in Figure 1.

Documentation of EPA Program System Records

Figure 1. Documentation of EPA Program System Records

EPA developed and released the LRT as part of its Envirofacts Warehouse to store this information. Data contained in the LRT augment the information contained in EPA's program systems.

Locational Reference Tables

The LRT provide the relational structure to accommodate coordinate information within the Envirofacts Oracle database. For program system records to be considered for loading into the LRT, a latitude and longitude coordinate, collection method, and description category must be present. If the other required fields are not present or populated (i.e., accuracy, vertical measure, horizontal datum, point line area, and source scale), the information is derived from lookup tables within the LRT.

The table design of the LRT is shown in Figure 2. Refer to the on-line data element dictionary for more specific information.

Table Structure for the LRT

Figure 2. Table Structure for the LRT

Program system data in the Envirofacts database are obtained from the EPA IBM mainframe by running Natural language retrieval programs to create flat files. A program was developed in C which incorporates the National Geodetic Survey NADCON (NOAA Technical Memorandum NOS NGS-50) routines with an Albers meters projection algorithm (national parameters). This program is used to add NAD83 based Albers X, Y values to each record in the flat files. The refresh process then generates the update/modify/delete transactions which are run against the Oracle database. It is in the creation of the transaction files that the preferred flag field is populated. This flag is used to indicate the 'most accurate' or best representative location for the set of coordinates associated with a program system identifier. FINDS linkages are then used to flag the 'best' coordinate to represent an EPA regulated facility.

The preferred coordinate calculation is predicated on the accuracy value. Accuracy values provided in degrees are converted, correcting for the eccentricity and radius of the Earth, as determined by the horizontal datum (NAD27 or NAD83) associated with the coordinate. The formula for conversion is:

Horizontal Datum Conversion Equation

The units for lp are meters/arcsecond with respect to the arc measure of the longitude. The LDP states that one accuracy value be provided for a position and that it should be the least accurate of the two coordinates. The assumption is that the accuracy of the point is the accuracy of the longitude since, in most cases, the accuracy of the longitude in seconds will be less accurate than the accuracy of the latitude in seconds. The constant a is the radius of the earth at the equator for a given datum. The latitude must be converted to radians before calculating lp.

The following steps describe the scoring and ranking process used in assigning the preferred flag:

  1. Calculate the raw score such that, for accuracy values less than or equal to 20, the raw score equals the accuracy value. For all other accuracy values, the raw score is calculated using:

    raw score = 15 + (accuracy value0.61)

  2. Calculate the modified (mod) score using:

    mod_score = raw score - (raw score * verification weight)

    where the verification weight is a percentage. Verification codes with their associated verification weights are housed in a lookup table.

The preferred flag field is set to 'Y' in the record that has the lowest mod_score within the set of coordinates for each entity. All other records will have 'N' in this field.

Coverage Creation Logic

The LRT are the primary source for coordinate information for the ef layer in the EPA Spatial Data Library System (ESDLS); MAD code compliant coordinates take precedence over undocumented coordinates. The ef layer has been designed to contain the maximum number of Envirofacts points that can be displayed in a GIS application. The LRT facilitates this by housing two coverage source tables: LDIP_EF_COVERAGE_SRC for facilities and LDIP_EF_SUB_SRC for operable units.

Although the LDIP_EF_SUB_SRC table is populated, there is currently no corresponding layer in ESDLS. These two tables are not part of the relational structure of the LRT (Figure 2). Their purpose is to house data for the monthly ESDLS refresh ARC Macro Language (AML) programs. The ef_refresh AML reads the data for the ef layer from the LDIP_EF_COVERAGE_SRC table.

Figure 3 shows the population logic for the coverage source table. The population of this table is tied to the monthly refresh cycle so that the ESDLS ef layer and the Envirofacts database remain synchronized.

Population Logic for the Coverage Source Table

Figure 3. Population Logic for the Coverage Source Table

The LDIP_EF_COVERAGE_SRC table is read by the ef_refresh AML to create a coverage which is then loaded into ESDLS as the ef layer by the ef_load AML.

Public Access

Several acts of legislation, including the 1986 Superfund Reauthorization Act, require government agencies to make information not classified as confidential available to the public. EPA's Envirofacts Warehouse partially fulfills this obligation.

The Envirofacts Warehouse builds on the RDBMS created in the early 1990 and is available to the public via the Internet. Location information contained in the LRT is stored in the Envirofacts Warehouse and is available on-line for use in GIS applications. The LRT serve as a link between the attribute data contained in Oracle and the spatial data from ESDLS. EPA's Maps On Demand (MOD) capitalizes on this link by offering a series of mapping applications that access spatial data from ESDLS and attribute data from Oracle.

This information is available to the public on-line.

Future LDIP Activities

Early phases of LDIP increased the number of program system records with documented latitude and longitude coordinates. Another LDIP accomplishment is the development of LRT, which houses the improved location information. Despite this success, more work must be done to accomplish the objectives stated in the Locational Data Improvement Project Plan, September 25, 1996.

EPA headquarters is currently coordinating with regional offices to acquire latitude and longitude coordinates for the remaining undocumented records and begin collection for spills, observation locations, and monitoring sites. Once achieved, focus will shift to gathering coordinate pairs of better quality, potentially using more accurate methods of collection. Figure 4 shows the current level of accuracy as of March 1997. Despite the improvement in the number of available coordinates, very few achieve the 25 meter accuracy goal, or are representative of the point of most significant release.

Current Coordinate Accuracy Level

Figure 4. Current Coordinate Accuracy Level

The Agency hopes to improve the accuracy of the coordinates associated with places of environmental concern to within 25 meters through the use of GPS. These include operable units (e.g., pipes, stacks), monitoring sites, and observation locations. The goal is GPS collect operable units by 1999 and monitoring sites and observation locations by 2000. A representative point for each facility will be stored in the LRT by 1998.

EPA also hopes to establish guidelines that will assist data collectors in the field. These guidelines will focus on what is to be collected (rather than how). Validation steps will be taken to ensure the quality of the reported information. All LDIP activities rely on the cooperation between headquarter and regional offices and state, tribal, and local agencies.

Conclusion

Automated geocoding techniques were vital to initiating this project and making progress on the original challenges expressed by the Locational Accuracy Task Force. As EPA moves forward to meet environmental challenges, GIS technology will further enable environmental protection in a geographic context-environmental protection is inherently spatial. Accurate locational data for all features is critical to the integrity in any particular environmental analysis. This is magnified at the community level and will become more important as communities themselves take a more active role in environmental protection.

Acknowledgments

The opinions expressed in this paper do not necessarily represent the views of EPA, the SDC or INDUS Corporation.

The mention of any hardware, software or other commercially available product or service does not necessarily constitute an endorsement of that product or service.

The authors wish to acknowledge the contributions by Loren Hall, Vickie Damm, Tony Selle, Cheryl Henley, Magdy Aziz , Steve Andrews, Anupam Tandon, Vladimir Entin, Sui Kho, Bryan McEnaney, Justin Brown, and Tracey Szajgin.

References

Locational Data Policy Implementation Guidance, Guide to Policy, 220-B-92-008, Office of Administration and Resources Management,March 1991.

Information Resources Management Policy Manual, EPA Directive Number 2100, Chapter 13 - Locational Data, April 8, 1991.

Method Accuracy Description (MAD), Version 6.1, Information Coding Standards for the U.S. Environmental Protection Agency's Locational Data Policy (LDP), LDP Sub-Workgroup of the Regional GIS Workgroup, November 7, 1994.

NOAA Technical Memorandum NOS NGS - 50, NADCON, The Application of Minimum Curvature-Derived Surfaces in the Transformation of Positional Data From the North American Datum of 1927 to the North American Datum of 1983. Dewhurst, Warren T, January 1990. 30p.

Locational Accuracy Task Force: Findings and Recommendations, Locational Accuracy Task Force, December 13, 1990.

The following documents are available through the Technical Support Center in EPA's Systems Development Center in Arlington, VA:

Design Document for the Locational Reference Tables (LRT) in Envirofacts, SDC-0055-091-RP-6013, January 21, 1997.

Design Document for the Geospatial Component of ESDLS Version 2.0, SDC-0055-065-KF-5034, September 30, 1997.

Locational Data Improvement Project Plan, SDC-0055-065-JB-5057A, September 25, 1996.

Latitude/Longitude Values Report, SDC-0055-091-SA-6033, March 3, 1997.

Locational Data Policy Compliance Assessment Report, SDC-0055-091-TU-6021B, March 31, 1997.

Locational Data Policy Compliance Assessment Report, SDC-0055-065-MC-4013C, September 29, 1995.

Locational Data Quality Assurance Report, SDC-0055-065-MC-4014C, September 29, 1995.


Author Information

Andrew T. Battin, Spatial Data Management Program Coordinator
U.S. Environmental Protection Agency, 401 M St. SW
Washington, DC 20460.
202/260-3061, battin.andrew@epamail.epa.gov

Charles D. Catlin, Spatial Data Management Program
U.S. Environmental Protection Agency, 401 M St. SW
Washington, DC 20460.
202/260-3069, catlin.dave@epamail.epa.gov

Robert G. Palmer, EPA Systems Development Center, INDUS Corporation
200 North Glebe Road, Suite 300
Arlington, VA 22203.
703/908-2610, palmer.rob@epamail.epa.gov

Theresa A. Urban, EPA Systems Development Center, INDUS Corporation
200 North Glebe Road, Suite 300
Arlington, VA 22203.
703/908-2016, urban.theresa@epamail.epa.gov