In-Line Geocoder - The Next Generation

Reno Fiedler, STC
Stephanie Fiedler, STC
Brad Tathum, STC

It is estimated that 80% of all data has a spatial component, such as an address, related to it. However, addresses in relational databases were almost exclusively kept in semi-free form and used for reference and mailing purposes only. Current attempts to use these databases in new applications, such as GIS, made many data base administrators (DBAs) aware of the limitations of such addresses. Misspelled words, inconsistent acronyms, missing information are only the beginning. It is computationally difficult to link free form addresses to locations on maps. This paper discusses challenges, benefits and real world experiences in implementing a next-generation In-Line Geocoder developed at Scientific Technologies Corporation.


Introduction

Geocoding has been described as:

  • The process of creating geographic coordinates for graphically referenced tabular data (Minnesota State Government Information, 2000)
  • The conversion of spatial information into computer-readable form (Clarke,1997)
  • A method of deriving spatial coordinates for tabular data stored in addressable form (Dempsey, 2000)
  • A way of extending an address by relating it to a particular location within a spatial reference frame thereby enabling its use in Geographic Information Systems (GIS) applications (Masse and Fiedler, 2001)

    Addresses are not fit to be displayed in a map context due to the fact that they lack a computer readable spatial reference. Geocoding allows the derivation of precise coordinates on the surface of the earth by assigning x-coordinates and y-coordinates to point locations. Geocoding is often the responsibility of a central mapping agency (Clarke, 1997).

    Geocoding enhances data so the inherent information contained within addresses can be harnessed by the researcher, planner, developer, or analyst. By placing data points in a spatial context, the database can now be used within the powerful context of GIS, thereby increasing the functionality of the database itself.

    Additional benefits of geocoding:

  • Means to perform data quality assurance and field input validation
  • Provides more detailed mapping and analysis capabilities
  • Allows for the summarization of information by any spatial size (i.e. legislative districts, census tract, street)
  • Provides for market analysis of areas
  • Allows for the investigation of spatial patterns in the data
  • Enhancement of decision making capabilities
  • Provides information to support site selection

    The spatial level of geocoding depends upon the intended uses of the data. Databases used for geocoding need to contain varying levels of information depending on the level of geocoding being completed. For example, to geocode at the zip code level, the database requires only one field, zip code. The most information is required for geocoding street addresses. Therefore, statistical analysis with map units equal or larger than zip codes are less demanding on address quality than routing applications that, for example, are intended to direct emergency vehicles to a location.

    Geocoding

    Process

    Geocoding can be accomplished through the use of stand alone applications or through geocoding applications within individual software packages (e.g. ArcView® Geocoder, Centrus Geocoder) Typically, geocoding takes place on addresses already contained within a database. The records are packaged and exported for batch processing within the stand alone geocoder or the internal geocoder of the software. This process can be time consuming and costly.

    The geocoding process utilizes at least two data sets: one containing the address information without map position information and the other containing a reference street map or other address diction with known map position information. Other data sources (e.g. USPS data sets) may provide useful information to increase the match rate between address data set and street data set. A software package links records in both databases, by matching street names and addresses. Successful matches result in the addition of the map information, usually in latitude and longitude coordinates, to the original address database.

    Considerations

    Shortcomings
    Inherently, the completeness and correctness of both databases and sophistication of the geocoder determines the accuracy of the geocoding process. It is important to note that the completeness and accuracy of data in address-ranged street network databases varies from area to area. Address matching is often particularly problematic in rural areas or areas where people rely on post office boxes. Additionally, if a relatively large number or a key set of cases is geocoded to particular street segments, it is worth verifying by field survey or some other method that the street names and address ranges on those segments are correct. Particular attention should be paid when the subsequent GIS processing of the case database involves assignment of cases to areas, especially to units like census blocks for which boundaries correspond to street center lines.

    Potential Errors
    Type I
    Occurs when addresses have not been geocoded although the address is valid. This is the most common error and is caused by incomplete street data or a failure of the geocoder to parse the address correctly.

    Type II
    Occurs when addresses are matched to the wrong street segment or zip code. This error may be caused by bad or incomplete address information or by incorrect parsing and cleaning of the address by the geocoder. Geocoders attempt to correct irregularities in two basic approaches. Some geocoders clean an address by using third party data sets, such as the USPS product suite. Another approach is to change or auto-complete parts of the address based on 'best guesses'.

    Some corrections of address irregularities may actually cause Type II errors. A typical example would be that the street prefix 'N' has been replaced by 'S' because the geocoder could not find the address on the north side of town, assumes a user input error and may, incorrectly, find the same street on the south side of town. Geocoders deal with irregularities differently resulting in different degrees of reliability of their results. Usually, geocoders indicate the level of correction employed to locate the address on a street network through an exception code.

    Address Cleaning

    A precursor to geocoding is database address analysis and cleaning. In order for the address information contained in a database to be geocoded, the database addresses undergo address cleaning to ensure storage consistency and the removal of ambiguity, where possible. Free form addresses may need to be manually corrected to fit an address style (such as U.S. Streets). Address cleaning involves making educated guesses about the addresses utilizing additional address data sets to enhance the address. This cleaning process is intended to improve overall geocoding accuracy. The level of address accuracy depends upon the intended usage of the data, as discussed above. For example, statistical analysis with map units equal to or larger than zip codes is less demanding on address quality than routing applications.

    Geocoding software produces different success levels. This may be due to additional data sets that are included with some geocoders or the level of effort that has been put into the design of the geocoders. Valid addresses should perform well under most geocoders. However, high-end geocoders may show a better resilience to address data irregularities.

    Improving address quality in database may be the single most important step in targeting the challenges of the spatially enabled environment of modern organizations.

    Geocoder

    Stand Alone

    Traditional geocoding software packages allow for batch processing of data, usually at designated times. These batches of data are exported for use in an external geocoding software or service and then imported once the map positions have been added. Details regarding the geocoding match-rate are usually provided.

    In-Line Geocoder

    A new breed of geocoders are dedicated to In-Line Geocoding. These geocoders perform two functions: free form address cleaning/validating and geocoding. Unlike traditional geocoders, input validation is completed before or at data input to the database.

    The In-Line Geocoder allows for the automatic geocoding of an address record, within a database, as soon as it is entered. The In-Line Geocoder can geocode single records or a batch of records.

    The following list summarizes typical features of In-Line Geocoders:

  • Runs stand alone and/or inside of databases
  • Uses multiple street data sources
  • Preferably uses publicly available data sets
  • Operates in-line, without user interaction
  • Reacts to triggers in databases
  • Requires no uploading/downloading of data
  • Provides status codes for later re-geocoding. These status codes tell why an address failed if it was not geocoded.
  • Acts as a quality assurance and field validation tool for address information
  • Improving address quality in database may be the single most important step in targeting the challenges of the spatially enabled environment of modern organizations.
  • System administrators can supply their own data for streets, zip codes, and abbreviations.
  • API's available in common programming languages, such as Java.

    Advantages of In-Line Geocoding

    By geocoding address records as they enter the database, the in-line geocoder removes the need to export address data for use in an external geocoding process or service, available from a commercial source. This removes the cumbersome upload/download step and ensures data integrity. An additional benefit of the in-line geocoder is that it may utilize the frequently updated TIGER/Line data available from the U.S. Census. This TIGER/Line street data set is inexpensive and publicly available from the US Census while commercial street sets are costly.

    Most commercial geocoding services require that their street data sources be obtained and used exclusively. Additionally, commercial geocoders often provide a matchcode in percentage. STC's Inline Geocoder adds exception codes to the result set that tell any user what happened to that address record. The exception codes can be queried and later re-geocoded when new street data become available.

    Limitations

    In-Line geocoders can not perform well on non-standardized address data sets. If an address is ambiguous and fits multiple street segments, there is no way of allowing the user to pick and chose where the address is located, as is possible in the traditional geocoder. The solution is to export these ambiguous addresses and then process them separately.

    Case Study

    Scientific Technologies Corporation has developed the In-Line Geocoder 2.0, Java version. This In-Line Geocoder has been implemented within STC's immunization registries and other Public Health data bases.

    Process

    Reports from local hospitals and doctors are sent via Internet transmission to the gateway. The gateway is a software application that acts as the entry point to the immunization registry. Within this gateway software, the data is manipulated in three ways. The data undergoes Quality Assurance procedures, DeDuplication, to remove duplicate records, and it is geocoded. These three procedures ensure that the data is correct, consistent and spatially enabled before it is inserted into the Master Tables.

    Benefits

    Prior to the In-Line Geocoder, the data sets had to be exported, on average, every quarter. This took an average of two man-hours of work. This exported data was then geocoded by a GIS specialist, which took another four man-hours. Finally, the data was imported into the database, which required an additional two hours. In total, to geocode data using a stand-alone geocoder takes an average of eight man-hours, not including communications and overhead.

    Conversely, the In-Line geocoder allows consistent data at all times and near "real-time" mapping possibilities thus minimizing required man-hours involved to none. Therefore, the In-Line geocoder pays for itself.

    Lessons Learned

    Following STC's In-Line Geocoder implementation, several points came to light.

  • Do not commit to the implementation unless there is a strong bussiness case and commitment
  • Review all available data sets from USPS, utility companies, etc.
  • Clearly define the business objective (quick and dirty tool?, commercial software? Integration piece?)
  • Design strategy (specialized format verses utilizing unchanged street data)
  • Partner with the clients for feedback

    Summary

    STC's In-Line Geocoder is a success. Through this implementation a clear objective was defined. Namely, to make the In-Line Geocoder an independent tool. The implementation showed that the In-Line Geocoder was useful for address analysis, geocoding and internationalization. The created geocoder is easy to use and user friendly, requiring little training. The In-Line Geocoder is a key step in spatially enabling an enterprise and is capable of working in a distributed environment.

    References

    Clarke, K.C. 1997. Lecture 4: Geocoding [online]. Prentice Hall. Available at http://www.geog.ucsb.edu/kclarke/G128/Lecture04.html [Accessed June 14, 2001]. Dempsey, C. 2000 A Geocoding Primer: with an example using ArcView [online]. About The Human Internet. Available at http://gis.about.com/science/gis/library/weekly/aa053100a.htm [Accessed June 14, 2001] Masse, J and Fiedler, S 2001 STC's In-Line Geocoder® Press Release [online]. Available at http://www.stchome.com/pressreleases [Accessed June 14, 2001]. Minnesota State Government Information, 2000 Geocoding and Address Matching [online]. Available at http://www.state.mn.us/intergov/metrogis/address/geocd.htm [Accessed June 14 2001].



    Reno Fiedler, Director GIS Services
    Stephanie Fiedler, GIS Analyst
    Brad Tathum, Sr. GIS Analyst and Trainer

    Scientific Technologies Corporation, STC
    4400 E Broadway Blvd, Suite 705
    Tucson, AZ, 85750
    (520) 202 3333
    (520) 202 3340 (fax)
    www.stchome.com
    reno_fiedler@stchome.com