Jack Mills

Tain't Necessarily So: Address Geocoding in the Real World

Vagaries associated with spatial and address data require users of geocoding software to make many quick decisions in order to create timely, accurate information. Paramount among those decisions is a method for standardizing street addresses. This paper will reveal the methods that were used to rework voter files so that ArcView GIS software's geocoding functions could be used to help assign voters to their appropriate polling districts. Pitfalls to be avoided to improve accuracy and efficiency will also be discussed.


INTRODUCTION

Address geocoding (AG) is a function for which geographic information systems (GIS) software is particularly suited. Using a complex algorithm, AG adds a point to a virtual map at a calculated X,Y location which represents the coordinates of an address that is listed in an events table. Such a table can contain a variety of events including names of customers, competitors, students, etc. and their addresses. In this report, the events represented registered voters in Smith County, Texas.

The Smith County Clerk administers elections for school board member and school bond initiatives for the 10 independent school districts (ISD) located within the County's boundaries, for the seven Single Member districts within the Tyler ISD, and for the Tyler Junior College district. The Smith County Clerk contracted with the Office of Research Services/Geographic Information Systems Lab (UTT GIS) at the University of Texas at Tyler to assign registered voters to their appropriate school district. UTT GIS's task was to append school district codes, determined though the AG process, to each voter's record in the database of registered voters furnished by the Smith County Clerk and return the updated database. Because elections can be decided by a single vote, the accuracy of the AG was paramount. This paper reports the many problems encountered (and their solutions) during the completion of the AG task.

Following this Introduction is a Background of the project wherein the state of street addresses and of addresses incorporated in the voters database is described. Following the Background is an explanation of the problems posed by the database and the street maps and the limitations inherent in the ArcView (AV) geocoding software. Next are the details of the procedures followed in overcoming or working around those problems. A conclusion brings the paper to a close.

BACKGROUND

Basically, AG programs establish locations of customers, customers, students, etc. by matching addresses in the events file to addresses in the map file. They are designed to compare the house number and street name in the events file to the range of house numbers and the street name in the map file. Because the programs literally interpret house numbers and street names, those elements must be consistently formatted. When there are inconsistencies, such as a street name spelled 'Hillcreek' in the events file and 'Hill Creek' in the map file, time-consuming intervention is the only solution.

In a great majority of AG projects, customers, competitors, students, etc. live in cities, towns, or villages where streets are laid out in block grids, streets have East and West or North and South sides, and buildings on one side of the street are assigned even numbers while those on the other side are given odd numbers. Most streets are named (e.g., AGGIE; PATRIOT; LONGHORN) or in ordinal format if numbered (i.e., 5TH; 98TH; 123RD). Most house numbers follow a regular pattern (e.g., 1001 through 1299 on one side, 1000 through 1298 on the other).

The addresses assigned to those customers, competitors, students, etc. are typically stored in standard United States Postal Service (USPS) format (house number, prefix direction, street name, street type, suffix direction, city, state, and ZIP code). Also, the prefix or suffix directions, the street type, and the state have been truncated, using Official USPS abbreviations. Examples of such addresses are 3900 UNIVERSITY BLVD, TYLER TX 75799-0001, 7107 KICKAPOO ST, CHANDLER TX 75758-2310, and 1213 BARTO ST, ARP TX 75750.

Maps used in many AG projects reflect years of updating and correcting because they have been shared by many organizations and agencies whose employees have discovered and corrected errors before returning the maps to the sharing pool. Such intentional cooperation leads to improved AG output coming from all organizations and agencies involved. These maps typically have been edited to remove overlapping address ranges. They contain consistent left side-right side/odd number-even number designations. The street names for these maps have been compared to those in the events files, and all spelling and capitalization differences have been reconciled.

A fourth factor in most AG projects is the accuracy or match rate. Clearly, the match rate's acceptability is a function of how the geocoded addresses are to be used. Most everyone involved with an AG project is very satisfied if 80% of the customers, competitors, students, etc. are matched to the project's street map. If the match rate is above 90%, they all are elated.

Now that a typical AG project has been outlined, it will be informative to see how our current project differs. Smith County, Texas has many areas where streets and addresses follow the patterns just described. However, more than a third of Smith County's 98,000+ registered voters live on roads and highways whose designations include cardinal numbers as their final element (e.g., Farm-to-Market Road 117; County Road 21; State Highway 110; US Highway 69). That format can raise havoc with AG matching algorithms.

Also, those roads and highways have, in most cases, been assigned street names and local address ranges over the sections that pass through the cities, towns, and villages (State Highway 110 is variously called Van Highway, Gentry Parkway, Glenwood Boulevard, 4th Street, 5th Street, Beckham Avenue, and Troup Highway within Tyler's city limits). Many early residents registered to vote at a time before outlying areas were annexed or incorporated and before the formerly rural roads were taken into the city's or town's street naming and numbering plan. Others, when they registered, listed only their rural route and box number address or their post office box number. Many others, who still reside outside city or town boundaries, registered to vote before a countywide rural addressing scheme was completed. All of these factors resulted in many outdated addresses appearing in the database of voters provided to UTT GIS by the Smith County Clerk.

An accurate centerline street map of Smith County was not available in digital format. Therefore TIGER-Line files were purchased, with the intention of adjusting the map records and/or the voters records so that the two could be matched.

Data conversion procedures

Voter records were received in .csv format from the County's data processing vendor. A typical entry was:
00000001,HOLLEY,BUDDY,454248122,12311919,M,0021,JAN,AVE,,,3932,,,TYLER,757010000,,,,
The following awk script was used to select the necessary elements and to arrange them in an order that the AG program could use;
awk -F, ' { print $1,$2,$3,$7,$12,$10,$8,$9,$14,$15,substr($16,1,5) } ' \ UTVOTERS.CSV >> SMITHVOT.TXT
What resulted was a file in which the records were in ASCII format, such as this example; 1,HOLLEY,BUDDY,21,3932,,JAN,AVE,,TYLER,75701
Those records were added to a specially created INFO file.

The map coordinates and attributes were converted to an ARC/INFO line coverage from the shapefile format in which they were received. The line coverage was prepared for AG by using the ADDRESSCREATE and ADDRESSBUILD commands. The NOPARSE option was used to keep the street name elements (e.g., County Rd 135) together in one field.

Please click on the thumb-nail image on the left, below, to see a map of the roads and streets in Smith County, Texas. On the right is a link to a map of roads and streets with OVRLAP errors in Smith County, Texas.

Map of roads and streets, Smith County, Texas Map of roads and streets with OVRLAP errors, Smith County, Texas

PROBLEMS

The street map was made up of 18,624 segments. Each segment had linked to it a record containing beginning and ending address numbers for its left and right sides, such as LEFTADD1 = 13400, LEFTADD2 = 13598,RGTADD1 = 13401, RGTADD2 = 13599. That record also included a street name, city name, and ZIP codes for the street's left and right sides. If part of the address, the elements for prefix direction, street type, and suffix direction were included. One of the steps in preparing the street map for AG was the identification of possible errors. The ADDRESSERRORS command was used for that purpose. It uncovered 5,315 overlap errors, a result of one segment's beginning or ending address numbers falling within the range of address numbers in another segment. Many of those overlaps were caused by zero (0) values appearing in the To and From addresses in those case where To and From addresses were missing.

In hopes of bypassing these errors and hurrying along the project, it was decided to convert the voter's addresses into ZIP+4 numbers and match those to the 9-digit ZIP code layer that came with the purchased streets files. All voter address were entered into Semaphore Corporation's (http://www.semaphorecorp.com) ZP4 program. It assigned the official USPS 9-digit ZIP code to the records. In ArcView, those ZIP codes were matched to the ZIP4Centroid layer and points placed on the map. Then polygons of the county's school districts were overlain on the voters location points. At that time, each voter's record had attached to it the number of the school district polygon in which the voter was lived. That way, each voter was assigned to a particular school district.

To test the accuracy of this output, points on either side of several school districts were selected and the polygon numbers assigned to those locations were compared to the school district in which those voters were known to be located. Time and time again that test revealed errors. Neighbors, living along the same street but on opposite sides of a school district boundary, had the same school district number assigned to their records. After a thorough analysis of both the project's status at that point in time and the chances that a speedy solution could be created, it was decided to abandon the 9-digit ZIP code approach to AG.

To summarize the problems, digital street maps contained more overlap errors than could be corrected within a reasonable time period, many voters lived on roads the names for which had been changed since they registered to vote, and many others lived on roads with names that included cardinal numbers as their final element. All these combined with the need for utmost accuracy to separate this assignment from the standard AG undertaking.

SOLUTIONS

ARC/INFO, through the NOPARSE option, permits users to have complete control on how addresses are broken into the elements need by ADDRESSMATCH. It was decided, therefor, to use A/I. To get an idea of the best approach to take, the INFO file of voters addresses was matched to the TIGER-file street coverage. Other than changing "County Road" to "CR" and "FM Road" to "F-M Rd," the latter being the format adopted by the TIGER-file street cover, no attempt was made to identify and correct errors in either the INFO file or the street coverage. Offset and squeeze factors were set to 10 feet, the NOREJECTS option was chosen, and the minimum matching score was reduced to 90 from 100.

Out of 93,700 records, 65,161 were matched to the coverage and 28,539 were rejected. Because the matching score was set so low, all records that were rejected had scores of -1 (No matching name), -2 (No matching number), or -3 (Multiple matches of same score). The next step was to use FREQUENCY to determine the how often various street names and associated scores appeared in the rejects file. With that information, we first changed voters data or the street coverage for those errors which occurred most often and scheduled for a later time changes to the data or coverage that appeared fewer times in the rejects file.

The process of correcting address errors was slow and arduous. It began with the sorting of the frequency file by frequency first and street name next. With the error code and name of the decreasingly most frequently rejected street at hand, the street coverage's .aat file was searched for all segments with that street name attached. If the error code was -1 (No matching name), the search was made of the ALTNAME1 item, knowing that residents who had registered to vote some time ago might live on roads the name of which had been changed over the years. Searches for streets with error codes -2 or -3 were made of the street name item.

When a search of the ALTNAME1 item returned a street name match, the value of the street name was noted and the affected voter data records were changed. If a match was not found, various hard-copy maps were searched and changes, as appropriate, made to the street name item in the voter data records or the street coverage. In those cases where a street that was found on hard copy maps was missing from the coverage, ArcTools Edit (ATE) was started and the new street was added, using usual heads-up digitizing methods. The attribute data for the new street (address ranges, street name, etc.) were added also in ATE.

In the case of -2 errors (No matching number) ATE was used to display the subject streets, to list their address ranges, and to change the address ranges if that action was called for. If a segment of an existing street had been omitted from the TIGER-file coverage, it and its attributes were add in ATE.

When an error resulted from multiple matches (error code -3) ATE was used to display the segments and their attributes. If changes to the coverage were indicated, they were made, as were changes to the voters data file under those circumstances where it was appropriate to do so. However, most -3 errors came about because the necessary prefix direction was missing from a voter's record. For example, an address of 103 Main St cannot be correctly matched to either 101-199 W Main St or 101-199 E Main St and will be properly rejected. The next step in matching voter records rejected with -3 error codes would be to use the "Search White Pages" section of the Excite PeopleFinder (http://peoplefinder.excite.com) or White Pages on MSN (http://search.msn.com/mod_wpages.asp ) online search engines, using the voter's name, city, and state as the search variables.

As corrections to the street coverage, to the voters database, or to both were made throughout the project, new ADDRESSMATCHs were performed and the resulting reject files, each smaller than its predecessor, were generated and subsequently edited and processed.

The other part of the project, which saw the creation of boundary maps for the seven Single Member districts of the Tyler ISD, and the conversion of the boundary maps for Tyler Junior College and the County's 10 ISDs, also enjoyed its own challenges. The toughest was digitizing the Tyler ISD map. Attempts to SNAPboundary arcs to the background street coverage of Tyler led to unacceptable output. Next, a COPY of the Tyler street coverage was edited. Streets that represented boundary lines were left in place while other streets were selected and deleted. That too failed to meet the precision requirements. The final, successful attempt also used a copy of the Tyler street coverage. The changes invoked included DENSIFYing both the original and the copy of the Tyler street coverage and SNAPping the copy coverage to the original before non-boundary streets were selected and deleted. That produced a very accurate boundary.

The thumb-nail image on the left, below, is a link to a map of the boundaries of the ISDs in Smith County, Texas. The image in the center, below, is a link to a map of the boundary of the Tyler Junior College District, Smith County, Texas. The image on the right, below, is a link to a map of the boundaries of the Single Member Districts of Tyler ISD, Smith County, Texas.

Map of boundaries of Independent School Districts, Smith County, Texas Map of boundary of Tyler Junior College District, Smith County, Texas Map boundaries of Single Member Districts,Tyler ISD, Smith County, Texas

After the voters database records had been matched and the boundary coverages produced, IDENTIFY was used to stamp the boundaries on top of the point coverages of voters residences, separating the voters into their correct voting district. Three items were added to the Point Attribute Table that the ADDRESSMATCHing process produced; one for the county school district, one for Tyler Junior College district, and one for the Tyler ISD Single Member district. To summarize, edits to effect corrections to the street coverage and to the events file of voters addresses were scheduled based on the frequency in which street names appeared in the first reject file. Street names showing high frequency were analyzed and corrected first while those with lower frequencies were edited later. Once the matching was completed, overlays of boundary maps were made, thereby generating the information needed to assign voters to their correct voting district. That information was appended to each voters record and the database was returned to the County's data processing vendor.

CONCLUSION

Address geocoding where rural roads are involved is a very complex undertaking, much more so than the typical AG project in which urban-style addressing is prevalent. Many hours must be spent in correcting and updating street coverages and voters addresses files in order to create the correspondence between the two which is crucial to accurate AG. What is called for is a program that will compare address components of both files before they are read into the AG programs. It must preserve the ability of the AG technician to control the form of the components. And it must simplify the editing process and should do that without requiring the editor to switch between map programs and database programs. Fuzzy logic implementation has advanced to the point where it could be incorporated into an editor program with the express purpose of reducing the choices the AG technician must consider. That program should be a tool to increase the efficiency of everyone involved in address geocoding.

ACKNOWLEDGEMENTS

Thanks to Cynthia Mills of Branch Design Group, Inc., Ken Rush of UT Tyler, Mary Morris, Judy Carnes, and Paula Patterson of the Smith County (Texas) Clerk's office, Jan Funderburgh of 9-1-1 Network of East Texas, Dan Allee of Smith County (Texas) Appraisal District, and to these Esri folks: Patrick Brennan, Robert Nicholas, Chuck Gaffney, Melissa Brenneman, Makram Murad-Al-Shaikh, and Tom Brenneman.


Jack Mills
Coordinator
Office of Research Services/Geographic Information Systems Lab
University of Texas at Tyler
3900 University Boulevard, Room 205, Business Building
Tyler, Texas 75799-0001
Phone 903-566-7366
Fax903-566-7377
jack.mills@mailexcite.com or
jmills@mail.uttyl.edu