The Utility of Geographical Information Systems (GIS) and Spatial Analysis In Tuberculosis Surveillance in Harris County, Texas

The Utility of Geographical Information Systems (GIS) and Spatial Analysis

In Tuberculosis Surveillance in Harris County, Texas, 1995-1998

Matthew L. Stone

ABSTRACT

OBJECTIVE

The main purpose of this research study was to examine the spatial distribution of tuberculosis (TB) cases by area in Harris County, Texas over a three-year period, 1995-1998, using geographical information systems (GIS) software and spatial analytical techniques. In doing this, it was anticipated that this research study would demonstrate some of the valuable assets of GIS in disease mapping and surveillance. It is expected that the information gathered by this research study will assist public health workers by identifying and providing effective examples of using epidemiologic data, public use statistical software and GIS in formulating study questions, generating and testing hypotheses, and critically evaluating maps that are prepared using GIS software and spatial statistical methods. In addition, useful resources (analytical and descriptive maps) have been produced for the Houston Tuberculosis Initiative, a population-based, active surveillance and molecular epidemiology study of tuberculosis cases reported to the City of Houston Tuberculosis Control Office. It is hopeful that, in the future, public health workers and officials will see the added value that GIS can bring to an already well-established disease surveillance group. In addition, GIS can provide data that is in a form to be readily communicated to community groups and the general public if necessary.

INTRODUCTION

A Geographic Information System (GIS) is a very important tool for use in disease mapping, as well as public health surveillance activities to assist in identifying high-risk groups. Because GIS is software for mapping and also has an embedded relational database component, it makes the management and analysis of public health surveillance data very organized for determining spatial and time trends. Disease cases can be viewed in their surrounding social context and patterns of their geographical distribution can be analyzed by using various spatial statistical methods that account for differences in location characteristics (e.g.; latitude and longitude). In addition, due to its ability to identify and map environmental factors associated with disease vectors, GIS is increasingly important in infectious and vector-born disease surveillance.^{7,25,26,38,46} All of these characteristics, coupled with the ease of use with the proper training, allow for the mapping of surveillance and disease data to be within reach of even the smallest health departments. GIS can assist epidemiologists by adding descriptive images that are systematically created according to proper scientific protocol as well as allow for evaluations of potential cluster investigations when combined with robust statistical methods and software. Detailed mapping can be produced with GIS and revised an infinite number of times with little effort, enabling the creation of a variety of new types of maps that could be useful in public health management and practice. An ideal outcome would be that communities have the capability to link together health information from various data sources for efficiency and centralization in order to recognize spatial data patterns that may suggest where cost-effective public health interventions can be applied.⁴⁰

The Role of GIS Technology in Public Health Efforts

The use of geography in epidemiologic studies is not a new phenomenon. Probably the most widely cited study for first incorporating the combination of field epidemiology and geographical analysis is John Snow’s analysis of local water pumps and their relationship with the spread of cholera in London during the 1850’s. The field of epidemiology stresses the importance of understanding three components of disease distribution: the people involved, the time of disease outbreak or transmission, and the location of transmission. However, many epidemiologic studies in the past have failed to examine the role that spatial patterns play in the development of trends in disease in ways that go beyond descriptive methods.³⁶ The importance of using geography when studying disease is based upon the factors that lead to non-uniformity of disease distribution such as: physical and environmental factors; social, economic and cultural factors; and even behavioral factors.^4,34,47One tool that has come of age in the last twenty or so years to enable researchers to easily investigate spatial trends and question the nature of disease distributions is the advance in computerized GIS and its capabilities.

GIS mapping can also allow researchers to examine many different types of questions involving the particulars of a specific location, the distribution of certain phenomenon, the changes that have occurred since a previous analysis, the impact of a specific event, or the relationships and systematic patterns of a region.⁶ The GIS database becomes a model of spatial information that can be used in epidemiologic and health research in order to recognize the specific spatial structure of a process.²³ A particular spatial structure includes the individuals affected and how they are connected in communities, as well as the dynamics of these communities and their organization into larger units.³⁶ The geographic component of the GIS becomes a method of classifying data records into groups (administrative areas) separately from the personal characteristics of the individuals and allow for examination of aspects of location that are not captured by variables observed directly for the individuals.⁴⁴ However, when detailed location information is available for individuals, it is not necessary to aggregate information into groups. It is possible to fit models that include spatial correlation components and do so without compromising confidentiality of the individuals by using varying levels of resolution to display patterns. The nature of GIS allows for flexibility in utilizing different techniques for mapping data through the use of area-based (counts of cases) and point-based (actual incident cases) data types.

Factors Influencing a Spatial Analysis of TB

Unfortunately, until more recently, the use of GIS in the study of infectious disease and, more specifically tuberculosis, has been less documented than that of chronic or environmentally related illness. In fact, some believe that there is little appreciation amongst public health professionals of the value in mapping communicable diseases or associated risks. “Limited resources, large datasets, and concern for the maintenance of patient anonymity, combined with under-recognition of the benefits of conducting geographical analysis…mean that the spatial references required for disease mapping are frequently not made available.²” This does not have to be the case though, as GIS allows the researcher to display data at different resolutions and aggregations in order to protect confidentiality. This should not limit the use of analytical methods to describe geographical variation in the distribution of infectious diseases that will be readily understood and used by public health professionals.

A national plan dedicated to the elimination of tuberculosis (TB) in the United States by 2010 (defined by a case rate of less than 1 per 1,000,000 population) has been in place since 1989 by the Centers for Disease Control and Prevention (CDC) and the Advisory Council for the Elimination of Tuberculosis (ACET).^{12, 35} The most recent data for reported incident cases of TB in the United States shows a low of 16,377 cases during 2000 compared to 17,531 cases for 1999.11 Although data show a significant decrease in TB cases during most of the last decade, there is still concern among medical and public health professionals to provide new diagnostic and therapeutic tools to continue this progression towards TB elimination and to deal with the impending tide of individuals with latent TB infection that serve as a reservoir of future cases.¹ In fact, one of the greatest scientific advances in TB detection methods has been the use of genetic molecular characterization such as restriction fragment length polymorphism (RFLP) analysis. Population-based studies have shown that identification of TB case clusters is significantly enhanced by profiling certain copies of TB DNA (defined as probes) and has become a standard tool used in TB epidemiologic studies.⁴³

Tuberculosis, in general, is frequently associated with marginalized populations such as the homeless, persons living at the poverty level, and those living in overcrowded housing, such as immigrants. Numerous studies, both in the United States and abroad, have shown these factors as well as human immunodeficiency virus (HIV) infection, increasing cases among foreign-born individuals (in the U.S.), drug use, multi-drug-resistant TB (MDR TB) and living in various institutional settings are responsible for a large proportion of the TB cases reported annually.^{5,8,9,13,15,17,37,39,45}

In 1999, the CDC provided revised recommendations for TB prevention that included issues for: using elimination strategies based on local epidemiology, establishing new strategic partnerships to effectively reach the diverse population of people at risk, enhancing the use of current tools for TB prevention and control, developing new tools for TB elimination, recommitting to the global battle against TB, and supporting broad-based efforts for TB prevention and control at all governmental levels in the U.S. Of specific relevance for this research study is CDC’s view that surveillance and program evaluation data show areas for improvement.¹²There are particular concerns about individual contacts maintaining compliance with or even starting TB therapy, as these are the individuals most likely to become future TB cases. According to the CDC, strategies that target groups at high risk for TB and treat those infected have often been poorly applied.¹² In order to effectively deal with this surveillance issue, one objective should be to develop and implement systems to conduct active case finding among high-risk populations, when appropriate.¹² As an example, TB-control staff members could be trained to use local epidemiologic data, coupled with GIS, to consistently identify high-risk groups that are deemed appropriate for targeted testing (e.g. immigrant populations) and to ensure that a greater proportion of infected persons begin and complete therapy.

Incidence of Tuberculosis in Texas and Harris County

Although there was a steady decrease in cases overall (from 12.7/100,000 to 9.2/100,000)¹¹ , the incidence in Texas from 1995-1998 remained quite high compared to the national rate. Part of the explanation for these rates was due to the prevalence of HIV infection and the number of acquired immunodeficiency syndrome (AIDS) cases.²⁷ In 1998, the majority of tuberculosis cases reported in Texas were among the age group 25-44 (38.4%). In terms of race/ethnicity, the majority of tuberculosis cases in 1998 were reported among Hispanics (47.5%), then African-Americans (24.3%), and then Whites (18.5%). The incidence of tuberculosis in Harris County, Texas decreased from 25.6 cases/100,000 in 1995 to 14.4 cases/100,000 in 1998.¹⁶ Incidence rates tended to be higher among minority groups when compared to Whites although the rates in minority groups did follow the decreasing trend.

RESEARCH QUESTIONS

What is the geographical point distribution of all reported tuberculosis cases in Harris County, Texas from October 1995 through September 1998?
What are the spatial patterns of tuberculosis cases and clinically defined clusters in Harris County, Texas from October 1995 through September 1998 and what is the statistical significance of these patterns?
What is the estimated incidence of tuberculosis in Harris County, TX at the 2000 U.S. Census block group level from October 1995 through September 1998 and are there statistically significant low or high rate areas (compared to a standardized rate for Harris County) based on an assumption of spatial randomness and a p-value of .05?
What is the geographic distribution of tuberculosis cases of the same genetic type among individuals in reference to a specific mode of transmission (public transportation) and is there an apparent geographic clustering of these similar genetic types?

METHODOLOGY

Study Population

Secondary data analysis was performed on a subset of data collected during the 36-month period from October 1995 to September 1998 by the Houston Tuberculosis Initiative Program (HTIP). HTIP is an ongoing, population-based, active surveillance and molecular epidemiology study of tuberculosis cases reported to the City of Houston Tuberculosis Control Office, covering Houston and surrounding Harris County, Texas (referred to as Houston from now on in this study). Since 1995, 93% of all reported tuberculosis patients and 85% of all culture-positive tuberculosis patients in Houston have been enrolled in this study. Patients who lived in Houston for less than 3 months were excluded as an incident case from the analysis. During the study period, 1774 cases of tuberculosis were reported to the City of Houston Tuberculosis Control Office. Of these identified cases, 1481 individuals agreed to participate in the study and were interviewed by members of HTIP. The remainder of the cases (302) were excluded for the following reasons: 108 cases could not be located, 98 cases were considered prevalent cases (had been residents of Houston for less than 3 months), and 96 cases declined to participate. Patients with newly diagnosed tuberculosis were approached by members of the HTIP research team and given a description of the study and asked to participate. If informed consent was obtained, patients were interviewed using the “Houston Mycobacteria Active Surveillance Form.” Patients were asked about their demographic characteristics, living situations, modes of transportation, travel, social contacts and places frequented. Additionally, they were asked to provide information on tobacco, alcohol, and drug use, as well as sexual and medical histories. For consenting patients, clinical records were reviewed for symptoms, dates of symptom onset, specimen details and dates, diagnoses, treatments, adherence to medication and outcomes, including death. Where possible, information on patients who had died (or left the area) was sought from proxy persons. M. tuberculosis isolates were then analyzed with 3 molecular typing methods, discussed elsewhere.¹⁷ In order to ensure strict confidentiality, personal identifiers were not recorded on the questionnaire and participants had the option to refuse any question or to terminate the interview at any time.

Database Organization and Geocoding

In order to allow for spatial analysis and mapping of individual TB cases, the address of each case at time of entry into the study was geocoded. This involved assigning a latitude and a longitude for the address utilizing specialized geocoding software²⁰ and a database of street network files.²² The process of address matching involved matching the street address number, street, city, and zip code with the corresponding street segment in the street network file. Over 98% of the cases (n=1459) were exactly matched by the geocoding software. Some of the edits involved in the address matching of the final 22 cases involved corrections of: misspellings; mistakes in street typing (Road instead of Street); and problems with street numbering. Only one case was unable to be matched as there was no valid address information for this case. Subsequently, this case was dropped from the final analysis.

When analytical methods were used that included the observed points only, without aggregation to specific areas, the subset of 1480 cases was used. However, because some of the spatial methods used in this analysis required boundary restrictions and aggregation of points to specific areas (block groups, census tracts), only the boundary of Harris County was used. Therefore, an additional 7 cases whose locations fell outside the county boundary were excluded from these types of analysis resulting in a subset of 1473 cases. Using these two subsets, it was possible to construct further subsets of individuals whose isolates underwent molecular characterization for the exploration of genetic clustering of cases as well as unique subsets of individuals based on a variety of significant variables. Specific variables that have been utilized in previous studies performed in Houston and that have been considered associated with tuberculosis clustering were kept in the dataset.^{16,17,27,43,47} These variables included race/ethnicity, gender, household income, genetic type of TB strain, age, history of homelessness, HIV infection status, previous drug use, number of people per household, use of public transportation and country of birth (US vs.foreign-born). This subset information is shown in Table 1 and reflects the complete number of geocoded cases before dropping due to aggregation. Once this process was completed, all address information was stripped from the data set for confidentiality purposes, leaving only the geographic identifiers (latitude and longitude). At no time during this project were cases displayed at such a resolution that would allow for possible identification of specific individuals.

Most of the spatial analytical methods used in this project were based upon point pattern methods where the objective is to determine if there is a tendency for events (TB cases) to exhibit a pattern; some form of regularity or clustering.³ The cases were the geocoded addresses from the data set as described above and the attributes were the spatial coordinates (latitude and longitude) and the various independent variables under consideration. The data under study represented a complete map of events of tuberculosis between October 1995 and September 1998 (because the number of 1995 cases were so few, they were aggregated with 1996 cases, creating the time period 1996-1998 for analysis purposes) and the study region was comprised of the area of Harris County, Texas. The main purpose of this analysis was to follow exploratory spatial analysis methods to generate possible hypotheses for future analyses and to suggest possible explanatory models to describe the observed processes. There are various ways in which one can view a spatial point pattern and the following exploratory methods outline the processes utilized in this research study to examine properties of intensity (mean number of events per unit area; 1^st order properties) and spatial dependence or interactions (relationships between numbers of events in the study area; 2^nd order properties).

Kernel Estimation

The function of kernel estimation was to obtain a smooth estimate of a bivariate probability density from an observed sample of observations. For any chosen kernel and bandwidth, values of intensity can be examined at locations on a suitably chosen grid over the study area to provide a useful visual indication of the variation in the intensity. The individual kernel estimates for each cell are summed to produce an overall estimate of density for that cell. Through this method we are provided with a summary of how events tend to cluster throughout the study area as a means of assessing 1^st order properties. The study area was a rectangular grid placed over the whole of Harris County and cell size based on the display of 2000 U.S. Census Block Groups (225 rows by 351 columns; approximately a square grid size of .2 miles). ArcView^® Spatial Analyst (v 1.1)¹⁹ was used with the kernel density function in order to generate the mean intensity of cases per square mile for all TB cases, cases stratified by each independent variable described above, and 2000 U.S. Census Population characteristics (fixed at the centroid) for the block group level.

In order to adjust density estimates for heterogeneous population distributions (such as population at risk of disease in an area) one can also use a ratio of kernel estimates for intensity of events and population density.³² This allows for the viewing of an image that takes into consideration the intensity of events along with the intensity of population and begins to provide an estimate of case risk. This also assists in judging whether what is viewed as cases converging towards a specific area is a function of population density or not. This procedure is easily obtained through the use of CrimeStat^® software³². For the purposes of this procedure, a quartic kernel method was used with a fixed bandwidth of 1.25 miles (for both kernel calculations) in order to obtain a smooth model for descriptive purposes. Cell size was adjusted in order to create a grid covering all of Harris County with squares approximately .5 miles long. The 2000 U.S. Census block group population, fixed at the centroids of the block group, was used for the ratio of kernel estimates method described above. All kernel ratios were fixed into similar classification schemes in order to provide a means of comparison. The classifications were in increments of 50 cases per 100,000 population, except for the last increment of 100 cases per 100,000 population. This scheme was used due to the variation in the number of case points compared to the number of population centroid points and provides a meaningful estimate of incidence rates relative within Harris County.

Nearest Neighbor Distances

Nearest neighbor distances were the exploratory methods used in this study in order to investigate second order properties (looking at possible relationships between points) using (w) or (x) distances³ between observed events in a study area. This provides information about inter-event interactions at small distances which could provide useful information when dealing with an infectious disease such as TB. Calculations of these distribution functions were provided by S-plus 2000^® ³³ and a limitation on the total distance used was set at approximately 3.5 miles (.05 degrees latitude). The resulting empirical distribution function (w) was plotted against suitable values of distances and the resulting empirical distribution function (x) was plotted against the theoretical distribution function of Complete Spatial Randomness (given by the equation 1- exp(-ply²)) in order to explore possible evidence of inter-event interactions.

The above nearest neighbor distance methods were useful for looking at patterns among the closest events and in considering small scales or patterns. Therefore, a loss of information occurs because only these smallest patterns of scale are considered. The above statistics only indicate the direction of departure from complete spatial randomness but don’t provide a means for interpreting a process that doesn’t adhere to this assumption. An alternative approach that provided a more effective summary of spatial dependence over a wider range of scales for second order properties was the (h) function, which provided a test of randomness for every distance from the smallest up to the size of the study area.^3,32 CrimeStat^® was used for calculating this function using 100 intervals (radii) by which the statistic was counted based on an overall radius of approximately 33 miles³². The resulting (h) function was transformed into the square root function ((h) ) and plotted against distance to reveal whether there was any clustering at certain distances or any dispersion at others. This transformation is useful to better visualize the function by making it more linear.³ Edge corrections were not considered in this preliminary analysis. Five hundred Monte Carlo simulations were run to calculate a random simulation envelope under theoretical spatial randomness for comparison.

These calculations were used for all TB cases, cases stratified by each independent variable described above, and 2000 U.S. Census Population characteristics (fixed at the centroid) for the block group level for comparisons.

Spatial Filtering Method

The spatial filtering method as outlined by Rushton^41,42 was used also as an exploratory technique in order to build upon the kernelling methods used earlier. Not only can one view estimated disease rates based upon extrapolation of individual cases and underlying population to points on a fine grid, but this method allows for the input of probabilities for the event in order to generate Monte Carlo simulations of expected rates to compare with the observed rates and provide a level of significance for the observed rates. The output is generated as the proportion of simulated rates that were less than the observed rates, whereby contour lines can be portrayed on a map that show where this proportion was low or high. This procedure was used because it allowed for the data to remain in its original form (N=1480) instead of being forced to aggregate to a larger area. The numerator files for this method were all TB cases and cases stratified by each independent variable described above. The denominator files utilized the 2000 U.S. census block level population characteristics aggregated to the centroid level. This level was chosen in order to maintain a distribution as if one had location information on all individuals at risk in an area. This method had been used before as an approximation to having all enumerated cases and population at risk and had shown no apparent difference in results (Rushton, personal communication). The probability files were calculated by dividing the total number of cases by relevant population group for the case subgroup (3-year aggregated population counts for all TB case subgroups except those stratified by year of incidence) and a total of 750 simulations were run. A grid with points at one mile intervals was used to overlay the numerators and denominators and a 1 mile filter was used as the search radius. Of specific interest for this analysis was the area in which the incidence rate was significantly higher than the simulated rate as rationale for where to focus TB control efforts. In order to analyze tuberculosis data utilizing this method, the D-map™ software found on an instructional CD-ROM produced by G. Rushton was used.⁴¹

Spatial Scan Statistical Method

This final method, as outline by Kulldorff et al.^24,29,30,31 was used in order to determine possible cluster areas for TB that were based on statistical likelihood. Each resulting cluster of areas would have an assigned p-value and relative risk measurement to compare to an expected value. For this analysis, the data was broken into cases (TB cases by genetic print type) and controls (all other TB cases) in order to determine areas of clustering for specific print types relative to all TB cases. One of the underlying assumptions is that shared print types have possible shared contacts. One of these possible contacts is the use of public transportation. If significant clusters can be determined for various print types, the actual case points with bus-route attribute information can be overlaid onto this area and provide a rationale for checking personal contact information where print type and bus-routes are identical. For all analysis, a space-time scan statistical test was used as provided in SaTScan™ v2.1 in order to adjust for time variations (broken into year intervals, 1996-1998) as well as spatial variations. The test was set to scan for clusters with both high and low rates and the underlying coordinates file was based on the centroids of the 2000 U.S. Census block groups. Three thousand Monte Carlo simulations were run for each analysis as proposed for a medium-sized data set²⁹. The maximum spatial cluster size was set at 10% of the population (controls) and the maximum temporal cluster size was set at the recommended 50% level.

RESULTS

It is not possible to provide examples of all results for the above-mentioned analytical and exploratory methods in this space. Instead, I focus on the complete set of all TB cases and one case subgroup stratified by race (Black) for comparison.

Results from Kernel Estimation

Figure 1 shows the relative density of TB cases per square mile of area for all cases in comparison to the density of the 2000 U.S. Census Block Group total population. Figure 2 shows the relative density of Black TB cases per square mile of area in comparison to the density of the 2000 U.S. Census Block Group Black population. Upon visual comparison of the images in Figure 1, it appears that the density of TB cases is more focused towards the geographic center of Harris County than overall population density. Figure 2, however, shows a high intensity of Black TB cases in the same, overall geographic locations where high Black population intensity occurs (according to the 2000 U.S. Census).

When comparing the ratio of total TB cases to total population, one can see that there seems to be an area of elevated risk at the center of Harris County (Figure 3). In addition, when comparing Black TB cases to the underlying Black population, one sees a slightly larger total area of elevated risk, with additional areas of elevated risk toward the south of the center of Harris County (See circle in Figure 3). Even though it appeared that Black TB cases were simply occurring at a higher intensity due to the higher intensity of underlying population (Figure 2), the high density areas remain even when adjusting for the underlying population.

Results from Nearest Neighbor Distances

The plots of the (w) function for all TB cases and Black TB cases can be seen in Figure 4. On visual inspection, it is clear that there is relative clustering among all TB cases as evident by the steep rise in the function at small distances. This trend is also evident among the Black TB cases. Plots of the (x) function demonstrated a clustered pattern if the values for the (x) function varied from the theoretical distribution function at larger distances. These plots are shown in Figure 5 for all TB cases and Black TB cases. On visual inspection of Figures 4 and 5, it is evident that there is large variation between the two functions (theoretical and empirical) for all TB cases and for Black TB cases.

The plots for the (h) function transformed into the square root function ((h) ) for all TB cases and Black TB cases can be seen in Figure 6. Notice that there is evidence for clustering at all scales for all TB cases and is more so than for total population. Based on the fact that this function lies well outside the simulation envelopes given, there is some confidence in concluding that the locations of all TB cases are clustered. Black TB cases show evidence of clustering up to approximately 12 miles whereby the function falls steeply. Up to this distance, Black TB cases appear to be more clustered than Black population and there is confidence in this conclusion based on the function falling well outside the simulation envelopes.

Results from the Spatial Filtering Method

The results from this method for all TB cases can be seen in the map in Figure 7. Here, the blue isolines indicate where the highest proportion of simulated TB incidence was lower than the observed incidence. The 3153 grid point locations have computed TB incidence rates based on more than 100 persons at risk (3-year aggregated block level population) within the 1-mile search radius. Actual case points help to determine where areas of high rates may be less meaningful (very few cases). The mean incidence rate that was calculated for the whole group was equal to 18.72 cases/100,000 population. There is definitely a large area of higher than average rates running in a North/South direction in the center of Harris County. This area cuts across the major Houston metropolitan area from the north of the inner Highway 610 Loop to the south of this Highway 610 Loop. The mean incidence rate that was calculated for this area was equal to 91.14 cases/100,000 population, with a range from 16.9 to over 1200 cases/100,000 population. Figure 8 shows the results from this method for Black TB cases. Again, 3153 grid point locations have computed TB incidence rates based on more than 100 Black persons at risk (3-year aggregated block level population for Black individuals) within the 1-mile search radius. The mean incidence that was calculated for this group was equal to 21.72 cases/100,000 population. Again there is an area of higher than average rates focused in the center of Harris County, inside the Highway 610 Loop (see circle in Figure 8). The mean incidence that was calculated for this area was equal to 182 cases/100,000 population, with a range from 49 to over 500 cases/100,000 population.

Results from the Spatial Scan Statistical Method

For this method, 9 different print types were analyzed in order to find a most likely cluster in comparison to other TB cases. Figures 9 and 10 show two of these print types and their associated most likely cluster with significance level. In Figure 9, the map shows that the most likely cluster for Print Type 1 had an overall incidence nearly 3 times higher than that among all other areas (significant at p=.01). In Figure 10, the map shows that the most likely cluster for Print Type 4 had an overall incidence approximately 9 times higher than that among all other areas (significant at p<.01). There is strong evidence for the existence of these clusters although the exact boundaries of these clusters are uncertain given the fact that according to the procedure for this method, there are many overlapping circular windows that will contain the most likely cluster. In using this method, however, one is able to take the information on most likely clusters and characterize the case attributes in order to look for significant patterns. As mentioned earlier under the objectives section, one of the possible patterns is characterized by the public bus routes that may or may not be shared between cases. In Figure 11, both Print 1 and Print 4 clusters are shown with the cases in each cluster characterized by their bus route. In the Print 4 cluster, there were at least 6 individuals who shared the same bus route (Route 82). In the Print 1 cluster, there were 3 individuals who shared one bus route (Route 25), 2 different individuals who shared another bus route (Route 15), and 2 different individuals who shared a 3^rd bus route (Route 80).

Discussion

This study of cases from a three-year, population-based study of the epidemiology of tuberculosis in Harris County, Texas, used various spatial analytical methods to look at the intensity and spatial interactions of TB cases and determine whether there were significant spatial patterns among cases that may have deviated from a random pattern. Through the use of kernel estimation methods it was evident that there were specific areas in which the intensity of TB cases during the three-year period was high, even in reference to the underlying population. This allowed for a quick assessment of potential centers of TB incidence that were stratified by various risk factors, such as ethnicity, under the assumption that case density would follow the underlying population distribution instead of a completely spatially random distribution. On first glance, it was observed that both the population density and TB case density among Blacks looked very similar. However, when controlling for the underlying Black population by using a ratio method of kernel densities, it was discovered that even within areas of high Black population density, there were still high TB case density areas among Blacks. Some may argue with the necessary assumption of the kernel ratio method used for this study in that the population values were centered at a specific location (centroids); this is an obvious limitation of this method. However, block groups generally contain between 600 and 3,000 people, with an optimum size of 1,500 people. While there still may be variation in a neighborhood area of this size, the effect of allocating all individuals to a single point only produces a small error (Levine, personal communication). Another method that could be used for comparison purposes would be to use another point process that could act as a surrogate measure of underlying population, (Non-Black TB cases) to be used as the denominator for the kernel ratio method in much the same way as a case-control design in epidemiology.³ The above finding is notable, however, for hypothesis generation, when comparing it to an epidemiological study performed by HTIP (previous to this study) using a similar data set¹⁶. That study looked at contributions of certain risk factors associated with clustering of TB cases (where at least two individuals had similar genetic print types). The finding in that study stated that among Blacks, the odds for clustering was 3 times greater (univariate OR of 3.1) than for Whites.¹⁶ Had the evidence from this study, that the intensity of TB among Black cases appeared to be high, been available prior to the HTIP study, one would have had more rationale for including ethnicity in a multivariate model with an underlying hypothesis that Blacks may be at high risk for clustering. Additional evidence for possible clustering among Black TB cases was given in the results of the Nearest Neighbor methods utilized in this project, most notably the (h) function analysis. The use of simulation envelopes under the assumption of spatial randomness allowed one to assess significant departures of (h) from its theoretical value. By providing the same analysis for the underlying population at risk, one is able to directly compare the functions and realize that in distances up to approximately 12 miles, there was a tendency for Black TB cases to show more of a clustering effect than even Black population.

Another extrapolation technique that served to build upon the kernelling methods used was the Spatial Filtering method advocated by Rushton⁴². The added benefit of this technique was the use of simulation techniques in order to provide a level of significance for judging the observed relationships. Again, there was a definite area where relative TB incidence rates for all cases and relative incidence rates for Black cases appeared quite high. One can feel confident that these areas are meaningful if viewed under the aegis of exploratory analysis and can lead the researcher to refine areas for further analysis in the future. Again, one of the limiting factors for this analytical method is that there was no spatial point pattern for use as the denominator that took into account the total population at risk. At best was the use of population centroids at the smallest geographical area available from the U.S. Census Bureau (blocks). However, the method compares the observed case rates with a simulated distribution of case rates that inevitably use the same variance structure of the observed rates⁴². In addition, previous studies have looked at using census-based approaches to account for the lack of population and socio-economic data at the individual level and noted that this approach is valid and meaningful when the individual-level data is not available.²⁸ As a means for routine analysis under a surveillance group, one can quickly make tentative conclusions about the likelihood of case clusters and their geographic distribution based on sound methodology and follow these conclusions with the relevant epidemiological analyses⁴².

The spatial scan statistical method^29,30,31 was utilized in order to find the most likely clusters based on genetic print type in comparison to all other TB cases. The previous analytical methods have tried to show the evidence of overall clustering but provide no information on where the locations for potential clustering may occur. The spatial scan statistical method was an attempt to provide location information for an observed cluster that is provided with a level of significance based on a maximum likelihood test. Earlier analysis¹⁶ had identified that the use of public transportation was a significant risk factor for the clustering of TB cases (multivariate OR of 1.4, p-value = .03). Therefore, it was assumed that if this spatial method could show, with statistical significance, the most likely genetic print clusters, one could compare the attribute information on public transportation for each case found in this cluster to look for relevant patterns. Four genetic print types were found to be significant geographic clusters based on comparing the cases with associated print type (aggregated to Census block groups) with all other TB cases as the control group (also aggregated to Census block groups). Among these four clusters, the cohort for Print type 4 geographic cluster was the same as that observed by previous epidemiological analysis by HTIP⁴⁷. Through molecular characterization and data collected from a standardized questionnaire, and matched case-control methods, researchers were able to determine that many of the individuals in this cohort frequented the same social locations (bars), had similar HIV+ status, had the same ethnic background (White), and had a history of drug use⁴⁷. In addition to these characteristics, it was demonstrated by the current study that at least six individuals in this geographic cluster alone (out of a total of 7 in the total cohort of 38) shared the same mode of public transportation (Bus Route 82). This analysis used a space-time scan statistic that calculated an overall relative risk that takes into consideration the location and the time of infection (based on the City of Houston TB control morbidity date) of the specific genetic print type cases relative to non-print cases. This method attempts to correct for any faulty assumptions based on the possibility that all cases occurring in the same time period may bias the overall results.

Conclusions

One of the main reasons for performing this analysis was to show that there is a definite utility in the use of GIS and spatial analysis in conjunction with epidemiological analyses in public health. The HTIP group has published numerous papers on the risk factors associated with TB clustering and developed novel ways of isolating the threat of increasing incidence rates.^{16,17,27,43,47} This project adds a benefit of performing another type of analysis that provides the researcher with a meaningful picture of the disease patterns that can be used in conjunction with output from epidemiologic studies. However, critics may be quick to point out that this benefit is also a limitation; the fact that this current analysis is coming on the heels of prior research findings is no guarantee that these methods would have steered the research group toward their findings. This should not hinder, however, the use of spatial analytical methods in conjunction with epidemiological studies, especially as hypothesis-generating activities and exploratory exercises useful for planning future explanatory analyses. The main focus of this project was to show that the description of spatial patterns in disease events can lead to important decisions as to where interventions may need to take place or dollars spent on control efforts. In addition, some may recognize the limits of simple univariate point analysis with the methods used here, preventing one from looking for spatial relationships that may adjust for a number of covariates together as is done in traditional epidemiological studies. There are methods that will analyze multivariate point patterns¹⁴, such as a bivariate (h) function,that could be used in the future to look at comparing differences in spatial point patterns that account for locations of two or more types of events in a study region but this type of decision should be made by all interested parties involved in the research tasks, with a variety of analytical and exploratory data for background comparison. This project serves to add to that wealth of information already present in Harris County.

The importance of GIS in health research has been documented in a large number of articles during the past decade. Various peer-reviewed journals have devoted whole issues to the topic of GIS in health research (Journal of Public Health Mgmt. Vol. 5 Nos. 2,4), spatial analysis (Statistics in Medicine, Vol.19 Nos. 17,18), as well as, lengthy review articles on both subjects.^36,40 There have even been several books written on the theme of GIS and health, as well as, exploratory analyses using spatial statistical methods and specialized software/training modules developed to meet the needs of researchers when stand-alone GIS software is not enough for more robust statistical analysis purposes.^{3,18,21,29,32,41} The benefits of combining active health surveillance efforts with systematic collection and display of geographical information have also been discussed at length.^32,34,42 GIS provides a visual component that may often be lacking in scientific studies that can provide useful information when combined with sound statistical methods. The ease of incorporating such GIS systems into already existing database structures in public health departments and surveillance systems should become the norm in an effort to promote the timely communication of disease trends to policy makers and the general public.

Acknowledgements

This study would not have been possible without the assistance of the Houston Tuberculosis Study and Dr. Edward Graviss, Ph.D, M.P.H. who agreed to let me use the necessary data for this study.

In addition, I would like to thank the researchers Martin Kulldorrf, Ph.D, Ned Levine, Ph.D and Gerard Rushton, Ph.D who responded promptly to my questions about using their software.

References

1. American Thoracic Society. 2000. Diagnostic standards and classification of tuberculosis in adults and children. Am J Respir Crit Care Med. 161: 1376-95.

2 Atkinson, P. and Molesworth, A. 2000. Geographical analysis of communicable disease data. In: P. Elliot; J.C. Wakefield; N.G. Best; D.J. Briggs (Eds.) Spatial epidemiology: methods and applications. pp. 253-66. England: Oxford University Press.

3. Bailey, T.C. and Gatrell, A.C. 1995. Interactive spatial data analysis. England: Addison Wesley Longman Ltd.

4. Barnes, P.F.; Yang, Z.; Preston-Martin, S.; Pogoda, J.M.; Jones, B.E.; Otaya, M.; Eisenach, K.D.; Knowles, L.; Harvey, S.; Cave, M.D. 1997. Patterns of tuberculosis transmission in central Los Angeles. JAMA. 278(14): 1159-63.

5. Bellin, E.Y.; Fletcher, D.D.; Safyer, S.M. 1993. Association of tuberculosis infection with increased time in or admission to the New York City jail system. JAMA. 269: 2228-31.

6 Bernhardsen, T. 1999. Geographic information systems, an introduction, 2^nd edition. New York: John Wiley and Sons, Inc.

7. Beyers, N.; Gie, R.P.; Zietsman, H.L.; Kunneke, M.; Hauman, J.; Tatley, M.; Donald, P.R. 1996. The use of a geographical information system (GIS) to evaluate the distribution tuberculosis in a high-incidence community. S Afr Med J. 86:40-44.

8. Bifani, P.J.; Mathema, B.; Liu, Z.; Moghazeh, S.L.; Shopsin, B.; Templaski, B.; Driscoll, J.; Frothingham, R.; Musser, J.M.; Alcabes, P.; Kreiswirth, B.N. 1999. Identification of a W variant outbreak of Mycobacterium tuberculosis via population-based molecular epidemiology. JAMA. 282(24): 2321-2327.

9. Bishai, W.R.; Graham, N.M.H.; Harrington, S.; Pope, D.S.; Hooper, N.; Astemborski, J.; Sheely, L.; Vlahov, D.; Glass, G.E.; Chaisson, R.E. 1998. Molecular and geographic patterns of tuberculosis transmission after 15 years of directly observed therapy. JAMA. 280(19): 1679-1703.

10. Centers for Disease Control and Prevention. 2001. MMWR. 49(Nos. 51&52):1153-76.

11. Centers for Disease Control and Prevention. 2001. Division of Tuberculosis Elimination. (Online). Available: HYPERLINK "http://www.cdc.gov/nchstp/tb/surv/surv.htm" [2001, June 15].

12. Centers for Disease Control and Prevention. 1999. Tuberculosis elimination revisited: obstacles, opportunities, and a renewed commitment. MMWR. 48(No. RR-9): 1-13.

13. Centers for Disease Control and Prevention. 1990. Tuberculosis among foreign-born persons entering the United States: recommendations of the advisory committee for elimination of tuberculosis. MMWR. 39(RR-18): 1-21.

14. Cressie, N.A.C. 1991. Statistics for spatial data. Chichester: John Wiley.

15. Daley, C.L.; Small, P.M.; Schecter, G.F.; Schoolnik, G.K.; McAdam, R.A.; Jacobs, W.R.; Hopewell, P.C. 1992. An outbreak of tuberculosis with accelerated progression among persons infected with the human immunodeficiency virus. N Engl J Med. 326: 231-235.

16. De Bruyn, G.; Adams, G.; Teeter L.; Soini, H.; Musser, J.M.; Graviss, E.A. 2001. The contribution of ethnicity to Mycobacterium tuberculosis strain clustering. Int J Tuberc Lung Dis. 5(7): 633-41.

17. El Sahly, H.M.; Adams, G.J.; Soini, H.;Teeter, L.;Musser, J.M.;Graviss, E.A. 2001. Epidemiologic differences between United States- and foreign-born tuberculosis patients in Houston, Texas. The Journal of Infectious Diseases. 183: 461-8.

18. Elliot, P.; Wakefield, J.C.; Best, N.G.; Briggs, D.J. (Eds.). 2000. Spatial epidemiology: methods and applications. England: Oxford University Press.

19. Environmental Systems Research Institute, Inc. 1999. ArcView Spatial Analyst Vers. 1.1, Redlands, CA

20. Environmental Systems Research Institute, Inc. 1998. Atlas GIS 4.0, Redlands, CA.

21. Gatrell, A. and Löytönen, M. 1998. GIS and health. London: Taylor & Francis, Ltd.

22. Geographic Data Technology, Inc. 2000. Dynamap 1000 Street Network File for Texas, ver. 8. Lebanon, NH.

23. Goodchild, M.F. 1987. A spatial analytical perspective on geographical information systems. Int J Geographical Information Systems. 1(4): 327-34.

24. Hjalmars, U.; Kulldorff, M.; Gustafsson, G.; Nagarwalla, N. 1996. Childhood leukaemia in Sweden: using GIS and a spatial scan statistic for cluster detection. Statistics in Medicine. 15:707-715.

25. Jacquez, G.M. 1998. GIS as an enabling technology. In: A. Gatrell and M. Löytönen (Eds.) GIS and health. pp. 17-28. London: Taylor & Francis, Ltd.

26. Kleinschmidt, I.; Bagayoko, M.; Clarke, G.P.Y.; Craig, M.; Le Sueur, D. 2000. A spatial statistical approach to malaria mapping. International Journal of Epidemiology. 29:355-361.

27. Klovdahl, A.S.; Graviss, E.A.; Yaganehdoost, A.; Ross, M.W.; Wanger, A.; Adams, G.J.; Musser, J.M. 2001. Networks and tuberculosis: an undetected community outbreak involving public places. Soc Sci and Med. 52: 681-694.

28. Krieger, N. 1992. Overcoming the absence of socioeconomic data in medical records: validation and application of a census-based methodology. American Journal of Public Health. 82(5): 703-710.

29. Kulldorff, M.; Rand, K.; Gherman, G.; Williams, G.; DeFrancesco, D. 1998. SaTScan v2.1: Software for the spatial and space-time scan statistics. Bethesda, MD: National Cancer Institute.

30. Kulldorrf, M.; Feuer, E.J.; Miller, B.A.; Freedman, L.S. 1997. Breast cancer clusters in the Northeast United States: a geographic analysis. American Journal of Epidemiology. 146(2): 161-170.

31. Kulldorff, M. and Nagarwalla, N. 1995. Spatial disease clusters: detection and inference. Statistics in Medicine. 14:799-810.

32. Levine, N. 2000. CrimeStat: A Spatial Statistics Program for the Analysis of Crime Incident Locations (Vers. 1.1). Ned Levine & Associates, Annandale, VA, and the National Institute of Justice, Washington, DC.

33. Mathsoft, Inc. 1999. S-Plus 2000 Professional Release 1. Seattle, WA.

34. Mayer, J.D. 1983. The role of spatial analysis and geographic data in the detection of disease causation. Soc Sci Med. 17:1213-21.

35. McKenna, M.T.; McCray, E.; Jones, J.L.; Onorato, I.M.; Castro, K.G. 1998. The fall after the rise: tuberculosis in the United States, 1991 through 1994. Am J Public Health. 88:1059-63.

36. Moore, D.A. and Carpenter, T.E. 1999. Spatial analytical methods and geographic information systems: use in health research and epidemiology. Epidemiologic Reviews. 21(2): 143-61.

37. Moore, M.; Onorato, I.M.; McCray, E.; Castro, K.G. 1997. Trends in drug-resistant tuberculosis in the United States, 1993-1996. JAMA. 278:833-7.

38. Ormerod, L.P.; Charlett, A.; Gilham, C.; Darbyshire, J.H.; Watson, J.M. 1998. Geographical distribution of tuberculosis notifications in national surveys of England and Wales in 1998 and 1993: report of the Public Health Laboratory Service/British Thoracic Society/Department of Health Collaborative Group. Thorax. 53:176-181.

39. Pablos-Mendez, A.; Ravioglinone, M.C.; Laszlo, A. et al. 1998. Global surveillance for antituberculosis-drug resistance, 1994-1997. N Engl J Med. 338:1641-9.

40. Richards, T.B.; Croner, C.M.; Rushton, G.; Brown, C.K.; Fowler, L. 1999. Geographic information systems and public health: mapping the future. Public Health Reports. 114:359-373.

41. Rushton, G.; Armstrong, M.P.; Lynch, C.; Rohrer, J. 1997. Improving public health through geographical information systems: an instructional guide to major concepts and their implementation, vers 2.5. Iowa City, IA: The University of Iowa, Department of Geography (CD-ROM).

42. Rushton, G. and Lolonis, P. 1996. Exploratory spatial analysis of birth defect rates in an urban population. Statistics in Medicine. 15: 717-726.

43. Soini, H.; Pan, X.; Teeter, L.; Musser, J.M.; Graviss, E.A. 2001. Transmission dynamics and molecular characterization of Mycobacterium tuberculosis isolates with low copy numbers of IS6110. Journal of Clinical Microbiology. 39(1): 217-221.

44. Westlake, A. 1995. Strategies for the use of geography in epidemiological analysis. In: M.J.C. de Lepper et al. (Eds.) The added value of geographical information systems in public and environmental Health. pp. 135-144. The Netherlands: Kluwer Academic Publishers.

45. Whalen, C.; Horsburgh, C.R. Jr., Hom, D.; Lahart, C.; Simberkoff, M.; Ellner, J. 1997. Site of disease and opportunistic infection predict survival in HIV-associated tuberculosis. AIDS. 11: 455-60.

46. Wilkinson, D. and Tanser, F. 1999. GIS/GPS to document increased access to community-based treatment for tuberculosis in Africa. Lancet. 354(9176):394-5.

47. Yaganehdoost, A., Graviss, E.A.; Ross, M.W.; Adams, G.J.; Ramaswamy, S.; Wanger, A.; Frothingham, R.; Soini, H.; Musser, J.M. 1999. Complex transmission dynamics of clonally related virulent Mycobacterium tuberculosis associated with barhopping by predominantly human immunodeficiency virus-positive gay men. The Journal of Infectious Diseases. 180: 1245-51.

Author Information

Matthew L. Stone

Public Health and GIS Researcher

Center for Health Policy Studies

University of Texas-Houston School of Public Health

1200 Herman Pressler, Suite RAS E929

Houston, TX 77030

713-500-9395

713-500-9493(fax)

mstone@sph.uth.tmc.edu