<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> <html> <HEAD> <TITLE>Sampling Design Issues in Identifying Breast Cancer Sufferers</TITLE> Julie McCormick and Larry Anderson, PhD<BR> </HEAD> <BODY> <H1>Sampling Design Issues in Identifying Breast Cancer Sufferers</H1> <P>Because the reliability of a relationship is not an intuitive concept we depend upon statistical procedures to test if a sample is representative of the population. But, this is also not intuitive. It would be intuitive if we could use maps to demonstrate our representativeness. For example, we would expect that a sample of breast cancer sufferers are representative of breast cancer sufferers in the USA if they were dispersed but concentrated in known population centers. We also would expect the sample should be representative if it was drawn from a representative panel of 250,000 households. However, the intuitive map and representative panel are not adequate because it is possible that breast cancer sufferers are not distributed uniformly throughout the population. We need to accompany the map with an acceptable statistical measure of representativeness and confirm that the breast cancer sufferers found in our specific sample represent the entire population. In other words, how probable would it be that a similar group of sufferers would be found if another sample was drawn from the same population. We are not interested only in what is going on in our sample; we are interested in the sample only to the extent it can provide information about the population. If the sample meets specific criteria, then the reliability of a relation between variables observed in our sample can be quantitatively estimated and represented using a standard measure (p-value or statistical significance level). We will review a variety of nonparametric statistical procedures, including the sign test and the Wilcoxon signed-rank test. The sign test computes the differences between the two variables for all cases and classifies the differences as either positive, negative, or tied. If the two variables are similarly distributed, the number of positive and negative differences will not differ significantly and we can conclude that the number of breast cancer sufferers are evenly distributed throughout the population. An extension of this review will be to vary the level of aggregation of respondents, e.g. MSA, County, or State Level. Although no particular distributions are assumed for the two variables, the population distribution of the paired differences would be symmetric. So, an alternative methodology would be to use the Kolmogorov-Smirnov Z test or the Wald-Wolfowitz test which are more general tests that detect differences in both the locations and the shapes of the distributions. The Kolmogorov-Smirnov test is based on the maximum absolute difference between the observed cumulative distribution functions for both samples. When this difference is significantly large, the two distributions are considered different and we can assume that the breast cancer sufferers are not evenly disbursed throughout the population. In general, the larger the sample size N, the smaller sampling error tends to be. By running the various statistical tests we will know the level of detail that can be reached with the sample size. If N is too small, there is not much point in attempting to map the data because the results will not be representative.</P> <HR> <P>In this paper, the authors discuss the difficult task of determining appropriate sample sizes for statistical analysis and subsequent geographic mapping. Should only statistically significant differences be mapped and should geographic units be aggregated until an adequate sample size is achieved? If so, what are the appropriate guidelines to follow? To address these questions, the authors reviewed two studies which dealt with the issue of representativeness in different ways. The authors compare and contrast these studies to uncover appropriate statistical guidelines for mapping analysis.</P> <P>First, the New York State Cancer Surveillance Improvement Initiative (see www.health.state.ny.us. for detailed description) is reviewed. This comprehensive mapping study has the goal of providing  easy to understand information about cancer to New York State residents. To accomplish this goal the state mapped breast cancer incidence at the zip code level and published maps of every zip code in the state comparing actual breast cancer incidence with the number of cases expected in each zip code. The second study reviewed is one wherein the sample of breast cancer sufferers came from The NPD Group, Inc. s (Port Washington, NY based market research firm) 250,000 member Home Testing Institute representative panel. In both studies the issues are the same, how large of a sample is required to allow for statistical inferences to be concluded, and if the sample is large enough, how can statistical significance be measured?</P> <H2>Review #1: New York State Cancer Surveillance Improvement Initiative</H2> <H3>Descriptive or Inferential Statistics?</H3> <BLOCKQUOTE>  The goal of this project is to provide New Yorkers with information about cancer. It will also guide future research on the causes of cancer and cancer prevention programs. The scientists taking part in the project are looking for the best ways to map where cancer patients live. The maps in this book show the comparison of the actual incidence for individual ZIP Codes with the expected incidence of this type of cancer for the ZIP Code. </BLOCKQUOTE> <P>The maps show which grouping a ZIP Code falls in:</P> <P>- more than 100% above expected - 50%-100% above expected - 15%-49% above expected - within 15% of expected - 15%-50% below expected - more than 50% below expected - very sparse data</P> <P>The goal of the study is clearly inferential and has caused understandable inferences to be concluded by anyone living in a zip code that has above expected levels of breast cancer. The researchers warn readers of the report that,</P> <IMG SRC="p9491.jpg" ALT= New York State Breast Cancer Incidence By Zip Code > <BLOCKQUOTE>  The maps show two things. First, they show whether the cancer incidence for each ZIP Code in New York State is higher, lower or about the same as expected. Second, they show areas of elevated incidence. These are areas of the State where, as a whole, cancer incidence is higher than expected and the elevation in these areas is likely not to be due to chance. The maps do not show any of the risk factors for getting cancer, the reason why people got cancer, or why some areas have a  higher than expected incidence of cancer. </BLOCKQUOTE> <P>The maps were published in many newspapers and despite the warnings in the report, often the media published these maps with attention-grabbing headlines about local communities with elevated breast cancer rates.</P> <P>The American Cancer Society estimates that during 2000, 13,700 new cases of breast cancer will be diagnosed among women in New York and 3,100 women will die of breast cancer in New York. According to the American Cancer Society, the average annual age-adjusted mortality rates for breast cancer deaths per 100,000 persons, by race, 1992 1996 were:</P> <IMG SRC="p9492.jpg" ALT= New York vs National Breast Cancer Incidence By Race > <P>It is not within the scope of this review to question how mortality rates are related to incidence of breast cancer, but if they are, it would be a simple task to include race in the calculation of expected cases in each zip code. While race was not included, age was included.</P> <BLOCKQUOTE>  The cancer rates for the entire State and the number of people in each age group of a ZIP Code were used to calculate the number of people who would be  expected to get this type of cancer. This calculation assumes that people in the ZIP Code have the same risk of getting cancer, and would get cancer at the same rate, as people everywhere in the State. </BLOCKQUOTE> <P>It is also not within the scope of this review to question the variance between New York and National levels of mortality, but it might be suspected that, for example, Western New York and Western Pennsylvania would form a more homogeneous unit, than Western New York and New York City. In any comparative research, the standard, in this case, the New York State average must be defended and variances or confidence intervals explored. A simple procedure to calculate confidence intervals for a population proportion for large samples like the New York State sample would be:</P> <IMG SRC="p9493.jpg" ALT= Formula and Confidence Intervals > <P>This confidence interval would explain why the State limited the mapping to zip codes where the variance to the state average was + or  15%, or between 174 and 235. It could be argued that an incidence that is outside of this range is unexpected and should be mapped. However, this is only true if the sample size is large. If the incidence of breast cancer in a zip code is small, it will be necessary to combine or aggregate the results of several zip codes or report that insufficient data was available.</P> <H3>Level of Aggregation</H3> <P>The highest level of aggregation would be the state average, so any segmentation below the state level would conceivably offer meaningful information provided the sample size was large enough. How large does the sample size need to be? Marketing research clients would not be happy if we reached conclusions with sample sizes of less than 30 people, or, in this case, 30 breast cancer sufferers (based on the diminishing change at the p=.01 value of the binomial distribution function). For low incidence products, a major research expense is screening to find a large enough sample (incidence) to trust the results of the research. If the same standards were applied to mapping, no zip code would be shown that had fewer than 30 cases of breast cancer. Those zip codes with lower incidence would be combined with adjacent zip codes until a large enough sample was aggregated. Many of the zip codes mapped by the state would fail this standard.</P> <P>Are there enough people in the Warren & Washington county zip codes to provide meaningful statistical results? On the following map all of the zip codes that fail the minimum sample size are marked. Only three zip codes survive and all have incidence rates  within 15% of expected. Note the two zip codes in the map below where the incidence is identified. When they are combined, they change from  100% above and  50% below to  within 15% of expected. </P> <IMG SRC="p9494.jpg" ALT= Warren and Washington Counties > <IMG SRC="p9495.jpg" ALT= Table for Warren and Washington Counties > <H3>Validity of the Research</H3> <P>While the sample size would effect the reliability of the inferences made, other issues would effect the validity of the study. For example, why was the state average used to compare zip codes? Are the people who live along Lake Champlain more like people from Vermont than residents of Long Island? Should the residents of Lake Placid be compared with the NY state average, VT state average, US national average, average for rural incidence, or other average? A report of below average incidence of breast cancer when compared to the New York average might be above average when compared to rural incidence levels. A more specific model would take into consideration the known risk factors, including race and age, and individual incidence levels would be calculated at the lowest possible level of aggregation (Y=30 or roughly 15,000 women at risk). An algorithm could be written that combines adjacent zip codes into incidence clusters that would identify significantly different incidences of breast cancer based on geography.</P> <H2>Review # 2: NPD s 250,000 member Home Testing Institute Representative Panel</H2> <P>Because the reliability of a relationship is not an intuitive concept we depend upon statistical procedures to test if a sample is representative of the population. But, this is also not intuitive. It is far more intuitive when maps are used to demonstrate the representativeness. For example, in the map below, we suspect that the identified breast cancer sufferers are representative of breast cancer sufferers in the USA because they appear to be widely dispersed, yet concentrated in known population centers.</P> <IMG SRC="p9496.jpg" ALT= Breast Cancer Sufferers Nationwide > <P>The more intuitive a map is, the more tempting it is to infer some meaning to the data. As in the review of the State of New York mapping, inferences may not be intended, but the maps are so compelling that it is nave to assume that inferences are not being made. Therefore, it is the responsibility of the researcher to make certain that only substantiated inferences be made by recognizing that a footnote or small print disclaimer is not adequate.</P> <P>In this study, the sample sizes and necessary aggregation, per the previous recommendations, have been followed. Two issues will be reviewed, first, what was the source of the data, and second, what statistical procedure can be used to determine if breast cancer incidence has a geographic component.</P> <H3>Representative Panel</H3> <P>We suspect that the sample should be representative because it was drawn from a representative panel of 250,000 households. However, the intuitive map and representative panel alone are not adequate because it is possible that breast cancer sufferers are not distributed uniformly throughout the population. We need to accompany the map with an acceptable statistical measure of representativeness and confirm that the breast cancer sufferers found in our specific sample represent the entire population. In other words, how probable would it be that a similar group of sufferers would be found if another sample was drawn from the same population. We are not interested only in what is going on in our sample; we are interested in the sample only to the extent that it can provide information about the population. If the sample meets specific criteria, then the reliability of a relationship between variables observed in our sample can be quantitatively estimated and represented using a standard measure (p-value or statistical significance level).</P> <IMG SRC="p9497.jpg" ALT= Breast Cancer Incidence By State Population > <P>In the above scatter chart, the relationship between the number of breast cancer sufferers on the NPD HTI Representative Sufferers panel and the state population is highly correlated. Basing the results on the ranks of the states (1=DC & 49=California) offers an advantage in that the ranks are statistically tested rather than the raw data. Although it could be argued that a state s rank should not be calculated for states with less than 30 reported cases, when the ranks are assigned, the sample size is no longer an issue. Rank tests are one of a variety of nonparametric statistical procedures, including the sign test and the Wilcoxon signed-rank test. The sign test computes the differences between the two variables for all cases and classifies the differences as either positive, negative, or tied. If the two variables are similarly distributed, the number of positive and negative differences will not differ significantly and we can conclude that the number of breast cancer sufferers is evenly distributed throughout the population.</P> <IMG SRC="p9498.jpg" ALT= Density of Breast Cancer Sufferers > <P>An extension of this review would be to vary the level of aggregation of respondents, e.g. MSA, County, or State Level. Although no particular distributions are assumed for the two variables, the population distribution of the paired differences should be symmetric. So, an alternative methodology would be to use the Kolmogorov-Smirnov Z test or the Wald-Wolfowitz test, which are more general tests that detect differences in both the locations and the shapes of the distributions. The Kolmogorov-Smirnov test is based on the maximum absolute difference between the observed cumulative distribution functions for both samples. When this difference is significantly large, the two distributions are considered different and we can assume that the breast cancer sufferers are not evenly disbursed throughout the population.</P> <P>In general, the larger the sample size N, the smaller the sampling error tends to be. By running statistical tests, we will know the level of detail that can be reached with the sample size. If N is too small, there is not much point in attempting to map the data because the results will not be representative.</P> <IMG SRC="p9499.jpg" ALT= New York County > <H2>Summary</H2> <P>In the map of Manhattan only four zip codes had fewer than 30 cases of breast cancer reported, however, only the Upper East Side of Manhattan had an incidence of breast cancer that was  not likely do to chance. That is, the number of cases in the identified zip codes of the Upper East Side when compared to the expected number for the State of New York (based on the New York average), the observed number of cases is significantly higher. If it is appropriate to compare urban high-rise apartment dwellers to an average that includes suburban and rural women, a few additional explanations for the higher incidence may be race and affluence. Of course, affluence is highly correlated with education and women with higher levels of education may be more likely to understand breast cancer risk factors and be checked. It may also be found that affluence is correlated with frequency and quality of mammograms. It is unlikely that environmental factors can be blamed for the increased incidence because all of the women in Manhattan live in close proximity to each other and are exposed to the same environment. (Obviously, understanding whether environmental factors contribute to breast cancer is a highly complex issue and beyond the scope of this paper.)</P> <P>An alternative way to view the data from the Cancer Registry is to assign each breast cancer case a latitude and longitude in GIS. Based on that exact position, a cluster analysis can be performed using the distance from the breast cancer sufferer to the cluster center as the Mahalonobis distance. This technique allows for the identification of homogenous clusters, rather than arbitrarily using zip codes to define the study area. A density measure (incidence) will identify which clusters have a higher than expected level of breast cancer sufferers.</P> <P>The dilemma for the State of New York and much of the criticism in this review is that the research is based on reported incidence and not a random sample of women who have the same incidence of being screened for breast cancer. In the second review, 250,000 women who were selected for the HTI for reasons unrelated to health issues were asked to report if they were suffering from breast cancer. The sample size (250,000) is too small to allow for local inferences about breast cancer, but it is large enough to provide a benchmark for the screening for breast cancer. Does screening start at an earlier age for affluent women? Are women who have health insurance more likely to have a mammogram?</P> <P>Because maps are so intuitively obvious (easy for people to incorrectly infer meaning) the mapping of statistical findings should be statistically rigorous. A goal of the New York study was to study if breast cancer incidence was higher in those zip codes that were proximate to known environmental hazards like nuclear power plants, hazardous waste sites, or industrial pollutants. There may be cases in New York where this can be demonstrated but it will be difficult to prove until we understand why women on one side of the street have elevated incidence of breast cancer, while those on the other side are below expected.</P> <HR> Julie McCormick<BR> Project Director<BR> The NPD Group, Inc. <BR> 900 West Shore Rd. <BR> Port Washington, NY 11050<BR> (516) 625-4848<BR> (516) 625-2329<BR> julie_mccormick@npd.com<BR> Larry Anderson, PhD<BR> Director<BR> The NPD Group, Inc. <BR> 900 West Shore Rd. <BR> Port Washington, NY 11050<BR> (516) 625-4149<BR> (516) 625-2329<BR> larry_anderson@npd.com<BR> </BODY> </HTML>