Testing for geographic representativeness of subsets of Large Data Sets

Frederick D. Busche PhD.

Currently when using artificial intelligence algorithms as well as statistical treatments using discovery based data mining to discover patterns in behavior exhibited by customers, it is necessary to create training data sets where a predicted outcome is known as well as testing data sets where the predicted outcome is known to be able to validate the accuracy of a predictive outcome. Generally, the training data sets are larger and are skewed toward a positive response associated with the behavior that is being predicted. The predictive algorithm might be designed to predict a customer's propensity to respond to an offer or his propensity to buy a product. Tests of randomness of each of the attributes in the data sets are completed to see if they represent a randomly selected population for each of the attributes. However, since purchasing behavior of customers is based upon not only the attributes in the data base that are demographic and cyclographic characteristics, but also may be influenced by customer location with respect to a commercial establishment, some validation of their random selection with regards to location must also be completed. This validation must also include an assessment of each customer's proximity to a competitors location if a competitive analysis is being completed.

Common practice today when using IM is not to assess the location variable with regards to the random distribution of customers and thereby may introduce a geographic bias in either the test or train data set making either or both unrepresentative of the overall customer data set. This problem has been apparent in gold mining exploration activities for decades.

People tend to cluster in much the same way as gold tends to nugget. People live in locations where people of like backgrounds, both demographic and cyclographic, tend to co-locate or nugget with each other in much the same way as gold. Therefore, a density assessment can be used to obtain an estimate of the size of sample that should be taken to be geographically representative. This population density can be viewed by plotting the location of the data set or sets on a map. Drive time can be used as a means of assessment of the geographic similarity between the test and train data sets with the apply data set. By plotting the frequency distribution of drive times of the three data sets, a comparison of the distribution of drive times can be completed. This comparison will allow for the assessment of the error that may be associated with predicting an outcome as a result of geographic bias. This has been tested on a data set of 100,000 persons and the results will be presented. This small test seems to indicate that if one is concerned about the influence of geography on the predictability of a result using common predictive techniques within Intelligent Miner, then plotting of locations of the respective test, train, and apply data sets must be done to assess the density and geographic similarity of distribution of the data between the data sets.

Currently when using artificial intelligence algorithms to discover patterns in behavior exhibited by customers it is necessary to create training data sets where a predicted outcome is known as well as testing data sets where the predicted outcome is known to be able to validate the accuracy of a predictive outcome. Generally, the training data sets are larger and are skewed toward a positive response associated with the behavior that is being predicted. The predictive algorithm might be designed to predict a customers propensity to respond to an offer or his propensity to buy a product.

The data that are used to populate the train and test data sets are selected from the overall data that the results are known to complete these two data sets by some random selection procedure e.g. selecting data based upon a random number generated to select a row from the whole data set on a random basis to insure that the attributes in both the train and test data sets are representative of the entire data population being evaluated. Tests of randomness of each of the attributes in the data sets can then be completed to see if they represent a randomly selected population for each of the attributes. However, since purchasing behavior of customers is based upon not only the attributes in the data base that are demographic and cyclographic characteristics, but also may be influenced by customer location with respect to a commercial establishment, some validation of their random selection with regards to location must also be completed. This validation must also include an assessment of each customers proximity to a competitors location if a competitive analysis is being completed.

Common practice today does not assess the location variable with regards to the random distribution of customers and thereby may introduce a geographic bias in either the test or train data set making either or both unrepresentative of the overall customer data set. This problem has been apparent in gold mining exploration activities for decades. Gold, due to its inert chemistry rarely is evenly distributed through rock but is generally located in nuggets within a particular geologic formation or if in more than one formation exhibits a very small dissemination throughout the rock. Mathematical models have been developed to calculate proper sized samples to adequately represent the overall population. These models, developed by Gy and Pitard as well as others as a result of exploration and mining work completed in the Witwattersrand and other gold mining districts in South Africa, found that appropriate sample size to be representative of the gold deposit could be determined by using a formula that uses the shape factor as a function of density. Using this calculation it becomes apparent that extremely large samples must be taken to satisfy acceptable error. Thus, when collecting a sample as a means of characterizing and predicting the overall gold grade and consequently an estimate of the total ounces of gold contained with the deposit, a very large sample must be taking for analysis to guarantee that the gold content in the sample is representative of the whole rock gold value.

People tend to cluster in much the same way. People live in locations where people of like backgrounds, both demographic and cyclographic, tend to co-locate or nugget with each other in much the same way as gold. Therefore, a density assessment can be used to obtain an estimate of the size of sample that should be taken to be geographically representative. This population density can be viewed by plotting the location of the data set or sets on a map.

After having plotted customer locations on a map plus the location or locations of the brick and mortar place or places of interest, the degree of difficulty for a person to travel from their home location to the brick and mortar location of interest can be approximated by calculating the drive time it may take to travel the distance to the closest location of interest. By comparing the drive times of the test and train data subsets, one can insure that there will be no geographic bias to the development of a proper estimator to prejudice the predictive aspects of the model when it is used upon a data set for which the outcome is unknown. After having made the comparison of the test and train data sets, it is necessary to compare the geographic distribution of points within these data sets to the data set upon which the predictive algorithm will be applied. Again, drive time can be used as a means of assessment of the geographic similarity between the test and train data sets with the apply data set. By plotting the frequency distribution of drive times of the three data sets, a comparison of the distribution of drive times can be completed. This comparison will allow for the assessment of the error that may be associated with predicting an outcome as a result of geographic bias.

A data set of 100,000 persons voting in the last four elections in North Carolina was selected as a test data set to be able to understand the geographic bias discussed above. Initially, the data set was divided into three sets by using a random number non-seeded method to select the members of each data set. One set contained approximately 50% of the data, the others contained 30 and 20% respectively. All three data sets were plotted on a map and their densities compared. Drive time to the same arbitrary point for each point in each data set was calculated and compared by plotting the frequency distribution of each of nine classes of drive time ranging from 2 to over 18 minutes.

As one might expect, the tails of the frequency distributions, because of their lower number of members, showed the greatest absolute variability between the three data groups. For the less than 2 minute drive time group for the test and train data sets a difference of the percent of the population within the class from the same class from the apply data set was 7.7% and 6.5% respectively. When a 10 and 20 percent subset of data was extracted from the apply data set the same class demonstrated a 13 and an 8% variability respectively from the apply data set. The overall absolute percentage in variability of the number of members of the nine classes of drive time increased from 2.3% in the first case to 13.5% in the second case.

In conclusion, results from this small test seem to indicate that if one is concerned about the influence of geography on the predictability of a result using common predictive techniques, then plotting of locations of the respective test train and apply data sets must be done to assess the density and geographic similarity of distribution of the data. Secondarily, an assessment of the relative differences in the time that it takes to traverse from each of the data points to either a common point or group of common points must be completed for each point in all data sets. This analysis might be referred to as the "Degree of Difficulty" in making a purchase and is necessary to adequately assess the geographic representativeness of the data sets to the overall data population to insure against an unacceptable error in prediction that may result from the differences in the geographic distributions of the data sets.


Frederick D. Busche, PhD.
IBM Global Services
Dallas, Texas