Steve Wise, Jingsheng Ma, and Bob Haining

SAGE - A System for the Interactive Analysis of Area-based Health Data linked to ArcInfo


GIS packages, such as ArcInfo, have very good facilities for many types of analysis, but are currently weak in the statistical analysis of spatial data and the use of scientific visualisation techniques. This paper describes the development of a system based around ArcInfo, which provides a wide range of facilities for the analysis of area-based health data. The system, called SAGE (Spatial Analysis in a GIS Environment), consists of purpose written graphical and statistical software which calls ArcInfo running as a server, to perform certain operations, such as the provision of the data to be analysed, cartographic display and some GIS operations such as polygon dissolve. The maps produced by ArcInfo can also be linked to other graphs and data displays produced by SAGE. For example highlighting a polygon on the map not only causes the relevant row in the attribute display to be highlighted, but also the data points on any graphs drawn using the data. SAGE exploits the ability of ArcInfo to use client/server computing to produce a flexible system, providing a range of analytical tools to analyse spatial data held in ArcInfo - the facilities provided include a range of graphical display techniques, the ability to create purpose built regions from basic spatial units such as census areas and a range of statistical techniques ranging from simple summary statistics to the fitting of regression models specially modified to deal with spatial data. Although developed with health applications in mind, the system would probably be of use in the analysis of many other area-based datasets.


Introduction

The analysis of spatial data has always been one of the principal strengths of Geographic Information Systems. For instance the earliest operational GIS, the Canadian GIS, was developed to allow the analysis of large amounts of environmental data. The commonest type of analytical operation in GIS involves the manipulation of coverages to produce new information or further coverages (Burrough 1986). The simplest example is where a single coverage is modified in some way to produce new information - for example a DTM might be used to generate a map of slope steepness, which might in turn be processed to select those areas above a certain angle. More complex operations can be performed by combining coverages, as in the various types of overlay operation, which are commonly used in sieve mapping.

ArcInfo is particularly rich in these kinds of cartographic modelling techniques, especially as it has the capability of performing vector or raster-based analyses. In addition, it contains functions for more specialised types of analysis, the two best examples being the hydrological modelling tools within GRID, and the network routing and location/allocation tools in NETWORK.

However the analytical capabilities of current GIS, including ArcInfo, are still limited in two areas:

This term describes the use of interactive graphical techniques for the analysis of data, and was originally coined in respect of the analysis of vast amounts of information created by simulation programmes in the physical sciences. Although many of the display techniques used in visualisation, such as statistical graphs, maps, 3D views and block diagrams, are standard graphical devices, what makes visualisation different from simply producing graphs of data is that these tools are provided in a highly interactive environment which allows the user to explore the data by using a wide range of different graphical techniques. Some of the techniques are by their very nature unsuitable for the production of hard copy graphics - for example the use of animation or the ability to rotate 3D views in real time. A particularly powerful feature of some visualisation packages is the ability to link different views of the same data, so that highlighting an extreme data point on a scatter plot will highlight the location of that sample on a map.

The statistical capabilities of many GIS packages are quite poor, even for producing standard summary statistics such as measures of central tendency and spread. ArcInfo actually has slightly better facilities for raster data, since GRID can calculate some statistical measures, and calculate correlations between coverages. However, there are also a wide range of more sophisticated statistical techniques for analysing patterns in spatial data and relationships between variables, very few of which are available in standard GIS systems (Haining 1990, Bailey and Gatrell 1995).

These limitations of the current analytical toolkit of GIS have been most strongly noted by academic researchers (Goodchild et al 1992) and some have argued that such facilities are too specialised to be of interest to the majority of GIS users. While it may be true that for many, a GIS is essentially a data management tool, there are also important GIS application areas where the analysis of data is important. One such area is in crime pattern analysis (CPA), where the analysis of patterns in the vast amounts of information stored by the police on the occurrence of crime can often lead to useful insights in tracking down the perpetrators of crime, or in planning preventative measures, and some work has been done to develop CPA facilities linked to ArcInfo (Cross and Openshaw 1991).

Another area where analysis is important is in the field of health which is the focus of this paper. There is a long history of analysing the spatial patterns of ill-health as a guide to causative factors, dating from Snow's pioneering work on cholera in London (Snow 1854) right up to present day studies of the evidence of the link between environmental pollution and some cancers suggested by the clustering of cases in certain areas (Bailey and Gatrell 1995). For this kind of analysis it is not enough to be able to plot a map of the cases of a disease. The human eye is notoriously prone to seeing clusters in randomly scattered points, and numerical techniques are needed to allow for the variation in the location of the background population for example and test whether a cluster genuinely exists.

The mapping of incidence rates of disease, or of uptake rates of services is also important in the management of health resources, a topic which is becoming increasingly important with the growing pressure on health services in many countries. In the UK, the need for efficient management has been increased by the separation of the health sector into providers and purchasers, which has generated a need for both groups to manage their resources as cost effectively as possible.

This is not to suggest that all health data analysis requires sophisticated techniques - for example the production of standard choropleth or dot density maps for standard reporting areas is still a very useful tool for many health professionals. However, we would suggest that the use of interactive visualisation techniques and spatial statistics is not limited to academic research, but has a real relevance to all those with an interest in health data.

In this paper, we describe the development of a software system called SAGE (Spatial Analysis in a GIS Environment) which is designed to assist in the analysis of area-based health data. In the next section we briefly describe some of the other work which has been done on the provision of extra analytical facilities for spatial data. This is followed by a description of the approach we have taken which we believe combines the best elements of previous work. The architecture of SAGE is briefly described, followed by an outline of the facilities provided, together with some examples of their use.

Linking GIS and Spatial Analysis

A number of reviews of the work on providing GIS with better spatial analysis facilities have already been written, and the reader is referred to these for a complete review of the field (Bailey 1994, Haining et al 1996). Here we concentrate on outlining the main approaches taken, identifying their strengths and weaknesses.

One of the earliest systems to illustrate the potential benefits of interactive visualisation tools for the analysis of spatial data was developed by Haslett et al (1990). The system ran on the Macintosh and was a purpose written package which provided the ability to produce multiple linked views of a set of data. In an early example, a map of geochemical sample locations in one window was linked to a scatter plot of the copper and zinc contents of the samples - a series of particularly anomalous values on the scatter plot were highlighted and the relevant locations on the map were automatically highlighted, indicating that all the samples came from the same region. Since then, a number of workers have developed a series of similar systems using different programming environments to speed up the development process (Dykes 1995, Brunsdon and Charlton1995).

Although such systems can produce a wide range of graphic displays, and can link these together, they all suffer the drawback that the data must be imported into them from the GIS. This has a number of problems:

The design of SAGE

In designing SAGE, our aim was to provide a comprehensive set of tools to assist in the analysis of health data. The tools are described in more detail below, but they include both interactive visualisation methods, including the linking of graphical, tabular and cartographic displays, plus more quantitative tools such as exploratory and confirmatory statistics. In addition we wanted to make use of existing software capabilities wherever possible, which meant that it was planned to use the GIS for the storage and cartographic display of the data, and use other packages for some of the other functions, such as tabular display of the attribute data, statistical calculations etc.

These different packages are presented to the user via a consistent user interface as shown in Figure 1.

Figure 1: Typical screen layout during a SAGE session.

Figure 1: Typical screen layout during a SAGE session.

Four windows are shown. In the top left hand corner is the window from which the operation of SAGE is controlled - this is a tabular display of the attributes associated with the ArcInfo polygon coverage. The coverage itself is displayed in map form in the ARCPLOT window on the upper right. Notice that two of the polygons on the map are highlighted - these correspond to two of the rows in the table which are also highlighted (one is visible in Figure 1 - the other is further down the table). Selecting one or more rows of the table will cause the relevant polygons to be highlighted on the map - the selection can be a manual one or the result of an SQL-like query. Conversely, selecting one or more polygons on the map will cause the relevant rows in the table to be highlighted - again the selection can be manual or the result of a spatial query. All these functions can of course be performed within ArcInfo itself - however in SAGE, this linkage extends to other windows such as the scatter plot in the lower left hand corner - although it may not be very clear, the three uppermost points on the plots have been selected, and it is this selection which has actually caused the rows in the table and the polygons on the map to be highlighted. Other windows could also be opened up and these would also be linked to the existing views. For instance we might decide to see whether the outliers on the scatter plot related to polygons with small population values (and hence potentially unreliable incidence rates) and this could be checked by creating a window with a box plot of the population values for each polygon.

As explained above, the system consists of a number of pieces of software, linked together using the client/server mechanism. The architecture of the system is shown in Figure 2.

Figure 2: Architecture of SAGE

Figure 2: Architecture of SAGE

In all, three pieces of software are involved, each shown as a row in the diagram: ArcInfo, running in server mode, a spatial statistical analysis (SSA) package and a Linking Interface (LI) both running as clients. As shown by the columns, both ArcInfo and the SSA can be thought of as having three elements - a user interface, an operational module that performs any computation required and an interface which allows the packages to communicate both internally and with external processes.

The system is controlled from the SSA, the heart of which is a purpose written C program. However a great deal of use has been made of existing software in the construction of the SSA - for example the software which produces the tabular display in Figure 1 is a modified version of a public domain package and many of the graphical plots are produce by public domain code. This software re-use is possible because all these packages are written using object-oriented programming languages which allows them not only to be used by other packages, but for their basic functionality to be modified as well.

When a user selects a row from the table, this information is passed on to the SSA interface. This keeps track of all the other displays which are currently active, and sends out the appropriate instructions for these to be updated to highlight the selected set of records. In the case of the map display, this means that the SSA sends a request to the LI, which translates this into a set of ArcInfo commands (largely calls to purpose-written AMLs) which are transmitted as requests to the ArcInfo server. Communication between the LI and both the SSA and ArcInfo is performed using the standard UNIX facility of named pipes and RPC. ArcInfo provides an Inter-Application Communication mechanism (Esri, 1994) which could be used instead, and which would allow true distributed computing in that an SSA module running on one computer could communicate with an ArcInfo server running on another one. However, the current version of IAC only has a small communications buffer making it unsuitable for transferring the results of complex requests.
One of the advantages of the SAGE architecture is that it uses data which is stored in ArcInfo, which means that all the normal functions of data entry, data editing and manipulation are automatically available (although not via SAGE itself - the point is that the data does not need to be exported from ArcInfo in order to be used by SAGE). However, when SAGE is running it does in fact take a local copy of some of the information held by ArcInfo, namely selected attribute values from the Polygon Attribute Table and the topology data from the Arc Attribute Table. This is done for three reasons:

  1. It speeds the system up because it can work with data held in memory.
  2. It allows changes to be made to the data without necessarily altering the actual data held in ArcInfo. A temporary INFO table is established at the start of the session, which is linked to the coverage PAT using a RELATE. Any changes, such as new attribute columns, are thus added to this table, and only added to the original PAT if requested by the user. This is important because many of the analytical techniques create new attribute columns which are only useful for the purpose of the analysis and don't need to be kept - one example would be the column of residual values which can be created when a regression is fitted between two of the variables.
  3. To allow the construction of contiguity information required by certain spatial analysis functions.

Functionality of SAGE

As stated above, SAGE has been designed to support the analysis of area-based health data and so a wide range of analytical tools are provided. The main ones can be grouped together under three headings for the purposes of discussion, although there is some overlap between these categories and they do not actually cover every element of the system.

Exploratory tools

These include the facility of the linked graphical windows which has already been described above. Three types of graphical display are supported:

Cartographic

This makes use of the excellent facilities of ARCPLOT, and one of the strengths of SAGE is that there has been no need to write any software for this purpose.

Tabular

This is the spreadsheet-like display shown in the top left of Figure 1. This package holds a local copy of information from the coverage PAT.

Statistical

A wide range of statistical graph types are provided, including histograms, box plots, scatter plots .

In a sense these facilities can be used as analytical tools in their own right, to explore the data, identify patterns, relationships and outliers and suggest hypotheses which can then be tested using some of the other techniques. In addition they support many of the other types of analysis, as will be described below.

Classification and Regionalisation

A common problem with the analysis of area-based data is that the basic spatial units (bsus) for which the data are available are not well suited to the type of analysis. In the case of health data, the basic spatial units are usually census areas such as the UK Enumeration Districts (EDs). These are relatively small, which can cause sensitivity problems in some cases. For example, in a study of colorectal cancer in Sheffield, Haining et al (1994) were dealing with a disease with approximately 300 cases per year distributed over nearly 1100 EDs. Simply calculating incidence rates on an ED basis was not advisable because the results would be very sensitive to small errors in the data - a single case assigned to the wrong ED because of a mis-diagnosis or an error in geocoding could effectively double the apparent rate in that ED.

One solution to this problem is to use larger areal units. However the next standard unit in the UK, the ward, is too heterogeneous and may be too large. What is needed is the ability to group the basic spatial units into purpose-built regions and so SAGE includes a suite of regionalisation tools. These are described in more detail elsewhere (Wise et al, 1996) but they provide the ability to construct regions which satisfy any combination of the following three criteria:

Homogeneity

It will usually be important that the bsus which are merged into regions are similar in terms of one or more attributes. If studying the relationship between health and deprivation for example, it makes sense to do this using areas which have relatively uniform levels of deprivation, rather than being made up of a mixture of affluent and deprived areas.

Equality

When calculating incidence rates for health data, it is useful if the regions have similar populations, and so SAGE has an option to create regions which have equal values of some attribute - population is the clearest example but the user is given complete freedom to select any of the columns in the attribute table to be equalised.

Compactness

Simply grouping similar bsus may produce regions which have strange shapes. In some cases it may be desirable to constrain the shape, and try and produce regions which are reasonably compact - this will be important if the regions may be used for administrative purposes for example, but it may also accord with one's intuitive notion that natural regions within cities for example will form compact zones, possibly centred on some focal area.

Since these criteria are often competitive - forcing regions to be compact will almost certainly mean merging bsus which differ more than if homogeneity is the only criterion - they may each be weighted from 0 to 100% in importance.

Regionalisation is a long-standing research topic in many areas and it is well known that it is very difficult to find the best possible regionalisation given a set of bsus (partly because the number of possible regionalisations is enormous, so that it is impossible to try them all to find the best (Cliff et al 1975)). One of the strengths of using SAGE is that it is possible to construct several different regionalisations and compare them using some of the exploratory tools described above.

Figure 3: 1981 Census Enumeration Districts of Sheffield

Figure 3: 1981 Census Enumeration Districts of Sheffield

Figure 3 shows a map of the 1981 EDs for Sheffield, for which infomation on deprivation measured using the Townsend index and population was available. The regionalisation tools were used to produce the 30 regions shown in Figure 4, based on two equally-weighted criteria: (1) homogeneity in terms of deprivation (2) equality in terms of population.

Figure 4 : 30 regions constructed from EDs

Figure 4 : 30 regions constructed from EDs. Criteria used were that deprivation within regions should vary as little as possible and that regions should have equal population.

The regionalisation process produces a new column in the attribute table held in SAGE, which indicates which region each ED belongs to. At this stage the polygons are not dissolved to form new polygons for the regions which means it is possible to use the graphical tools of SAGE to look at the results of the regionalisation before creating a new coverage. From the map it can be seen that without a compactness criterion, the regions are rather strange shapes in some cases.

Figure 5: Total population within the 30 regions in Figure 4.

Figure 5: Total population within the 30 regions in Figure 4.

Figure 5 is a histogram of the total population which each new region would have - the graph drawing tools have an option to produce a graph of one variable (population in this case) grouped according to another (the region ID). This shows that the majority of the regions have population counts which are very similar, but that two regions would have populations which are much larger. If this was a problem, then it would be possible to go back and redo the regionalisation with the population equality criterion given a greater weighting - this would produce another new column in the attribute table which could be graphed in the same way for comparison.

Figure 6: Inter quartile range of deprivation levels in 30 regions in Figure 4

Figure 6: Inter quartile range of deprivation levels in 30 regions in Figure 4.

Figure 6 shows a graph designed to assess the homogeneity of each new region. The inter-quartile range of the ED deprivation scores within each region has been calculated and plotted, again using the SAGE histogram tool. The inter quartile range for the whole of Sheffield was 4.66, Many of the new regions have values of below 2 which indicates that they are reasonably homogeneous, although there are a few with rather large values. Again depending on the nature of the analysis to be undertaken, it might be desirable to try the regionalisation with a greater weight given to homogeneity.
The key point here is not that SAGE will guarantee to produce the optimum regionalisation, but that it provides the tools to explore a range of different ones and assess how suitable they will be for the particular purposes of the analysis. When a suitable regionalisation is found, the final step is to create a new coverage by merging all the bsus which belong to the same region. This is another reason why the link to ArcInfo is so useful, since this operation can be performed using ArcInfo's DISSOLVE command - in a free standing package this functionality would have to be written from scratch.

Spatial Statistics

As already seen SAGE provides the facilities for undertaking standard statistical operations, such as calculating summary statistics, producing statistical graphs and fitting regressions. It is well known that spatial data has particular properties which can create problems for certain standard statistical methods (Haining 1990). It is commonly the case that values for a particular variable will be very similar for neighbouring areas reflecting the underlying spatial trends in the phenomenon in question. For example, deprivation levels will vary in broad patterns across a city, and will not change abruptly at ED or ward boundaries. This tendency of neighbouring areas to have similar characteristics is called spatial autocorrelation, and where it is present it violates one of the basic assumptions of most classical statistical techniques, that the sample values should be independent of one another. This in turn can render many of the significance tests which are normally performed invalid.

A range of methods have been developed to deal with these problems (Haining 1990). Many of the methods require some knowledge of the spatial autocorrelation in the data, or of the connectivity between areas (i.e. which areas are neighbours of which). This is another good reason to link a system like SAGE to a GIS such as ArcInfo, because the topological information held in an ArcInfo polygon coverage can be used to provide this information (Ding and Fotheringham 1992).

SAGE therefore contains facilities for constructing the connectivity matrix for a given set of polygons, for calculating standard measures of spatial autocorrelation such as Moran's I and for fitting regression models which can take account of spatial autocorrelation in the data. These are facilities which are not available in most statistical packages, let alone in GIS packages!

Discussion

The preceding section has hopefully given a flavour of the range of facilities available in SAGE, and the way in which they might be used to undertake an analysis of health data. The types of analysis which such a system allows will range from largely exploratory approaches, using the graphical tools and the ability to link the different views to simply look at patterns and relationships in the data, through to a more formal approach which might begin by constructing a set of areal units with desirable properties of homogeneity and equality of population, establish the strength of a relationship between health and other factors by fitting a regression model, and then use the graphical tools to look at outliers from this model.

There are often calls for extra facilities to be added to GIS, which is partly a reflection of the widespread use of spatial data in many areas of research and commerce, and the great interest in the power of GIS software for handling spatial data. However, it is not always possible or even advisable simply to add in extra functionality to an existing package such as ArcInfo. Apart from anything else, there is the risk of turning an already large system into an unwieldy giant, too complex for anyone to hope to master.

The approach described here for adding extra functionality to ArcInfo has a number of advantages, which may make it of interest as a general mechanism for linking specialised software to a general purpose GIS. There are two advantages in any approach which links new software to GIS, rather than simply importing data from it:

  1. It avoids duplicating the GIS database.
  2. It means that all the standard facilities of the GIS, such as cartographic display and standard GIS analysis operations such as polygon dissolve, are automatically available.

Performing the link via a client/server architecture can also make the resulting system very flexible, as has been illustrated in the case of SAGE, and provides the potential for a distributed strategy with interactive graphical software running on a desktop PC and communicating with a GIS server running on a powerful central computer. This type of approach is likely to become more common in the future, but there are currently two impediments to its widespread adoption:

  1. The lack of GIS packages offering the possibility of client/server working. ArcInfo has taken a lead in this respect, although there are still limitations in the current version such as the small communications buffer with the IAC mechanism.
  2. The lack of a standard GIS language. When linking packages to relational databases, for example, it is possible to use a standard language, SQL, which makes it easy to interface software to more than one database. The same is not true of GIS - if SAGE were to be linked to a GIS other than ArcInfo, this would require the rewriting of the parts of the system which translate user requests into commands for the GIS, and which deal with the information which is returned.

It will be interesting to see whether future developments remove these impediments.

Conclusion

In this paper we have described the design and implementation of a software system for the analysis of area-based health data which combines the power of ArcInfo with new software for visualisation and statistical spatial analysis. We have tried to show that the facilities this system offers will be of potential interest to many people working with health data, and not just to a handful of academic researchers, but whether this is true will only become clear once the system has been tested in practice. The system is still a prototype version, but if anyone is interested in discussing its possible use in their own work, they should contact the team as indicated below.


Acknowledgements

The authors acknowledge receipt of grant R000234470 from the UK Economic and Social Research Council which has made the work reported here possible.

References

Anselin, L., Dodson, R.F. and Hodak, S. (1993) Linking GIS and Spatial Data Analysis in Practice. Geographical Systems 1 (1), 3-23.

Bailey, T. C. (1994) A Review of Statistical Spatial Analysis in Geographical Information Systems. In Fotheringham S and Rogerson P. (ed) Spatial Analysis and GIS. Taylor and Francis, London. 11-44.

Bailey T.C and Gatrell A.C. Batty, M. and Yichun, X. (1994) Urban Analysis in a GIS Environment : Population Density Modelling using ArcInfo in Fotheringham, S. and Rogerson, P. (ed) Spatial Analysis and GIS, Taylor and Francis, London, 189-220.

Brunsdon C and Charlton M. (1995) A Spatial Analysis Development System using LISP. Proc. GISRUK '95, 155-160.

Burrough P.A. (1986) Principles of geographical Information Systems for Land Resources Assessment. Oxford University Press. Cliff, Haggett, Ord, Bassett and Davies. (1975) Elements of spatial structure. Cambridge University Press. Cross A. and Openshaw S. (1991) Crime pattern analysis: the development of ARC/CRIME. Proc. AGI 91 3.28.1-3.28.6. Westrade Fairs, London.

Ding, Y. and Fotheringham, S. (1992) The Integration of Spatial Analysis and GIS. Computers, Environment and Urban Systems, 16, 3-19.

Dykes J. (1995) Pushing Maps Past their Established Limits : a Unified Approach to Cartographic Visualization. Proc. GISRUK '95, 78-95.

Esri (1994) ArcDoc version 7.0 (online help). Esri, Redlands, CA.

Goodchild, M.G., Haining, R.P. and Wise, S.M. (1992) Integrating Geographic Information Systems and Spatial Data Analysis: Problems and Possibilities. Int. Journal of Geographical Information Systems, 16, 407-24.

Haining, R.P. (1990) Spatial Data Analysis in the Social and Environmental Sciences. Cambridge University Press.
Haining, R.P., Wise, S.M. and Blake, M (1994) Constructing Regions for Small Area Analysis: Material Deprivation and Colorectal Cancer. Journal of Public Health Medicine, 16, 429-438.

Haining R.P., Wise S.M. and Ma.J. (1996) The design of a software system for the interactive spatial statistical analysis linked to a GIS. Computational Statistics (in press)

Haslett. J, Wills G. and Unwin A. (1990) SPIDER - an Interactive Statistical Tool for the Analysis of Spatially Distributed Data. Int.J.Geographical Information Systems 4(3),285-296.

Snow J. (1854) On the mode of communication of cholera. Churchill Livingstone, London.

Wise S.M., Haining R.P and Ma.J. (1996) Regionalisation tools for the exploratory spatial analysis of health data. Paper presented at the 28th International Geographical Congress, The Hague, August 4th-10th, 1996.


Steve Wise, Jingsheng Ma, Bob Haining
Sheffield Centre for Geographic Information and Spatial Analysis
Department of Geography
University of Sheffield
Sheffield S10 2TN
UK
Telephone : +44 (0)114 282 4749
Fax: +44 (0)114 272 7919
Email : R.Haining@shef.ac.uk