MAKING SPATIAL DATA AVAILABLE TO REMOTE USER OF A WIDELY-DISTRIBUTED ENTERPRISE

MAKING SPATIAL DATA AVAILABLE TO REMOTE USER OF A WIDELY-DISTRIBUTED ENTERPRISE

Donald G. Brady
High Performance Computing, GIS Business Group, Compaq Computer Corporation, 200 Forest Street MRO1-1/P5, Marlboro, MA 01752 USA, Tel: (+1) 508 467 3028, Fax: (+1) 508 467 1137, don.brady@digital.com

Abstract

GIS (Geographic Information Systems) users, until recently, were part of a well-defined community, with well-defined connections to their spatially-enabled database and application servers. In the future, GIS users will be anyone in an enterprise, accessing data from general-purpose desktop environments, remote from the GIS database and application servers.
This paper will review the technical challenges of bringing GIS into the mainstream of enterprise computing, by offering insight into how networking technologies and related strategies can enable the distribution of spatial servers and data to remote sites of an enterprise.

BACKGROUND

There is no doubt that GIS is becoming a core and essential technology for industry. As an enterprise becomes spatially enabled, GIS is becoming part of the business and IT mainstream. Consequently, we are witnessing rapid and unpredictable growth in the number and variety of spatially-enabled applications, in the number of GIS users, and in the amount of spatial data managed by an enterprise.
Success with GIS requires networked performance:
Even as an enterprise scales its computing environment to handle the increased GIS workload with bigger, faster, and more hardware, with optimized databases, and with re-engineered spatially-enabled applications, users at remote sites may not directly benefit to the same extent as their colleagues at headquarters. Users who do not reside on the Local Area Network (LAN) of the database and application servers may suffer from the performance of the connection between their desktops and the enterprise systems providing the spatially-enabled data.
The demand for spatial data will attract users whose desktop systems are not configured as GIS clients. These users will heavily burden the application and data servers, since the GIS processing to satisfy their spatial queries will be performed by the servers, rather than the desktop clients.
These technology issues facing industry today are a variation of similar challenges already successfully addressed by today's computing environments. Let's now investigate those technologies and demonstrate how they can be applied to today's spatial data challenges.

THE ENTERPRISE IS BEING SPATIALLY ENABLED

Technology advances:
Information technology (IT) has advanced rapidly in the past 30 years, from batch computing to online computing, timesharing, personal computing, network computing and enterprise computing. In support of these technologies, the hardware industry has evolved through mainframes, minicomputers, workstations, and PCs. And programming methodologies have matured, from stand-alone and server-centric, to distributed, to client/server, and now to Web-based.
Each advancement has brought computing and data to increasingly greater numbers of users. Technology has become vastly more affordable over time. (Consider today's cost of a 400 MHz personal computer with 48 megabytes of memory and two gigabytes of disk, compared to comparable systems of 10 and 20 years ago!) Furthermore, we have moved from the era of the big computer for computer scientists to the ubiquitous PC connected to the Internet.
Getting more data to more users:
The result: GIS is thriving. It is widely accepted that an estimated 80-85% of an enterprise's data is location-dependent. Data is a valuable asset of an enterprise, and good data is a formidable competitive advantage. Thus the explosive growth in GIS applications easily justifies the investments needed to deploy them. The enterprise must be spatially enabled to be viable in the next millenium!
As an enterprise expands its physical boundaries, users of GIS data and applications inevitably will be dispersed. Making GIS available to those remote users becomes a difficult, and sometimes costly, challenge. To pump water from a local well into my kitchen, I can easily find the proper machinery at a reasonable price. However, to pump large amounts of water over very long distances to very many consumers requires an entirely different solution. The same problem exists with computer networks: transmitting data from a server to a client in the same location is an easy problem to solve; transmitting large amounts of data thousands of miles to thousands of users in thousands of locations, is an entirely different problem that requires more complex solutions.
So the challenges are distance, data volume, and the number of data consumers. And as GIS moves into the enterprise, we are seeing tremendous increases in the volume of spatial data types, in the number of users, and in the geographic diversity of those users!

THE CHALLENGES OF DISTANCE AND VOLUME

If the users of an organization's spatial data do not reside close to the data repository, we can move the data closer to them. Data replication, which can be implemented in a variety of ways, is one method of achieving this.

Regional data sets:
One implementation creates regional subsets of the data for distribution to local sites, so that users can access their own region's data over their Local Area Network (LAN).
Operational issues arising from this type of solution are twofold: 1) synchronization between headquarters' master data and the multiple regional data subsets throughout the enterprise; and 2) synchronization between regional spatial data and the enterprise's non-spatial databases, which generally do not encounter the same data management issues. The first problem is addressed by routine uploads of regional data to the master, or routine downloads of master subsets to the regions; however, one functional shortcoming of this mechanism is that it does not address the issue of how a user in one region can access spatial data of another region. The second problem can be solved by replicating at each region the entire master database; but this incurs a high cost for the duplication of disk farms and the potential concerns of data integrity.

Data caching:
A second form of spatial data replication makes use of caching. Spatial data can be cached over the Wide Area Network (WAN) by a small cache server at each regional office. The replication is dynamic and is determined not by the physical boundaries of the organization, but rather by the data actually accessed by individual users, allowing each regional office access to the enterprise's entire database. Data updates will automatically be applied concurrently to both the master database and the cache, so the master database remains the single central repository of spatial data and thus synchronized with other enterprise databases. Since the most frequently used data will nearly always be cached locally, the central server will not have to service the high proportion of spatial data reads that are implicit in GIS data accesses.

Even this data replication solution incurs the cost of additional hardware (the cache servers and the disk storage for the replication) and the overhead of the intricate data-caching mechanisms. But it does meet the goal of data replication: it gets the data as close to the user as possible. The motivation behind minimizing this distance is that significant performance improvement occurs as we move data closer to the location from which it will be accessed: the bandwidth of the network (the "diameter of the pipe", the amount of data per unit of time that can be transmitted) increases. So, for example, a LAN provides better performance than a WAN. In short, local problems are easier to solve than global problems.

Making the world smaller:
Metaphorically, the world seems to be getting smaller. We can cross oceans and continents in hours, if not minutes. We are more intimate today with remote cultures and peoples than at any other period of human history. We are living in a global economy. But if the world really were getting smaller, then the physical distance one would have to traverse in order to travel from Point A to Point B would shorten, and we could then apply "short distance" technology to "long distance" problems. In other words, the problem of getting large amounts of water to thousands of consumers thousands of miles away, would reduce to the problem of getting water from one's local well to one's kitchen.

So, what if we had available to us a range of technology solutions that could make long physical distances logically short, and seamless? We could apply that technology to the problem of making spatial data available to remote users. Let us now take a look at how technology can in fact allow us to apply "short distance" solutions to "long distance" problems.

THE NETWORK IS THE REPLICATION

It is important to understand that networking technology enabled the computing evolution from mainframes to PCs on the Internet. But more than raw technology is necessary to enable millions of computers to communicate with each other.

Standards enable communication:
The 1990s has been a time of dramatic shift toward standardization in the IT industry. The world has essentially settled on two operating systems, UNIX and Windows NT. TCP/IP is a standard networking protocol. There are very few chip designs to choose from. We have standardized on object oriented methodologies; SQL, HTML, Java, and a host of other software paradigms are standard. Essentially gone are proprietary operating systems and networking protocols; open APIs (Application Programming Interfaces) enable communication among otherwise incompatible applications. Adherence to standards has allowed the Internet to flourish: how else could millions of computers talk to each other in a worldwide web! And as it has flourished, the number of users and the amount of traffic have grown phenomenally. No other "community" is as widely dispersed as users of the Internet.

Web-based application design:
Application design is no longer LAN- or WAN-based, it is Web-based, further anointing the Internet as an absolute standard. Consequently, IT vendors are racing to deliver higher performance across longer distances. The existence of the infrastructure (namely, the worldwide web), and the shift to Web-based development, have made data available to more people at reasonable cost. And the GIS industry is at the forefront of this latest shift. What technologies make this possible? And how can they apply to GIS and the challenges already presented?

As we shall see, due to advancements in networking technologies driven by the growth of the Internet, we are less dependent on bringing data closer to the user: the network becomes the replication.
Let's now look at a real-life GIS implementation, The TerraServer, which does not replicate data, and investigate some of the technologies that make it possible.

THE TERRASERVER: GIS DATA MADE AVAILABLE TO THE WORLD

TerraServer is a collaboration of Microsoft Corporation, Digital Equipment Corporation (recently acquired by Compaq Computer Corporation), the United Stated Geological Survey (USGS), and SPIN2, a provider of declassified Soviet satellite imagery.

Terabyte multimedia database:
TerraServer is a multimedia database that stores aerial and satellite images of the earth in a Microsoft SQL Server™ Database served to the public via the Internet. It is the world's largest atlas, containing five terabytes (5 TB) of uncompressed satellite and aerial image data from SPIN2 and the USGS, compressed to 1 TB of database data. The data covers nearly five square tera-meters of data, which is more territory than all the urban areas on Earth combined. It is also the world's largest online database, and will double in size as more images become available.

TerraServer design:
TerraServer can be accessed from any web browser: navigation can be spatial via a point-and-click map, or clients knowing only place names can navigate textually.

Clients send requests to the TerraServer's Internet Information Server (IIS) built into Windows NT. These requests are passed to Active Server Pages (ASPs) programmed in VBscript, which in turn send queries to stored procedures in the SQL Server database to fetch image tiles. The ASPs dynamically construct the HTML Web pages needed to mosaic the tiles together to make a complete image. It sends this HTML back to the client's browser. The client browser then requests the images needed to fill in the picture. These URL requests generate between 30 and 500 database accesses.
The database stores the USGS and SPIN2 data as small (10 kilobyte or less) tiles compressed with JPEG. Larger images are created as a mosaic of these tiles, allowing quick response to users over slow network connections.

Hardware:
TerraServer runs on a single Digital AlphaServer 8400 system with 10 GB of memory and eight Alpha processors, and support for up to 160 PCI slots. The TerraServer configuration hosts seven KZPBA dual-ported Ultra SCSI host bus adapters -- one for each of seven disk storage cabinets.
Each storage cabinet holds 46 9 GB drives, for a total of 324 drives, and a total capacity of 2.9 TB. Hardware RAID5 provided by Digital StorageWorks converts the 324 disks into 28 large RAID5 disks. Windows NT RAID0 software striping is used to convert these 24 disks into four huge logical volumes, each of which is 595 GB. The TerraServer uses a single physical database spread across all four logical volumes. The design masks any single disk fault, masks many string failures, and masks some controller failures. Spare drives are configured to help availability.
TerraServer accesses four Map Servers running on dedicated Compaq Intel servers each with four processors and 256 MB of memory.
What makes it work?
The map servers and the SQL Server are on a LAN, which is not of interest to this paper. The physical structure of the database, and the underlying technology of the database server are not within the scope of this paper. Rather, the focus of this discussion is how the spatial data -- the images of locations on the earth -- can efficiently be transmitted from the database server over long distances to many users.

Two issues are at play here: 1) increasing the speed and bandwidth of the "pipe" between the server and the user; and 2) reducing the amount of data that must be passed from the server to the user.
In the world of enterprise computing, LAN performance is extending beyond the data center as a result of advancements in networking technologies and supporting software technologies. So we can apply "short distance" technology to spatial data challenges, rendering the replication of data sets unnecessary.

Networking Technologies:
TCP/IP is unquestionably the standard networking protocol today. Whereas Ethernet was the dominant conduit for short distance (LAN) TCP/IP traffic, and modems were the common low-cost conduit for long distances, the TCP/IP infrastructure today is much more robust.
Historically, as demand for remote access to computing has grown, modem technology has advanced to meet the rising demands, from the earliest 300 baud acoustic couplers to today's 56KB links. But the popularity of the Internet has caused network traffic to surpass the capacity of modem technology, causing increased levels of frustration among users. And similarly, as GIS expands into the enterprise, industry seeks more efficient means of accessing and delivering spatial data to remote users.
Here are a few networking technologies that approach local performance across very long distances, enabling tasks that used to be done only locally to now be done at remote sites.
For any application, of course, one needs to consider cost, performance, ease of use, and availability. It is within the scope of this paper to present options for consideration, not to attempt to provide best-fit scenarios for the various technologies.

Cable modems:

A cable modem system is designed to deliver broadband IP by taking advantage of coaxial and fiber connections used by the cable TV industry infrastructure. Cable modem creates a virtual LAN connection, linking to a user's PC through a standard 10Base-T Ethernet card and twisted-pair wiring. Users can experience access speeds approaching that of Ethernet.
Cable modems also offer constant connectivity: much like in a LAN, a user's PC is always online with the network. Unlike switched telephone networks where a caller is allocated a dedicated circuit, cable modem users do not consume fixed bandwidth. Rather, they share the connection with other users and consume bandwidth only when they actually send or receive data. So, they are able to grab all the bandwidth available during the instant of time they actually transmit or receive data packets.

Asymmetric Digital Subscriber Link (ADSL)

ADSL provides high data rate broadband digital services over existing copper-based telephone lines, for such services as high speed Internet access and remote LAN access. The 'asymmetry' is in the data rate downstream, from the exchange to the user, being different from the data rate upstream. Like cable modems, ADSL uses an existing infrastructure to provide bandwidth close to that of Ethernet.
ADSL provides both analog phone service and connection to digital services. Employing
ASDL Technology over twisted pair telephone lines achieves access speeds of approximately 6-8 Mbps downstream and 768 Kbps upstream.

Direct PC

Direct PC uses satellite transmission to deliver TCP/IP to a set-top box. It's an appealing application for field users who need access to remote data. Direct PC delivers 400Kb access.

Supporting Technologies

Consider, too, the benefit of software technologies such as data compression and back-end modules that can add custom functionality to standard applications.

Data compression

Data compression allows data to be represented and stored in a format that, although it is not directly usable by an application, requires less space than uncompressed data. Consequently, compressed data sent over a network, since its volume is reduced, will consume less network bandwidth than its uncompressed equivalent.

The network still is -- and will be for the foreseeable future -- the weak link of an IT architecture. So any reduction in the volume of network traffic, even at the expense of another component in the IT architecture, will improve overall performance. The cost of data compression is that algorithms must compress the data for storage and transmission, and then decompress it, or rebuild it, for use by the user application. The compression/decompression algorithms are much less costly at today's processor speeds (400 MHz) than are the network bandwidth that would be required to transmit uncompressed data
Note that TerraServer stores approximately 5 TB of data, compressed to 1 TB of database storage.

Java, JPEG, GIF, RealAudio are additional examples of common applications that use data compression.

Backend modules

Major RDBMS vendors have extended their servers with spatial operations. For example, the Oracle Spatial Data Option and the Informix Spatial DataBlade� module are extensions to Oracle� Universal Server� and INFORMIX� Universal Server that add support for spatial data and analysis. They add datatypes that describe common plane geometry shapes and polygons of arbitrary complexity. They also provide spatial functions that allow object creation, comparison, manipulation, and queries.

Key operations used with position information are incorporated into the database and are accessible through SQL, both as SQL queries and from within applications using supplied libraries.
In situations in which specific functionality does not exist, a user can develop his/her own functionality, in the absence of extensions such as Spatial Data Option and Spatial DataBlade. The benefit is that the client application can make a simple network request of the user-defined extension, which can then do it own pre-processing of the request before passing it off to the server application. This type of design simplifies and reduces network traffic, by adding some (small) amount of processing complexity at the server end. But again, as with compression algorithms, additional computing by today's fast processors is favorable to increased network traffic.

SUMMARY

What this paper has endeavored to illustrate is that technologies exist today which, when applied smartly, can help solve the performance problems caused by the explosive growth of spatial data and applications.
Remote users need to access enterprise data with the same efficiency as their colleagues at headquarters. Proper selection of network products, and intelligent use of application design can alleviate much of the performance degradation commonly associated with networked applications. TerraServer, accessible at www.terraserver.microsoft.com, demonstrates that real solutions to these challenges exist.

REFERENCES

Microsoft TerraServer Whitepaper, June 1998
Jim Gray, Microsoft Research and Development, et al
Informix Universal Server, November 1996
All company names, brand names, and product names used in this paper are trademarks, registered trademarks, trade names, or service marks of their respective owners.

[Introduction] [Conference programme] [Presentation by authors] [Presentation by category]
[Poster session] [List of european Esri distributors] [List of exhibitor] [Esri products news]
[Credits]