Donald G. Brady
High Performance Computing, GIS Business
Group, Compaq Computer Corporation, 200 Forest Street
MRO1-1/P5, Marlboro, MA 01752 USA, Tel: (+1) 508 467
3028, Fax: (+1) 508 467 1137, don.brady@digital.com
Abstract
GIS (Geographic
Information Systems) users, until recently, were part of
a well-defined community, with well-defined connections
to their spatially-enabled database and application
servers. In the future, GIS users will be anyone in an
enterprise, accessing data from general-purpose desktop
environments, remote from the GIS database and
application servers.
This paper will review the technical challenges of
bringing GIS into the mainstream of enterprise computing,
by offering insight into how networking technologies and
related strategies can enable the distribution of spatial
servers and data to remote sites of an enterprise.
BACKGROUND
There is no doubt that GIS is
becoming a core and essential technology for industry. As
an enterprise becomes spatially enabled, GIS is becoming
part of the business and IT mainstream. Consequently, we
are witnessing rapid and unpredictable growth in the
number and variety of spatially-enabled applications, in
the number of GIS users, and in the amount of spatial
data managed by an enterprise.
Success with GIS requires networked performance:
Even as an enterprise scales its computing environment to
handle the increased GIS workload with bigger, faster,
and more hardware, with optimized databases, and with
re-engineered spatially-enabled applications, users at
remote sites may not directly benefit to the same extent
as their colleagues at headquarters. Users who do not
reside on the Local Area Network (LAN) of the database
and application servers may suffer from the performance
of the connection between their desktops and the
enterprise systems providing the spatially-enabled data.
The demand for spatial data will attract users whose
desktop systems are not configured as GIS clients. These
users will heavily burden the application and data
servers, since the GIS processing to satisfy their
spatial queries will be performed by the servers, rather
than the desktop clients.
These technology issues facing industry today are a
variation of similar challenges already successfully
addressed by today's computing environments. Let's now
investigate those technologies and demonstrate how they
can be applied to today's spatial data challenges.
THE ENTERPRISE IS BEING
SPATIALLY ENABLED
Technology advances:
Information technology (IT) has advanced rapidly in the
past 30 years, from batch computing to online computing,
timesharing, personal computing, network computing and
enterprise computing. In support of these technologies,
the hardware industry has evolved through mainframes,
minicomputers, workstations, and PCs. And programming
methodologies have matured, from stand-alone and
server-centric, to distributed, to client/server, and now
to Web-based.
Each advancement has brought computing and data to
increasingly greater numbers of users. Technology has
become vastly more affordable over time. (Consider
today's cost of a 400 MHz personal computer with 48
megabytes of memory and two gigabytes of disk, compared
to comparable systems of 10 and 20 years ago!)
Furthermore, we have moved from the era of the big
computer for computer scientists to the ubiquitous PC
connected to the Internet.
Getting more data to more users:
The result: GIS is thriving. It is widely accepted that
an estimated 80-85% of an enterprise's data is
location-dependent. Data is a valuable asset of an
enterprise, and good data is a formidable competitive
advantage. Thus the explosive growth in GIS applications
easily justifies the investments needed to deploy them.
The enterprise must be spatially enabled to be viable in
the next millenium!
As an enterprise expands its physical boundaries, users
of GIS data and applications inevitably will be
dispersed. Making GIS available to those remote users
becomes a difficult, and sometimes costly, challenge. To
pump water from a local well into my kitchen, I can
easily find the proper machinery at a reasonable price.
However, to pump large amounts of water over very long
distances to very many consumers requires an entirely
different solution. The same problem exists with computer
networks: transmitting data from a server to a client in
the same location is an easy problem to solve;
transmitting large amounts of data thousands of miles to
thousands of users in thousands of locations, is an
entirely different problem that requires more complex
solutions.
So the challenges are distance, data volume, and the
number of data consumers. And as GIS moves into the
enterprise, we are seeing tremendous increases in the
volume of spatial data types, in the number of users, and
in the geographic diversity of those users!
THE CHALLENGES OF DISTANCE AND
VOLUME
If the users of an organization's
spatial data do not reside close to the data repository,
we can move the data closer to them. Data replication,
which can be implemented in a variety of ways, is one
method of achieving this.
Regional data sets:
One implementation creates regional subsets of the data
for distribution to local sites, so that users can access
their own region's data over their Local Area Network
(LAN).
Operational issues arising from this type of solution are
twofold: 1) synchronization between headquarters' master
data and the multiple regional data subsets throughout
the enterprise; and 2) synchronization between regional
spatial data and the enterprise's non-spatial databases,
which generally do not encounter the same data management
issues. The first problem is addressed by routine uploads
of regional data to the master, or routine downloads of
master subsets to the regions; however, one functional
shortcoming of this mechanism is that it does not address
the issue of how a user in one region can access spatial
data of another region. The second problem can be solved
by replicating at each region the entire master database;
but this incurs a high cost for the duplication of disk
farms and the potential concerns of data integrity.
Data caching:
A second form of spatial data replication makes use of
caching. Spatial data can be cached over the Wide Area
Network (WAN) by a small cache server at each regional
office. The replication is dynamic and is determined not
by the physical boundaries of the organization, but
rather by the data actually accessed by individual users,
allowing each regional office access to the enterprise's
entire database. Data updates will automatically be
applied concurrently to both the master database and the
cache, so the master database remains the single central
repository of spatial data and thus synchronized with
other enterprise databases. Since the most frequently
used data will nearly always be cached locally, the
central server will not have to service the high
proportion of spatial data reads that are implicit in GIS
data accesses.
Even this data replication solution
incurs the cost of additional hardware (the cache servers
and the disk storage for the replication) and the
overhead of the intricate data-caching mechanisms. But it
does meet the goal of data replication: it gets the data
as close to the user as possible. The motivation behind
minimizing this distance is that significant performance
improvement occurs as we move data closer to the location
from which it will be accessed: the bandwidth of the
network (the "diameter of the pipe", the amount
of data per unit of time that can be transmitted)
increases. So, for example, a LAN provides better
performance than a WAN. In short, local problems are
easier to solve than global problems.
Making the world smaller:
Metaphorically, the world seems to be getting smaller. We
can cross oceans and continents in hours, if not minutes.
We are more intimate today with remote cultures and
peoples than at any other period of human history. We are
living in a global economy. But if the world really were
getting smaller, then the physical distance one would
have to traverse in order to travel from Point A to Point
B would shorten, and we could then apply "short
distance" technology to "long distance"
problems. In other words, the problem of getting large
amounts of water to thousands of consumers thousands of
miles away, would reduce to the problem of getting water
from one's local well to one's kitchen.
So, what if we had available to us
a range of technology solutions that could make long
physical distances logically short, and seamless? We
could apply that technology to the problem of making
spatial data available to remote users. Let us now take a
look at how technology can in fact allow us to apply
"short distance" solutions to "long
distance" problems.
THE NETWORK IS THE REPLICATION
It is important to understand that
networking technology enabled the computing evolution
from mainframes to PCs on the Internet. But more than raw
technology is necessary to enable millions of computers
to communicate with each other.
Standards enable communication:
The 1990s has been a time of dramatic shift toward
standardization in the IT industry. The world has
essentially settled on two operating systems, UNIX and
Windows NT. TCP/IP is a standard networking protocol.
There are very few chip designs to choose from. We have
standardized on object oriented methodologies; SQL, HTML,
Java, and a host of other software paradigms are
standard. Essentially gone are proprietary operating
systems and networking protocols; open APIs (Application
Programming Interfaces) enable communication among
otherwise incompatible applications. Adherence to
standards has allowed the Internet to flourish: how else
could millions of computers talk to each other in a
worldwide web! And as it has flourished, the number of
users and the amount of traffic have grown phenomenally.
No other "community" is as widely dispersed as
users of the Internet.
Web-based application design:
Application design is no longer LAN- or WAN-based, it is
Web-based, further anointing the Internet as an absolute
standard. Consequently, IT vendors are racing to deliver
higher performance across longer distances. The existence
of the infrastructure (namely, the worldwide web), and
the shift to Web-based development, have made data
available to more people at reasonable cost. And the GIS
industry is at the forefront of this latest shift. What
technologies make this possible? And how can they apply
to GIS and the challenges already presented?
As we shall see, due to
advancements in networking technologies driven by the
growth of the Internet, we are less dependent on bringing
data closer to the user: the network becomes the
replication.
Let's now look at a real-life GIS implementation, The
TerraServer, which does not replicate data, and
investigate some of the technologies that make it
possible.
THE TERRASERVER: GIS DATA MADE
AVAILABLE TO THE WORLD
TerraServer is a collaboration of
Microsoft Corporation, Digital Equipment Corporation
(recently acquired by Compaq Computer Corporation), the
United Stated Geological Survey (USGS), and SPIN2, a
provider of declassified Soviet satellite imagery.
Terabyte multimedia database:
TerraServer is a multimedia database that stores aerial
and satellite images of the earth in a Microsoft SQL
Server™ Database served to the public via the
Internet. It is the world's largest atlas, containing
five terabytes (5 TB) of uncompressed satellite and
aerial image data from SPIN2 and the USGS, compressed to
1 TB of database data. The data covers nearly five square
tera-meters of data, which is more territory than all the
urban areas on Earth combined. It is also the world's
largest online database, and will double in size as more
images become available.
TerraServer design:
TerraServer can be accessed from any web browser:
navigation can be spatial via a point-and-click map, or
clients knowing only place names can navigate textually.
Clients send requests to the
TerraServer's Internet Information Server (IIS) built
into Windows NT. These requests are passed to Active
Server Pages (ASPs) programmed in VBscript, which in turn
send queries to stored procedures in the SQL Server
database to fetch image tiles. The ASPs dynamically
construct the HTML Web pages needed to mosaic the tiles
together to make a complete image. It sends this HTML
back to the client's browser. The client browser then
requests the images needed to fill in the picture. These
URL requests generate between 30 and 500 database
accesses.
The database stores the USGS and SPIN2 data as small (10
kilobyte or less) tiles compressed with JPEG. Larger
images are created as a mosaic of these tiles, allowing
quick response to users over slow network connections.
Hardware:
TerraServer runs on a single Digital AlphaServer 8400
system with 10 GB of memory and eight Alpha processors,
and support for up to 160 PCI slots. The TerraServer
configuration hosts seven KZPBA dual-ported Ultra SCSI
host bus adapters -- one for each of seven disk storage
cabinets.
Each storage cabinet holds 46 9 GB drives, for a total of
324 drives, and a total capacity of 2.9 TB. Hardware
RAID5 provided by Digital StorageWorks converts the 324
disks into 28 large RAID5 disks. Windows NT RAID0
software striping is used to convert these 24 disks into
four huge logical volumes, each of which is 595 GB. The
TerraServer uses a single physical database spread across
all four logical volumes. The design masks any single
disk fault, masks many string failures, and masks some
controller failures. Spare drives are configured to help
availability.
TerraServer accesses four Map Servers running on
dedicated Compaq Intel servers each with four processors
and 256 MB of memory.
What makes it work?
The map servers and the SQL Server are on a LAN, which is
not of interest to this paper. The physical structure of
the database, and the underlying technology of the
database server are not within the scope of this paper.
Rather, the focus of this discussion is how the spatial
data -- the images of locations on the earth -- can
efficiently be transmitted from the database server over
long distances to many users.
Two issues are at play here: 1)
increasing the speed and bandwidth of the
"pipe" between the server and the user; and 2)
reducing the amount of data that must be passed from the
server to the user.
In the world of enterprise computing, LAN performance is
extending beyond the data center as a result of
advancements in networking technologies and supporting
software technologies. So we can apply "short
distance" technology to spatial data challenges,
rendering the replication of data sets unnecessary.
Networking Technologies:
TCP/IP is unquestionably the standard networking protocol
today. Whereas Ethernet was the dominant conduit for
short distance (LAN) TCP/IP traffic, and modems were the
common low-cost conduit for long distances, the TCP/IP
infrastructure today is much more robust.
Historically, as demand for remote access to computing
has grown, modem technology has advanced to meet the
rising demands, from the earliest 300 baud acoustic
couplers to today's 56KB links. But the popularity of the
Internet has caused network traffic to surpass the
capacity of modem technology, causing increased levels of
frustration among users. And similarly, as GIS expands
into the enterprise, industry seeks more efficient means
of accessing and delivering spatial data to remote users.
Here are a few networking technologies that approach
local performance across very long distances, enabling
tasks that used to be done only locally to now be done at
remote sites.
For any application, of course, one needs to consider
cost, performance, ease of use, and availability. It is
within the scope of this paper to present options for
consideration, not to attempt to provide best-fit
scenarios for the various technologies.
Cable modems:
A cable modem system is
designed to deliver broadband IP by taking advantage
of coaxial and fiber connections used by the cable TV
industry infrastructure. Cable modem creates a
virtual LAN connection, linking to a user's PC
through a standard 10Base-T Ethernet card and
twisted-pair wiring. Users can experience access
speeds approaching that of Ethernet.
Cable modems also offer constant connectivity: much
like in a LAN, a user's PC is always online with the
network. Unlike switched telephone networks where a
caller is allocated a dedicated circuit, cable modem
users do not consume fixed bandwidth. Rather, they
share the connection with other users and consume
bandwidth only when they actually send or receive
data. So, they are able to grab all the bandwidth
available during the instant of time they actually
transmit or receive data packets.
Asymmetric Digital Subscriber Link
(ADSL)
ADSL provides high data rate
broadband digital services over existing copper-based
telephone lines, for such services as high speed
Internet access and remote LAN access. The
'asymmetry' is in the data rate downstream, from the
exchange to the user, being different from the data
rate upstream. Like cable modems, ADSL uses an
existing infrastructure to provide bandwidth close to
that of Ethernet.
ADSL provides both analog phone service and
connection to digital services. Employing
ASDL Technology over twisted pair telephone lines
achieves access speeds of approximately 6-8 Mbps
downstream and 768 Kbps upstream.
Direct PC
Direct PC uses satellite
transmission to deliver TCP/IP to a set-top box. It's
an appealing application for field users who need
access to remote data. Direct PC delivers 400Kb
access.
Supporting Technologies
Consider, too, the benefit of
software technologies such as data compression and
back-end modules that can add custom functionality to
standard applications.
Data compression
Data compression allows data to
be represented and stored in a format that, although
it is not directly usable by an application, requires
less space than uncompressed data. Consequently,
compressed data sent over a network, since its volume
is reduced, will consume less network bandwidth than
its uncompressed equivalent.
The network still is -- and will be
for the foreseeable future -- the weak link of an IT
architecture. So any reduction in the volume of network
traffic, even at the expense of another component in the
IT architecture, will improve overall performance. The
cost of data compression is that algorithms must compress
the data for storage and transmission, and then
decompress it, or rebuild it, for use by the user
application. The compression/decompression algorithms are
much less costly at today's processor speeds (400 MHz)
than are the network bandwidth that would be required to
transmit uncompressed data
Note that TerraServer stores approximately 5 TB of data,
compressed to 1 TB of database storage.
Java, JPEG, GIF, RealAudio are
additional examples of common applications that use data
compression.
Backend modules
Major RDBMS vendors have
extended their servers with spatial operations. For
example, the Oracle Spatial Data Option and the
Informix Spatial DataBlade® module are extensions to
Oracle® Universal Server® and INFORMIX® Universal
Server that add support for spatial data and
analysis. They add datatypes that describe common
plane geometry shapes and polygons of arbitrary
complexity. They also provide spatial functions that
allow object creation, comparison, manipulation, and
queries.
Key operations used with position
information are incorporated into the database and are
accessible through SQL, both as SQL queries and from
within applications using supplied libraries.
In situations in which specific functionality does not
exist, a user can develop his/her own functionality, in
the absence of extensions such as Spatial Data Option and
Spatial DataBlade. The benefit is that the client
application can make a simple network request of the
user-defined extension, which can then do it own
pre-processing of the request before passing it off to
the server application. This type of design simplifies
and reduces network traffic, by adding some (small)
amount of processing complexity at the server end. But
again, as with compression algorithms, additional
computing by today's fast processors is favorable to
increased network traffic.
SUMMARY
What this paper has endeavored to
illustrate is that technologies exist today which, when
applied smartly, can help solve the performance problems
caused by the explosive growth of spatial data and
applications.
Remote users need to access enterprise data with the same
efficiency as their colleagues at headquarters. Proper
selection of network products, and intelligent use of
application design can alleviate much of the performance
degradation commonly associated with networked
applications. TerraServer, accessible at www.terraserver.microsoft.com, demonstrates that real solutions to these
challenges exist.
REFERENCES
- Microsoft TerraServer
Whitepaper, June 1998
- Jim Gray, Microsoft
Research and Development, et al
- Informix Universal Server,
November 1996
- All company names, brand
names, and product names used in this paper are
trademarks, registered trademarks, trade names,
or service marks of their respective owners.
|