Mark Harley

A Practical Maintenance System for Enterprise ArcSDE

ABSTRACT

ArcSDE is an outstanding product to use as a central data repository for enterprise GIS applications. Built on a relational database model, the information to be presented in ArcSDE format must be converted and loaded into the database. This session describes an efficient system for ArcSDE data delivery and long-term maintenance. Methods will be discussed for dealing with massive geographic layers, including tips for effectively maintaining near real-time data currency through synchronization with remote sources. Benchmarks are shown, trading off Geodatabase data modeling choices and selected loading methodology.


OVERVIEW

This paper is based on over seven years of real-world experience building and maintaining numerous large (Arc)SDE databases. During this time, ArcSDE has evolved significantly with dramatic increases in operating speed, functionality and integration with the full Esri software environment, now known as ArcGIS.

When designing a large ArcSDE database, there are some important considerations which affect the choice of data installation and maintenance methods. They include the size of the completed database, the computing environment, the relational database used, the geographic coverage, requirements for data currency and the end application served by the database.

As part of any large GIS server installation, it is necessary not only to design the database, but to identify data sources and load the corresponding Feature Classes into ArcSDE. Only then is the data available for general query and use.

Topics discussed here include:


ARCSDE SERVER COMPONENTS

ArcSDE could be described as a "middleware" component, adding spatial query capabilities to a standard relational database. Functionally, it provides an interface between a relational database and the end-user GIS software. This high-level block diagram, from Esri, shows the basic relationship between ArcSDE and the other major components of ArcGIS.

(See p10101.jpg for a diagram of ArcSDE server components)

Notice that ArcSDE connects the Geodatabase with the other software components. The data resides in the Multiuser Geodatabase, stored according to a predefined Data Model. The Geodatabase can be built on many standard relational database management systems (RDBMS), including Oracle, DB2, Informix, or SQL Server. Before any of the end-user GIS modules, such as ArcView, ArcInfo or an ArcIMS-driven browser can access the data, it must first be loaded into the Geodatabase. ArcSDE provides a generic interface to the spatial data. Therefore, it is possible for the same end-user application to connect to a Geodatabase built on any of the standard RDBMS.

This paper is covers issues related to efficiently loading and maintaining data currency within a variety of Geodatabases. Design details relating to the functionality of specific end-user applications are a separate topic, beyond the scope of this writing.


ARCSDE DATABASE DESIGN AND PREPARATION

When designing and building a large enterprise Geodatabase, there are numerous challenges to address. It is important to have a solid system design before proceeding with implementation. A system capacity plan should be developed, including an estimated number of active connections, a list of anticipated spatial or relational queries: typical and worst-case, a maximum acceptable query wait time, and a hardware plan including scalability for future expansion and/or upgrading in the event additional throughput is required.

Other elements to plan up front include:

Design and Data Model

The software design and data model required to drive the GIS application must be completed first, before a corresponding data delivery solution can be engineered. Data sources for each proposed Feature Class must be identified and evaluated to ensure the system requirements are met. Adjustments to the design may be required, depending on the availability or completeness of certain data layers.

ArcSDE Installation, Configuration and Tuning

There are several complex components which together make an enterprise Geodatabase server. When installing the ArcSDE system, it is essential that the components be evaluated and tuned as a system. There can be subtle interactions between the physical disk layout, the operating system, the commercial relational database software and ArcSDE. Each layer has its own mechanisms for buffering, caching and optimizing I/O. By understanding the nature of the most critical queries and carefully tuning the entire system, the maximum performance can be obtained.

It can not be stressed enough that the complete system must be evaluated as a whole. There are numerous occasions where a client has installed and attempted to tune each component separately. New server hardware is purchased and the appropriate version of the operating system is installed. The physical disks may be discrete local drives, several drives striped together, a local RAID, a network appliance or even a storage area network. A database administrator (DBA) installs the RDBMS software, attempting to make best use of the disk storage available and optimizing the database parameters according to its proposed use. Finally, the GIS people are called in to install and tune ArcSDE. After data has been installed and some test loading of the server is done, the performance and interaction of each major component should be evaluated. This last step is often omitted, frequently resulting in sub-optimal server performance.


DATA DELIVERY

The data delivery method discussed here has been used successfully and refined over the last seven years. Typically, the GIS data involved includes a full coverage of the United States and often Canada as well, including layers for streets, water, highways, railroads, numerous political boundaries, postal geography, landmarks, routing information. Some of the layers are similar in structure to the TIGER data available from the US Census Bureau, but in the form of a high-end commercial data product with frequent updates. Most of the enterprise-scale spatial databases covering the full United States are at least tens of gigabytes in size.

Data Preparation

The source data is usually assembled at the at a data provider’s location and the destination is an ArcSDE database at a client site, so the data delivery process is divided into two steps. First, "database-loadable" files are built and pre-validated for each ArcSDE layer. Then the database-loadable files are delivered to the end-user, along with a custom ArcSDE client that can read the database-loadable files and insert the data into ArcSDE. The final process is similar to running Esri’s tools shp2sde or cov2sde to load a shapefile or coverage into an ArcSDE Feature Class. There is considerable additional functionality provided with the custom ArcSDE data loading software, however.

Because the building of the database-loadable files is done prior to delivery to the client, each Feature Class is pre-verified ensure it meets the data modeling specifications. The data for every geographic feature (i.e. point, line, polygon, donut polygon, region, etc.) is run through a process to ensure that it contains valid geometry, according to the rules defined by ArcSDE. Attribute values destined for the corresponding Business Table entries are also checked ahead of time. This up-front verification guarantees that the data is clean and "ArcSDE-ready" prior to delivery, removing a major stumbling block and often saving considerable time in the data loading process.

Prior to client delivery, scripts are written to create the ArcSDE Business Tables, the ArcSDE Feature Classes and load each of the database-loadable files in sequence. When appropriate (for example, when using ArcSDE with an Oracle database) initial and next extent sizes are pre-calculated for each database entity in each Feature Class. A Feature Class has initial and next extent sizes on the Feature, Business and Spatial Index Tables. In addition, there are six Oracle indexes between those three tables. This means eighteen database size parameters are pre-calculated per Feature Class. For a typical Geodatabase with about twenty-five layers, this means 450 values are set up correctly ahead of time, saving considerable trial and error.

Initial ArcSDE Data Loading

At the client site, the second step in the process is executed. The scripts are actually executed to create the Business Tables, ArcSDE Feature Classes and load the data. A custom ArcSDE client is used to read the features and attributes from the database loadable files and insert them into the appropriate ArcSDE Features Classes. Because the process is fully scripted, the feature geometry is pre-verified as valid for ArcSDE, and key database parameters are calculated ahead of time, the process is expected to proceed smoothly. Loading of a large Geodatabase such as streets for the full United States can take several days, depending on the exact configuration of the server. The loading process should be carefully monitored to ensure that it executes to a successful completion.

The following diagram shows an overall view of the components in a typical ArcSDE system. The blue blocks are standard system components, including the ArcSDE software, relational database, operating system and hard disks. At the top-left are some examples of standard Esri clients: ArcInfo, ArcView and ArcIMS. Shown in light yellow are the components of the custom data loading solution. Database loadable files are read by the custom data loading application which communicates with the ArcSDE database through a universal API. The end result is to have ArcSDE data loaded in the RDBMS tables, and therefore available to other ArcSDE clients.

(See p10102.jpg for a diagram showing ArcSDE data loading)


DATA MAINTENANCE OPTIONS

When maintaining remote Geodatabases of US streets and related layers, different clients have different currency requirements. For some, it is sufficient to reload the entire database with a new release of data annually or quarterly. Others have more demanding needs and must have more frequent updates, like monthly or daily snapshots. The ArcSDE data loading system includes the option of applying transactional updates without requiring a full rebuild of the database. This saves the down-time associated with deleting are completely reloading the large Feature Classes.

Complete Reload

The SDE data loader described in the Data Delivery section was originally written to execute INSERT transactions. These are used when loading an empty Feature Class for the first time, or when deleting and completely replacing a layer. The delivered database loadable files contain geometry (point, line, polygon, donut polygon, etc.) and associated attributes. Each INSERT transaction, as executed, consists of reading a feature with its attributes from the database loadable file and inserting it into the ArcSDE database, via the ArcSDE API. A user-sizable buffer, usually set to around 1000 transactions, allows the loader to store multiple inserts and apply them to the ArcSDE database as one operation. This greatly improves the throughput of data loading when compared with inserting features one at a time.

Loading benchmarks have been performed on several relatively fast UNIX and Windows servers, well-tuned, running Oracle and SQL Server databases. With the ArcSDE loader set up to buffer features one thousand at a time, ArcSDE is often able to process INSERT transactions at a rate of two to four seconds per thousand features. One of the largest feature sets loaded is the full street layer for the United States. With approximately 40 million line features, an insert rate of two to four seconds per thousand represents about 22 to 44 hours of total loading time. This is approximately one to two days of continuous, uninterrupted processing.

Transactional Updates

The SDE data loader has been extended to execute two additional types of transactions: UPDATE and DELETE. This allows an existing ArcSDE layer to be modified, changing only those features and/or attributes which are different since the last full data reload. ArcSDE is particularly well suited to this type of transactional update model, due to the rows-and-columns nature of the underlying RDBMS storage.

A street database can be maintained efficiently using transactional updates. A feature class as large as streets for full United States can be loaded into a Geodatabase. As edits are made to a master database, they are captured and bucketed into INSERT, UPDATE and DELETE transactions. The transactions are then written to a database loadable file. Note that to support DELETE and UPDATE transactions, there must be field containing a unique identifier in the Business Table for each feature. This tells the software which feature to UPDATE or DELETE.

The ArcSDE data loading utility is used to apply transactions in the same way as an initial data load. One or more database loadable files are read and the contents, including INSERT, UPDATE and DELETE transactions are applied to the ArcSDE database. With medium to large feature classes, a unique index should always be created on the "unique identifier" field prior to applying the transactions. This allows the UPDATE and DELETE operations to execute in a reasonable amount of time.

Near Real-Time Updates

A method has been developed to maintain database currency in numerous remote ArcSDE enterprise servers from a single master database. The keeper of the master database is a provider of high-quality street level data for the United States and Canada. As the master database is updated, snapshots of only the changes, represented as INSERT, UPDATE and DELETE transactions, are captured and written to database loadable files. ArcSDE transactions can be captured on a monthly, or in some cases as frequently as a daily basis and delivered to the remote clients.

By frequently applying the transactions to the Street Feature Class in a remote database, it is maintained at nearly the same currency as the master database. This would allow, for example, a client to report a missing street or incorrect ZIP code to a data provider. The data provider would then verify the information, enter appropriate update(s) to the master database, and promptly deliver a set of ArcSDE transactional updates to apply the fix to the client database. The client is able to take advantage of the new information quickly, and even apply the transactions to a live, production ArcSDE database. This saves the typical down time of up to several days to delete and completely reload the full United States or North American street layer. The relational database foundation and overall design of ArcSDE allows this sort of transactional update system to function extremely well in the field.

Additional Functions Available While Loading

The ArcSDE data loader has some additional functionality which can be used during the data loading process. A strength of this tool is the ability to perform these functions during either a full data load, or when applying transactional updates.

Reprojection is available on-the-fly as the data is processed. Because the loader is accessing each feature, reading them one at a time from the compressed load file, it is able to perform an optional reproject prior to inserting the geometry into the ArcSDE database. Normally, database loadable files are created using straight geographic projection - that is, unprojected, raw, latitude and longitude coordinates. A parameter file is provided to pass the Esri coordinate reference string and any relevant parameters to the data loader. This makes it simple to build a Feature Class projected in Albers, a UTM zone or a state plane projection, for example. Timing tests have shown no measurable increase in loading time to specify a projection when compared with loading the unprojected latitude and longitude coordinates directly.

Another feature is the use of an attribute selector file. This allows for selection of a subset of the attributes in the database loadable file. For example, if a street layer were being built using data similar to the US Census Bureau’s TIGER data, the street geometry might include attributes for the Census tract, block group and block on the left and right sides of the segment. If the Geodatabase were being built for a system needing to display geography and include street names, the census tract and block fields might not be required. The desired fields would be listing in an attribute selector file and only those would be loaded into ArcSDE.

Attribute selector files have the additional feature that they can specify a different field name for use in the target ArcSDE database. Take an example where a layer consisting of ZIP code polygons was being built and the database loadable file contained a field called "ZIP". If the data model for the completed Geodatabase specified a field called ZIP_CODE instead, the attribute selector file could be used to specify both the loading of the ZIP field, and renaming it to ZIP_CODE in the target database.

The ArcSDE loading tool will also convert data types automatically, when possible. Using the ZIP field example once again, take a situation where the database loadable file stores the ZIP field as a string. If the ZIP (or ZIP_CODE, if renaming) field in the target ArcSDE database was declared to be a numeric type, the loader will convert the string ZIP codes to integers automatically. Note that this feature will only work properly in situations where the strings contain digits only. Conversion in the other direction will always work, however. That is, a numeric field can always be optionally loaded into a string in the target database and the loader will perform the conversion automatically.

Each of these features are integral to the ArcSDE data loading tool. Because the same tool is used for both a full database reload and applying transactional updates, the projection, attribute selection, field names and data types will be applied consistently, through all updates to the ArcSDE Feature Class.


CONCLUSIONS

The ArcSDE data loading tool, with associated scripts, pre-certified geometry and pre-calculated database parameters make it relatively easy to set up and load a large ArcSDE database. Once loaded, monthly or near real time transactional updates are an excellent methods to keep an ArcSDE database current. By only altering the features which have changed since the last reload, time can be saved compared with rebuilding the Feature Classes from scratch. With mission-critical systems, an ArcSDE server can remain on-line and available to users as the transactions are applied. There are numerous active ArcSDE clients successfully using this technology today, including selected Feature Sets on Esri’s own Geography Network server.


AUTHOR

Mark Harley
Director, Consultative Services, Geographic Data Technology, Inc.
11 Lafayette St.
Lebanon, NH 03766-1445
Telephone: (603) 643-0330
Fax: (603) 653-0249
E-mail: mark_harley@gdt1.com
Web: http://www.geographic.com/