Julie Conquest, Eddie Speer

Disseminating ArcInfo Dataset Documentation In a Distributed Computing Environment

The King County (WA) Department of Information and Administrative Services is currently completing a three year project to create a central GIS library of ArcInfo datasets that is accessible to all county departments. Many of the departments in the King County government have independently developed ArcInfo datasets, but in the past other departments rarely utilized these datasets because no formal mechanism for sharing GIS data existed. As a part of this effort, an application has been developed to make dataset documentation also available county-wide. Core datasets and associated documentation are kept on a central server that is accessible to most county departments via a wide-area network. Documentation includes metadata, data dictionary information, history, and custodial information. Updating and distributing the documentation occurs by utilizing three components: Because the documentation is available using a Web browser, any user on the county's network, or eventually the Internet, can view the documentation or submit an electronic request for a copy of the data.

Introduction

In 1994 the King County Government began a three year project to create a central GIS library of ArcInfo datasets that would be available to all county departments via the county's wide-area network. Many of the datasets planned for inclusion in the central GIS library already existed, having been created and maintained by one of the county's departments. When these datasets became accessible to a wider audience, many of the users would no longer be aware of the history and content of a dataset. Central GIS library datasets would be less valuable to the organization if they were not accompanied by a complete set of documentation. Furthermore, if the documentation could be maintained and accessed on-line, it would save the Central GIS staff much time and effort. It was agreed that dataset owners would take responsibility for maintaining dataset documentation, because they are the ones most familiar with the history and content of their dataset.

The basic application requirements were:

  1. A complete set of documentation must be available to anyone requesting a dataset from the central GIS library.
  2. Datasets and the associated documentation must be available on-line to anyone having access to the county's wide-area network.
  3. GIS dataset maintainers must have an easy method for maintaining documentation.
  4. New or updated documentation must be transferred and transformed automatically from the GIS Maintenance Area to point the of access.
Early in the project it was decided to make use of Web browsers for viewing documentation and making requests for datasets. Web browsers are inexpensive and readily available on most platforms. Additionally, at some time in the future a Web server could be made available to the general public.

As designed, dataset documentation must pass through three phases: documentation collection, translation to HTML, and posting to a Web server where it can be viewed using a Web browser. During the first phase, dataset documentation is created or modified and moved to the Central GIS Library. In the second phase documentation is translated from ASCII to HTML using PERL scripts. In the third and final phase the updated HTML documents are made available county-wide on a Web server.

Phase 1: Creation and Maintenance of Dataset Documentation

Before the application could be developed, the King County GIS Database Administrator and GIS users needed to reach consensus on what information would be included in the documentation. The King County GIS Database Administrator worked closely with the clients of the central GIS library to develop a series of forms that covered everything that would be included in the final documentation set. Agreement was reached that a complete set of documentation would have four components: dataset custodian information, metadata, data dictionary, and dataset history. While the application for creating and maintaining documentation was under development, dataset information was entered manually on the forms to get a jump-start on the documentation effort. The Metadata Form (see graphic) is an example of one of these forms.

To make the task of documentation creation and maintenance easier, an application was envisioned that would automatically extract metadata and data dictionary information from central GIS library datasets and provide a graphical user interface for dataset custodians to enter the remaining required dataset information. In the summer of 1995, no existing application could be found that would fulfill all of the application requirements. The GIS application development team made the decision to develop the application, DOCTOOL, in-house. To encourage use of DOCTOOL the application needed to be easy to learn and to use. After considering Motif or Visual Basic for developing the application's graphical user interface, ArcInfo FORMS was selected. ArcInfo FORMS was selected for several reasons: It could be easily ported to the county's various GIS platforms, the interface could be rapidly constructed and would integrate easily with the AML portion of the application, and GIS analysts would not have to learn additional skills in order to build or maintain the graphical user interface.

DOCTOOL was written to automatically extract as much of the required dataset information as possible from the dataset itself. The "DESCRIBE" command is used to extract most of the required metadata information. Feature attribute tables and INFO tables associated with the dataset can be queried for item names and item definitions. The ArcInfo FORMS menus provide an easy to use interface for entering the remaining information.

Any time a dataset is modified and placed in the GIS Maintenance Area for posting to the Central GIS Library, the documentation must also be updated to reflect and explain the changes made to the dataset. DOCTOOL reads in the most current version of documentation for a dataset, either from the Central GIS Library of the GIS Maintenance Area, when a dataset is selected for documentation editing. For this reason, DOCTOOL resides on the central GIS server even though most coverages are not maintained there.

Dataset Keywords

Dataset names are kept to 8 or fewer characters because of the limitation of the DOS operating system, which is the primary platform for county ArcView users. Because dataset names are short, the content of a dataset is not always obvious from the name alone. To help GIS users find the data they need, keywords can be associated with a dataset. A keyword is a single word or phrase that describes the information contained in a dataset. Keywords can be used to search for datasets pertaining to a topic when using a Web browser to access the Central GIS Library dataset documentation. The Keyword Menu allows entry of one or more keywords.

Dataset Custodian Information:

The Custodian Information Menu provides entry fields for the name of the department or organization, if outside of the county, that owns and maintains a dataset and the person or persons that should be contacted with questions or problems concerning the dataset.

Metadata:

The Metadata Menu prompts for information that describes the data itself: coverage type, map units, projection, datum, etc. Much of this information can be automatically generated using the ARC DESCRIBE command.

Data Dictionary:

The data dictionary portion of the application has several submenus in addition to the Data Dictionary Main Menu. The Data Dictionary Main Menu lists the dataset's feature attribute tables and allows dataset maintainers to associate other INFO tables with the dataset. The Data Dictionary Main Menu also includes a list of annotation subclass levels in the dataset. To describe the items in a table the application user highlights a table name and presses the "Items" button. The Data Dictionary Item Menu contains a list of all items in the table. The selected item can be described. Item definitions are automatically extracted. The dataset maintainer can provide both a short and a long description of the item. Currency date, percent complete, and percent correct values help dataset users determine the quality of the data for an item. An item look-up table, a list of item codes, and/or valid value range can also be specified for an item. To provide descriptive information for an annotation subclass level, the dataset maintainer selects the subclass level and presses the "Describe" button. Annotation size, type, symbology, and fit can be stored for each annotation subclass level.

Dataset History:

The Dataset History Menu prompts for two types of history; initial history and on-going history. The initial history includes dataset source information, a description of the methods used to create the dataset and the quality assurance procedures used. Once the initial history section has been completed it will seldom need modification. The second type of history gathered is on-going history. The On-going History section tracks changes made to the dataset throughout its existence. The expectation is that a brief description of maintenance activities will be entered into the on-going history section every time a modified version of the dataset is posted to the Central GIS Library.

Saving Documentation:

Once documentation maintenance is completed, the modified documentation is saved to a ASCII file in the GIS Maintenance Area. DOCTOOL has been designed to allow additional items to be included in the documentation sets without requiring editing of existing documentation.

Posting Documentation to the Public Library:

New or updated documentation and datasets are posted from the GIS Maintenance Area to the Central GIS Library on a nightly basis. Datasets are run through Esri's QCAP (Quality Control Application) and some additional checks to ensure that all files associated with the dataset exist and that all items are present with the correct item definition. QCAP tables were originally populated using documentation generated by DOCTOOL. As part of the check-in process, the last modification date on history documentation for a dataset is checked and compared against the modification date on the dataset. If history documentation has not been modified since changes were last made to the dataset, a warning is issued to the GIS Database Administrator and the dataset custodian.

Once all documentation updates have been checked in, DOC2HTML, a tool for converting ASCII output from DOCTOOL to HTML (Hypertext Markup Language). is called to process modified documentation sets.

Phase 2: Convert ASCII Documentation to HTML (Hypertext Markup Language)

In order to present the documentation in a readable format, a scripting program reads the ASCII files created by DOCTOOL, and creates output files in HyperText Markup Language (HTML). Files in this format can be viewed using any number of available "Web browsers", either by opening directly available files, or using the Internet.

The scripting language chosen in this project was PERL (ref. 1). PERL interpreters exist on many different platforms, and most of the available web servers can execute PERL scripts. At design time, it was not known whether these scripts would be invoked at the time of presentation, or in batch by an administrator.

The first step in developing the scripts was to define a standard "look" to the pages that would serve as the output. As of fall, 1995, existing sites on the Internet that contained metadata were difficult to find. Therefore, the manual forms used by the Database Administrator were used as an initial guide for the pages. It soon became apparent that a decision regarding the use of HTML/3.0 Tables needed to be made. In the summer of 1995, only Netscape version 1.1 or later supported the use of Tables, while Mosaic did not. Since it appeared that other browsers would eventually support tables, the pages were designed using tables. An initial overview page lists the name of the coverage, a short description, and the number of features in each feature class. If a sample graphic is available, it is linked here, as well as a link to a "comment page" for users to send comments back to the Database Administrator. At the bottom of the page are buttons that link the user to other pages:

Since each of these pages are required for each coverage, a standard set of scripts was developed to create these pages. The syntax of the calling script is:


% onecover <cover> <source directory> <output directory>
<cover> is the name of the coverage
<source directory> is the full path to the location of the ASCII files created by DOCTOOL
<output directory> is the full path to the location where the output HTML files should be written

This calling script calls several other scripts, each of which generally creates a single HTML page. An exception is the script for the data dictionary. The data dictionary page contains links for each database table referenced by the coverage. This includes extension tables, look-up tables, and symbol tables, as well as feature attribute tables. Also, if the user entered a list of "valid values" in the DOCTOOL application, these are presented as a separate page. As noted by Schulzrinne (ref. 2), HTML is suited for display, not for printing. For this reason, two versions of the data dictionary were created. One has hypertext links for each table, and the other lists all tables in a large, printable page. By default, the user sees the linked page, which also has a link to the more printable page.

Another group of scripts were written to create "summary lists" of the available documentation, sorted by coverage size, date of last update, or by theme.

Phase 3: Make HTML Documents Available on a Web Server

As this project was being designed, the ultimate method of distributing the pages was uncertain. The following options were available:
  1. Install "HTTP Server" (web server) software on the primary library server, and generate the documentation "on-the-fly" using the PERL scripts
  2. Run the PERL scripts in batch and copy the output HTML files to a file server.
  3. Run the PERL scripts in batch and copy the output HTML files to a Compact Disc.
  4. Run the PERL scripts in batch and copy the output HTML files to a separate Web server.

As the project progressed, it was determined that the primary library server should not be used as a web server, as it had never been specified to perform this function, and for security reasons. A separate workgroup server was used as a Web server, and plans to move to a web server outside the agency firewall are being implemented. Although option #4 listed above is the primary method for distributing the documents, options 2 and 3 are also used for outside users or clients who do not have network access.

Conclusion

Since DOCTOOL was developed, a new version of DOCUMENT.AML was released with ArcInfo 7.04. The new version of DOCUMENT.AML covers more of the original requirements for the documenting tool, and may have been sufficient for our needs.

The greatest amount of effort for the project has gone into doing the initial dataset documentation. The quality of the documentation varies greatly depending on the dataset maintainer and the complexity of the dataset itself. Because the dataset custodians have been given the task of doing documentation, it has been difficult for the central GIS staff to require that documentation be complete. However, it is believed that documentation quality will improve as more GIS users access the Central GIS Library datasets and contact dataset custodians to inquire about missing portions of the documentation.

As final testing is completed for DOCTOOL and DOC2HTML, the Central GIS staff is already saving time and effort by having these tools. King County GIS users are also benefiting by having up-to-date documentation available instantaneously.

References

1. Larry Wall and Randal L. Schwartz. Programming Perl. O'Reilly & Associates, 1991. (see also Frequently Asked qustions about Perl).
2. Schulzrinne, Henning.
"World-wide web: Whence, whither, what next?," IEEE Network, vol. 10, March/April 1996.

Author Information

Author: Julie Conquest, Senior Analyst (206)684-1493
Author: Eddie Speer , GIS Project Manager (206) 684-2071
Organization: King County Department of Information and Administrative Services
Mailing Address: 821 Second Avenue, MS 170, Seattle, WA 98104
Fax: (206) 689-3145
Email: julie.conquest@metrokc.gov,eddie.speer@metrokc.gov