Julie Conquest, Eddie Speer
Disseminating ArcInfo Dataset Documentation In a
Distributed Computing Environment
The King County (WA) Department of Information and Administrative
Services is currently completing a three year project to create a
central GIS library of ArcInfo datasets that is accessible to all
county departments. Many of the departments in the King County government
have independently developed ArcInfo datasets, but in the past other
departments rarely utilized these datasets because no formal mechanism
for sharing GIS data existed. As a part of this effort, an application has
been developed to make dataset documentation also available county-wide.
Core datasets and associated documentation are kept on a central server that
is accessible to most county departments via a wide-area network.
Documentation includes metadata, data dictionary information, history, and
custodial information. Updating and distributing the documentation occurs
by utilizing three components:
- An ArcInfo AML program for creating and maintaining dataset
documentation.
- PERL scripts for translating the ASCII files generated by the AML
program to HTML (Hypertext Markup Language).
- Netscape (or other WWW browser) for viewing dataset documentation.
Because the documentation is available using a Web browser, any user on the
county's network, or eventually the Internet, can view the documentation
or submit an electronic request for a copy of the data.
Introduction
In 1994 the King County Government began a three year project to create a
central GIS library of ArcInfo datasets that would be available to all
county departments via the county's wide-area network. Many of the
datasets planned for inclusion in the central GIS library already existed,
having been created and maintained by one of the county's departments.
When these datasets became accessible to a wider audience, many of the
users would no longer be aware of the history and content of a dataset.
Central GIS library datasets would be less valuable to the organization if
they were not accompanied by a complete set of documentation. Furthermore,
if the documentation could be maintained and accessed on-line, it would
save the Central GIS staff much time and effort. It was agreed that
dataset owners would take responsibility for maintaining dataset
documentation, because they are the ones most familiar with the history
and content of their dataset.
The basic application requirements were:
- A complete set of documentation must be available to anyone requesting
a dataset from the central GIS library.
- Datasets and the associated documentation must be available on-line to
anyone having access to the county's wide-area network.
- GIS dataset maintainers must have an easy method for maintaining
documentation.
- New or updated documentation must be transferred and transformed
automatically from the GIS Maintenance Area to point the of access.
Early in the project it was decided to make use of Web browsers for viewing
documentation and making requests for datasets. Web browsers are
inexpensive and readily available on most platforms. Additionally, at some
time in the future a Web server could be made available to the general
public.
As designed, dataset documentation must pass through three phases:
documentation collection, translation to HTML, and posting
to a Web server where it can be viewed using a Web browser.
During the first phase, dataset documentation is created or modified and
moved to the Central GIS Library. In the second phase documentation is
translated from ASCII to HTML using PERL scripts. In the third and final
phase the updated HTML documents are made available county-wide on a Web
server.
Phase 1: Creation and Maintenance of Dataset Documentation
Before the application could be developed, the King County GIS Database
Administrator and GIS users needed to reach consensus on what information
would be included in the
documentation. The King County GIS Database Administrator worked closely
with the clients of the central GIS library to develop a series of forms
that covered everything that would be included in the final documentation
set. Agreement was reached that a complete set of documentation would have
four components: dataset custodian information, metadata, data dictionary,
and dataset history. While the application for creating and maintaining
documentation was under development, dataset information was entered
manually on the forms to get a jump-start on the documentation effort.
The Metadata Form (see graphic) is an example
of one of these forms.
To make the task of documentation creation and maintenance easier, an
application was envisioned that would automatically extract metadata and
data dictionary information from central GIS library datasets and provide
a graphical user interface for dataset custodians to enter the remaining
required dataset information. In the summer of 1995, no existing
application could be found that would fulfill all of the application
requirements. The GIS application development team made the decision to
develop the application, DOCTOOL, in-house. To encourage use of DOCTOOL
the application needed to be easy to learn and to use. After considering
Motif or Visual Basic for developing the application's graphical user
interface, ArcInfo FORMS was selected. ArcInfo FORMS was selected for
several reasons: It could be easily ported to the county's various GIS
platforms, the interface could be rapidly constructed and would integrate
easily with the AML portion of the application, and GIS analysts would
not have to learn additional skills in order to build or maintain the
graphical user interface.
DOCTOOL was written to automatically extract as much of the required
dataset information as possible from the dataset itself. The
"DESCRIBE" command is used to extract most of the required metadata
information. Feature attribute tables and INFO tables associated with the
dataset can be queried for item names and item definitions. The ArcInfo
FORMS menus provide an easy to use interface for entering the remaining
information.
Any time a dataset is modified and placed in the GIS Maintenance Area
for posting to the Central GIS Library, the documentation must also be
updated to reflect and explain the changes made to the dataset. DOCTOOL
reads in the most current version of documentation for a dataset, either
from the Central GIS Library of the GIS Maintenance Area, when a dataset
is selected for documentation editing. For this reason, DOCTOOL resides on
the central GIS server even though most coverages are not maintained
there.
Dataset Keywords
Dataset names are kept to 8 or fewer characters because of the
limitation of the DOS operating system, which is the primary platform
for county ArcView users. Because dataset names are short,
the content of a dataset is not always obvious from the name alone.
To help GIS users find the data they need, keywords can be associated
with a dataset. A keyword is a single word or phrase that describes the
information contained in a dataset. Keywords can be used to search for
datasets pertaining to a topic when using a Web browser to access the
Central GIS Library dataset documentation. The
Keyword Menu allows entry of one or more keywords.
Dataset Custodian Information:
The
Custodian Information Menu
provides entry fields for the name of the department or organization,
if outside of the county, that owns and maintains a dataset and the person
or persons that should be contacted with questions or problems concerning
the dataset.
Metadata:
The Metadata Menu prompts
for information that describes the data itself: coverage type, map
units, projection, datum, etc. Much of this information can be
automatically generated using the ARC DESCRIBE command.
Data Dictionary:
The data dictionary portion of the application has several submenus in
addition to the Data Dictionary Main Menu. The
Data Dictionary Main Menu
lists the dataset's feature attribute tables and allows dataset
maintainers to associate other INFO tables with the dataset. The Data
Dictionary Main Menu also includes a list of annotation subclass levels
in the dataset. To describe the items in a table the application user
highlights a table name and presses the "Items" button. The
Data Dictionary Item Menu
contains a list of all items in the table. The selected item can be
described. Item definitions are automatically extracted. The dataset
maintainer can provide both a short and a long description of the item.
Currency date, percent complete, and percent correct values help dataset
users determine the quality of the data for an item. An item look-up
table, a list of item codes, and/or valid value range can also be
specified for an item. To provide descriptive information for an
annotation subclass level, the dataset maintainer selects the subclass
level and presses the "Describe" button. Annotation size, type, symbology,
and fit can be stored for each annotation subclass level.
Dataset History:
The Dataset History Menu
prompts for two types of history; initial history and on-going history.
The initial history includes dataset source information, a description of
the methods used to create the dataset and the quality assurance
procedures used. Once the initial history section has been completed it
will seldom need modification. The second type of history gathered is
on-going history. The On-going History section tracks changes made to the
dataset throughout its existence. The expectation is that a brief
description of maintenance activities will be entered into the on-going
history section every time a modified version of the dataset is posted to
the Central GIS Library.
Saving Documentation:
Once documentation maintenance is completed, the modified documentation is
saved to a ASCII file in the GIS Maintenance Area. DOCTOOL has been
designed to allow additional items to be included in the documentation
sets without requiring editing of existing documentation.
Posting Documentation to the Public Library:
New or updated documentation and datasets are posted from the GIS
Maintenance Area to the Central GIS Library on a nightly basis. Datasets
are run through Esri's QCAP (Quality Control Application) and some
additional checks to ensure that all files associated with the
dataset exist and that all items are present with the correct item
definition. QCAP tables were originally populated using documentation
generated by DOCTOOL. As part of the check-in process, the last
modification date on history documentation for a dataset is checked and
compared against the modification date on the dataset. If history
documentation has not been modified since changes were last made to the
dataset, a warning is issued to the GIS Database Administrator and the
dataset custodian.
Once all documentation updates have been checked in, DOC2HTML, a tool for
converting ASCII output from DOCTOOL to HTML (Hypertext Markup Language).
is called to process modified documentation sets.
Phase 2: Convert ASCII Documentation to HTML
(Hypertext Markup Language)
In order to present the documentation in a readable format, a scripting
program reads the ASCII files created by DOCTOOL, and creates output files
in HyperText Markup Language (HTML). Files in this format can be viewed
using any number of available "Web browsers", either by opening directly
available files, or using the Internet.
The scripting language chosen in this project was PERL
(ref. 1). PERL
interpreters exist on many different platforms, and most of the available
web servers can execute PERL scripts. At design time, it was
not known whether these scripts would be invoked at the time of
presentation, or in batch by an administrator.
The first step in developing the scripts was to define a standard "look"
to the pages that would serve as the output. As of fall, 1995, existing
sites on the Internet that contained metadata were difficult to find.
Therefore, the manual forms used by the Database Administrator were used
as an initial guide for the pages. It soon became apparent that a decision
regarding the use of HTML/3.0 Tables needed to be made. In the summer of
1995, only Netscape version 1.1 or later supported the use of Tables,
while Mosaic did not. Since it appeared that other browsers would
eventually support tables, the pages were designed using tables. An
initial overview page lists the name of the
coverage, a short description, and the number of features in each feature
class. If a sample graphic is available, it is linked here, as well as a
link to a "comment page" for users to send comments back to the Database
Administrator. At the bottom of the page are buttons that link the user to
other pages:
- Custodian Contact information
- Metadata - contains projection, datum,
usable scale, and geographic extent
- Data Dictionary - links to all
associated tables used with the dataset, with
item definitions, valid ranges and values.
- History - source material, automation, and
quality control techniques used
- Disclaimer - any wording required for plots or other warnings are
listed on this page
Since each of these pages are required for each coverage, a standard set
of scripts was developed to create these pages. The syntax of the calling
script is:
% onecover <cover> <source directory> <output directory>
<cover> is the name of the coverage
<source directory> is the full path to the location of the ASCII files
created by DOCTOOL
<output directory> is the full path to the location where the output
HTML files should be written
This calling script calls several other scripts, each of which generally
creates a single HTML page. An exception is the script for the data
dictionary. The data dictionary page contains links for each database
table referenced by the coverage. This includes extension tables, look-up
tables, and symbol tables, as well as feature attribute tables. Also, if
the user entered a list of "valid values" in the DOCTOOL application,
these are presented as a separate page. As noted by
Schulzrinne (ref. 2),
HTML is suited for display, not for printing. For this reason, two
versions of the data dictionary were created. One has hypertext links for
each table, and the other lists all tables in a large, printable page. By
default, the user sees the linked page, which also has a link to the more
printable page.
Another group of scripts were written to create "summary lists" of the
available documentation, sorted by coverage size, date of last update, or
by theme.
Phase 3: Make HTML Documents Available on a Web Server
As this project was being designed, the ultimate method of distributing
the pages was uncertain. The following options were available:
- Install "HTTP Server" (web server) software on the primary library
server, and generate the documentation "on-the-fly" using the PERL
scripts
- Run the PERL scripts in batch and copy the output HTML files to a
file server.
- Run the PERL scripts in batch and copy the output HTML files to a
Compact Disc.
- Run the PERL scripts in batch and copy the output HTML files to a
separate Web server.
As the project progressed, it was determined that the primary library
server should not be used as a web server, as it had never been specified
to perform this function, and for security reasons. A separate workgroup
server was used as a Web server, and plans to move to a web server outside
the agency firewall are being implemented. Although option #4 listed above
is the primary method for distributing the documents, options 2 and 3 are
also used for outside users or clients who do not have network access.
Conclusion
Since DOCTOOL was developed, a new version of DOCUMENT.AML was
released with ArcInfo 7.04. The new version of DOCUMENT.AML covers more
of the original requirements for the documenting tool, and may have been
sufficient for our needs.
The greatest amount of effort for the project has gone into doing the
initial dataset documentation. The quality of the documentation varies
greatly depending on the dataset maintainer and the complexity of the
dataset itself. Because the dataset custodians have been given the task
of doing documentation, it has been difficult for the central GIS staff
to require that documentation be complete. However, it is believed that
documentation quality will improve as more GIS users access the Central
GIS Library datasets and contact dataset custodians to inquire about
missing portions of the documentation.
As final testing is completed for DOCTOOL and DOC2HTML, the Central
GIS staff is already saving time and effort by having these tools. King
County GIS users are also benefiting by having up-to-date documentation
available instantaneously.
References
1. Larry Wall and Randal L. Schwartz. Programming Perl. O'Reilly & Associates, 1991.
(see also
Frequently Asked qustions about Perl).
2. Schulzrinne, Henning.
"World-wide web:
Whence, whither, what next?," IEEE Network, vol. 10, March/April 1996.
Author Information
Author: Julie Conquest, Senior Analyst (206)684-1493
Author:
Eddie Speer
, GIS Project Manager (206) 684-2071
Organization: King County Department of Information and Administrative
Services
Mailing Address: 821 Second Avenue, MS 170, Seattle, WA 98104
Fax: (206) 689-3145
Email: julie.conquest@metrokc.gov,eddie.speer@metrokc.gov