Pat Horn Fell and David T. Hansen

Construction of a Theme Keyword Thesaurus for Indexing Search and Retrieval across Networks

Abstract:
The efficient management, searching, and retrieval of geospatial information are supported by indexing thematic content as well as geographic location and extent of data. The FGDC Content Standards for Geospatial Metadata requires that data providers use theme keyword indexing derived from an associated thesaurus. Use of thesauri assists data producers and managers to achieve some consistency in selecting and assigning indexing terms to related data sets. The consistency of thesaurus-assisted indexing provides structure and direction for data searching and retrieval, which is enhanced by the transparency of a browsable thesaurus.

This application began with the American Geographical Society Map Catalog Subject Entries and the U.S. Bureau of Reclamation Thesaurus of Water Resource Terms. The Theme Keyword Thesaurus for Geospatial Data is presented in association with geospatial metadata and bibliographic description of GIS data and other information in a variety of formats available from the USBR Mid Pacific Region server. Sources for terms included in the thesaurus are identified. URLs and links are provided when possible. Intended to be flexible and hospitable, the thesaurus may be extended by key terms from the thesauri (glossaries, definitions of terms, etc.) of the disciplines involved in the development of datasets and metadata at USBR.

The capture of a theme keyword list for digital or nondigital data is part of the preliminary documentation. Capture of keywords provides a subset of the thesaurus which becomes a list of keywords used in metadata for Mid Pacific geospatial datasets. At any time, the theme keyword list for a GIS dataset may be included as part of fully compliant metadata documentation. Assignment of key words may be done either in ArcView or ArcInfo. Searches using the theme keyword index are done in ArcView, or ArcInfo or via World Wide Web browsers such as Netscape or Mosaic.

Software: Implementation of the theme keyword thesaurus has been done in HTML, Avenue scripts, and Arc Macro Language (AML).

Purpose

We have been working on a scheme to identify the thematic or subject content of GIS datasets and other information generated by USBR Mid Pacific Region that will comply with the keyword thesaurus requirements of the FGDC Content Standards for Geospatial Metadata with as little pain as possible. Compliance with the Content Standard has been hampered by all kinds of interesting issues, among which theme keywords probably don't get top priority for practitioners of GIS. Still, it has to be done.

Our thesaurus of theme or subject keywords is intended to be used to describe digital datasets and written materials created and used at the USBR Mid-Pacific Region. For producers of digital spatial datasets (a.k.a. GIS files) and written materials (reports and surveys), the thesaurus is meant to help in the selection of keywords for abstracting and indexing. Anyone searching for datasets or reports through the USBR Intranet or, eventually, the WWW will be encouraged to use the thesaurus as an aid to understanding the scope of what is available and to select terms to describe what they want. Subject searching is most successful when the query language matches the indexing.

The word is half his that speaks and half his that hears it. - Montaigne.

Development

In the beginning was the FGDC Content Standards for Digital Geospatial Metadata, section 1.6, which defines keywords as "words or phrases summarizing an aspect of the dataset" and requires that theme (subject), place, stratum and temporal keywords be supplied in metadata and that these keywords be associated with or "reference[d] to a formally registered thesaurus or a similar authoritative source" (1.6.2.1). There actually is a place to register thesauri or at least deposit them. There are two international clearinghouses; one for English language thesauri at the library of the University of Toronto and one for other languages in Warsaw NISO,1993). We haven't contacted them yet. Instead, I looked for available thesauri in the fields and disciplines represented in datasets produced at the Bureau of Reclamation Mid-Pacific Region offices. The FGDC Content Standard did provide a source of authoritative (and presumably registered) thesauri, the USMARC Code List for Relators, Sources, Description Conventions. That's where I found out that we already had a thesaurus.

In 1971, the Department of the Interior and the Bureau of Reclamation published a Thesaurus of Water Resources Terms: a Collection of Water Resources and Related Terms for Use in Indexing Technical Information which is terrific, except that it's more than 25 years old and the technical information it was meant to index was text, not geospatial digital datasets. The missing elements were, of course, recent and current terminology, but also terms to describe the map-like content of GIS data layers.

Many of the thesauri I looked at included some mapping and/or cartographic terms although none had the GIS specific terminology we needed. The American Geographic Society's Map Catalog Subject Entries, which they no longer use, contains a lot of useful cartographic terms, was available online, and seemed like a reasonable starting place. Using it as a backbone, I've flattened it a bit (more later about hierarchy), cut it a lot, and added to it from a number of more recent sources to begin developing a thesaurus that will, we hope, fulfill the letter and the spirit of the Metadata Content Standard requirements.

Sources used for thesaurus terms so far have been those mentioned above plus:

a title keyword printout of the contents of the Bureau of Reclamation Mid-Pacific Regional Library
terms used in Cultural Resource Management, solicited from the archaeologists in the Environmental Resources Management Division who referred us to the glossary in California Archaeology, the language of the federal and state regulations which define CRM, and the language of the California State Historical Preservation Office's site inventory form
judicious mining of the USGS online map feature definitions
the earthbound portions of NASA's online thesaurus
keywords from existing metadata for datasets produced at USBR
ambient terminology - words that hang like dust motes in the air above the cubicles of people working on their questions in ArcInfo and ArcView

Implementation of the Thesaurus

This implementation has been designed with the intent to provide an opportunity for a dataset developer to select a set of keywords describing the theme or subject content of a dataset. Once the set is completed to the satisfaction of the data developer, it can be written to the appropriate section of the documentation. Permanent subsets may be created for use in describing recurring themes. The same functions may be used by a searcher to describe the theme or subject content of a desired dataset.

In our implementation, we have been aware of some impediments to the assignment and use of keywords. The implementation has set the following requirements.

Ability to make use of any keyword thesaurus that is or can be made available as a text file
Vocabulary control, identification of preferred terms, standardized spelling.
Ability to select multiple keywords from full alphabetized lists or from shorter thematic keyword lists.
Insertion of the keywords into metadata for the GIS theme
Ability to identify the usage of keywords in metadata
Assignment of keywords by the person responsible for the source data, not only GIS personnel

Keyword lists or thesauri are subject to continual modification. There are a variety of lists representing different disciplines or areas of practice. They represent terms used by one or more authorities and are assumed to be related to active terminology for the discipline. As such, it is assumed that a current glossary exists which defines the usage of the terms.

The structure of contributing formal thesauri usually includes and identifies major terms and associated or related terms in a hierarchy. Terms for keyword lists may be selected from any level of hierarchy.

The vocabulary control function of a thesaurus identifies preferred terms and those they replace. Both sides of the relationship are noted. Terms which are controlled out of the vocabulary have a USE or SEE reference to the term that is to be used instead. The preferred term has a UF (Used For) or SEE FROM reference to the term or terms it replaces. Only preferred terms may be selected for use in keyword lists.

The initial implementation of our tool for selecting and reporting keywords for GIS themes is in ArcView. Avenue scripts were prepared to read in ASCII text files of the keyword lists or thesauri and output a reformated comma delimited text file. This text file can then be loaded into INFO or database tables. Any number of lists can be read in as required for different disciplines or to keep the tables current.

For loading into database preferred terms, modifiers of the preferred term, related terms, and terms that are not used are identified by different columns. The AGS Map Catalog Subject Entries was used as a test file. This list has some common characteristics of a thesaurus. There is a hierarchy of major terms, and terms with modifying phrases. Associated terms (See also) are identified. Terms which replace other terms (Used for ) are identified. Figure 1 shows a portion of the keyword list in the original format and as an ASCII file ready for loading into a database table.

Portion of source file of the Map Catalog Subject
Entries of the American Geogrphical Society and reformatted ASCII file

All major headings are retained. Modifying phrases are linked with their major term to form a keyword entry. Associated terms are paired with related terms in a second column heading. Terms which are replaced by another term are in the third column. All controlled words for the keyword database table are in the first column. All phrases in any column are available to generate a searchable list. Although it is not formatted as a hierarchical thesaurus, relationships are identified. This file is then ready for loading into INFO or as an ArcView table to form the basic keyword list.

As a database table, this list provides the basis for Avenue scripts to generate selected theme keywords to be included in metadata indexing or to be applied as a search string against existing metadata. Additionally, the table provides the management functions of identifying the terms actually used for indexing and the capacity to count the number of times each term is used.

Selecting keywords from very long lists or thesauri can be a frustrating experience. You need access to the entire list, but a selection routine that requires return to the entire list for each term is going to inspire very short lists and half-hearted indexing. In addition to selecting terms from the entire list, the designed routine permits the user to create groups of keywords for a theme and to select from existing groups of keywords associated with particular themes or subject areas. The designed routine permits the user to select terms from the entire thesaurus, prepare a group of keywords for a theme, or select words from an existing theme group. Figure 2 shows the process of reviewing and selecting keywords.

Keyword selection for a GIS theme using the
ArcView Help Utility

The user can either begin the keyword list for the theme from the ArcView menu bar or in the Help window. The search utility of ArcView Help permits rapid viewing of long lists of keywords. It also identifies and links to associated terms. Typing in terms that have been replaced in the controlled list takes the user to the preferred term. The MidPacific Region does not have a professional indexer. Providing keyword selection in ArcView allows us to have the professional engineer or scientist take responsibility for describing their own data. Identifying groups of keywords for recurring themes allows the GIS professional to assign reviewed terms to sets of related themes.

Presently, the Document program in ArcInfo provides very limited space for keywords in the metadata description. We use the Document program as one of our documentation tools. This ArcView application prepares a text file that can contain a robust list of identified keywords for a theme. This text file contains other metadata information that is important for our office in managing our GIS themes. This text file can then be substituted for the narrative file that is generated by the Document program. Figure 3 shows various keyword forms for a narrative text file of one GIS theme.

Two versions of the same set of keywords are shown. The upper version contains the list of keywords as originally captured from the AGS Map Catalog Subject Entries. A revised version flattens the structure which may be more useable for currently available search engines.

To recap, our implementation is in ArcView with storage of local thesauri and keyword groups in INFO or in dBase format. Any ASCII file of potential keywords can be used, provided they follow some basic conventions. Avenue scripts are used to review and reformat text files of keyword thesauri. Access for the user is in ArcView. No limit is set to the number of keyword phrases that can be identified for a theme.

Barriers

Some barriers to the development and implementation of keyword thesauri in the GIS community have been:

Lack of familiarity with thesaurus structure: Even worse, you may have had experiences with highly structured and exhaustive thesauri which clearly took years of mind-numbing, anal-retentive, intellectual labor to produce.
Lack of familiarity with thesauri in the range of disciplines represented in your datasets: Let's say you're a soil scientist turned GIS specialist helping a former CAD expert now well on his way to becoming a GIS technician to build an analytic dataset (dare one say map?) combining historical and prehistorical archaeological data with paleogeologic data, contemporary and historical hydrology, soils, and landforms. You're working with data from at least half a dozen disciplines and subdisciplines. Few, if any, are your own. Where do you shop for descriptive terminology?
Inadequacy of existing thesauri to represent the kinds of datasets you produce: The concepts are too new, too cross-disciplinary, not scholarly enough to have been incorporated into a hard copy thesaurus. Most of the thesauri I found in research libraries were between 5 and 30 years old. Most of the living, growing thesauri are online, some disguised as descriptors and keywords for databases or internal documents tended by the organizations that grew them.
Perception that the FGDC requirement is unrealistic and will be revised to reflect what the community finds possible or worth doing: A project manager I contacted called the requirement unrealistic. Then he sent me off to have a look at a sort of thesaurus designed for one of his projects. It consisted of a several broad theme headings associated with minimal sets of subject keywords and phrases but it was formatted in an elaborate hypertext software environment. I got it right away that it was lots more fun to create the environment than the thesaurus.
The eternal hope that some other organization will do the work and make it available to the rest of the GIS community: There is hope here. Work has been done and is being done in a lot of agencies and organizations, including NASA, USGS, USBR, and at the project level. It's clear, though, when you have a look at what's been done, that we're all building thesauri for local data and local needs. NASA's thesaurus is gorgeous, and it may be great for NASA, but they occupy, literally, a whole different stratum then most of us.

Benefits

The point of a theme or subject keyword thesaurus in the context of GIS datasets is to support data sharing. The bottom line here is the bottom line: this is about creating the broadest possible base through which to distribute data gathering costs (Frank, 1994, p.588). Research, gathering and creating data, is big fun but not cheap. Digitizing data is somewhat less fun and the single greatest expense associated with GIS. Why should your organization pay the full price of duplicating data that is already available?

It's almost too obvious to say out loud, but how much information about your organization's data depends on the memories of individuals? How much is in files with labels that once seemed perfectly clear but have become positively cryptic over time? How much work is duplicated even within your organization in, for example, the creation of macros. Metadata offers a structure for keeping internal records for your organization.

Subject Keyword Thesauri

Subject access (and its obverse, subject indexing) is a big messy epistemological can of worms and the larger the database, the messier it gets. If you have any doubts about the messiness of uncontrolled subject keyword searching in very large databases, try using any of the major WWW search engines for a simple search on a one or two word topic of your choice. Call me when you've removed the duplicates and winnowed the list down to whatever you really wanted or needed to know about. Thesauri are one way of dealing with the mess by providing a little control and a little context to lessen the cognitive load.

The control part is fairly simple stuff that can be done to whittle down the sheer size of the potential searching and indexing vocabulary. It standardizes (and corrects) spelling. It establishes whether a singular or plural form is to be used. It refers the user to so-called preferred terms - which means that you'll have to use bathymetry rather than submarine relief if you actually want to retrieve something on the topic or have it found. Scope notes (little parenthetical clarifications like this one) are added to sort out the meanings of potentially confusing terms such as: DIKES(igneous intrusions). For water protection use LEVEES.

The context part is not so simple. It has to do with the relatedness of terms and, in carefully wrought comprehensive thesauri, the hierarchical relations between terms. The arguments for hierarchical arrangement are based on what is believed to be a hierarchical ordering principle in human memory (Najarian, 1981; Rosch, 1978). The difficulty in working with hierarchy is that, while the principle may be at work in all of us, the particular hierarchical structures of individuals are unique; they differ with respect to the breadth and depth of knowledge and experience. That's my excuse, or rationale, for abandoning hierarchical structure except for a trial collection of groups of very strongly related terms having to do with Cultural Resource Management, analytical methods, laws and regulations, and a stab at classifying kinds of boundaries.

Summary - Metadata for Resource Sharing

MARC, MAchine Readable Cataloging, has become the basis for resource sharing among libraries.The hard line proposed by some excellent librarians is that digital spatial data ought to be cataloged using the MARC format so it, too, can be shared.(If sharing reminds you of either kindergarten or therapy groups, think distribution .) Because MARC was originally designed to create catalog records for books and textual materials, it had to be adapted for the cataloging of hard copy maps by acknowledging cartographic data attributes, such as scale, projection, and coordinates, with their own fields. The excellent librarians have backed their proposal to catalog digital spatial data in MARC format with a lot of hard collective work tweaking MARC structure once again, this time to accommodate the attributes of GIS data (Mangan, 1995). There is now a crosswalk for catalogers to link the fields of the FGDC Content Standard for Digital Metadata with those of USMARC. What this means for libraries is that they can now catalog GIS data more or less the way they catalog other formats. What it means for the GIS community is that you can let them do it.

Data producing organizations, quaintly called the drawing office by one map librarian (Perkins, 1992), have a different approach to organizing data and output. They have fewer items to control than libraries, fewer requests to process and their community of data users tends to be aware of both format and subject matter. Organizational schemes used by data producers don t generally run to library style universal classification and cataloging . Depending on the size and nature of the organization, they may organize in piles or the computer equivalent of piles (odd collections of files spread throughout the system in unclassified order), impose a formal local system of based on anything from client, location or project identification to some form of cataloging with multiple access points. A lucky few get to hand the problem over to the National Archives to sort out.

Both of these approaches, from the library and from the data-producing agencies, have their advantages and limitations. The FGDC is encouraging you, through the requirements of the Content Standard to bridge the gap between universal and local forms for the organization of data for digital distribution. The matter of subject keywords is a small piece of the problem. We hope our solution demonstrates two things: that theme keywords thesauri needn t be built from scratch and that the whole process can be done fairly simply using software you already have.

This implementation makes use of ArcView with storage of thesauri and keyword groups in INFO or in dBase format. Any ASCII file of keywords can be used provided that they follow some basic coventions. The lists are then available for selection of keywords for assignment to a GIS theme or for query against the existing metadata.

References

American Geographical Society (n.d.). AGS Map Catalog Subject Entries [Online]. Available at http://leardo.lib.uwm.edu/maptops.html [March 23, 1997].

Federal Geographic Data Committee (1994). Data standards: Content standard for Federal geospatial metadata [Online]. Available at: http://fgdc.er.usgs.gov/metaover.html [1996, March 22].

Frank, Steven (1994). Cataloging digital geographic data in the information infrastructure: a literature and technology review. Information Processing and Management, 30(5), pp. 587-606.

Library of Congress, Network Development and MARC Standards Office (1993). USMARC code list for relators, sources, description conventions (1993 ed.). Washington, D.C.: Cataloging Distribution Service, Library of Congress.

Mangan, Elizabeth U. (1995). The making of a standard. Information Technology and Libraries, 14(2), 99-110.

Moratto, Michael J. (1984). California archaeology. New York: Academic Press.

Najarian, Suzanne (1981). Organizational factors in human memory: implications for library organization and access systems. Library Quarterly, 51(3), 269-291.

National Information Standards Organizaton (U.S.) (1993). Guidelines for the construction, format, and management of monolingual thesauri (ANSI/NISO Z38.19-1993). Bethesda, MD: NISO.

Perkins, Chris (1992). Metaphysical mayhem? Retrieving and describing maps and spatial data in the map library and drawing office. Bulletin of the Society of University Cartographers 26(2), pp.21-24.

Rosch, Eleanor (1978). Principles of categorization. In Eleanor Rosch and Barbara B. Lloyd (Eds.), Cognition and Categorization (pp. 27-48). Hillsdale, NJ: Lawrence Erlbaum Associates.

U.S. Department of the Interior, Bureau of Reclamation (1971). Thesaurus of water resources terms: a collection of water resources and related terms for use in indexing technical information. Denver, CO: U.S. Department of the Interior.

U.S. Geological Survey (1996, February 14). USGS Mapping Information: Feature Class Types [Online]. Available at: http://mapping.usgs.gov/www/gnis/features.html [March 31, 1997].

U.S. Geological Survey (1995,October 4). USGS National Mapping Information: Geographic Names Information System Data Users Guide: APPENDIX C.--Geographic Names Information System (GNIS) Feature Class Definitions [Online]. Available at: http://mapping.usgs.gov/www/ti/GNIS/gnis_users_guide_appendixc.html [March 31, 1997].

Works Consulted

Bovey, J. D. (1995). Building a thesaurus for a collection of cartoon drawings. Journal of Information Science, 21(2), pp. 115-122.

Lai, Pohchin & Gillies, Charles F. (1991). The impact of geographical information systems on the role of spatial data libraries. International Journal of Geographical Information Systems, 5(2), 241-251.

Library of Congress (1976, and supplements through 1991). Library of Congress Classification Schedule: Class G: Geography, Maps, Anthropology, Recreation (4th ed.). Washington, D.C.: Library of Congress.

NASA, Scientific and Technical Program Office (1994). NASA Thesaurus [Online]. Available at http://www.sti.nasa.gov/nasa-thesaurus.html [April 18, 1997].

NASA, Scientific and Technical Program Office (1997, January). NASA Thesaurus Supplement [Online]. Available at http://www.sti.nasa.gov/Pubs/Thesaurus-Supplement.html [April 18, 1997].

Milstead, Jessica L. (1993). Thesaurus management software. In Encyclopedia of library and information science (Vol. 51, Supp. 14, pp. 389-407). New York: Dekker.

Weinberg, Bella Hass, & Cunningham, Julie A. (1988). The design of online thesauri. In Martha E. Williams & Thomas H. Hogan (Compilers), National online meeting proceedings - 1988: Proceedings of the 9th national online meeting, New York, May 10-12, 1988 (pp. 411-419), Medford, NJ: Learned Information.

Pat Horn Fell
MLIS Candidate,
School of Library and Information Science
San Jose State University
San Jose California
Email: phfell@netcom.com
David T. Hansen
GIS Specialist / Soil Scientist
U.S. Bureau of Reclamation
Mid Pacific Region
2800 Cottage Way
Sacramento, CA 95825-1898
Telephone: (916) 979-2418
Fax: (916) 979-2505
Email: dhansen@mpgis7.mp.usbr.gov