Doug Nebert

Supporting Search for Spatial Data on the Internet:
What it means to be a Clearinghouse Node

The Federal Geographic Data Committee (FGDC) has been facilitating the accessibility of digital spatial information on the Internet among federal and state organizations. By Executive Order of the President, all agencies are required to document their digital spatial data and make it available to the public to encourage re-use of expensive information. The National Geospatial Data Clearinghouse is a FGDC-sponsored activity that provides a series of technical solutions to making spatial data discoverable on the Internet. The Clearinghouse is an implementation of concepts that define service within the National Spatial Data Infrastructure (NSDI). This paper describes the requirements for a site to be considered a service node within the National Geospatial Data Clearinghouse.


Background

The current definition of the National Geospatial Data Clearinghouse is available on the Federal Geographic Data Committee (FGDC) Web server and describes the Clearinghouse activity in the following way:

Although the Clearinghouse activity clearly includes institutions, policies, electronic infrastructure (hardware and software) and data, focus must now be spent on consolidating the infrastructural definition of a "Clearinghouse Node in the NSDI." Without explicit agreement on what Internet protocols, attributes, metadata formats, and data delivery formats can be expected, the goal of providing search capability across many NSDI nodes cannot be fulfilled. as increasing numbers of spatial data services appear on the Net, the ability to discover and exploit them decreases accordingly.

Browse versus Search

Most existing spatial data services provide only Web Server access to their spatial metadata or data. Web servers provide multimedia content delivery that present a "front-door" to an organization and its holdings. The use of Web servers to hold organizational information is getting overloaded as agencies discover the ease of "publishing" on the Net using the Web. Consequently, any one piece of information or activity gets buried in deeper levels of linkages, and not all activities can be hot-linked from the "home page" or starting point. There is also no guarantee that a user encountering a page will ever start at the "top" or home page and may never traverse the right links to find information. For example, a recent visit to one large federal Web Site revealed that extensive spatial data were available, but access to the actual data required traversal of 15 levels of links to obtain it. More than 5 levels of links exceed the attention span of most users and the inclusion of large numbers of tangential "look here" links along the way can serve to elaborate on a concept or, more likely, distract the user from his target. What we are seeing is pressure on the features of the Web that indicate serious limits on its scalability for operational resource discovery in large organizational settings.

There is an interesting pattern of evolution revealing itself on the Internet with respect to information service protocols. A protocol is created which provides a unique new level of access and organization to the Net, usually for a restricted community. Once it becomes popular the ability find out "who" is serving "what" using that protocol becomes difficult, and an index of all services of that protocol on the Net is subsequently created. This happened years ago with anonymous ftp sites which were immediately useful to the scientists wanting to share files on the Internet. When too many of these sites became available, the Archie software was developed by McGill University to traverse the Internet at night and index the files names held in anonymous ftp sites. The Gopher protocol is similarly indexed by Veronica to map out all "gopherspace."

With the popularity and complexity of the Web, a number of Web index services have appeared on the Internet -- Lycos, Yahoo, Web Crawler -- all with the intent of indexing text that occurs within on-line HTML documents. Such a service (already very popular and overwhelmed) can provide users with the ability to find documents on servers by pointing at that lowest level. This presents three problems in information architecture:

In summary, the Web offers a high level of flexibility and creativity in the service of spatial and non-spatial information but presents challenges in indexing and consistency of presentation. The service of spatial information requires both a browse capability, as provided by the Web, and a search capability as provided through the Z39.50 protocol.

Federal Mandates

The specification of the behavior of a Clearinghouse service must include a set of requirements. The requirements are essential in specifying solutions to specific objectives -- solutions that require information, software, and hardware to be fulfilled. The most authoritative sources for requirements that affect digital spatial data service in the federal government come from three documents:

The Executive Order is non-specific about the technical capabilities of the Clearinghouse, calling it a "...distributed network of geospatial data producers, managers, and users linked electronically." Other than its compatibility with the National Information Infrastructure and the adherence to standards developed by the Federal Geographic Data Committee, the Executive Order is vague about implementation details of the Clearinghouse so as to not limit the scope of its implementation over the succeeding years.

The OMB Circular A-130 was published to encourage government agencies to make more government-held information accessible to the public and to other agencies. It specifies a requirement to make data available and encourages use of the Internet as a dissemination mechanism. It encourages agencies to recover the cost of dissemination (incremental cost of information delivery) but reminds federal agencies that they cannot collect fees for re-covering the cost of the initial data collection, which were already paid for through appropriations.

More recently, the Government Information Locator Service, GILS, was approved as a Federal Information Processing Standard (FIPS) as the official method for making information collections known to the Internet user community. GILS specifies the use of the Z39.50 service protocol and the GILS attribute set which includes approximately 20 attributes that can be specifically addressed in a search. Although all fields in GILS are considered optional, GILS does include the spatial elements for a bounding rectangle and polygons, adopted from the FGDC metadata standard. Designed primarily to identify information systems in government (one GILS record per database description), GILS could actually be pushed a level further down to describe spatial data, one GILS record per spatial data set.

These guidelines point to the need for an Internet solution for search and retrieval of digital spatial data but do not specify how this can be done for extensive collections of spatial metadata and data. It is the intent of this document to provide a starting point for the implementation of a Clearinghouse information service.

From USGS experience in developing Internet applications for spatial data over the past three years, it is suggested that a Clearinghouse service should:

  1. support both "search" and "browse" on the Internet,
  2. use compatible metadata elements and operators
  3. provide nominal or extended client capabilities to the largest possible community of users
  4. minimize investment in software development and updates
  5. provide for distributed search from a client process to many servers
  6. present results from a search in predictable forms, yet
  7. allow for the expansion of supported data types and levels of abstraction (collection down to spatial feature).
The balance of this paper will attempt to address each of these requirements in terms of recommended solutions.

Requirements

1. Support both "search" and "browse" on the Internet

The World Wide Web presents a perfect example of the browse capabilities afforded by hypertext links between documents. A use typically traverses links in a directed fashion, as organized by the Web page author, or in a random fashion as pages may include links to related documents not under direct control of the author. The proliferation of specialized topical home pages with "interesting" links found by the author exemplifies this latter approach, which has value in connecting a visitor to similar information. This is analogous to looking at books on a shelf in the library that are physically near the book of interest, which can add value to the information where exploited.

A Clearinghouse Node should provide access to digital spatial information collections through an active HyperText Transport Protocol (http) server. Information to be served through this Web server should include spatial metadata in FGDC verbose format as HTML documents with embedded links to browse graphics and online spatial data, where available. Additional information useful to the evalution of the spatial data including processing algorithms or software, collection-level metadata, or links to similar sites, are encouraged in order to use the hypertext media to its fullest.

The Z39.50 protocol is the preferred ANSI/ISO standard for network information search and retrieval. It was designed by the library community initially to provide search and capabilities for bibliographic catalog entries on the Internet and has since been used in public-domain (freeWAIS) and commercial (WAIS, Inc.) text/document indexing and service software. Other information service companies (Ameritech, TRW, OCLC, Chemical Abstracts) use more recent versions of Z39.50 that support profiles of attributes for specific user communities as "well-known" fields that can be queried. This provides a low-level support for search interoperability among servers operating within the same community but using different server software. Although one can use Z39.50 and data indexing software to locally index a collection of pages at a Web site, the requirement for search capability of a site must be that a remote client, using Z39.50, can search one or more servers directly and obtain links to metadata and data stored there.

The definition of a profile, such as GILS, within the Z39.50 implementor community facilitates a level of interoperability between clients and servers, standardizing both a specific attribute set (fields) and the operators used to work with each attribute (e.g. greater-than, falls-within). A Z39.50 client and server connect and pass queries and results to one another using the search and retrieve protocol. In contrast with familiar versions of WAIS and freeWAIS, where the indexer and the server were the same software, some recent versions of Z39.50 software decouple the search/database capability from the server and allow use of one or more search solutions -- relational database, text-search engine, spatial search (GIS). This can provide for more creative and current solutions than the existing freeWAIS-sf model by allowing a user to query the spatial data base directly and not query static collections of metadata reports.

The data elements and structures of the FGDC metadata content standard have been placed into an attribute set, called GEO for geospatial metadata, to be implemented under the Z39.50 service protocol. This will permit the query of specifically identified fields of information, define the operators that will work on them, and define the types of results and formats that will be returned to the user as the result of a presentation request.

The Center for Networked Information Discovery and Retrieval (CNIDR) at Research Triangle Park, NC has developed freely-available Z39.50 server software, known as I-Site, that provides Version 2 and 3 service. I-Site permits use of a CNIDR-developed indexing and search engine called I-Search to index the contents of documents. It also provides an application programming interface (API) for users to connect the server with other search software directly for a more custom solution. The Postgres object-relational database has been interfaced to the I-Site server product through the server which can provide database and rudimentary GIS services, including polygon overlap. CNIDR is implementing the GEO profile in its I-Site and ISearch software for the FGDC and the North Carolina Center for Geographic Information Analysis. This will provide baseline functionality against metadata documents (as text) and demonstrate interface with the Postgres database for more sophisticated users.

Enhancement of the I-Site software through a contract with CNIDR to include support for FGDC metadata (the GEO profile) will initially allow metadata records that are provided in SGML to be indexed, served, and presented rapidly via Z39.50. SGML format files can be generated as an output of the USGS metadata parser (mp) software available on the Net. The parser also validates the content of a metadata record against the production rules of the FGDC metadata content standard. A second deliverable from CNIDR will be the creation of the metadata tables, entities, and forms using the Postgres (freely-available) database software from U.C. Berkeley along with a direct interface to the I-Site server.

The FGDC is also working with several projects that are implementing the metadata content standard elements in relational and object-oriented databases. The storage of metadata and spatial data in active databases shows long-term promise in simplifying the generation and management of spatial metadata. Connections between Z39.50 client and a Z39.50 and SQL-based server have been demonstrated by CNIDR in a NASA project, and by the USGS EROS Data Center. The FGDC will pursue the support of FGDC-defined search attributes in these Z39.50/SQL servers and share reference implementations with the Clearinghouse community as they become available.

Multiple Access Path Strategy (MAPS)

It is suggested that the FGDC encourage search interoperability through the use of Z39.50 Version 3 software and the GEO profile as a robust set of the FGDC metadata elements that can be presented or accessed by the client. At the same time, the FGDC should encourage the development of Web services by organizations to impart organizational information and context to on-line geographic resources. Where possible, 'hits' or documents that are found via Z39.50 should provide linkages to related or similar data sets available at the site in order to promote search first, then some browse capability. Support of only a Web Server at a site would not constitute a full NSDI Clearinghouse Node.

A matrix of existing Internet spatial Information services is provided below to assist in the evaluation of sites and provide a growth path for sites seeking to become fully searchable NSDI Clearinghouse Nodes.

Category Zero Interface Support does not qualify as a full Clearinghouse Node as it does not provide adequate interoperability.

The USGS NSDI Node is a level C2 server. USGS EROS Data Center is testing a level C3 server. Custom services such as feature extraction, clipping, data overlay, and user-defined conversions done on-the-fly constitute features of a category D content service. Category 2 and 3 servers are desirable because they permit spatial search using coordinates, whereas Category 0 and 1 do not. The Clearinghouse activity will strive for C2, C3, D2, D3 types of services but will expect some A's and B's due to institutional constraints.

2. Use compatible metadata elements and operators

Profiles of Z39.50 specify three characteristics that are known to and used by server and client software:

These characteristics allow a generic client that supports the profile to pose a query and receive results. Each attribute that Z39.50 describes is then "mapped" by the server to its actual location and identity, if different, to perform the search. This means that information indexed in a text file may actually be associated with the string "East_Bounding_Coordinate:" or or might equate to a field called "eastbc" in a relational database, shielding the end-user from the translation that is being performed to satisfy the query.

When we state that we will use compatible metadata elements and operators, this can be described on two levels. First, it is important for the purposes of a distributed search that the attribute tags that are used are the same on all systems -- at least as far as Z39.50 is concerned. Second, it is desirable that the information that is returned to the user is presented in a form with attribute naming that is familiar to the user.

To accommodate a systematic search, the draft GEO profile of Z39.50 provides a set of 8-character tags, numeric Z39.50 tags, and their definitions as taken from the FGDC metadata content standard. The 8-character tags should be used by implementors interested in sharing data base schemas to assist in visual review of the structure and elements. The numeric tags of each element must be known to the Z39.50 server for a query to be processed. A table or file is usually provided with a configurable server to draw an equivalence between a numeric tag and its location in the data collection entries.

3. Present results from a search in predictable forms

To promote consistency in presentation (and in absence of any other specific guidance on the official presentation form of FGDC metadata,) it has been generally accepted to present metadata elements as verbose, multi-word tags, mixed-case, followed by a colon (e.g. East_Bounding_Coordinate:)and followed by the attribute value. Underscores or spaces are required between words in multi-word tags. The USGS metadata parser also expects that some form of indentation (space or tab) be made to denote subsections within the standard elements. This may not be required for presentation, but does provide increased legibility to some readers.

A companion standard to SGML is the Document Semantic Specification and Style Language (DSSSL) which allows the creation of style sheets to selectively and systematically present the contents of an SGML marked-up file. Multiple style sheets may be invoked against a single SGML file, allowing for several "standard" presentation views of the document. For digital spatial metadata, this can support a set of common views of metadata from multiple agencies when viewed in a viewer or printed to paper. This conversion of SGML using a style sheet may be done on a server, and returned to a client in HTML, or can be done by the client using a helper application that can interpret and format SGML.

One area not addressed by the standard or by the Clearinghouse previously is the proper exchange format for FGDC metadata. It is desirable that both complete metadata entries and collections be transportable without loss of context or information. For this purpose it is suggested that metadata providers consider making metadata entries from the Clearinghouse available as Standard Generalized Markup Language (SGML), an international standard for data markup. SGML can be produced for a metadata entry through use of the USGS metadata parser (mp) software. The tags used for the representation of a metadata entry in SGML (e.g. ) shall use the 8-character tags described in the GEO profile draft. Each beginning tag must have an ending tag to close it (e.g. ) to ensure proper demarcation of each piece of information. All elements, including compound elements such as Identification Information must be included in the SGML for full exchange of information.

4. Provide nominal or extended client capabilities to the largest possible community of users

One critique of the use of specialized graphical user interfaces to connect to WAIS databases was that it was yet another interface to learn. Since the inception of WAIS, gopher, and telnet, the Web client has become a nearly universal client application to a number of protocols, shielding the user from the details, and presenting it in a consistent, familiar form. Versions of the NCSA Mosaic product for the UNIX operating system included direct support for the now obsolete WAIS (Z39.50-1988) protocol. To provide widespread access to spatial Z39.50 data servers the USGS and FGDC are actively:

A Web-to-Z39.50 gateway allows a client to download a form from a web server, submit a query in the form of variables back to the server which gets translated at the server into a Z39.50 query, as illustrated below:

The client sends and receives all information via HTML using the Hyper Text Transport Protocol. The server gateway script makes one or more connections to Z39.50 servers, waits for reponses, and then prepares a composite result, in the form of HTML to present back to the user. The advantage of this arrangement is that a large number of Web clients (with form support) gain access to a distributed query capability. The disadvantages of the gateway approach are in performance of the brokering script and the fact that a gateway may be handling many simultaneous requests. The establishment of single gateways to collections also presents a critical fail point -- if the gateway server is down, then access is dropped to all other services although active. Although this provides a convenient method for a large number of initial clients it is not scalable and should be viewed as a longer-term back-up solution.

5. Minimize investment in software development and updates

Two major client extensions are being designed by the FGDC for deployment by March 1996: the ability to embed a native Z39.50 client stub in a Web browser, and a map query tool to enable users to pick geographic areas of interest from an interactive map rather than entering coordinates. After researching developments in Web technology, it seems apparent that the use of the Java programming language can be used to develop applications for multiple platforms using the same software. A portable small application, called an applet, can be written and called from an HTML page to extend Web browsers with the required functionality. This will allow a Netscape or Oracle Web browser client to gain native Z39.50 client capabilities or produce a map query tool without the intentional downloading and configuration of software for many different platform types. Market dominance by Netscape will intimidate competitors into support for Java applets soon, as well. Until Java interpreters are available on all platforms (Windows 3.1, 95, NT, Solaris 2.3+ supported now) and these applets are written, support for a gateway approach to spatial servers will be pursued with a primary gateway at the FGDC facility and mirror gateways at other voluntary sites.

6. Provide for distributed search from a client process to many servers

The current gateway software used to connect existing servers using the freeWAIS-sf software (fwais.pl) already supports the query of multiple servers, providing a single set of results. A demonstration of multi-server query can be found at http://nsdi.usgs.gov/public/fgdcquery.html. This capability will be incorporated into the Z39.50 Version 2/3 client applet to permit more local processing of the information and reducing load on the gateway, as illustrated below. The results from a search will be translated by the Z39.50 client applet back into HTML for easy and familiar display by the browser.

In addition, the FGDC will host a "spatial directory of servers" with basic information about Z39.50 and Web services that hold spatial metadata and data. When ready, a site developer can register their spatial data server with the FGDC by providing some basic information about the server in a standard form to enable discovery of the resource by the user community not sure where to search. This list of servers would be kept current and would be used in on-line forms to assist users in asking and refining queries on the Network.

7. Allow for the expansion of supported data types and levels of abstraction (collection down to spatial feature).

The focus on the FGDC metadata content standard has been on collection of semi-static metadata at the data set level. Where individual data products, such as might be developed for a scientific study, are unique, the preparation and management of metadata is not a significant problem. Agencies attempting to document series of spatial data sets or even aerial photography begin to find that a large amount of metadata is redundant or re-usable between entries. For example, flight-line metadata are shared by all individual photographs that compose the flight-line. An significant amount of metadata is potentially inherited from the flight-line metadata, if such information were collected so it could be re-used. The ability to collect, manage, and query spatial metadata for levels of abstraction such as a "collection" or a "project", the traditional data set, and even features within a dataset is a desirable extension of the concepts behind the metadata standard and innovative methods of metadata management and data service. Research into the implementation of spatial data collections using innovative techniques, yet still supporting the familiar metadata elements will be supported by the FGDC over the next year.

Research into digital libraries and distributed data archives supported through National Science Foundation and NASA grants will also provide new metaphors and methods and better access to spatial data. The development of a Common Object Request Broker Architecture (CORBA) for digital imagery data under the NASA EOSDIS project shows promise in delivering spatial data to applications directly, and yet does not solve the problem of "who has what" information. A coupling of search and service capabilities between Z39.50 and CORBA technologies is a topic of research by several of these projects.

Conclusions

This paper presents an overview of the elements that the FGDC seeks to promote as robust, interoperable facets of a Clearinghouse activity. Support of browse access to metadata and spatial data using the World Wide Web is encouraged, although the ability to search collections directly using Z39.50 is a more critical measure of interoperability in a growing on-line spatial data community. Browse and search capabilities complement one another -- neither is sufficient by itself as a Clearinghouse service.

The FGDC Clearinghouse Working Group will be continuing the development of software for public use, as well as encouraging commercial solutions, that will enable the largest number of spatial data producers to share their data and the largest number of potential clients to access it.


Douglas D. Nebert, Clearinghouse Coordinator
Federal Geographic Data Committee
U.S. Geological Survey, Mail Stop 590
Reston, VA 22092
Telephone: (703) 648-4151
Fax: (703) 648-5755
E-Mail: ddnebert@usgs.gov