There is an interesting pattern of evolution revealing itself on the Internet with respect to information service protocols. A protocol is created which provides a unique new level of access and organization to the Net, usually for a restricted community. Once it becomes popular the ability find out "who" is serving "what" using that protocol becomes difficult, and an index of all services of that protocol on the Net is subsequently created. This happened years ago with anonymous ftp sites which were immediately useful to the scientists wanting to share files on the Internet. When too many of these sites became available, the Archie software was developed by McGill University to traverse the Internet at night and index the files names held in anonymous ftp sites. The Gopher protocol is similarly indexed by Veronica to map out all "gopherspace."
With the popularity and complexity of the Web, a number of Web index services have appeared on the Internet -- Lycos, Yahoo, Web Crawler -- all with the intent of indexing text that occurs within on-line HTML documents. Such a service (already very popular and overwhelmed) can provide users with the ability to find documents on servers by pointing at that lowest level. This presents three problems in information architecture:
The OMB Circular A-130 was published to encourage government agencies to make more government-held information accessible to the public and to other agencies. It specifies a requirement to make data available and encourages use of the Internet as a dissemination mechanism. It encourages agencies to recover the cost of dissemination (incremental cost of information delivery) but reminds federal agencies that they cannot collect fees for re-covering the cost of the initial data collection, which were already paid for through appropriations.
More recently, the Government Information Locator Service, GILS, was approved as a Federal Information Processing Standard (FIPS) as the official method for making information collections known to the Internet user community. GILS specifies the use of the Z39.50 service protocol and the GILS attribute set which includes approximately 20 attributes that can be specifically addressed in a search. Although all fields in GILS are considered optional, GILS does include the spatial elements for a bounding rectangle and polygons, adopted from the FGDC metadata standard. Designed primarily to identify information systems in government (one GILS record per database description), GILS could actually be pushed a level further down to describe spatial data, one GILS record per spatial data set.
These guidelines point to the need for an Internet solution for search and retrieval of digital spatial data but do not specify how this can be done for extensive collections of spatial metadata and data. It is the intent of this document to provide a starting point for the implementation of a Clearinghouse information service.
From USGS experience in developing Internet applications for spatial data over the past three years, it is suggested that a Clearinghouse service should:
1. Support both "search" and "browse" on the Internet
The World Wide Web presents a perfect example of the browse capabilities afforded by hypertext links between documents. A use typically traverses links in a directed fashion, as organized by the Web page author, or in a random fashion as pages may include links to related documents not under direct control of the author. The proliferation of specialized topical home pages with "interesting" links found by the author exemplifies this latter approach, which has value in connecting a visitor to similar information. This is analogous to looking at books on a shelf in the library that are physically near the book of interest, which can add value to the information where exploited.
A Clearinghouse Node should provide access to digital spatial information collections through an active HyperText Transport Protocol (http) server. Information to be served through this Web server should include spatial metadata in FGDC verbose format as HTML documents with embedded links to browse graphics and online spatial data, where available. Additional information useful to the evalution of the spatial data including processing algorithms or software, collection-level metadata, or links to similar sites, are encouraged in order to use the hypertext media to its fullest.
The Z39.50 protocol is the preferred ANSI/ISO standard for network information search and retrieval. It was designed by the library community initially to provide search and capabilities for bibliographic catalog entries on the Internet and has since been used in public-domain (freeWAIS) and commercial (WAIS, Inc.) text/document indexing and service software. Other information service companies (Ameritech, TRW, OCLC, Chemical Abstracts) use more recent versions of Z39.50 that support profiles of attributes for specific user communities as "well-known" fields that can be queried. This provides a low-level support for search interoperability among servers operating within the same community but using different server software. Although one can use Z39.50 and data indexing software to locally index a collection of pages at a Web site, the requirement for search capability of a site must be that a remote client, using Z39.50, can search one or more servers directly and obtain links to metadata and data stored there.
The definition of a profile, such as GILS, within the Z39.50 implementor community facilitates a level of interoperability between clients and servers, standardizing both a specific attribute set (fields) and the operators used to work with each attribute (e.g. greater-than, falls-within). A Z39.50 client and server connect and pass queries and results to one another using the search and retrieve protocol. In contrast with familiar versions of WAIS and freeWAIS, where the indexer and the server were the same software, some recent versions of Z39.50 software decouple the search/database capability from the server and allow use of one or more search solutions -- relational database, text-search engine, spatial search (GIS). This can provide for more creative and current solutions than the existing freeWAIS-sf model by allowing a user to query the spatial data base directly and not query static collections of metadata reports.
The data elements and structures of the FGDC metadata content standard have been placed into an attribute set, called GEO for geospatial metadata, to be implemented under the Z39.50 service protocol. This will permit the query of specifically identified fields of information, define the operators that will work on them, and define the types of results and formats that will be returned to the user as the result of a presentation request.
The Center for Networked Information Discovery and Retrieval (CNIDR) at Research Triangle Park, NC has developed freely-available Z39.50 server software, known as I-Site, that provides Version 2 and 3 service. I-Site permits use of a CNIDR-developed indexing and search engine called I-Search to index the contents of documents. It also provides an application programming interface (API) for users to connect the server with other search software directly for a more custom solution. The Postgres object-relational database has been interfaced to the I-Site server product through the server which can provide database and rudimentary GIS services, including polygon overlap. CNIDR is implementing the GEO profile in its I-Site and ISearch software for the FGDC and the North Carolina Center for Geographic Information Analysis. This will provide baseline functionality against metadata documents (as text) and demonstrate interface with the Postgres database for more sophisticated users.
Enhancement of the I-Site software through a contract with CNIDR to include support for FGDC metadata (the GEO profile) will initially allow metadata records that are provided in SGML to be indexed, served, and presented rapidly via Z39.50. SGML format files can be generated as an output of the USGS metadata parser (mp) software available on the Net. The parser also validates the content of a metadata record against the production rules of the FGDC metadata content standard. A second deliverable from CNIDR will be the creation of the metadata tables, entities, and forms using the Postgres (freely-available) database software from U.C. Berkeley along with a direct interface to the I-Site server.
The FGDC is also working with several projects that are implementing the metadata content standard elements in relational and object-oriented databases. The storage of metadata and spatial data in active databases shows long-term promise in simplifying the generation and management of spatial metadata. Connections between Z39.50 client and a Z39.50 and SQL-based server have been demonstrated by CNIDR in a NASA project, and by the USGS EROS Data Center. The FGDC will pursue the support of FGDC-defined search attributes in these Z39.50/SQL servers and share reference implementations with the Clearinghouse community as they become available.
A matrix of existing Internet spatial Information services is provided below to assist in the evaluation of sites and provide a growth path for sites seeking to become fully searchable NSDI Clearinghouse Nodes.
Category Zero Interface Support does not qualify as a full Clearinghouse Node as it does not provide adequate interoperability.
The USGS NSDI Node is a level C2 server. USGS EROS Data Center is testing a level C3 server. Custom services such as feature extraction, clipping, data overlay, and user-defined conversions done on-the-fly constitute features of a category D content service. Category 2 and 3 servers are desirable because they permit spatial search using coordinates, whereas Category 0 and 1 do not. The Clearinghouse activity will strive for C2, C3, D2, D3 types of services but will expect some A's and B's due to institutional constraints.
When we state that we will use compatible metadata elements and operators, this can be described on two levels. First, it is important for the purposes of a distributed search that the attribute tags that are used are the same on all systems -- at least as far as Z39.50 is concerned. Second, it is desirable that the information that is returned to the user is presented in a form with attribute naming that is familiar to the user.
To accommodate a systematic search, the draft GEO profile of Z39.50 provides a set of 8-character tags, numeric Z39.50 tags, and their definitions as taken from the FGDC metadata content standard. The 8-character tags should be used by implementors interested in sharing data base schemas to assist in visual review of the structure and elements. The numeric tags of each element must be known to the Z39.50 server for a query to be processed. A table or file is usually provided with a configurable server to draw an equivalence between a numeric tag and its location in the data collection entries.
A companion standard to SGML is the Document Semantic Specification and Style Language (DSSSL) which allows the creation of style sheets to selectively and systematically present the contents of an SGML marked-up file. Multiple style sheets may be invoked against a single SGML file, allowing for several "standard" presentation views of the document. For digital spatial metadata, this can support a set of common views of metadata from multiple agencies when viewed in a viewer or printed to paper. This conversion of SGML using a style sheet may be done on a server, and returned to a client in HTML, or can be done by the client using a helper application that can interpret and format SGML.
One area not addressed by the standard or by the Clearinghouse
previously is the proper exchange format for FGDC metadata. It is
desirable that both complete metadata entries and collections be
transportable without loss of context or information. For this purpose
it is suggested that metadata providers consider making metadata
entries from the Clearinghouse available as Standard Generalized
Markup Language (SGML), an international standard for data markup.
SGML can be produced for a metadata entry through use of the USGS
metadata parser (mp) software. The tags used for the representation
of a metadata entry in SGML (e.g.
The client sends and receives all information via HTML using the Hyper Text Transport Protocol. The server gateway script makes one or more connections to Z39.50 servers, waits for reponses, and then prepares a composite result, in the form of HTML to present back to the user. The advantage of this arrangement is that a large number of Web clients (with form support) gain access to a distributed query capability. The disadvantages of the gateway approach are in performance of the brokering script and the fact that a gateway may be handling many simultaneous requests. The establishment of single gateways to collections also presents a critical fail point -- if the gateway server is down, then access is dropped to all other services although active. Although this provides a convenient method for a large number of initial clients it is not scalable and should be viewed as a longer-term back-up solution.
In addition, the FGDC will host a "spatial directory of servers" with basic information about Z39.50 and Web services that hold spatial metadata and data. When ready, a site developer can register their spatial data server with the FGDC by providing some basic information about the server in a standard form to enable discovery of the resource by the user community not sure where to search. This list of servers would be kept current and would be used in on-line forms to assist users in asking and refining queries on the Network.
Research into digital libraries and distributed data archives supported through National Science Foundation and NASA grants will also provide new metaphors and methods and better access to spatial data. The development of a Common Object Request Broker Architecture (CORBA) for digital imagery data under the NASA EOSDIS project shows promise in delivering spatial data to applications directly, and yet does not solve the problem of "who has what" information. A coupling of search and service capabilities between Z39.50 and CORBA technologies is a topic of research by several of these projects.
The FGDC Clearinghouse Working Group will be continuing the development of software for public use, as well as encouraging commercial solutions, that will enable the largest number of spatial data producers to share their data and the largest number of potential clients to access it.