A Digital "Living" Library - A Prototype for Harvesting Ecological Data over the Internet.

A Digital "Living" Library - A Prototype for Harvesting Ecological Data over the Internet.

Sudha Ram, Michael R. Kunzmann, Jongseo Kim and Jeff Abbruzzi

Abstract

To increase the efficiency of data collection and subsequent environmental decision making the rate of data collection must be increased. In addition, the costs associated with managing the information and making the information available to the public in a user friendly interface is an important goal for federal land managing agencies. To facilitate these goals and the goals of the National Biological Information Infrastructure (NBII) program we have developed a digital "living" library that promotes data distribution and contribution using a www interface. By providing the capability to share data and resources, important ecological data can be updated and distributed more efficiently. In addition, landscape-level determinations derived from the digital library can be more readily applied by management in day-to-day decision making processes and integrated with long-term land management policy development. Our digital library system not only allows users to search for and retrieve spatial data sets and other information, but is also able to grow dynamically through users' contributions of data. To facilitate data "harvesting" numerous tools and data protocols had to be developed to increase the the overall utility of the library and to assist with the identification of user needs and other data management issues such as security. The automated collection of natural and cultural resource information and associated metadata decreases maintenance requirements and delivers on the promise of a digital "living library".

Introduction to Saguaro Digital Library

The Saguaro Digital Library (SDL) is a comprehensive digital library system providing a full range of services to facilitate our understanding of the impacts of natural and human environmental hazards, to provide models of environmental change that can access and utilize data, processing tools and algorithms across the Internet, and to provide a wide range of users the ability to obtain quantitative measures of this change.

The primary focus of the SDL is to facilitate the responsible stewardship of our natural assets and good ecosystem management. The SDL directly addresses the goals of the National Biological Information Infrastructure (NBII) by developing the capability to share data and resources so that biodiversity and ecosystem findings can be more readily applied in management and policy. The ultimate goal of the SDL is to allow components of the digital library to evolve independently and yet be able to call on one another efficiently and conveniently. Thus, the digital library will support heterogeneous and federated collections of digital content, including data, metadata, models, tools, and algorithms. The digital library will specifically provide decision support tools to improve monitoring of ecosystem status, better predict and mitigate change, and optimize sustainable productivity. The following figure describes the overall architecture of the Saguaro Digital Library.

The Saguaro Digital Library is a joint project being developed by a consortium of University research groups as well as Federal and State agencies in conjunction with industrial partners. State Agencies partnering in the Saguaro Digital Library project include the Arizona State Lands Department, the Arizona State Cartographers Office, the Arizona State Geological Survey, and the Arizona Geographic Information Council. Federal Agencies participating in the project include, the United State Geological Survey (USGS) Sonoran Desert Field Station, the US Army, the Rocky Mountain Research Station, Los Alamos National Laboratory, the Nature Conservancy, and the US National Park Service. Industrial partners involved in the project include Online Computer Library Center Inc. (OCLC), Raytheon STX, and Simons International Corporation. K-12 partners include Lawrence Intermediate School, Fort Lowell Elementary School, and the Vail School District from Arizona. The development of the library is led by the Department of Management Information Systems in collaboration with other departments at the University of Arizona (UA) including, the UA Library, Hydrology and Water Resources, Arid Land Studies, Geography, Electrical and Computer Engineering, the Arizona Regional Image Archive, and Renewable Natural Resources.

Resource Harvesting

For the most part, digital libraries are mausoleums, or in other words, static information sets that require periodic updates. Such updates are often temporary and costly. Thus, in reality, a digital library should support dynamic resources that can be updated by users or an information provider at any time. By employing appropriate data management techniques to assure quality, security, searchability and a means to generate metadata information, it is possible to create dynamic digital libraries. However, in order to collect the information, it is essential to provide digital forms for the appropriate information set and establish update rules. Furthermore, tools allowing new resources to be added and old/obsolete resources to be removed periodically must support the evolution of the library.

Providing mechanisms for resource harvesting and supporting dynamic evolution of digital libraries would have enormous practical benefits to the general public at large. Potential benefits include the following:

(1) reduced cost of obtaining updated information

(2) direct communication with the end user to learn what information is used and why

(3) creation of a cooperative atmosphere to encourage information exchange

(4) improved cost effectiveness of research efforts by expediting and increasing information flow

(5) mechanisms for direct public participation in the information gathering process

In addition, it creates an environment where users and the public have a significant role and responsibility to add information they deem necessary to effect the outcome maps or decision surfaces that may be created through information exchange.

Architecture

The Harvest System is composed of a Harvester, Metadata, XML Parser, Metadata Storage and Templates (Overall Architecture of Harvest System). Harvester provides users with a user interface through which users can submit the URL for GIS data and its metadata. According to the user's input, Harvester determines which template file to display. If Harvester receives an XML file, it creates Metadata after extracting the metadata from that file using the XML Parser and then saves the Metadata information in Metadata Storage. The following is a description of each component of the Harvest System.

Harvester

Harvester provides users with a user interface for data submission as well as retrieval of an URL for GIS data or its metadata in an XML file format. Harvester consists of MetadataHarvest.java extending ADRGServlet.java, MetadataUpload.java, MetadataAccess.java, MetadataFormReader.java and HarvesterUser.java extending ADRGUserImpl.java. Users can select from among the following three options:

1) No digital metadata available. Help me to create it.

2) Use existing metadata as a template for creating new metadata.

3) Submit metadata as an FGDC XML file.

According to the users selection, Harvester retrieves metadata from Metadata Storage or works with XML Parser to handle what the user has submitted. Also, Harvester provides a user interface for using files in Templates.

Metadata

Metadata has been created following the Content Standard for Digital Geospatial Metadata approved by the FGDC, and consists of MetadataComponent.java, Metadata.java, MetadataLists.java, FGDCCompliance.java, Cntinfo.java, Citeinfo.java, Timeinfo.java, Mdattim.java, Rngdates.java, Cntaddr.java, Keywords.java, Metextns.java, and DomainException.java. Metadata is used to create a metadata object for Digital Geospatial Metadata from data extracted from a parsed XML file which users submit, data retrieved from Metadata Storage, or data entered into the Metadata Form page provided by Harvester.

XML Parser

An XML parser for Java called XML4J developed by IBM was extended to create XML Parser. Cooperating with Harvester, XML Parser parses an XML file that users submit to create a metadata object and store it in Metadata Storage. XML Parser consists of HarvestXMLParser.java and DOMParserSaveEncoding.java.

Templates

Considering the design of Servlet, we used WebMacro, an HTML template engine and back end servlet development framework. Templates includes main.wm, metadata.wm, metadatamain.wm, xmlview.wm, htmlview.wm, compliant.wm, notcompliant.wm, error.wm, metaregister.wm, login.wm, loginfailure.wm, dbchoice.wm, changepw.wm, changepwfailure.wm, changepwsuccess.wm, register.wm, registerfailure.wm, registersuccess.wm, and thankyou.wm. Harvester uses the above template files to display the user interface.

Harvesting System Description

The Arizona NBII Metadata Harvesting System is a precursor to the Saguaro Digital Library, a confederation of natural resources data which dynamically grows through user contributions. Although many individuals and organizations have GIS data sets which they are willing to make available to the public, metadata is often not available for these data sets, or is not present in a form which can be easily processed digitally. The harvesting system addresses two key challenges: storing GIS metadata in a form which allows for maximum flexibility in performing searches and presenting this data, and providing a mechanism for users to easily create and contribute metadata for their coverages.

The system stores metadata, and although it is capable of also storing the GIS data files themselves, it is designed primarily to hold only references to such files somewhere on the Internet. Using our system, a user can place database files in an accessible location on the Internet, and then use our system to submit or create metadata for these data files. Once this is completed, users can use our system to search for and locate these data files. This makes it possible for organizations to publish their data without redevelopment of a web interface for searching for and downloading this data.

The metadata creation process begins in one of three ways. The user may choose to create the metadata from scratch, in which case they are presented with a blank metadata form. Alternatively, the user may choose to create metadata using an existing metadata record as a template. In this case, the system presents a list of existing metadata records from which the user may choose. The system presents the same metadata form, but the fields are filled with the values of the metadata record used as a template. This is very helpful when creating a series of metadata records which are mostly similar, since only that data which differs must be changed before submitting the new metadata record. Finally, the user may have existing metadata for the data set, in which case producing metadata using the form is tedious an unnecessary. In such cases, the user may upload the metadata as a valid XML file corresponding to the FGDC Content Standard XML Document Type Definition. Many organizations use the USGS Metadata Parser (mp) by Peter Schweitzer to prepare FGDC-compliant metadata. This tool has an XML output option. Metadata in a variety of forms can thus be converted to XML format and uploaded to the site. The system parses the XML file and then fills the metadata form with the appropriate data. Any desired changes can be made, although a compliant metadata record will not require them - the metadata form can then be submitted to place the record in the database.

Even if the user chooses to create entirely new metadata, the system provides assistance to make the process as fast and easy as possible. Users must register at the site in order to use the system. This enables us to track the way the system is used, but it also enables users to optionally submit personal information such as their address, phone number, email and other details. When a user creates new metadata, this information is automatically included in the metadata form, eliminating the need to enter the data each time a submission is made. When creating metadata using another record as a template, the user has the option of using the original contact information, or replacing it with his or her own. Fields which have a limited set of values, or a list of recommended values specified in the FGDC content standard, include those values in pull-down menus for easy metadata creation. The value lists for other pull-down menus, such as the menu for the "Originator" field, are dynamically pulled from the database. If the desired value exists in a previously entered metadata record, the user may choose it from the menu rather than typing the value. Thus, users can very quickly create metadata with a minimal amount of typing.

When the user has finished creating the metadata and submits the metadata form, the system checks it for FGDC compliance, and reports specifically the errors which it finds - a mandatory field which is missing, or a numeric value outside the allowable range. Understanding and correcting these errors is made far easier through an online help system which provides information about the FGDC content standard requirements for each metadata field. Clicking on the field name in the metadata form causes a new window to appear which contains abbreviated FGDC content standard documentation. Once the user submits a fully-compliant metadata record, he or she may review the record as an XML file. Currently, this view supports only raw XML delivered as an XML or text data type. In the future, we would like to enhance the display options for the data. Internet Explorer 5 provides a tree structure interface for examining raw XML files which by itself is a useful viewing tool for this data. The XML files is useful for more than review. The user may save the file to their local system, and then process it through mp to create FGDC-complaint metadata in a variety of formats, for application to other purposes completely unrelated to Arizona NBII data search and services. Even users who choose not to actually submit their data may find the automated metadata creation tools useful.

The set of metadata fields used by the harvesting system does not include the entire FGDC metadata standard. We chose to include all fields which are identified as mandatory, and any other fields which are useful for metadata search. Because the system was primarily designed to capture data which would enable users to search for, locate and download data sets, those optional fields which would rarely be used for search criteria were left out of the system. The result is a subset of important metadata elements which allow the user to create a minimal fully-compliant metadata record quickly and easily. Although the richness of the FGDC standard is lost, this decision makes the system more practical, efficient and user-friendly. In addition, it eases the significant database design burden placed upon developers by the FGDC content standard, which is highly recursive and irregular. Its complexity makes it very difficult to represent in a relational database, although storage of metadata in a database system, as opposed to collections of files or file-based systems, provides a more scalable, efficient, flexible and secure solution.

Future Direction and Conclusion

This Harvest System project is an attempt to develop a Harvesting Agent (HA) for the Saguaro Digital Library in support of resource harvesting. The Harvest System provides users with a friendly user interface through which they can submit geographic data and its metadata. The Harvest System also offers three options for users who have geographic data but not its metadata, and helps less-advanced users to submit metadata.

The Harvest System will be able to be extended as follows:

Extension of the Metadata Information
Integration of this system into the larger context of the Saguaro Digital Library
Submission of indented text files as metadata

The current system has been developed based on Identification Information and Metadata Reference Information. In the future, it will include Data Quality Information, Spatial Data Organization Information, Spatial Reference Information, Entity and Attribute Information in addition to Distribution Information to allow users to describe their metadata in more detail. The result is that it can then provide an advanced search option with which people can access the exact metadata they need.

The effort to develop this Harvest System has been largely been oriented toward information harvesting. However, since the ultimate goal is to also share geographic data and to increase knowledge, users should be able access and evaluate our data sets. To achieve this goal, the system will also be search-enabled in the future. People will be able to submit geospatial information they have gathered and share it with other users.

Even though the recommended file format by FGDC is XML, oftentimes people have an indented text file as their metadata. Also being considered is the ability to extend the system to receive indented text files, parse them to an XML file using mp, validate the XML file and then follow the same process when an XML file is received.

The overwhelming hope for this system being developed is that it will not only help people to submit and share more geospatial information, but that it will also help people better understand the interdependence between the economy and the environment, actively conserve biodiversity, and protect natural ecosystems to preserve the quality of human life.

Acknowledgements

We would like to express our sincere thanks to the following organizations and individuals for their expertise and willingness to contribute funding, information, or time: The Eller College of Business and Public Administration, The College of Agriculture, The University of Arizona Advanced Resources Technology group (ART), The Arizona State Cartographers Office, The Arizona State Lands Department (ALRIS), the USGS National Biological Information Infrastructure program (NBII), the USGS California Science Center, The University of Arizona Library, The Arizona Remote Sensing Center, The University of Arizona College of Arts and Sciences, and the many University of Arizona graduate students that provide many hours of code checking and data entry to make all of this happen.

References

[1] The Saguaro Digital Library for Natural Asset Management
http://vishnu.bpa.arizona.edu/projects/saguaro.html

[2]     Sudha Ram, Jinsoo Park and Dongwon Lee,
         Digital Libraries for the Next Millennium:
         Challenges and Research Directions, Information Systems Frontiers
         1:1, 75-94, 1999

[3]     Coordinating Geographic Data Acquisition and Access:
         The National Spatial Data Infrastructure
         http://www.fgdc.gov/publications/documents/geninfo/execord.html

[4]     Content Standard for Digital Geospatial Metadata, Version 1.0
         Metadata Standards Development, The Federal Geographic Data Committee
         http://www.fgdc.gov/publications/documents/metadata/metav1-0.html

[5] Rames Elmasri and Shamkant B. Navathe, "Fundamentals of Database Systems"
Second Edition, Addison ?Wesley Publishing Company, 1994

[6] Jinsoo Park, "Facilitating Interoperability among Heterogeneous Geographic Database Systems:
A Theoretical Framework, A Prototype System, and Evaluation", 1998

[7] Jason Hunter with William Crawford, "Java^TM Servlet Programming", O'Reilly, 1999

[8] Doug Tidwell, XML Programming in Java
http://www-4.ibm.com/software/developer/education/xmljava/

[9] WebMacro Java Servlet Framework
http://www.webmacro.org/

[10] Declaring Elements and Attributes in an XML DTD
http://www.informatik.tu-darmstadt.de/DVSI/staff/bourret/xml/xmldtd.html

[11] Todd Freter, XML: Mastering Information on the Web
http://www.sun.com/980310/xml/

[12]   STS Prasad and Anand Rajaraman,
         Virtual Database Technology, XML, and the Evolution of the Web,
         IEEE Computer Society Technical Committee on Data Engineering, 1999

[13] Andrew V. Royappa, Implementing Catalog Clearinghouse With XML and XSL,
pp 616-623 ACM, 1999

[14] Howard Smith and Kevin Poulter, Share the Ontology in XML-based Trading Architectures,
Communications of the ACM 42(3) pp110-111, March 1999

[15] Tim Bray, Beyond HTML: XML and Automated Web Processing
http://developer.netscape.com/viewsource/bray_xml.html

[16] Norman Walsh, A Technical Introduction to XML (1998)
http://nwalsh.com/docs/articles/xml/

[17] C. M. Sperberg-McQueen, What is XML and Why Should Humanists Care?
http://users.ox.ac.uk/~drh97/Papers/Sperberg.html

[18] XML: Structuring Data for the Web: An Introduction
http://wdvl.com/Authoring/Languages/XML/Intro/fixing.html

[19] Udi Manber and Peter A. Bigot, Connecting Diverse Web Search Facilities,
IEEE Computer Society Technical Committee on Data Engineering, 1999

[20] Jon Bosak, XML, Java, and Future of the Web
http://metalab.unc.edu/pub/sun-info/standards/xml/why/xmlapps.htm

[21]   Robert J. Glushko, Jay M. Tenenbaum and Bart Meltzer,
         An XML Framework for Agent-based E-commerce,
         Communications of THE ACM 42(3) pp. 106-114, 1999

Author Information:

Sudha Ram is a professor in the Management Information System Department which is an integral part of the Eller College of Business and Public Administration. The Eller College of Business and Public Administration is one of the 17 colleges at The University of Arizona. Correspondence should may addressed: Dr. Sudha Ram, Department of Management Information Systems, McClelland Hall 430, The University of Arizona, Tucson Arizona, 85721. Ms. Ram may also be contacted by telephone at (520) 621-4113 or by email mailto:Ram@bpa.arizona.edu

Michael R. Kunzmann is an Ecologist at the USGS Sonoran Desert Field Station located within the School of Renewable Natural Resources. The School of Renewable Natural Resources is in the College of Agriculture and is centrally located on The University of Arizona campus. Correspondence may be addressed: Michael R. Kunzmann, USGS Sonoran Desert Field Station, The University of Arizona, 125 Biological Sciences East, Tucson, Arizona, 85721. Mr. Kunzmann may also be reached by telephone at (520) 621-7282 or by mailto:mrsk@npscpsu.srnr.arizona.edu.

Jongseo Kim is a graduate student in the Management Information Systems department at the University of Arizona. Correspondence should be addressed: Mr. Jongseo Kim, Department of Management Information Systems, McClelland Hall 430, The University of Arizona, Tucson Arizona, 85721. Mr. Kim may also be contacted by telephone at (520) 621-2328 or by email jskim@bpa.arizona.edu.

Jeff Abbruzzi has recently received a masters degree under the Management Information Systems department at the University of Arizona. Mr. Abbruzzi is in the process of relocating to Phoenix, Arizona to pursue a new career opportunity. In the interim correspondence may be addressed: Mr. Jeff Abbruzzi, Department of Management Information Systems, McClelland Hall 430, The University of Arizona, Tucson Arizona, 85721.