The Geospatial and Statistical Data Center at the University of Virginia is one of four library-based electronic centers. Other centers focus their user-services and collections on texts, images & multimedia, and digitized rare materials. How do library-based GIS services fit within such a diverse collection of holdings? What challenges do librarians face when working with GIS metadata standards, especially when trying to adapt these standards and that for other media to establish a core data model? How can and how do GIS users influence the process?
Today I want to focus on a specific context for GIS: a diverse and well-established, yet changing digital library environment at the University of Virginia Library. I want to emphasize the role that less technical factors, such as organizational mission and culture, will play in defining an approach to "digital library" design, and I'll be speaking from my experiences as "the GIS guy" at the Uva. Library. As we know, GIS and libraries both rely heavily on the people who know how material is organized, and who can help others use it. We should similarly consider that a digital library may exist within both physical and virtual space. That is certainly the approach that my organization has taken over the past seven years. We actively seek out and encourage new users of digital technologies, and provide interested faculty and students with heavy support. Rather than build a digital library machine behind the scenes, we have instead adopted a laboratory and service approach, creating several digital centers in our buildings, enabling users to collaborate in our experiments with new technologies, including GIS.
But this approach has had consequences for the way we organize and publish our digital materials. The term "digital library" has no stable meaning right now, but I think that most of us would assume that digital libraries will not only improve access to materials, but also provide enhanced tools for interpretation and use of those materials. It's not just a matter of building modes of access, but of building meaningful modes of access. One Esri promotional campaign asks: "How in the world do you put GIS data on the Web?" Librarians rephrase this: "Once you get GIS data on the web, how in the world do you help users find it? How can you integrate such data with many other digital collections--including non-spatial data--that may be similar in theme, but foreign in type?" I've met GIS users from all over the country who, when they find out I work for a library, become excited: "We need to get the librarians involved to help us solve the metadata issue." The open secret is: librarians don't really have it solved yet either. We're working on it, just like everyone else, but not only through the lens of the Content Standard for Digital Geospatial Metadata.
The library's GIS services at UVa. are coordinated through a unit we call Geostat: the Geospatial and Statistical Data Center. I'll speak more about our GIS services in a second, but let me provide some context for our work. Besides Geostat, there are three similar digital centers sponsored directly by the Library, and the library houses two research computing institutes. These efforts began in 1992 with the Electronic Text Center, which creates and collects SGML encoded documents, including literature and significant non-fictional works. Our Digital Media Center focuses its collection efforts on image, video, and audio materials, and supports a variety of disciplines, focusing heavily on the fine arts. The Special Collections Digital Center has less direct public contact than the other three centers. But their digitization of texts and images from our collection of rare manuscripts, books and photographs improves access to these valuable materials. The Institute for Advanced Technology in the Humanities and the Virginia Center for Digital History both live in our main library building, just down the hall from Geostat and Etext. These two institutes sponsor large projects by supporting a few faculty directly with fellowships. While we have learned a lot by working with them, the library’s centers have a broader service mission: to work with faculty and students to developing materials in response to their teaching or research needs.
Geostat collects, manages and delivers machine readable data for both spatial and statistical analysis. We began six years ago as the Social Sciences Data Center. This was the library's response to the increasing numbers of electronic data sets we were acquiring through the Government Documents Depository Program. We have close working relationships with users in the sciences, arts, and social sciences. We support teaching and research by instructing classes or consulting on projects, and we collect a wide variety of data, including spatial data sets and paper maps, social and political opinion surveys, and national and international economic indicators. In many cases, users come to us, get their data file, and go. But our focus requires a lab-based service approach because the data we deliver--whether in person or through the web--usually requires software other than a web browser to fully realize its value. We provide a large computing lab where students can work with GIS and statistical software packages, and we staff the lab heavily during the school year. We hold short instruction sessions on use of the software in the lab, which also serves as a teaching space for faculty in the Department of Environmental Sciences.
The ways we have published data on the web have been heavily influenced on the structure of the center. For several years, what is now Geostat was two separate units: the Social Sciences Data Center and the Geographic Information Center. They divided their collections and services according to their names: numbers in one place, spatial data in another. Both centers constructed web-based data browsers which allowed users to interact with server-housed programs such as ArcInfo or SAS, and which guided the user through analysis and subsetting of a large spatial and quantitative datasets. However, the split between the Geographic Information Center and the Social Sciences Data Center meant that web-based applications developed by each sometimes didn't make use of materials provided by the others. For several years the Social Sciences Data Center provided online access to large federal data sets with spatial components, like the County Business Patterns, published by the U.S. Census Bureau, and the Regional Economic Projections published by the Bureau of Economic Analysis. These included no mapping components. The Geographic Information Center delivered TIGER coverages of Virginia via a web-interface to ArcInfo starting in 1995. But this provided no statistical component. The narrowly defined missions of the two service areas got in the way of providing a useful interpretive tool. When Geostat unified, we focused on including online mapping as a function for the statistical data. Because these sets were already in SAS format, we made use of SAS's graphing functions to create statistical maps and deliver them on line. In this case, we're using the tool that works best for a particular purpose—expediency is the rule of thumb in our centers. SAS's statistical functionality is more flexible than ArcInfo's, and its graphic functions speedier, but its GIS features are less well developed (and we don't know them). It would of course make little sense to store and deliver geodata to our users except in a GIS data exchange format--such as ArcInfo export files.
The split between services in a pre-unified Geostat is mirrored throughout the centers as a whole. When we began opening these centers focused on specific media--texts, images, numeric and spatial data, audio and video--the library bet that by separating responsibilities, both staff and collections could develop more rapidly. This approach made the most sense at the time, and it has worked: each of the centers knows well what their clients need, and have developed an outstanding variety of materials. The center-based approach has resulted in an ever-growing, ever-diverse collection of electronic data in library hands, but it has also led now to a rather balkanized set of collections that do not unify well into a "digital library." (Witness geo and stat.) While we all work together well on occasional projects, each of the separate centers have approached data publishing and management through the lens of their own collections and what works well for them. Data does not mean the same thing to all of us. Geostat is in fact an oddball in the group of centers. The balance of expertise in our library tips towards SGML, not SQL; texts, rather than numbers; rasterized photography, instead of vectorized cartography. While almost everyone can now readily grasp the concept of putting the text of a book on the web, it's hard to imagine what a spatial query looks like if you've never seen one performed.
Furthermore, there has been no overarching system in place --on an organizational chart, or in reality-- to control the data or the interface to the data produced by each of these centers. The methods that each center has adopted to data publishing and collection building have not been random, but they have been pragmatic and circumstantial. These centers pre-dated the web, and their methods have developed with the means of distribution. As a result, the digital collections are a confusing mix of full-text SGML, contorted pseudo-SGML databases, ColdFusion and FileMaker Pro databases, RealAudio files, SAS data files, and ArcInfo coverages, scattered hither and yon among many servers. The problem is not that we're using different approaches and tools, but that the collections have grown too quickly to develop a coherent and coordinated metadata structure among them all. In many cases, the deep and useful metadata which describes these collections sits with the data itself, rather than in an overall catalog, or it sits unrecorded and unexpressed, except in printed documentation; or it sits primarily in the scripts and syntax written specifically to make the data dance in a certain way on the web. As a result, even where potential connections exist across the collections of different centers, the tools that we have built do not connect the sets with each other, nor do they allow our users to easily search for discover materials unknown to them.
But didn't librarians invent metadata? Isn’t data management what libraries do? What about library catalogs? Doesn't the library's catalog, which we call VIRGO, work to comprehend all of these diverse collections? The answer is that library catalogs can work in some instances and for some purposes, but that they cannot do all the things our users want done in a digital library. (Ask yourself: do all nodes on the National Spatial Data Infrastructure Clearinhouse Network provide you with all the functionality you want?) Our catalog works as a finding aid, not primarily as a tool to add value to and enable use of a data collection. Furthermore, the preponderance of electronic data has spawned new metadata standards, which evolve slowly but still outpace the practices of traditional library cataloging. Library catalog records--defined by the standard known as MARC (Machine Readable Archive format)--are not robust enough to comprehend the depth and uniqueness of some of the data in these collections, especially the non-textual items. FGDC and other highly-complicated metadata structures were not developed for use in these systems, but by those persons who would use the data itself. When we set up a web-based archive of material, we there have the opportunity to define in advance some frequently-performed searches to speed access. In the library catalog, your options are more limited: you can't easily pre-program database searches into the structure of the cataloging record for a large collection. A library catalog is very much like the spatial data clearinghouse that lists only contact information to retrieve data. A link providing you with access to the resource is a bonus.
None of us have abandoned the catalog. It works well for cataloging digital items that have a close physical analogue, like an electronic text and it's corresponding book. Even for some of our own spatial data, it is the best means of access we can currently provide. Over 800 Digital Raster Graphics for the state of Virginia are available for retrieval through our catalog, along with their complete FGDC metadata records. However, you can only retrieve them through the catalog. We planned this as a stopgap until we can acquire more server space, better image processing and delivery software, and have the time to work on an index that will allow users close to the same functionality you get from a printed index and a drawer full of quad sheets. Unfortunately, that is beyond the scope of the catalog as it now exists.
For some time now, all of us in our centers have discussed the need for a management scheme that can incorporate and extend the traditional library cataloging model. Our system is messy. We want to unify the tools for resource discovery (what catalogs do best), resource access (which catalogs can do increasingly well), and resource interpretation (which has to be done by dozens of different programs, perl scripts, SAS jobs, and AMLs). Such a system, by necessity, must be able to comprehend data, and metadata, of many formats. A unified repository for the data would manage and track the data. It would allow for the most basic searching across all collections, using a standard metadata set such as Dublin Core, while maintain the integrity of the distinct metadata and data collections for specialists like the GIS user. Furthermore, it has to incorporate the tools we build more directly into the system to save effort and time.
There is no such single tool out there. Products from the commercial sector seem often to be designed around commercial functionality; mechanisms in development at other library systems focus--not surprisingly--on the strengths of the particular collection and upon the research interests of the development team. However, we have begun cooperating with researchers at other universities who share our interests, and quietly started our own research and development effort that will attempt to build a open, scaleable, modular system out of many existing and developing pieces. It is very much a work in progress, and each of the centers will be developing data to test the system out over the coming year. That is the context in which our GIS publishing will take place.
The major new software we will experiment with is known as FEDORA (Flexible and Extendable Digital Object Repository Architecture), under development at Cornell University. The goal here is to create a repository system in which our data are self-aware objects. While new GIS software (notably ArcInfo 8) has begun to deploy an object-oriented data model, we hope to develop such a model not just for GIS data, but also for electronic texts, images, and numerical data sets. In the digital library, these data objects have to be the stars, and they need to carry their properties and behaviors with them. To accomplish this, we will need new tools to the system, such as XML and Java, to create, process and manage metadata collections, and to extend metdata's role from a descriptive to a functional one. Data should carry with it the metadata that would describe itself, thus allowing for discovery in the broad catalog. It should carry the metadata that describes how it can be used, allowing specialists to determine how appropriate it is to their purposes. Finally, it will have to carry the metadata that can talk to the software that uses it, and be able to launch those processes, whether those are SAS, ArcInfo, or simply another web browser window. The real trick in this system will be programming our tools to accept metadata sent to them by the repository. By making statistical and GIS software talk to new data types, such as a codebook or metadata record that has been marked up in XML, we can extended the usefulness of metadata. That is the dream.
The more prosaic reality means making all of that metadata. Geostat's first focus in our research effort will somewhat old fashioned: building a new clearinghouse model to describe data to our basic information community. But in fact, such a finding aid will be the backbone of everything else that is done. Once metadata exists for the material to be mapped or otherwise manipulated, all other processes in the chain of access will talk to that record, or to information generated from that record. Our early experiments with XML and stylesheets for other types of metadata suggest that we should be able to use it to design a clearinghouse at once more functional and simpler to use, providing the ability to deliver different parts of a record, depending upon the user's needs. But the target is a system that does more than report on the data. It should also allow the user to plot and view, subset or otherwise manipulate the data before retrieving it for her own use. It should also be able to deliver a newly-generated, compliant metadata record that has been specifically created to reflect her manipulations.
To achieve this, we'll be rethinking the ways we currently deliver and map data on the web, and experimenting to create a replicable process. We are joining the numerous data librarians who have begun experimenting with the Data Documentation Initiative (DDI), an XML tag set being defined to describe the structure of tabular data. While FGDC metadata can be very narrative in structure and content, DDI focuses heavily on describing the structure of a flat data file, column by column, field by field. We will be using DDI to define the structure of 1990 STF3a Census files for Virginia, but working with the FGDC standard to describe the core spatial materials, such as coverages generated from TIGER files. At this point, we are working with old data we know very well. As we tinker with the process, we'll move on to new methods of storing and distributing data that take advantage of functionality such as that we are seeing with new GIS tools like ArcInfo 8, ArcSDE, and ArcIMS. Our job here is to first build a system that is replicable for other sets of numerica and spatial data, and to build one that will allow discrete sets of data to interact with each other through the repository model.
In developing new data management structures, we'll keep our mission focused on users, and on improving their access and use of new materials and collections, and, where possible, collaborating with them to produce new collections. These collaborations promise to be a very fruitful way to extend the usefulness of GIS, and at the same time, enhance the value and usefulness of multiple collections, including those that have not previously employed it. We may be able to more deeply exploit the spatial information latent in our literary and historical documents. Both the Institute for Advanced Technology in the Humanities and the Virginia Center for Digital History now have research faculty very interested in using GIS to study a range of topics, including post-Civil War communities, and the witchcraft hysteria in Salem in 1692. Imagine a textual database of court records of the Salem Witchcraft Trials--this exists already. Imagine then an historic gazetteer of old Salem village that shows where the accusers and accused lived in 1692--this is under construction. Imagine bringing these two databases together under one system to allow for broader, richer, and deeper research experience than either can provide on their own: you can begin to see what a digital library should perhaps look like.
All Web citations were verified on August 1, 1999.
Content Standard for Digital Geospatial Metadata
http://www.fgdc.gov/metadata/contstan.html
Data Documentation Initiative
http://www.icpsr.umich.edu/DDI/
Digital Media Center at the University of Virginia Library
http://www.lib.virginia.edu/dmc
The Electronic Text Center at the University of Virginia Library
http://etext.lib.virginia.edu
FGDC National Spatial Data Infrastructure Clearinghouse Network
http://www.fgdc.gov/clearinghouse/clearinghouse.html
Flexible and Extensible Digital Object and Repository Architecture (FEDORA)
http://www2.cs.cornell.edu/payette/papers/ECDL98/FEDORA.html
Geospatial and Statistical Data Center at the University of Virginia Library
http://fisher.lib.virginia.edu
County Business Patterns
http://fisher.lib.virginia.edu/cbp/Regional Economic Projections (at the Geostat Center, University of Virginia Library)
http://fisher.lib.virginia.edu/projection/Virginia TIGEr/Line Data Browser
http://www.lib.virginia.edu/gic/spatial/tiger.browse.html
Institute for Advanced Technology in the Humanities
http://www.iath.virginia.edu/
MARC Standards
http://lcweb.loc.gov/marc/
The Salem GIS Project
http://fisher.lib.virginia.edu/projects/salem
Special Collections Digital Center at the University of Virginia Library
http://www.lib.virginia.edu/speccol/scdc/
University of Virginia Library
http://www.lib.virginia.edu
Viriginia Center for Digital History
http://www.vcdh.virginia.edu/
VIRGO (University of Virginia Library Online Catalog)
http://virgo.lib.virginia.edu/cgi-local/vg.pl
Witchcraft in Salem Village
http://etext.virginia.edu/salem/witchcraft/
Mike Furlough
Geospatial and Statistical Data Center
University of Virginia Library