Jeffrey S. Malovich

USING DATA EXTRACTION FOR GIS DATABASE POPULATION AND VISUALIZATION

INTRODUCTION

A key problem for users of information systems-- information analysts, decision makers, planners -- is the magnitude of data that has become available as a result of improvements in the acquisition of machine-readable text and the dissemination of textual documents across proliferating networks. It is estimated that the total amount of data in the world is doubling every 12 - 15 months. The rapidly growing breadth of information sources makes it increasingly difficult to distill information for analysis. Tools are needed to 'mine' information on an ongoing basis to bring out key, relevant facts for analysis, and to support interfaces to analytical tools. Assisting a user to rapidly filter and assimilate useful information from a variety of data sources is a major way to leverage the individual's productivity, and meet the mission critical objectives and deadlines of the organization.

Another source of leverage is the set of tools being developed for data transformation and presentation. These visualization tools provide substantial assistance in allowing users to formulate, explore, and evaluate textual content using visual characteristics rather than a textual interface. However, for these tools to be practical, a mechanism must be available to extract and normalize the data on which they operate; often these data are imbedded in messages, documents, or other textual sources.

Data Extraction tools now exist to provide the mechanism to perform the extraction and normalization on textual data from a variety of sources. These tools can be combined to address the following principles:

  1. Provide new operating principles that allow users to find and identify information without spending weeks or months organizing or reading all individual documents supporting a research task.
  2. Provide users with tools to collect, compile and organize information, and help understand the meaning of information.

These tools are designed to facilitate navigation through a text collection in search of evidence or information that cannot be readily defined with just a search query. They provide uniform methods for analyzing and evaluating information used to support analysis and decision making. Data extraction tools are used to discover, access, and retrieve information from a variety of open and closed sources of textual information.

Products and capabilities can be combined with data extraction tools to allow users to understand and analyze more data in less time, and provide a mechanism to convert the data into useful information in a cost-effective and timely manner. Together, these tools provide an end-to-end text processing system that incorporates commercial off-the-shelf (COTS) technologies where practical, and includes value-added tools where technology 'holes' exist. This engineering approach reduces the time and costs normally associated with development projects. The approach is highly modular so that each tool can operate independently or in conjunction with each other. Additionally, this design easily allows for the incorporation of new technologies and capabilities as new requirements are identified.

Key to improving the analysis process is reducing the effort required to get information into analysis tools. Techniques are available to provide the automatic preparation of data extraction results for input to databases, geographic information systems, data visualization, and other analysis tools. These techniques enable users and support staff to quickly and easily follow links from one representation of the data to another. For example, the user can search for documents that reference computer hardware acquisitions between countries, and request that the results be displayed on a map or timeline. The resultant display may show a world map with lines between countries that have been involved with purchase and/or acquisitions of computer hardware technology. The user can now spatially understand activity, and may more easily discern patterns or trends of activity that would be hard to understand through reading and understanding textual data.

CONCEPT

Information retrieval is the initial process in which users retrieve a set of documents of interest by entering a representative set of terms. From an operational perspective, extraction simply allows the user to ask for more specific information. Extraction retrieves references to entities and entity relationships. Integrating extraction with traditional search technology simplifies the user's conceptualization of the problem. Queries can contain any combination of terms, entities and entity relationship information. Examples of queries include:

The concept for data extraction is derived from the demands by users to access and process data, and quickly assimilate it into usable information. The resultant capabilities provide a method for users to access and analyze information in a seamless environment. When combined into a system, the data extraction tools allow users to quickly understand the following:

Who is in text (companies, individuals, etc.)?

What happened?

When did things occur?

Where did things occur?

How much?

Why? Causes?

How?

SOURCES

Data extraction tools can be tailored to accept text from a variety of open and closed source systems. Open source systems deliver information from a variety of subscription services (e.g., Dialog, NEXIS, and DataTime). Other open source systems include newswires (e.g., AP and UPI) and Internet Newsgroups.

Closed source systems include text from intranet sources (e.g., corporate databases, e-Mail, and word processing documents). Additionally, users may have static sources of information (e.g., OCR text) that can be accessed and processed by the data extraction tools.

RETRIEVAL

The data extraction tools can quickly process text based on keyword searches and Boolean logic. Once the text has been processed, the user can extract and categorize the text according to a specific domain (i.e., concept) that has been pre-defined. Use of this tool does not require the source ASCII text to be indexed or formatted. This tool is also used to pre-process the textual data that populate the data visualization tools.

EXTRACTION

Once the information is extracted from the text, it is presented to the user in an intuitive manner so that the user can quickly assess and interpret the content. The Entity Extraction process uses semantic information to locate entities in the documents. This information is then processed by a Relationship Extractor that identifies relationships among entities within documents.

These extraction methods process textual data and extract template information to allow a user to quickly review the document content and assess whether it contains relevant information to the analytic process. The user has the option to review only the template information (i.e., pre-defined values) or review the entire article. The template information can be used to populate a database at the user's discretion. This structure can be used as a decision support system or used by other applications to support database development and visual representation of information.

VISUALIZATION

Visualization tools may consist of COTS products that has can be integrated to work with the data extraction tools. One of the visualization tools, ArcView II, assists the user in interpreting, identifying, and analyzing relationships from data by providing a visual mechanism that reflects relationships. This COTS product helps the user to quickly visualize the correlation and association through graphic representations, thereby reducing the amount of text that has to be reviewed and analyzed. This tool provides a capability to spatially represent the geographic relationships described in textual documents.

FUNCTIONAL DESCRIPTION

The retrieval tools would allow a user to retrieve information from a variety of sources. Using the various tools, the user will be able to limit and control the data retrieval process. Once retrieved, the data can be processed by the data extraction capabilities. Extracted data can be presented to the user using the visualization tool incorporated into the tool set.

RETRIEVAL The retrieval environment allows a user to generate and execute structured queries against different information sources without the user having to know the identity or structure of the source information. The retrieval process allows a user to perform both keyword and concept searches on raw (i.e., unindexed) textual data. After the target text files have been selected, the user provides a keyword to search for. The keyword can be combined with another keyword to construct a Boolean operation for searching. Once the keyword(s) is provided, the user "launches" the query and receives immediate feedback on the number of documents processed and matches found.

After the initial search has been performed, the user has a refined set of documents that may be of interest. The user may review the articles or further refine the search criteria by selecting a predefined concept. For example, a user may perform an initial search of financial documents for articles on the Microsoft Corporation and reduce the number of documents from thousands to hundreds. Using the resultant document set, the user may want to search based on a "merger" concept. The search would be performed against the current document set for terms the user has predefined to be associated with the concept (e.g., "merger", "acquisition", "buyout", etc.).

Once the user has processed the documents according to his search criteria, the resulting documents are displayed to the user for review. The initial display contains the title of the document and the document paragraph that contains the keyword (or concept). The user may elect to review the next document paragraph or review the entire parent document that contains the displayed paragraph. The user may elect to copy the displayed document to the Windows NOTEPAD, or insert it into other software components (e.g., Microsoft WORD).

The retrieval component may also perform preprocessing of the documents for inclusion into the visualization component. The preprocessing capability performs the requisite activity of organizing the text and formatting a database that is required by the ArcView COTS software. This activity insulates the user from having to understand and perform the data preparation activity that is involved in using the ArcView product.

DATA EXTRACTION The first step toward developing an extractor is conceptualization. The result of conceptualization is a welldefined problem in which the pieces needed for the solution have been identified. These pieces include: what entities are of interest; how they can be categorized; what activities and relationships are of interest; and a determination of where the entities and relationships should occur (same paragraph, sentence, document). It is not necessary to determine every single entity (company, technology, person, etc.) or relationship of interest during the conceptualization phase. New entities and relationships that are discovered during research can be added to the extractor later.

A thesaurus is constructed that lists the various forms in which a particular subject or relationship might appear in the relevant sources. For example, for the entry for Britain in the conceptualization, you might use a Britain lexicon (dictionary) that contains the following thesaurus entries: GB, UK, United Kingdom, England, Briton, British, London, John Major.

The extractor is built based on the conceptualization. Extractors can be run against all available documents, or any subset thereof. When looking for more in-depth information, workgroup or personal extractors can use any existing entity and/or relationship information. Once an extractor is created, it can be applied by users to any or all documents, given appropriate security access. An existing extractor can be modified through extension, replacement, addition, elevation or demotion.

The data extraction process identifies predefined template information from raw (i.e., unindexed) textual data based on algorithms that use a defined lexicon, rules of grammar, and evidential abstraction. A template data structure that reflects the data extraction terms is applied to the text.

Once the user has processed the documents according to a selected template, the resulting document paragraphs are displayed to the user for review. The document display contains the title of the document and the document paragraph the contains the source data that the template information was derived from. The user is also provided with the option to review the parent document from which the displayed paragraph has been extracted.

The transaction portion of the screen displays the template fields and values that were extracted from the text. After reviewing the transaction, the user may save the document transaction for further processing, or delete it from the data set. Saved transactions can be used to populate predefined databases (e.g., Microsoft Access, Sybase, Lotus Notes). The user is also provided with a separate "Notes" area to record information pertaining to each transaction.

VISUALIZATION The ArcView component provides COTS tools that are customized to help a user find, "see", and understand the meaning of large amounts of information. The tool assists in visualizing patterns, trends, and relationships through the use of icons, objects, and color graphics.

ArcView is used to construct graphical representations of the links and relationships within textual or tabular data. It allows the user to see the critical ties between people, places, and other important elements by providing an intuitive mechanism to understanding the meaning of data and draw conclusions or anticipate trends. This type of visualization process facilitates the process of understanding correlations and associations represented in the data.

Using GIS as a visualization tool allows the user to extend their analysis and understanding of information derived from data. For example, the user may perform a search on nuclear plants. The resultant search and extraction process may indicate that North Korea is building new nuclear facility at Pungsan. A point is positioned at the geographic location for the city. The user can now understand what the spatial infrastructure is around Pungsan, including roads, population, vegetation, etc. The user can perform proximity analyses, comparing the plant location with respect to lakes and rivers. Additionally, the user can quickly understand where other plants are in relation to where this plant is being built. All of these features may be important in analyzing and understanding why the country has decided to build a facility there.

A user may want to understand technology trade activity between the US and Europe. This relationship can be stated by: "show me all activity where the US is the seller and Europe (defined as France, Germany, England, Spain, Italy) is the buyer." The user can elect to represent different technologies via distinct line colors, and quantities by line thickness. A map would be displayed showing color-coded lines of varying thickness between the US and the European countries. The user could quickly understand the type and quantity of trade activity. The user may further refine the query to reflect date ranges, technology types, etc.

This type of questions can be presented in different way, according to the user preference. Using data extraction and database population, this type of information can be quickly assimilated and available with little user intervention. Additionally, the information can be easily included into reports and briefings so that an audience can quickly and easily grasp the data meaning.

CONCLUSION

This is an era where computer technology is permeating the company environment and vast amounts of data are becoming more available and accessible. Companies are having to do more work with a smaller budget and workforce to remain competitive. It is essential to have tools that allow people to extract information from a wide variety of data sources, and quickly understand its meaning. Data Extraction provides a mechanism to quickly populate databases and make the information a strategic resource. Combining this capability with visualization tools can quickly transform data into a corporate resource to facilitate the planning and decision making activity.

Jeffrey S. Malovich
TASC, Inc.
12100 Sunset Hills Road
Reston, VA 22090
Telephone: (703) 834-5000
Fax: (703) 318-7900
E-mail: jsmalovich@tasc.com