Mapping Cyberspace with GIS


Esri User Conference Paper No. 615

Submitted to the 1998 Agenda Committee by Paul Terpstra
from the Office of Mines and Minerals of the Illinois Department of Natural Resources
3 April 1998

Abstract

The potential for spatial analysis of the Internet is demonstrated in a procedure that uses the ArcInfo Geographic Information System (GIS) to map any World-Wide-Web address, or Uniform Resource Locator (URL), to a unique location in a 2-dimensional, unitless Cartesian coordinate system. A UNIX script called mapsites extracts URLs starting with "http://" from any site list and produces a decimal dump of rearranged URL segments. An Arc Macro Language (AML) program called cybermap.aml converts the URL dump to a file of X-Y coordinates to generate a point coverage. From other files output by mapsites, points on the map can then be given attributes such as URL, Title, or any other information present on the original list of web sites.

Two small sample files (about 2200 web sites total) were mapped to the "cybergrid" using a test version of cybermap.aml that uses only the top-level domain (TLD) and 2 characters form the second-level domain of each URL. Preliminary results revealed that cyberspace is still mainly "ocean" (empty space), scattered with about 210 "islands" (2-letter TLDs), and 7 "continents" (3-letter TLDs). Of the 7 continents, one (.int) appears to be virtually uninhabited, while another (.com) looks seriously overcrowded. A cyberspace map may prove to be a convenient method of depicting the rapid changes occuring in this newborn universe, especially in these first few years after the "big (cyber)bang".


Introduction

GIS users like to know where they are at all times. Our ability to track location is a skill of such value in spatial data analysis that we hope never to learn the meaning of the word "lost".
. . . and then along comes the Internet -- a marvelous and awesome new phenomenon that entices us with the promise of instant information, and then terrifies us by teaching us the meaning of the dreaded 'l-word'.

The ability to spatially perceive the real world does not help us navigate the Web for one simple reason: In cyberspace, there is no coordinate system.

The Office of Mines and Minerals at the Illinois Department of Natural Resources maintains a library of links to Internet resources on GIS, GPS, geology, the mining industry, and environmental regulations. Early attempts to categorize web sites by content were frustrated by the fuzzily defined topic boundaries and sites that defiantly refused to be pigeonholed into a specific category. A need was perceived for an objective arrangement that could precisely map each Internet address without ambiguity.

The current project is merely one of several approaches to the cybermapping cartographic challenge. Some efforts, such as Matrix Information and Directory Services, project cyberspace to real space, showing servers and backbones as points and lines existing in latitude and longitude. Others, such as Boardwatch Magazine, map the Internet by tracing its connections, emphasizing topological continuity more than precise location. An excellent, link-filled compendium of cybergeography research can be found in Martin Dodge's Atlas of Cyberspaces at the Centre for Advanced Spatial Analysis at University College London.

The pages in the LRD Web-site library are organized by URL fields and arranged alphabetically within these fields. The general format of the organization scheme is:

http://h.s.m.t/d/f

where	t = top-level domain
	m = main domain
	s = subdomain(s)
	h = host
	d = directory(ies)
	f = file
Fields m and t are present in every URL. If h and f are present, they occur only once. s and d may be present multiple times. The URLs are alphabetized by fields in the following order:

t -> m -> s -> h -> d -> f ->

One way to create a grid for plotting URL points would be to use IP numbers, which are unique to each host. This would be less complicated for two reasons:

  1. The fields of an IP address are already in sequential order.
  2. Since the fields are already numeric, there would be no need for a decimal dump.
Unfortunately, IP addresses have two major disadvantages:
  1. They would limit the cybermap to the scale of the computer rather than the actual web pages. A single server (such as dnr.state.il.us or www.Esri.com) may contain thousands of individual web pages.
  2. IP numbers are assigned sequentially, and thus have no mathematical relationship to the URL. A cyberspace map plotted on an IP-based grid would have temporal but not spatial significance. For example, the Illinois State Library at 199.15.3.3 and the Illinois State Museum at 163.191.200.94 should be neighbors, but they would be far apart in IP-space.
A discussion of IP numbers as a coordinate grid (by Owen Rowley and Brian Behlendorf can be found in the VRML Hypermail Archive.

Procedure

The maps shown below were produced by the following procedure:
  1. A UNIX shell script was written to extract URLs from a list, rearrange the fields of each URL into segments of decreasing map extent, and output a decimal dump of its ASCII characters:
     
    if [ $# -lt 2 ]
    then 
      echo 'Usage: mapsites <infile> <outfile>'
    else
      echo Generating coordinate file $2 from URL list $1...
      grep -i 'http://' $1 | sed 's@http://@{@' > xx00
      cut -d'{' -f2 xx00 | cut -d'/' -f1 | tr '[A-Z]' '[a-z]' > xx01
      grep '\.' xx01 | grep '[a-z]' | sort -u | cut -d: -f1 > xx02
      cut -d'.' -f1 xx02 > xf1
      cut -d'.' -f2 xx02 > xf2
      cut -d'.' -f3 xx02 > xf3
      cut -d'.' -f4 xx02 > xf4
      cut -d'.' -f5 xx02 > xf5
      cut -d'.' -f6 xx02 > xf6
      cut -d'.' -f7 xx02 > xf7
      paste -d@ xf7 xf6 xf5 xf4 xf3 xf2 xf1 | tr -s '@' '@' > xx03
      cut -d@ -f2-3 xx03 | od -tu1 | cut -c9-80 > xx04
      sed 's/010/@/g' xx04 | sed 's/064/Z/g'> xx05
      paste -d' ' -s xx05 | tr '@' '\012' | sed 's/^ //' | nl > $2
      echo Your coordinate file is $2.
    fi
  2. An AML was written to generate a coordinate file from the output of step 1. A map of the ASCII character decimal values was used to choose conversion factors to generate a coordinate file for an X-Y grid centered at 0,0.
    /* cybermap.aml: Converts decimal dump of URL list to X-Y coordinate file
    /* Revised 03 April 1998
    /*
    &args infile outfile
    &if [null %infile%] or [null %outfile%] &then
    &do
      &type 'Usage: &r cybermap <infile> <outfile>'
      &return
    &end
    &else
    /* 
    /* Open input and output files, and read input file:
    /*
    &s openin [open %infile% openinstat -read]
    &s openout [open %outfile% openoutstat -write]
    &s line [read %openin% readstatus]
    &do &while %readstatus% = 0
    /*
    /* Extract elements from current line and process them:
    /*
      &s id [extract 1 [unquote [before %line% Z]]]
      &s n1 [extract 2 [unquote [before %line% Z]]]
      &s n2 [extract 3 [unquote [before %line% Z]]]
      &s n3 [extract 4 [unquote [before %line% Z]]]
      &s n4 [extract 1 [unquote [after %line% Z]]]
      &s n5 [extract 2 [unquote [after %line% Z]]]
      &s n6 [extract 3 [unquote [after %line% Z]]]
      &s n7 [extract 4 [unquote [after %line% Z]]]
      &s x1 [calc ( %n1% - 110 ) * 100]
      &s y0 [calc ( %n2% - 110 ) * 100]
      &if [null %n3%] &then &s y1 %y0%
      &else &s y1 [calc %y0% + ( %n2% - 50 )]
      &s x2 [calc ( %n4% - 110 ) * 10]
      &s y2 [calc ( %n5% - 110 ) * 10]
      &if ^ [null %n7%] &then
      &do
        &s x3 [calc %n6% - 110 ]
        &s y3 [calc %n7% - 110 ]
      &end
      &else
      &do
        &s x3 0
        &s y3 0
      &end
      &s x [calc %x1% + %x2% + %x3%]
      &s y [calc %y1% + %y2% + %y3%]
    /*
    /* Write record to output file, and read next line:
    /*
      &s record %id% %x% %y%
      &s write [write %openout% [quote %record%]]
      &s line [read %openin% readstatus]
    /*
    /* Close files and exit
    /*
    &end
    &s closein [close %openin%]
    &s closeout [close %openout%]
    &type Your output file is %outfile%.
    &return
  3. A point coverage was generated from the coordinate file:
    	Arc: generate ccmap
    	Generate: input genxy
    	Generate: points
    	Creating points with coordinates loaded from genxy
    	Generate: q
    	Externalling BND and TIC...
  4. Items URL and TLD were added to the coverage:
    	Arc: build ccmap point
    	Arc: additem ccmap.pat ccmap.pat URL 40 40 c
    	Arc: additem ccmap.pat ccmap.pat TLD 3 3 c
  5. Grid lines were generated
  6. The items URL and TLD were given values with an AML derived from the temporary file xx02, generated in Step 1.

Results

The figure above shows the TLD-based cyberspace grid with two point coverages plotted. The cyan coverage -- representing 1600 randomly selected sites -- shows the high density clustering of sites in some of the 3-letter domains, especially .com (left center), .edu (bottom left), and .net (bottom center). The 600 red points were collected primarily from four sites known to have links to a large number of top-level domains:

The figure below is an ArcView display of URL text for numerous sites in the northwest part of the map. Arcview can append a field from a TLD table to the active theme table, so that when a point is clicked on, the text in the "Identify Results" box lists the country associated with the TLD. The selected element (shown in yellow) is identified as a web site in Azerbaijan.

Conclusions

  1. When an ordered coordinate system is imposed on the chaos of the Internet, GIS can be a powerful tool to display information about cybertopology.

  2. This test represents only one of many possible coordinate systems for cyberspace, but the grid is more helpful than an IP-based system, because it clusters related sites together.

  3. This test was limited to the first and second-level domains, but it could easily be extended downward to host, site, and directory level. Every web page has a unique location.

  4. This test mapped a very small number of URLs (about 2200 out of over 300 million), but the automated procedure could theoretically map a list of sites the size of the HotBot index.

  5. There is plenty of wide-open cyberspace out there. Only about a third of the 676 possible 2-letter TLDs are in use, so it's a mystery why so many newbies feel compelled to squeeze into the crowded 3-letter domains.