Digital Preservation Case Study:

The Internet Archive http://www.archive.org/

Background/Personnel

The Internet Archive is a 501(c)(3) nonprofit that was founded in 1996 to archive, preserve, and provide access to our collected memory in, and of, cyberspace. Self-defined as an “Internet Library”, their mission is to provide free and permanent access to all publicly posted web-pages. As one might imagine, this is no small feat. While they must deal with some of the same issues as an organization undertaking a digital re-formatting project (and truly, that is only what most are), there arise many distinct troubles in the preserving of digitally born materials, not to mention in this project, problems of scale.

Even the libraries, museums, and archives that have made commitments to actual digital preservation projects, are really only attempting to preserve (to invert a phrase from James O’Donnell) ‘old wine in new bottles’. The Internet Archive’s goal is to preserve the new wine in the new bottles, and allow researchers, historians, and the public continual access to the history of the World Wide Web.

The thought of archiving the internet, of attempting to preserve the web is almost ludicrous – it is mostly un-organized, vast, deep, and constantly in flux. How does one even begin going about such a task? To understand the process, it’s best to start from the beginning, and that beginning is the closely related Alexa Internet.

Alexa Internet is a web navigation service that collects data about the web. This is done through “crawls”. These crawls are done by a robot (similar to the ones search engines use for indexing) on all publicly accessible web sites by starting at a site and following each link on that page. Each crawl takes about 2 months and gathers about 100 gigabytes per day. This data is recorded and used by Alexa to improve the services of their free downloadable web navigation toolbar. When a user of Alexa arrives at a site, Alexa servers search their databases for relevant information about the site and display it. These services are broken down into three parts. The first is Site Stats, which includes who has registered the domain name and their contact information, Securities and Exchange Commission (SEC) information about companies, how recently a site has been updated (freshness), how fast the connection is, and other important statistics. Second is Relevant Links, which includes data mined from usage paths and user suggestions to determine a list of relevant sites related to the one the user is at. The third, and greatly helpful service, is the Archive of the Web. When a user’s browser returns the common and frustrating “404” page not found error, Alexa searches its database of crawled sites to see if has an archived version. If it does, the user can view the site off of Alexa’s servers.

So, how does the Internet Archive fit into all of this? Well, Alexa, donates all of its web crawls to the Internet Archive (which is registered as a charity). Since the two founders of Alexa Internet, Brewster Kahle and Bruce Gilliat, also make up half of the board of the Internet Archive, this seems to be a fully beneficial symbiotic relationship, with everyone, even the public, coming out ahead.

The Internet Archive is currently made up of 4 board members [bios taken from Internet Archive Web site]:

Brewster Kahle: Brewster is an engineer by profession and an archivist at heart. He designed supercomputers for Thinking Machines and helped found WAIS, Inc., Alexa Internet, and the Internet Archive.

Peter Lyman: Peter is university librarian and a professor in the School of Information Management and Systems at the University of California, Berkeley.

Kathleen Burch: Kathleen has helped start and run nonprofits, including the San Francisco Center for the Book, since 1973. She is a Xerox PARC Artist in Residence for 2000.

Bruce Gilliat: Bruce, with a background in networking and online content strategies, is a cofounder (with Brewster Kahle) of Alexa Internet.

The Internet Archive also has 5 staff members: Marlita Kahn (Managing Director), Kurt Bollacker (Technical Director), Joseph Kacmarcik (Systems Administrator), Melanie Farley (Facilities Project Manager), and Belinda Greene (Office Manager).

Content/Selection Criteria:

Although the bulk of the Internet Archive’s content comes from Alexa (there is a six month waiting period between when they collect the data and when they donate it), donations are welcome from other institutions. Alexa’s donated crawls from 1996 to late 1998 include images and other media, but the collection since then has included only ASCII text. The Internet Archive plans to do its own crawls in the near future in order to both increase its collection, as well as fill out multimedia and image files that have not been included in the more recent crawls. As of now (Spring 2000), there is 2.7TB available for the public to access through the Archives Unix machines. The sheer amount of data collected makes it unrealistic to attempt web access to the collections – essentially it would be like a map maker drawing a to-scale map of the earth – however, the site does give a taste of what their digital preservation mission has helped others do. For instance, the Smithsonian has used web pages collected by the Internet Archive to create a historical exhibit of how politicians used the Web during the 1996 elections. The Internet Archive site gives a demonstration – a flashing tease more like it – of sites archived from the 1996 season, as well as a nice hyper linked tour of some of the preserved sites the Smithsonian used in their exhibit. The site also lists several other big names like Xerox PARC, IBM, AT&T, and the Library of Congress, and lists links and brief explanations of how these researchers are using the Internet Archive’s collections.

Since their goal is to archive the public space of the internet, there is no real selection criteria. The policy is more geared towards what is not included, than what is included. Sites that are password protected, on private servers, or whose owners request not to be crawled are not archived. Site owners also may ask the Internet Archive to remove any information collected on their site if they so wish. Their site states this very clearly, and provides information on how webmasters can write simple html code to prevent robots from crawling their pages. Although the Internet Archive has collected small amounts of FTP and USENET data, there is no CHAT or E-MAIL collected at all. They do however warn that since they archive publicly posted sites, and people do publicly post personal information on their own, it is possible that personal information could be archived. Once again they remind the user that a request can be made to remove any collected data. Yet, there is no easy way to determine if your site has been crawled. One way could be to download Alexa and visit your own site and see if it has been archived. Perhaps the Internet Archive could develop a search engine to their collection for users to determine this – though judging by the scope and size of their collection, this could be a difficult project.

In the Collection:

[tables from Internet Archive site: http://www.archive.org/collections/index.html]

World Wide Web Pages in the Archive

DATES:		October 1996 to now
SIZE:		13.8 terabytes (about 1 billion pages, text only during 1999)
RATE OF GROWTH:		About 2 terabytes a month as of March 2000
ACCESSIBLE:		From late 1998 to six or more months ago (the collection contains no material less than six months old), or about 3 terabytes as of March 2000. We hope to make the rest of the material (collected from late 1996 to late 1998) available during 2000
ACCESS:		See the Archive’s Terms of Use

FTP Sites in the Archive

DATES:		July to October 1996
SIZE:		0.05 terabyte (about 50,000 sites)
RATE OF GROWTH:		N/A; collection of FTP sites is currently on hold
ACCESSIBLE:		No access at present; we hope to move the collection from (slow) tape to (faster) disk and provide access during 2000
ACCESS:		See the Archive’s Terms of Use

Usenet Bulletin Boards in the Archive

DATES:		October 1996 to late 1998
SIZE:		0.592 terabyte (about 16 million postings)
RATE OF GROWTH:		N/A; collection of Usenet bulletin boards is currently on hold
ACCESSIBLE:		No access at present; we hope to move the collection from (slow) tape to (faster) disk and provide access during 2000. Try www.deja.com, which maintains a more complete collection
ACCESS:		See the Archive’s Terms of Use

Funding:

The Internet Archive is a public nonprofit organization and, according to their site, receives in-kind and financial donations from Alexa Internet, the Kahle/Austin Foundation, and Quantum Corporation (makers of Digital Linear Tape) In June 1999, Alexa Internet became a wholly owned subsidiary of Amazon.com.

Infrastructure/Technical:

Until late 1998, data was stored on DLT tape. It was determined that although this tape was an inexpensive storage medium, it was too slow for querying. New data is being stored on disk, and they are in the process of migrating the rest of the DLT to disk as well. The data from the crawls is stored on Linux machines run by a server facade.archive.org. These Linux machines have either 12 or 20 disk drives containing three file formats:

ARC (.arc): These are 100mb files which each contain complete data from a number of files in the collection. Alexa Internet is proposing that ARC become the standard for archiving Internet objects.

DAT(.dat) or MDT(.dt): These are metadata files containing data such as URLs and image references from the ARC files. This contextual information allows for easier indexing of ARC files and helps researchers to use these files to study things like link structures.

IDX (index): which each contain a list of URLs and their associated place in the ARC and DAT files.

Users connect to the Internet Archive Server through secure shell (ssh) access, through which they reference the requested files from the Linux machines.

Preservation:

The Internet Archive’s web site recognizes three main issues of preservation: accidents, media decay, and technological obsolescence. Although their site does not go in to any depth on these topics, it does make mention of them and explain a little.

Under the title of Accidents, it is mentioned that storing copies of the data in different locations is a wise idea. Already part of their collection has been copied and moved off-site, we are told the rest is soon to follow. They seem to distinguish between accidents and natural disasters, which seems to imply that accidents are man-made. Missing from these concerns are statements about the environment. Clearly temperature and relative humidity is important in the storage of both tape and disk – though being headquartered in San Francisco definitely reduces much of these troubles.

The second issue they address is Migration. It is here they mention media decay – how the storage media can degrade to a point where the data is completely lost. As of now, their plans are to migrate to new media every 10 years, as is the general rule of thumb, but anticipate having to do it more often than that. They plan to do so even though DLT (their digital storage tape) is rated for 30 years – clearly a good idea!

Under the heading of Data Formats, the site mentions digital preservation’s monkey on the back: technological obsolescence. As for a solution, all they mention is that “We will be collecting software and emulators that will aid future researchers, historians, and scholars in their research.”

Access/Privacy/Copyright:

The Internet Archive’s collections are open to the public through an application process. The potential user fills out a form (the same form used for donations or projects involved with the archives) that asks for some personal information like name, phone-number, address, and e-mail. In a day or two, the user will receive an e-mail with username and password included, as well as some initial directions on accessing the archive. Because of how the collections have been stored, the user must have some working knowledge of Unix commands in order to navigate within the file structures on the Internet Archives machines.

There exists a substantial Terms of Use Agreement that the user must agree to. The agreement states that users must not interfere with other researchers using the archive – this is possible because of the open style Unix system – and that they themselves will use their passwords and access for “ways consistent with this Agreement — no other access to or use of the Site, the Collections, or the Archive's services is authorized.” The agreement also warns users that they are using the resource ‘at their own risk’, and are responsible for knowing local, national, and international laws that may apply to their viewing and use of the archives. In particular, pertaining to intellectual property, the agreement states: “...you certify that your use of any part of the Archive's Collections will be noncommercial and will be limited to noninfringing or fair use under copyright law.”

Having no selection criteria opens up the possibility for all kinds of material, especially on the web - so as a bridge into their legal language about not being responsible for basically anything, there is this warning...“Because the content of the Collections comes from around the world and from many different sectors, the Collections may contain information that might be deemed offensive, disturbing, pornographic, racist, sexist, bizarre, misleading, fraudulent, or otherwise objectionable. The Archive does not endorse or sponsor any content in the Collections, nor does it guarantee or warrant that the content available in the Collections is accurate, complete, noninfringing....”

Along with the Terms of Use Agreement, users agree to what is listed in a Privacy Policy. This document begins by warning users that the archive is an open environment, making a comparison to a physical library: “This open approach is somewhat like the situation in a public library, where staff and patrons might see who else was in the library and a bit of what they were working on.” It then goes on to mention the standard privacy issues – saying that since the web site uses standard ‘web logging’ in its servers, it records domain names, IP address, web pages requested, and so on. Also, like most of the web, it makes use of cookies. Since it is also collecting web pages of individuals, it must address their privacy concerns as well. To do so, the document reiterates the ability of individuals to ask for their data to be removed. It also mentions that the Internet Archive may share any data collected (in the crawls) with others, like it already has in joint projects with the Smithsonian and Library of Congress.

Evaluation:

Because of the unique nature of this undertaking compared to most “digital preservation” projects, one cannot examine it on exactly the same criteria. This is especially true in regards to on-line content. Unlike an analog image collection that has been digitally reformatted in which one can compare quality of originals with screen versions, the Internet Archive is preserving ‘digitally born’ material that by its own nature looks exactly the same. However, ease of searching and use are important to look at, and those are two good problems to begin with.

The Internet Archive claims that eventually tools will be developed for exploring their collected data in a far more easier manner, but until that point, the user does have to not only have some programming skills, but also must understand what she or he is looking at. For the average web surfer, this is very unlikely, and so the audience of the Internet Archive’s collection of data will remain serious researchers. The exception to this is through the Internet Archive’s “for-profit” sister, Alexa. Alexa’s toolbar allows for the calling up from the database of individually archived pages. So on ease of use, the Archive scores low – but it must be remembered that this is an on-going project with an enormously idealistic mission.

Another comment on the site is that it lacks any real discussion of preservation. Which is odd, because that’s what they are all about. The one sentence about collecting software and emulators just doesn’t cut it. These are issues that Brewster Kahle and friends are certainly wrestling with, and must have thought about even to attempt such a project (in fact, he has written articles about some of the issues), so why isn’t there more mention of it? Perhaps because besides migration, nobody has really figured out a good way to preserve digital materials over the long haul. But then there should at least be links to discussions and writings of such topics.

The largest problem is that with the Web growing at such a fast rate, and with the upcoming introduction of higher bandwidth infrastructures, there will come a point when even the Internet Archive will have to become selective - then the real issues will arise.

But after all is said and done, the real test will be in time. 100 years from now will the poets, and historians, and politicians, and school children have a clear understanding of this exciting birthing process of new modes of expression? Will there be anything around for grad-students in Media-Studies programs to write communication evolution papers on? Will we enter another Dark Ages as some have suggested? It is from studying our past now, that we realize we must preserve our present. And our present most definitely includes the vast diversity of the World Wide Web.

All quotations are from the Internet Archive’s web site (www.archive.org)

Alexander Zimmerman Copyleft 02000

Hit Counter

Hosted by www.Geocities.ws