The Internet Archive is a 501(c)(3) nonprofit that was founded in 1996 to
archive, preserve, and provide access to our collected memory in, and of,
cyberspace. Self-defined as an
“Internet Library”, their mission is to provide free and permanent access
to all publicly posted web-pages. As
one might imagine, this is no small feat. While
they must deal with some of the same issues as an organization undertaking a
digital re-formatting project (and truly, that is only what most are),
there arise many distinct troubles in the preserving of digitally born
materials, not to mention in this project, problems of scale.
Even the libraries, museums, and archives that have made commitments to actual digital preservation projects, are really only attempting to preserve (to invert a phrase from James O’Donnell) ‘old wine in new bottles’. The Internet Archive’s goal is to preserve the new wine in the new bottles, and allow researchers, historians, and the public continual access to the history of the World Wide Web.
The thought of archiving the internet, of
attempting to preserve the web is almost ludicrous – it is mostly
un-organized, vast, deep, and constantly in flux.
How does one even begin going about such a task?
To understand the process, it’s best to start from the beginning, and
that beginning is the closely related Alexa Internet.
Alexa Internet is a web navigation service that collects data about the web. This is done through “crawls”. These crawls are done by a robot (similar to the ones search engines use for indexing) on all publicly accessible web sites by starting at a site and following each link on that page. Each crawl takes about 2 months and gathers about 100 gigabytes per day. This data is recorded and used by Alexa to improve the services of their free downloadable web navigation toolbar. When a user of Alexa arrives at a site, Alexa servers search their databases for relevant information about the site and display it. These services are broken down into three parts. The first is Site Stats, which includes who has registered the domain name and their contact information, Securities and Exchange Commission (SEC) information about companies, how recently a site has been updated (freshness), how fast the connection is, and other important statistics. Second is Relevant Links, which includes data mined from usage paths and user suggestions to determine a list of relevant sites related to the one the user is at. The third, and greatly helpful service, is the Archive of the Web. When a user’s browser returns the common and frustrating “404” page not found error, Alexa searches its database of crawled sites to see if has an archived version. If it does, the user can view the site off of Alexa’s servers.
So, how does the Internet Archive fit into all of this? Well, Alexa, donates all of its web crawls to the Internet Archive (which is registered as a charity). Since the two founders of Alexa Internet, Brewster Kahle and Bruce Gilliat, also make up half of the board of the Internet Archive, this seems to be a fully beneficial symbiotic relationship, with everyone, even the public, coming out ahead.
The Internet Archive is currently made up of 4 board
members [bios taken from Internet Archive Web site]:
Brewster
Kahle: Brewster is an engineer by profession and an archivist at heart. He
designed supercomputers for Thinking Machines and helped found WAIS, Inc., Alexa
Internet, and the Internet Archive.
Peter
Lyman: Peter is university librarian and a professor in the School of
Information Management and Systems at the University of California, Berkeley.
Kathleen
Burch: Kathleen has helped start and run nonprofits, including the San Francisco
Center for the Book, since 1973. She is a Xerox PARC Artist in Residence for
2000.
Bruce
Gilliat: Bruce, with a background in networking and online content strategies,
is a cofounder (with Brewster Kahle) of Alexa Internet.
The Internet Archive also has 5 staff members: Marlita Kahn (Managing
Director), Kurt Bollacker (Technical Director),
Joseph Kacmarcik (Systems Administrator), Melanie Farley (Facilities
Project Manager), and Belinda Greene (Office Manager).
Content/Selection
Criteria:
Although
the bulk of the Internet Archive’s content comes from Alexa (there is a six
month waiting period between when they collect the data and when they donate
it), donations are welcome from other institutions.
Alexa’s donated crawls from 1996 to late 1998 include images and other
media, but the collection since then has included only ASCII text. The Internet Archive plans to do its own crawls in the near
future in order to both increase its collection, as well as fill out multimedia
and image files that have not been included in the more recent crawls.
As of now (Spring 2000), there is 2.7TB
available for the public to access through the Archives Unix machines.
The sheer amount of data collected makes it unrealistic to attempt web
access to the collections – essentially it would be like a map maker drawing a
to-scale map of the earth – however, the site does give a taste of what their
digital preservation mission has helped others do.
For instance, the Smithsonian has used web pages collected by the
Internet Archive to create a historical exhibit of how politicians used the Web
during the 1996 elections. The
Internet Archive site gives a demonstration – a flashing tease more like it
– of sites archived from the 1996 season, as well as a nice hyper linked tour
of some of the preserved sites the Smithsonian used in their exhibit.
The site also lists several other big names like Xerox PARC, IBM,
AT&T, and the Library of Congress, and lists links and brief explanations of
how these researchers are using the Internet Archive’s collections.
Since
their goal is to archive the public space of the internet, there is no real
selection criteria. The policy is
more geared towards what is not included, than what is included.
Sites that are password protected, on private servers, or whose owners
request not to be crawled are not archived.
Site owners also may ask the Internet Archive to remove any information
collected on their site if they so wish. Their
site states this very clearly, and provides information on how webmasters can
write simple html code to prevent robots from crawling their pages.
Although the Internet Archive has collected small amounts of FTP and
USENET data, there is no CHAT or E-MAIL collected at all. They
do however warn that since they archive publicly posted sites, and people do
publicly post personal information on their own, it is possible that personal
information could be archived. Once
again they remind the user that a request can be made to remove any collected
data. Yet, there is no easy way to determine if your site has been
crawled. One way could be to
download Alexa and visit your own site and see if it has been archived.
Perhaps the Internet Archive could develop a search engine to their
collection for users to determine this – though judging by the scope and size
of their collection, this could be a difficult project.
[tables from Internet Archive site: http://www.archive.org/collections/index.html]
|
DATES: |
|
October 1996 to now |
|
SIZE: |
|
13.8 terabytes (about 1 billion pages, text only
during 1999) |
|
RATE
OF GROWTH: |
|
About 2 terabytes a month as of March 2000 |
|
ACCESSIBLE: |
|
From late 1998 to six or more months ago (the
collection contains no material less than six months old), or about 3
terabytes as of March 2000. We hope to make the rest of the material
(collected from late 1996 to late 1998) available during 2000 |
|
ACCESS: |
|
See the Archive’s Terms
of Use |
|
DATES: |
|
July to October 1996 |
|
SIZE: |
|
0.05 terabyte (about 50,000 sites) |
|
RATE
OF GROWTH: |
|
N/A; collection of FTP sites is currently on hold |
|
ACCESSIBLE: |
|
No access at present; we hope to move the collection
from (slow) tape to (faster) disk and provide access during 2000 |
|
ACCESS: |
|
See the Archive’s Terms
of Use |
|
DATES: |
|
October 1996 to late 1998 |
|
SIZE: |
|
0.592 terabyte (about 16 million postings) |
|
RATE
OF GROWTH: |
|
N/A; collection of Usenet bulletin boards is
currently on hold |
|
ACCESSIBLE: |
|
No access at present; we hope to move the collection
from (slow) tape to (faster) disk and provide access during 2000. Try www.deja.com,
which maintains a more complete collection |
|
ACCESS: |
|
See the Archive’s Terms
of Use |
The Internet
Archive is a public nonprofit organization and, according to their site,
receives in-kind and financial donations from Alexa Internet, the Kahle/Austin
Foundation, and Quantum Corporation (makers of Digital Linear Tape) In June 1999, Alexa Internet became a wholly owned subsidiary
of Amazon.com.
Infrastructure/Technical:
Until late 1998, data was stored on DLT tape. It was determined that although this tape was an inexpensive storage medium, it was too slow for querying. New data is being stored on disk, and they are in the process of migrating the rest of the DLT to disk as well. The data from the crawls is stored on Linux machines run by a server facade.archive.org. These Linux machines have either 12 or 20 disk drives containing three file formats:
ARC (.arc): These are 100mb files which each contain complete data from a number of files in the collection. Alexa Internet is proposing that ARC become the standard for archiving Internet objects.
DAT(.dat)
or MDT(.dt): These are metadata files containing data such as URLs and image
references from the ARC files. This
contextual information allows for easier indexing of ARC files and helps
researchers to use these files to study things like link structures.
IDX
(index): which each contain a list of URLs and their associated place in the ARC
and DAT files.
Users connect to the Internet Archive Server through secure shell (ssh) access, through which they reference the requested files from the Linux machines.
The
Internet Archive’s web site recognizes three main issues of preservation:
accidents, media decay, and technological obsolescence.
Although their site does not go in to any depth on these topics, it does
make mention of them and explain a little.
Under
the title of Accidents, it is mentioned that storing copies of the data in
different locations is a wise idea. Already
part of their collection has been copied and moved off-site, we are told the
rest is soon to follow. They seem
to distinguish between accidents and natural disasters, which
seems to imply that accidents are man-made.
Missing from these concerns are statements about the environment.
Clearly temperature and relative humidity is important in the storage of
both tape and disk – though being headquartered in San Francisco definitely
reduces much of these troubles.
The second issue they address is Migration. It is here they mention media decay – how the storage media can degrade to a point where the data is completely lost. As of now, their plans are to migrate to new media every 10 years, as is the general rule of thumb, but anticipate having to do it more often than that. They plan to do so even though DLT (their digital storage tape) is rated for 30 years – clearly a good idea!
Under
the heading of Data Formats, the site mentions digital preservation’s monkey
on the back: technological obsolescence. As
for a solution, all they mention is that “We will be collecting software and
emulators that will aid future researchers, historians, and scholars in their
research.”
Access/Privacy/Copyright:
The
Internet Archive’s collections are open to the public through an application
process. The potential user fills
out a form (the same form used for donations or projects involved with the
archives) that asks for some personal information like name, phone-number,
address, and e-mail. In a day or
two, the user will receive an e-mail with username and password included, as
well as some initial directions on accessing the archive.
Because of how the collections have been stored, the user must have some
working knowledge of Unix commands in order to navigate within the file
structures on the Internet Archives machines.
There exists a substantial Terms of Use Agreement that the user must agree to. The agreement states that users must not interfere with other researchers using the archive – this is possible because of the open style Unix system – and that they themselves will use their passwords and access for “ways consistent with this Agreement — no other access to or use of the Site, the Collections, or the Archive's services is authorized.” The agreement also warns users that they are using the resource ‘at their own risk’, and are responsible for knowing local, national, and international laws that may apply to their viewing and use of the archives. In particular, pertaining to intellectual property, the agreement states: “...you certify that your use of any part of the Archive's Collections will be noncommercial and will be limited to noninfringing or fair use under copyright law.”
Having no selection criteria opens up the possibility for all kinds of material, especially on the web - so as a bridge into their legal language about not being responsible for basically anything, there is this warning...“Because the content of the Collections comes from around the world and from many different sectors, the Collections may contain information that might be deemed offensive, disturbing, pornographic, racist, sexist, bizarre, misleading, fraudulent, or otherwise objectionable. The Archive does not endorse or sponsor any content in the Collections, nor does it guarantee or warrant that the content available in the Collections is accurate, complete, noninfringing....”
Along with the Terms of Use Agreement, users agree to what is listed in a Privacy Policy. This document begins by warning users that the archive is an open environment, making a comparison to a physical library: “This open approach is somewhat like the situation in a public library, where staff and patrons might see who else was in the library and a bit of what they were working on.” It then goes on to mention the standard privacy issues – saying that since the web site uses standard ‘web logging’ in its servers, it records domain names, IP address, web pages requested, and so on. Also, like most of the web, it makes use of cookies. Since it is also collecting web pages of individuals, it must address their privacy concerns as well. To do so, the document reiterates the ability of individuals to ask for their data to be removed. It also mentions that the Internet Archive may share any data collected (in the crawls) with others, like it already has in joint projects with the Smithsonian and Library of Congress.
Evaluation:
Because of the unique nature of this undertaking compared to
most “digital preservation” projects, one cannot examine it on exactly the
same criteria. This is especially
true in regards to on-line content. Unlike
an analog image collection that has been digitally reformatted in which one can
compare quality of originals with screen versions, the Internet Archive is
preserving ‘digitally born’ material that by its own nature looks exactly
the same. However, ease of
searching and use are important to look at, and those are two good problems to
begin with.
The Internet Archive claims that eventually tools will be developed for
exploring their collected data in a far more easier manner, but until that
point, the user does have to not only have some programming skills, but also
must understand what she or he is looking at.
For the average web surfer, this is very unlikely, and so the audience of
the Internet Archive’s collection of data will remain serious researchers.
The exception to this is through the Internet Archive’s
“for-profit” sister, Alexa. Alexa’s
toolbar allows for the calling up from the database of individually archived
pages. So on ease of use, the Archive scores low – but it must be
remembered that this is an on-going project with an enormously idealistic
mission.
Another
comment on the site is that it lacks any real discussion of preservation. Which is odd, because that’s what they are all about. The
one sentence about collecting software and emulators just doesn’t cut it.
These are issues that Brewster Kahle and friends are certainly wrestling
with, and must have thought about even to attempt such a project (in fact, he
has written articles about some of the issues), so why isn’t there more
mention of it? Perhaps because
besides migration, nobody has really figured out a good way to preserve digital
materials over the long haul. But
then there should at least be links to discussions and writings of such topics.
The largest problem is that with the Web growing at such a fast rate, and
with the upcoming introduction of higher bandwidth infrastructures, there will
come a point when even the Internet Archive will have to become selective - then
the real issues will arise.
But after all is said and done, the real test will be in time. 100 years from now will the poets, and historians, and politicians, and school children have a clear understanding of this exciting birthing process of new modes of expression? Will there be anything around for grad-students in Media-Studies programs to write communication evolution papers on? Will we enter another Dark Ages as some have suggested? It is from studying our past now, that we realize we must preserve our present. And our present most definitely includes the vast diversity of the World Wide Web.
All quotations are from the Internet Archive’s web site (www.archive.org)
Alexander Zimmerman Copyleft 02000