http://gwis2.circ.gwu.edu/~gprice/direct.htm
By Chris Sherman
There's a big problem with
most search engines, and it's one many
people aren't even aware of.
The problem is that vast expanses of the
Web are completely invisible
to general purpose search engines like
AltaVista, HotBot and
Google. Even worse, this "Invisible Web" is in
all likelihood growing
significantly faster than the visible Web
you're familiar with.
So what is this Invisible
Web and why aren't search engines indexing
it? To answer this question, it's important to
first define the
"visible" Web, and
describe how search engines compile their indexes.
The Web was created a little
over ten years ago by Tim Berners-Lee, a
researcher at the CERN
high-energy physics laboratory in Switzerland.
Berners-Lee designed the Web
to be platform-independent, so that
researchers at CERN could
share materials residing on any type of
computer system, avoiding
cumbersome and potentially costly conversion
issues. To enable this cross-platform capability, Berners-Lee
created
HTML, or HyperText Markup
Language - essentially a dramatically
simplified version of SGML
(Standard Generalized Markup Language).
HTML documents are simple:
they consist of a "head" portion, with a
title and perhaps some
additional meta data describing the document,
and a "body"
portion, the actual document itself. The
simplicity of
this format makes it easy
for search engines to retrieve HTML
documents, index every word
on every page, and store them in huge
databases that can be
searched on demand.
What's less easy is the task
of actually finding all the pages on the
Web. Search engines use automated programs called
spiders or robots
to "crawl" the Web
and retrieve pages. Spiders function
much like a
hyper-caffeinated Web
browser - they rely on links to take them from
page to page.
Crawling is a
resource-intensive operation. It also
puts a certain
amount of demand on the host
computers being crawled. For these
reasons, search engines will
often limit the number of pages they
retrieve and index from any
given Web site. It's tempting to think
that these unretrieved pages
are part of the Invisible Web, but they
aren't. They are visible and indexable, but the
search engines have
made a conscious decision not
to index them.
In recent months, much has
been made of these overlooked pages. Many
of the major engines are
making serious efforts to include them and
make their indexes more
comprehensive. Unfortunately, the
engines
have also discovered through
their "deep crawls" that there's a
tremendous amount of
duplication and spam on the Web. Current
estimates put the Web at
about 1.2 to 1.5 billion indexable pages.
Both Inktomi and AltaVista
have claimed that they've spidered most of
these documents, but have
been forced to cull their indexes to cope
with duplicates and
spam. Inktomi puts the size of the
distilled Web
at about 500 million pages;
AltaVista at about 350 million.
But these numbers don't
include Web pages that can't be indexed, or
information that's available
via the Web but isn't accessible by the
search engines. This is the stuff of the Invisible Web.
Why can't some pages be
indexed? The most basic reason is that
there
are no links pointing to a
page that a search engine spider can
follow. Or, a page may be made up of data types that
search engines
don't index - graphics, CGI
scripts, Macromedia flash or PDF files,
for example.
But the biggest part of the
Invisible Web is made up of information
stored in databases. When an
indexing spider comes across a database,
it's as if it has run smack
into the entrance of a massive library
with securely bolted doors. Spiders
can record the library's address,
but can tell you nothing
about the books, magazines or other documents
it contains.
There are thousands -
perhaps millions - of databases containing
high-quality information
that are accessible via the Web. But in
order to search them, you
typically must visit the Web site that
provides an interface to the
database. The advantage to this direct
approach is that you can use
search tools that were specifically
designed to retrieve the
best results from the database. The
disadvantage is that you
need to find the database in the first place,
a task the search engines
may or may not be able to help you with.
Another problem is that
content in some databases isn't designed to be
directly searchable. Instead, Web developers are taking advantage
of
database technology to offer
customized content that's often assembled
on the fly. Search engine
results pages are an example of this type of
dynamically generated
content - so are services like My Excite and My
Yahoo. As Web sites get more complex and users
demand more
personalization, this trend
toward dynamically generated content will
accelerate, making it even
harder for search engines to create
comprehensive Web indexes.
In a nutshell, the Invisible
Web is made up of unindexable content
that search engines either
can't or won't index. It's a huge part
of
the Web, and it's
growing. Fortunately, there are several
reasonably
thorough guides to the
Invisible Web.
Gary Price, Reference
Librarian at the Gelman Library at George
Washington University, is
considered one of the foremost authorities
on online databases and other
invaluable search resources on the
Invisible Web. Price has
assembled a massive collection of links to
Invisible Web resources at
his Direct Search page
<http://gwis2.circ.gwu.edu/~gprice/direct.htm>.
"A good librarian would
not start looking for a phone number
(specialized, Invisible Web
info) by searching the Encyclopaedia
Britannica (general
knowledge resource)," says Price. "Both
professional and casual
searchers should at least be aware that they
could be missing some
information or wasting time finding what could
be found more easily if the
right tool for the job is easily
accessible. This is very
similar to a good reference librarian
'knowing' the major
reference tools in his or her collection."
What kinds of databases does
Price consider to be essential Invisible
Web search tools? He names four as examples:
- The many databases that
make up GPO Access.
<http://www.access.gpo.gov/su_docs/aces/aaces002.html>
- Any of the telephone
directory databases such as Anywho
<http://www.anywho.com/>, Switchboard <http://www.switchboard.com/>,
and Phone Net U.K. <http://www.bt.com/phonenetuk/>.
And two that are crucial to
the business searcher:
- Any of the many flavors of
EDGAR, particularly the 10K Wizard.
- The Mercury Center
searchable version of the PricewaterhouseCoopers
Money Tree Survey of venture
capital made available by the San Jose
Mercury News. <http://wwdyn.mercurycenter.com/business/moneytree/>
"In addition to text
media, the Internet is serving up many other
formats. "One that
interests me a great deal is streaming media. One
experimental project that is
noteworthy is the Speechbot engine that
is being developed and
tested by Compaq," says Price.
<http://speechbot.research.compaq.com/>
Two other Invisible Web
resources Price maintains are his NewsCenter
<http://gwis2.circ.gwu.edu/~gprice/newscenter.htm>, which focuses on
sources providing up to the
minute news stories on any subject
imaginable, and his Web
Audio Current Awareness Resources page
<http://gwis2.circ.gwu.edu/~gprice/audio.htm>, with links to hundreds
of live and recorded
audio/video news and public affairs programming
on the Web.
"By the way, do not
mistake an interest in the Invisible Web as a slam
on the general search
engines because it is NOT," says Price. "General
search tools are still 100%
essential for accessing material on the
Internet."
One of the largest gateways
to the Invisible Web is the aptly named
Invisibleweb.com <http://www.invisibleweb.com>
from Intelliseek.
"Invisible Web sources
are critical because they provide users with
specific, targeted
information, not just static text or HTML pages,"
says Sundar Kadayam, CTO and
Co-Founder, Intelliseek.
"InvisibleWeb.com is a
Yahoo-like directory. It is a high
quality,
human edited and indexed,
collection of highly targeted databases that
contain specific answers to
specific questions," says Kadayam.
Intelliseek also makes
BullsEye, a desktop based meta search engine
that can also access many of
the sites included in InvisibleWeb.com.
More information can be
found at
<http://www.intelliseek.com/prod/bullseye.htm>.
Other notable Invisible Web
resources include:
AlphaSearch
<http://www.calvin.edu/library/searreso/internet/as/>
AlphaSearch is an extremely
useful directory of "gateway" sites that
collect and organize Web
sites that focus on a particular subject.
Created and maintained by
the Hekman Library at Calvin College, it's
both searchable and
browsable by either subject discipline or
descriptor.
The Big Hub
The Big Hub maintains a
directory of over 1,500 subject specific
searchable databases in over
300 categories. Listings for each
database feature both
annotations and search forms to directly access
the database. While these are useful for quick and dirty
searches,
Big Hub's search forms omit
most advanced searching features offered
by each database on their
own site.
Infomine Multiple Database Search
<http://infomine.ucr.edu/search.phtml>
Infomine might be called an
"academic" search engine, focusing on
scholarly resource
collections, electronic journals and books, online
library card catalogs, and
directories of researchers. Unlike many
Invisible Web search tools, Infomine
allows simultaneous searching of
multiple databases.
WebData.com
WebData is a database
portal, specializing in finding, categorizing
and organizing online
databases, and providing annotated links with
quality rankings.
As fast as the Web has been
growing over the past ten years, it's
likely that its growth rate
is accelerating, perhaps exponentially.
Speaking at the
NetWorld+Interop conference in May 2000, Inktomi CEO
David Peterschmidt said he
expected the Web to grow to more than 8
billion documents by the end
of the year - more than a fivefold
increase from its current
size.
The major search engines
have done a creditable job of scaling with
the visible Web. For the foreseeable future, however,
valuable
resources that are part of
the Invisible Web will be beyond their
reach. Fortunately, we have other workmanlike tools
that can help us
navigate the portion of the
Web that the search engines can't see.
> - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - -
Chris Sherman is the Web
Search Guide for About.com,
<http://websearch.about.com>.
Chris holds an MA from Stanford
University in Interactive
Educational Technology and has worked in the
Internet/Multimedia industry
for two decades, currently as President
of Searchwise.net, a Web
consulting and training firm. He's a
frequent contributor to
information industry trade publications
including Online Magazine
and Information Today. His email address is