http://gwis2

http://gwis2.circ.gwu.edu/~gprice/direct.htm

By Chris Sherman

There's a big problem with most search engines, and it's one many

people aren't even aware of. The problem is that vast expanses of the

Web are completely invisible to general purpose search engines like

AltaVista, HotBot and Google. Even worse, this "Invisible Web" is in

all likelihood growing significantly faster than the visible Web

you're familiar with.

So what is this Invisible Web and why aren't search engines indexing

it? To answer this question, it's important to first define the

"visible" Web, and describe how search engines compile their indexes.

The Web was created a little over ten years ago by Tim Berners-Lee, a

researcher at the CERN high-energy physics laboratory in Switzerland.

Berners-Lee designed the Web to be platform-independent, so that

researchers at CERN could share materials residing on any type of

computer system, avoiding cumbersome and potentially costly conversion

issues. To enable this cross-platform capability, Berners-Lee created

HTML, or HyperText Markup Language - essentially a dramatically

simplified version of SGML (Standard Generalized Markup Language).

HTML documents are simple: they consist of a "head" portion, with a

title and perhaps some additional meta data describing the document,

and a "body" portion, the actual document itself. The simplicity of

this format makes it easy for search engines to retrieve HTML

documents, index every word on every page, and store them in huge

databases that can be searched on demand.

What's less easy is the task of actually finding all the pages on the

Web. Search engines use automated programs called spiders or robots

to "crawl" the Web and retrieve pages. Spiders function much like a

hyper-caffeinated Web browser - they rely on links to take them from

page to page.

Crawling is a resource-intensive operation. It also puts a certain

amount of demand on the host computers being crawled. For these

reasons, search engines will often limit the number of pages they

retrieve and index from any given Web site. It's tempting to think

that these unretrieved pages are part of the Invisible Web, but they

aren't. They are visible and indexable, but the search engines have

made a conscious decision not to index them.

In recent months, much has been made of these overlooked pages. Many

of the major engines are making serious efforts to include them and

make their indexes more comprehensive. Unfortunately, the engines

have also discovered through their "deep crawls" that there's a

tremendous amount of duplication and spam on the Web. Current

estimates put the Web at about 1.2 to 1.5 billion indexable pages.

Both Inktomi and AltaVista have claimed that they've spidered most of

these documents, but have been forced to cull their indexes to cope

with duplicates and spam. Inktomi puts the size of the distilled Web

at about 500 million pages; AltaVista at about 350 million.

But these numbers don't include Web pages that can't be indexed, or

information that's available via the Web but isn't accessible by the

search engines. This is the stuff of the Invisible Web.

Why can't some pages be indexed? The most basic reason is that there

are no links pointing to a page that a search engine spider can

follow. Or, a page may be made up of data types that search engines

don't index - graphics, CGI scripts, Macromedia flash or PDF files,

for example.

But the biggest part of the Invisible Web is made up of information

stored in databases. When an indexing spider comes across a database,

it's as if it has run smack into the entrance of a massive library

with securely bolted doors. Spiders can record the library's address,

but can tell you nothing about the books, magazines or other documents

it contains.

There are thousands - perhaps millions - of databases containing

high-quality information that are accessible via the Web. But in

order to search them, you typically must visit the Web site that

provides an interface to the database. The advantage to this direct

approach is that you can use search tools that were specifically

designed to retrieve the best results from the database. The

disadvantage is that you need to find the database in the first place,

a task the search engines may or may not be able to help you with.

Another problem is that content in some databases isn't designed to be

directly searchable. Instead, Web developers are taking advantage of

database technology to offer customized content that's often assembled

on the fly. Search engine results pages are an example of this type of

dynamically generated content - so are services like My Excite and My

Yahoo. As Web sites get more complex and users demand more

personalization, this trend toward dynamically generated content will

accelerate, making it even harder for search engines to create

comprehensive Web indexes.

In a nutshell, the Invisible Web is made up of unindexable content

that search engines either can't or won't index. It's a huge part of

the Web, and it's growing. Fortunately, there are several reasonably

thorough guides to the Invisible Web.

Gary Price, Reference Librarian at the Gelman Library at George

Washington University, is considered one of the foremost authorities

on online databases and other invaluable search resources on the

Invisible Web. Price has assembled a massive collection of links to

Invisible Web resources at his Direct Search page

<http://gwis2.circ.gwu.edu/~gprice/direct.htm>.

"A good librarian would not start looking for a phone number

(specialized, Invisible Web info) by searching the Encyclopaedia

Britannica (general knowledge resource)," says Price. "Both

professional and casual searchers should at least be aware that they

could be missing some information or wasting time finding what could

be found more easily if the right tool for the job is easily

accessible. This is very similar to a good reference librarian

'knowing' the major reference tools in his or her collection."

What kinds of databases does Price consider to be essential Invisible

Web search tools? He names four as examples:

- The many databases that make up GPO Access.

<http://www.access.gpo.gov/su_docs/aces/aaces002.html>

- Any of the telephone directory databases such as Anywho

<http://www.anywho.com/>, Switchboard <http://www.switchboard.com/>,

and Phone Net U.K. <http://www.bt.com/phonenetuk/>.

And two that are crucial to the business searcher:

- Any of the many flavors of EDGAR, particularly the 10K Wizard.

<http://www.tenkwizard.com/>

- The Mercury Center searchable version of the PricewaterhouseCoopers

Money Tree Survey of venture capital made available by the San Jose

Mercury News. <http://wwdyn.mercurycenter.com/business/moneytree/>

"In addition to text media, the Internet is serving up many other

formats. "One that interests me a great deal is streaming media. One

experimental project that is noteworthy is the Speechbot engine that

is being developed and tested by Compaq," says Price.

<http://speechbot.research.compaq.com/>

Two other Invisible Web resources Price maintains are his NewsCenter

<http://gwis2.circ.gwu.edu/~gprice/newscenter.htm>, which focuses on

sources providing up to the minute news stories on any subject

imaginable, and his Web Audio Current Awareness Resources page

<http://gwis2.circ.gwu.edu/~gprice/audio.htm>, with links to hundreds

of live and recorded audio/video news and public affairs programming

on the Web.

"By the way, do not mistake an interest in the Invisible Web as a slam

on the general search engines because it is NOT," says Price. "General

search tools are still 100% essential for accessing material on the

Internet."

One of the largest gateways to the Invisible Web is the aptly named

Invisibleweb.com <http://www.invisibleweb.com> from Intelliseek.

"Invisible Web sources are critical because they provide users with

specific, targeted information, not just static text or HTML pages,"

says Sundar Kadayam, CTO and Co-Founder, Intelliseek.

"InvisibleWeb.com is a Yahoo-like directory. It is a high quality,

human edited and indexed, collection of highly targeted databases that

contain specific answers to specific questions," says Kadayam.

Intelliseek also makes BullsEye, a desktop based meta search engine

that can also access many of the sites included in InvisibleWeb.com.

More information can be found at

<http://www.intelliseek.com/prod/bullseye.htm>.

Other notable Invisible Web resources include:

AlphaSearch

<http://www.calvin.edu/library/searreso/internet/as/>

AlphaSearch is an extremely useful directory of "gateway" sites that

collect and organize Web sites that focus on a particular subject.

Created and maintained by the Hekman Library at Calvin College, it's

both searchable and browsable by either subject discipline or

descriptor.

The Big Hub

<http://www.thebighub.com/>

The Big Hub maintains a directory of over 1,500 subject specific

searchable databases in over 300 categories. Listings for each

database feature both annotations and search forms to directly access

the database. While these are useful for quick and dirty searches,

Big Hub's search forms omit most advanced searching features offered

by each database on their own site.

Infomine Multiple Database Search

<http://infomine.ucr.edu/search.phtml>

Infomine might be called an "academic" search engine, focusing on

scholarly resource collections, electronic journals and books, online

library card catalogs, and directories of researchers. Unlike many

Invisible Web search tools, Infomine allows simultaneous searching of

multiple databases.

WebData.com

<http://www.webdata.com/>

WebData is a database portal, specializing in finding, categorizing

and organizing online databases, and providing annotated links with

quality rankings.

As fast as the Web has been growing over the past ten years, it's

likely that its growth rate is accelerating, perhaps exponentially.

Speaking at the NetWorld+Interop conference in May 2000, Inktomi CEO

David Peterschmidt said he expected the Web to grow to more than 8

billion documents by the end of the year - more than a fivefold

increase from its current size.

The major search engines have done a creditable job of scaling with

the visible Web. For the foreseeable future, however, valuable

resources that are part of the Invisible Web will be beyond their

reach. Fortunately, we have other workmanlike tools that can help us

navigate the portion of the Web that the search engines can't see.

> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Chris Sherman is the Web Search Guide for About.com,

<http://websearch.about.com>. Chris holds an MA from Stanford

University in Interactive Educational Technology and has worked in the

Internet/Multimedia industry for two decades, currently as President

of Searchwise.net, a Web consulting and training firm. He's a

frequent contributor to information industry trade publications

including Online Magazine and Information Today. His email address is

[email protected].