H6303 INFORMATION STORAGE AND RETRIEVAL

ASSIGNMENT 1

NAME: CHU HO CHEUNG DOMINIC

COMPARE AND CONTRAST THE ESSENTIAL FEATURES OF THE DB/TEXT AND DTSEARCH SOFTWARE

This report discusses the database design and creation, search and retrieval features of two IR software, namely, Inmagic DB/Text and dtSearch. In each case, a small database is built using 4 bibliographic records from LISA Plus.

 

A) Inmagic DB/Text

Database design and creation

First, define textbase – text database structure, specify a name for the textbase and define the fields.

i) Defining fields

Define one field after another. For each field, specify

The following is an example of the Textbase structure for a simple bibliographic database

Textbase name: 6303a1

Fieldname Fieldtype Indexing(T-term,W-word)

Author last name Text TW
Author first name Text TW
Title Text TW
Journal Text TW
Volume Number T
Issue Number T
Year Number T
Page no Number T
ISSN Number T
Keywords Text TW
Abstract Text TW
Photo Image T

Besides, one can specify the validation rules for each field:

Entry Validation options:

Content Validation options:

In this example database, entry validation: Field entry required and Unique entries only are chosen for Author last name to ensure all records have unique only authors last names populated.

Whenever the database structure is changed, DB/Text will automatically refresh the indexes where necessary (e.g. when a new field is added or the indexing type is changed for existing fields).

 

ii) Populating the records into the database


A default tabular form is ready for user to key in data for each record, for fields of image type, DB/Text expect the user to fill in the path and filename of the image.

In our example, if the user try to save a new record with blank Author last name, a warning message

Entry in field ‘Author last name’ cannot be empty .

Appear and user must fill in the field before the record can be saved.

If the user try to save a new record with existing Author last name, say Blake, a warning message

Entry in field ‘Author last name’ should be unique, but is not.

Appear and user must fill in the field with a different last name before the record can be saved.

 

New records can be added / updated / deleted using the default form listing all fields of a record and the indexes are updated on completion of each action.

For large volume of inputs, user can choose to import from external source, provided that the input file follow the specification of the subjected database. In our example, a valid external source is preferably an ASCII file with each records separated by CR(carriage return)/LF(line feed) and each field separated by comma or tab.

E.g.

Weiss,A.,Building bridges to Java,Internet World,9,1,1998,94,

10643923,World Wide Web,Discusses ways in which JavaScript routines can be written to produce Java applets. Presents a step by step procedure for carrying out the program,a:\7arrow1.bmp

(Assume the above is in a record in 1 line without breaks in the external source files)

will be recognized as a valid record using the express import function. To accept external source of other format like HTML, Lotus 123, a separate add-on DB/Text import filter must be installed.

 

 

Search and retrieval features

To browse all records in the database, just click on the Find All Records (globe button in the toolbar) to trigger the Find All Records command, which actually runs a query with criteria as select all records.

A default report of all records will be displayed. User can design their own report layout to display only fields in concern. Besides printing out the reports, user can also write the current report to 3 different file formats: Plain text, Rich text format(RTF), or HTML to facilitate processing the reports in other applications.

In our example, to view the image file for each record, just click Show Record Images button and the subjected image will be displayed in a separate window in which user can perform simple functions like zoom in/zoom out, rotate image, invert colors.

 

Two searching interfaces are provided:

1) QBE - Query by Example

Type criteria into boxes and optionally toggle Boolean AND, OR, NOT buttons. Click the Execute Query (Green Go button) to submit the search.

User can place either a word, a term or Boolean / proximity search query in each box to conduct a search on a particular field.

2) Command query

Type criteria as a statement, using a particular syntax. Useful when specifying complex criteria.

Notice that search is not case-sensitive.

The following features example of searches using QBE and corresponding Command query which attain the same effect:

QBE statement

Corresponding command query

i) Word/Phrase search

 

world wide web

find (Keywords ct world wide web)

at Keywords field

 

retrieve records with phrase World Wide Web

 
   

ii) Term search

 

Java

find (Keywords ct java)

at Keywords field

 

retrieve records with term Java

 
   

iii) Truncation: right truncation only using *

 

e.g. Java* at title box give records with title with word with stem Java like Java or Javascript

find (Title ct java*)

   

iv) Comparison/range search

 

e.g. in year box

 

1996 retrieve records published in 1996

find (Year ct 1996)

   

>1996 retrieve records published after 1996

find (Year >1996)

   

1996:1997 retrieve records published between year

1996 and 1997 inclusively

find (Year ct 1996:1997)

   

v) Proximity search

 

e.g. in keywords box

 

world w2 web

find (Keywords ct world w2 web)

returns records with World Wide Web where web appear within 2 words from world

 

 

world p2 web

find (Keywords ct world p2 web)

returns records with World Wide Web where world precedes at most 2 words from web

 
   

vi) Boolean search (using AND, OR, NOT)

 
  • Boolean search on a particular field like Title field
 

java / javascript

find (Title ct java / javascript)

retrieve records with Java or Javascript

 
   
  • Boolean search across multiple fields
 

author last name: Blake

 

AND keywords: world wide web

find (Author last name ct blake) and (Keywords ct world wide web)

Retrieve records with last name as Blake and with world wide web in the keywords field

 

 

 

 

 

B) dtSearch

 

Database design and creation

1) Creating index file

dtSearch basically accept user supplied files as the database records and perform indexing on them to update an index file for later search and retrieval operations.

User can choose to create a basic index(just specify the name for the index and the index file will reside at dtSearch program directory) or an advanced index where user can specify the name and path of the index file and options for accent-sensitive and case-sensitive indexing.

For a case-sensitive indexing it means Food, FOOD, food will be treated as different words.

For accent sensitive index accents will be taken into account in indexing words.

The index size will be relatively bigger if any of the two options are turned on.

 

2) Populating the database

 

To populate the database, user simply supplies dtSearch the file to be indexed one file after another by Add file

button or the whole directory of files by Add folder button.

File name: dtSearch support filename filter and user can supply filename pattern like

SMITH*.DOC to select files like SMITHA.DOC, SMITH11.DOC

File Formats : dtSearch automatically recognizes major word processor files, DBF files, ANSI files, and ZIP files through the filter available. User can also the exclude filter to skip certain type of files from being indexed.

Once files are added and OK button is pressed, updating of index will proceed and status displayed. A completion alert box tells user how many files are added to the index.

User can choose from pull down menu to reach update index dialog box allow user to maintain the index anytime. The index manager allows user to copy/delete/compress an existing index.

Once index is created, user can click the search button to start searching for records by composing the search request

 

Search and retrieval features

User is allowed to browse the whole index file as shown in a complete Indexed word list, showing number of hits and the indexed word(e.g. name occur 8 times in the example below). Common words, which are not useful in searches, are called noise words(e.g. must in the example shown).

User can simply compose the search statement in the Search request area and turn on/off a number of search features below it and then click Search button to proceed with a search.

Presentation of search results

All search results are presented by a list of files with words indexed meeting the search criteria, listing in he upper pane the information of the subjected filenames, order by relevance score in % or no. of hits, path of file, date of creation and title of the file. The lower pane shows the contents of the subjected file and hits(match in searches) are highlighted in yellow. User can press the previous / next button to view all the hits.

 

Search report can be generated with options to select the amount the contents to be included.

Many search features are available:

i) Stemming

e.g. function will retrieve records with words function or functions

apply will retrieve records with apply or application etc.

ii) Truncation

Wildcard search at any position is supported (* for any no. of characters and ? for single characters)

java* for Java and Javascript, etc

*a?a for Lava and Java and Strata, etc

 

iii) Phonic

search for words supplied and also words of same pronunciation

For example, a phonic search for program will also find programme

iv) Fuzzy

this search feature helps to find even words that are misspelled. The degree of acceptable misspelling is controlled by the Fuzziness bar, the greater the Fuzziness the higher extend of misspelling will be accepted.

This is particularly useful for documents scanned and translated from image to text using OCR where 100% recognition is not always guaranteed.

e.g. raising the Fuzziness to 3 and searching a word Jaza will match records with word Java

 

v) Synonym search

A thesaurus is used to find synonyms of words in your search request.

The thesaurus can be user defined for from the Wordnet concept network which can cover Synonym, Antonym(opposite meaning), Hypernym(sub categories of search word), Hyponym(search word belongs to this category), Meronym(member of search word) and Holonym(search word is a member of it).

Search type radio button allow user to specify if a search is a Boolean search or natural language search

vi) search options exclusive to Boolean search type

a) Boolean (AND, OR, NOT)

e.g. easy AND data

retrieves records with both words easy and data

b) Proximity (w/5 – within 5 words, w/25 – within 25 words are supported)

e.g. examples W/5 action

return records where action is within 5 words from examples

c) Range

user will be prompt with a dialog box to key in the numeric range by specifying the lower and upper bound which translate into the syntax lower bound~~upper bound

e.g. 1~~8

returns records indexed with numbers between 1 and 8 (inclusive).

User can also choose to specify just lower or upper bound.

d) Field search

supported provided fields are defined by user like ISSN field beginning marker is ISSN and end marker is next field tag Keywords

and then user can use fields search option to specify the field to search

 

 

vii) search type Natural language (when Boolean, proximity, range, fields are disabled)

A natural language search request is any combination of words, phrases, or sentences. After a natural language search, dtSearch sorts retrieved documents by their relevance. Weighting of retrieved documents takes into account: the number of documents each word in your search request appears in (the more documents a word appears in, the less useful it is in distinguishing relevant from irrelevant documents); the number of times each word in the request appears in the documents; and the density of hits in each document. Noise words and search connectors like NOT and OR are ignored..

e.g.

java illustrated

returns results with scored in rank terms of % of relevance

 

viii) More search options

user can specify combination of indexed or unindexed search, filtering by filename, date and size and limit number of search results records.

 

ix) Search history provide a means to refine search statements using all previously conducted searches statements and the following strategies:

Strategy

Resultant search statement Eg.

Current search statement: illustrated

Selected historical search statement: java

Insert – append a current statement to the selected historical one

java illustrated

Broaden – apply an OR Boolean operator with a selected historical search statement on the current one

(java ) OR (illustrated)

Narrow – apply an AND Boolean operator with a selected historical search statement on the current one

(java) AND (illustrated)

Exclude - apply a NOT Boolean operator with a selected historical search statement and AND with the current one

(NOT java) AND (illustrated)

 

 

 

C) Discussion

 

Database design and creation

Both software provide easy-to-use interface for user to create a database. DB/Text allows user to specify the structure of the database at field level and type, indexing method(term or word) and validation rules of each field which is not available in dtSearch. In DB/Text user can change the fields definition even when the textbase is already populated with records.

To populate records into the database, DB/Text rely on the form for data entry which is tedious. While an import function is available, the external source must follow a specified format according to the textbase structure which is not always practical provided sources of different format and layout. dtSearch is relatively more flexible as indexing is performed in any recognizable files without stringent limit on the format.

Both software support links to external image files. DB/Text use a photo field to store the location of image file while DB/Text associate image file to indexed document by filename rather than contents in the document. Links to another textbase at field level is only available in DB/Text

Index management of dtSearch seems more extensive as compression of index is available to save storage space. The same feature is not reflected in DB/Text.

 

 

Comparison of search features between DB/Text and dtSearch

The following tables compare the search features between the 2 software:

Type of search

DB/TextWorks

dtSearch

Word or phrase

Available

Not available

Term

Available

Available

Truncation

Only right truncation using *

More flexible Wildcard search is supported using * for any no. of char and ? for single char at any positions

Comparison/range

Use = < > <= >= : in a Term indexed field.

 

Only lower bound ~~ upper bound is supported equivalent to <=, >= and : in DB/Text

Proximity

Use the proximity operators w# and p# in a Word indexed field.

Only w/5 and w/25 are supported, no preceding operators

Boolean

Type Boolean operators (& / !) between items in a box to represent and, or, and not. For example, cars&boats finds records only if they contain both words (cars and boats). cars/boats finds records that contain either word (cars or boats). cars!boats finds records about cars but not boats.Toggle the AND, OR, NOT button in front of a box to combine multiple requests.

AND, OR, NOT are used

Natural language

Not available

Relevance score is available

Refining search using historical search statements

Not available

Available

Search using multiple indexes

Available when user search multiples fields, indexes of each field will be used

Available only when multiple indexes files are created

Synonym

Not available

Using user-defined or Wordnet thesaurus

Stemming

Not available

Available

Fuzzy

Not available

Available

Phonic

Not available

Available

 

DB/Text is stronger in supporting phrase indexing, comparison/range search and proximity search where dtSearch is stronger truncation(wildcard supported), natural language search, synonym and refining search using historical search statements.

 

 

Comparison of retrieval features between DB/Text and dtSearch

DB/Text allow users to export the full search results into text files, RTF or HTML for further processing while dtSearch allow user to the select amount of context of search results to be included in a search report.

dtSearch also lets user browse the index file and export it to text files. User can speed up composition of search statement by choosing the indexed words in the index file shown.

In terms of presentation of search results, DB/Text provide user design options in the layout/fields to include in the reports which is not available in dtSearch.

 

Application

DB/Text is more useful in building database that required validation of records according to certain rules like bibliographic records, however the limitation on the tightly defined structure make it less flexible in handling documents of free text structure, which is well handled by dtSearch. DB/Text has higher capability in generating reports of search results in terms of layout and fields to be included.

Both software have numerous searching capability and yet user must go through training to utilize all the power features. Otherwise, the QBE in DB/Text and Natural language search in dtSearch are the best places for novice users to jumpstart their search.

 

 

 

 

-- END OF REPORT --

D) APPENDICES

Contents are extracted from LISA plus database and used to build DB/Text textbase.

Documents are then exported from records of the DB/Text textbase and indexed by dtSearch.

1) h6303a1dtf1.txt

Author last name

Weiss

Author first name

A.

Title

Building bridges to Java

Journal

Internet World

Volume

9

Issue

1

Year

1998

Page no

94

ISSN

10643923

Keywords

World Wide Web,Web sites,Authoring,Software,Java,Applets,JavaScript

Abstract

Discusses ways in which JavaScript routines can be written to produce Java applets. Presents a step by step procedure for carrying out the program.

Photo

a:\7arrow1.bmp

2) h6303a1dtf2.txt

Author last name

Hoque

Author first name

R.

Title

Brewing Javascript

Journal

Internet World

Volume

8

Issue

2

Year

1997

Page no

104

ISSN

10643923

Keywords

Javascript,Software,Authoring,Web pages,World Wide Web

Abstract

Presents an illustrated guide to the use of Javascript, introduced by Netscape with their Navigator 2.0. It is designed as a cost free, easy to learn scripting language for the tailoring and personalizing of World Wide Web (WWW) home web pages and for performing a range of other functions. Points to the advantage of Javascript over Java in its ability to manipulate the content and appearance of the web page itself. (The author may be contacted by electronic mail at [email protected]).

Photo

a:\7arrow2.bmp

 

3) h6303a1dtf3.txt

Author last name

Duval

Author first name

B. K.

Title

Microcomputer applications in the library

Journal

Library Software Review

Volume

16

Issue

3

Year

1997

Page no

164

ISSN

07425759

Keywords

Library technology,Software,World Wide Web,JavaScript

Abstract

Explains how Java, a programme language similar to C++ relates to HTML and Javascript. Examines Javascript as a means of increasing user interactivity with a Web page. Discusses Javascript fundamentals and includes several examples of it in action. Original abstract-amended.

Photo

a:\7arrow3.bmp

4) h6303a1dtf4.txt

Author last name

Blake

Author first name

P.

Title

Create Web pages with automated expertise

Journal

Information Today

Volume

13

Issue

3

Year

1996

Page no

53

ISSN

87556286

Keywords

Internet Studio,JavaScript,InTEXT,Software,Authoring,Web pages,World Wide Web

Abstract

Describes new products available to assist Web page authoring. A fresh generation of tools is being developed to make it easy to create `live' documents based on Sun's Java and its rivals. Reviews InTEXT to generate HTML pages and automatic hypertext links with a facility to view, search, summarize and retrieve data using natural language. 28 companies including Apple, Silicon Graphics and Hewlett-Packard have signed up for JavaScript, a program from Netscape to aid the creation of Java applets. Discusses Java's chief rival in World Wide Web publishing, Microsoft's Internet Studio. Discusses the implications of Internet Studio's reliance on Object Linking and Embedding.

Photo

a:\7arrow4.bmp

 

 

Hosted by www.Geocities.ws

1