NTU MSc(IS) dissertation proposal

NAME : CHU HO CHEUNG, DOMINIC

MATRIC ID : 980012H74

TOPIC : AUTOMATIC DOCUMENT CLASSIFICATION THROUGH NEURAL NETWORKS

INTRODUCTION

In the information age, hundreds of thousands of new articles are produced in printed media like books, journals, newspaper, magazines every year, not to mention that millions of new web pages are created in the same timeframe. While these create a critical mass of human knowledge in all disciplines, they must be organized in a way to facilitate retrieval. Various schemes are used in classification; the dominant one in the academic library environments is the Library of Congress Classification (LCC), which is also applied in initiatives like CyberStack on the World Wide Web.

Document classification is a popular research area and there are many approaches in building an automatic mechanism like using statistical methods and artificial intelligence. This project focuses on applying neural networks to achieve the goal of classifying textual documents. Once a class is assigned for the document, the searcher can it as the starting point for seeking related information sources of the same class in any media.

THE STATEMENT OF THE PROBLEM

The purpose of this research is to build a mechanism for automatic document classification using artificial neural networks (ANN) or neural networks in short. The project will analyze a database of Library of Congress (LC) catalog records, construct a neural network to determine what keywords are associated which LC class. Afterwards, textual documents like newspaper articles can be feed into the neural network to tell which LC class number should be assigned.

THE SUBPROBLEMS

keywords extraction

This involves extracting the keywords from the title field of the LC catalog records. A stop list will be used to filter out common words that do not contribute to representation of the concepts of the records.

construction and evaluation of the neural network

Once the keywords are extracted to represent a record, they will serve as the inputs to a neural network and outputs will be the LC class. A process to train the neural network should be developed. Upon completion of training cycles the neural network construction will be completed.

assigning LC class to new articles

Once the neural network is ready for use, a process should be developed to process new documents (e.g. newspaper articles) from extracting the keyword to assigning the LC class using the neural network built.

HYPOTHESIS AND ASSUMPTIONS

The success of classifying textual documents depends on the following:

representative keywords can be extracted from the both LC catalog records and textual articles.

the LC catalog records used in training the neural networks have accurate LC class assigned.

the neural network can tolerate a level of ‘noise’ input when the keywords of certain records are loosely related or totally irrelevant to the LC class assigned.

the neural network built can recognize a reasonable percentage of

DELIMITATIONS

This project will be limited to the scope of classifying textual documents. There will be no attempts to classify document in other media like image or videos as that require techniques in other disciplines like image recognition and understanding which are beyond the researcher’s capability and time constraints.

The tentative LC class output will be represented by the alphabet portion like H, HA, Q, QA etc. to provide a high level of classification with limited dispersion as compared to inclusion of numeric portion which are more prone to expansion in future editions of LCC.

RESEARCH METHODOLOGY

By nature this is an experimental type of research, computer software tools will be used extensively throughout the project in processing the data. The following elaborate on the proposed approach to tackle each subproblem in the project.

keywords extraction

The following is a sample of the LC records to be used

LC CALL NO.: LB1737.U6 A75 1998

FORM OF MATERIAL: Book

LCCN: 97-3421

TITLE: Teaching in the secondary school : an

introduction /

EDITION: 4th ed.

PUBLISHED: Upper Saddle River, N.J. : Merill/Prentice

Hall, c1998.

DESCRIPTION: p. cm.

NAMES: Armstrong, David G. (MAIN ENTRY)

Savage, Tom V. (ADDED ENTRY)

Armstrong, David G. Secondary education. (ADDED

ENTRY)

SUBJECTS: High school teaching--United States.

Education, Secondary--United States.

STANDARD NO: ISBN: 0-13-496498-5

DEWEY CLASS NO.: 373.1102 ED: 21

NOTES: Rev. ed. of: Secondary education. 3rd ed.

Includes bibliographical references and index.

PROCESSING INFORMATION

RECORD STATUS: New Record

TYPE OF RECORD: Language material

BIB LEVEL: Monograph/item

ENC LEVEL: Prepublication level

DESC CAT FORM: AACR 2

DATE ENTERED: 970113

DATE LAST TRANS: 19970116091251.8

CATALOGING SOURCE: Library of Congress

MODIFIED RECORD: Not modified

The researcher is interested in only the LC CALL NO. and the TITLE fields. A tool has to identified to extract them from each record. If a tool cannot be found, the research will develop a Visual Basic program to perform the function of field extraction.

Next, a keyword extraction tool should be identified to extract the keywords from the TITLE field. A stop list will be used to filter out common words that do not contribute to representation of the concepts of the records. Again, if a tool cannot be found, the researcher is prepared to develop a Visual Basic program to perform the function.

The output from this subproblem will be a collection of extracts with each record in the following format

Keyword 1

Keyword 2

Keyword n

# <- end of keyword list

LC Class

% <- end of record marker

A possible extract of the example record is as follows:

Teaching

secondary

school

introduction

The criteria of admissibility of the extract record will be at least 1 keyword is extracted, otherwise the record will be drop off from the collection.

The resultant LC catalog records extracts will be arbitrary divided into two collection – one for training and another for testing.

The size of the collection is pending for further study of the available source data.

construction and evaluation of the neural network

Multilayer feedforward network with backpropagation is chosen as it is one of the oldest and most established supervised learning model. The tentative number of layers of the neural network is four – 1 input layer, 2 hidden layer and 1 output layer. The keywords in the training collection will serve as the input for building and training the neural network. The actual LC class of each record will contribute to the backpropagation learning process.

A neural network tool Trajan 4.0 will be used. It is a fully-featured Neural Network simulation package which includes support for a wide range of Neural Network types, training algorithms, and graphical and statistical feedback on Neural Network performance.

The criteria for completion of training is pending for further study of the neural network model and the capability of the tools selected.

Upon completion of training stage, the testing collection will be fed into the neural network to measure the accuracy of classification of new records in terms of percentage. Detail analysis and presentation of results is pending for further consideration. A possible option is to use a statistical tool to analyse the results.

assigning LC class to new articles

Similar tools will be used as in the first two subproblems. The tentative new articles will come from newspaper which can be obtained online or from CD-ROM sources. Keywords of a particular will be extracted to feed into the neural networks and the output will be the LC Class which should be assigned to it.

The size of the collection of new articles is pending for further study of their availability and format with consideration of potential difficulty in keyword extraction.

RESOURCES REQUIREMENT

PC with Pentium 200 processor or above

32-bit windows operating system (Windows 95, 98 or NT)

Visual Basic 6.0 or above for keyword extraction

Trajan 3.0 or above for construction and training of neural network

Text processing tools like InMagic DB/Text to reduce the need of developing new programs

Statistical tools (optional for measurement of performance of the neural networks in classifying new articles)

MS Office for documentation, routine calculations and graphical presentation of results

PROJECT PLAN

WEEK	ACTIVITIES
Year 1 semester break	Subject study on the project topic
Year 2 semester 1
Week 2 - 5	Literature review
Week 2	Submission/confirmation of dissertation topic and preparation of draft work plan to supervisor
Week 3 - 5	Analyze source data; Evaluation and selection of tools
Week 5 - 11	Collection of new articles collection for subproblem 3
Week 6 - 7	Tackle subproblem 1
Week 8 - 11	Tackle subproblem 2
Week 12 - 14	Tackle subproblem 3
Week 15 - 17	Review results and refine solutions when necessary
Year 2 semester break	Write-up chapters on: Methodology Findings Discussion
Year 2 semester 2
Week 2	Submission of progress report
Week 3 - 8	Review results and refine solutions when necessary
Week 9	Submission of final draft of dissertation to supervisor
Week 13	Submission of two soft-bound copies of dissertation for examination (using form R/768/96)
Week 14 - 17	Dissertation examination and revision
Week 22	Submission of 3 hard-bound copies to Division office and a softcopy on 3.5" disk

REFERENCES

Haykin, S. (1999). Neural Networks: A Comprehensive Foundation. New York: Macmillan Publishing.

Patterson, D. (1996). Artificial Neural Networks. Singapore: Prentice Hall.

Chester, M. (1993). Neural Networks, A Tutorial. New Jersey: Prentice Hall.

Landau, L. (1998). Concepts for Neural Networks. London: Springer-Verlag.

Sarle, W. ed. (1997). Neural Network FAQ, part 1 of 7: Introduction. [Online]. Available: ftp://ftp.sas.com/pub/neural/FAQ.html

Sarle, W. ed. (1997). Neural Network FAQ, part 2 of 7: Learning. [Online]. Available: ftp://ftp.sas.com/pub/neural/FAQ2.html

Hosted by www.Geocities.ws