NTU MSc(IS) dissertation proposal
NAME : CHU HO CHEUNG, DOMINIC
MATRIC ID : 980012H74
TOPIC : AUTOMATIC DOCUMENT CLASSIFICATION THROUGH NEURAL NETWORKS
INTRODUCTION
In the information age, hundreds of thousands of new articles are produced in printed media like books, journals, newspaper, magazines every year, not to mention that millions of new web pages are created in the same timeframe. While these create a critical mass of human knowledge in all disciplines, they must be organized in a way to facilitate retrieval. Various schemes are used in classification; the dominant one in the academic library environments is the Library of Congress Classification (LCC), which is also applied in initiatives like CyberStack on the World Wide Web.
Document classification is a popular research area and there are many approaches in building an automatic mechanism like using statistical methods and artificial intelligence. This project focuses on applying neural networks to achieve the goal of classifying textual documents. Once a class is assigned for the document, the searcher can it as the starting point for seeking related information sources of the same class in any media.
THE STATEMENT OF THE PROBLEM
The purpose of this research is to build a mechanism for automatic document classification using artificial neural networks (ANN) or neural networks in short. The project will analyze a database of Library of Congress (LC) catalog records, construct a neural network to determine what keywords are associated which LC class. Afterwards, textual documents like newspaper articles can be feed into the neural network to tell which LC class number should be assigned.
This involves extracting the keywords from the title field of the LC catalog records. A stop list will be used to filter out common words that do not contribute to representation of the concepts of the records.
Once the keywords are extracted to represent a record, they will serve as the inputs to a neural network and outputs will be the LC class. A process to train the neural network should be developed. Upon completion of training cycles the neural network construction will be completed.
Once the neural network is ready for use, a process should be developed to process new documents (e.g. newspaper articles) from extracting the keyword to assigning the LC class using the neural network built.
HYPOTHESIS AND ASSUMPTIONS
The success of classifying textual documents depends on the following:
DELIMITATIONS
RESEARCH METHODOLOGY
By nature this is an experimental type of research, computer software tools will be used extensively throughout the project in processing the data. The following elaborate on the proposed approach to tackle each subproblem in the project.
The following is a sample of the LC records to be used
LC CALL NO.: LB1737.U6 A75 1998
FORM OF MATERIAL: Book
LCCN: 97-3421
TITLE: Teaching in the secondary school : an
introduction /
EDITION: 4th ed.
PUBLISHED: Upper Saddle River, N.J. : Merill/Prentice
Hall, c1998.
DESCRIPTION: p. cm.
NAMES: Armstrong, David G. (MAIN ENTRY)
Savage, Tom V. (ADDED ENTRY)
Armstrong, David G. Secondary education. (ADDED
ENTRY)
SUBJECTS: High school teaching--United States.
Education, Secondary--United States.
STANDARD NO: ISBN: 0-13-496498-5
DEWEY CLASS NO.: 373.1102 ED: 21
NOTES: Rev. ed. of: Secondary education. 3rd ed.
Includes bibliographical references and index.
PROCESSING INFORMATION
RECORD STATUS: New Record
TYPE OF RECORD: Language material
BIB LEVEL: Monograph/item
ENC LEVEL: Prepublication level
DESC CAT FORM: AACR 2
DATE ENTERED: 970113
DATE LAST TRANS: 19970116091251.8
CATALOGING SOURCE: Library of Congress
MODIFIED RECORD: Not modified
The researcher is interested in only the LC CALL NO. and the TITLE fields. A tool has to identified to extract them from each record. If a tool cannot be found, the research will develop a Visual Basic program to perform the function of field extraction.
Next, a keyword extraction tool should be identified to extract the keywords from the TITLE field. A stop list will be used to filter out common words that do not contribute to representation of the concepts of the records. Again, if a tool cannot be found, the researcher is prepared to develop a Visual Basic program to perform the function.
The output from this subproblem will be a collection of extracts with each record in the following format
Keyword 1
Keyword 2
..
Keyword n
# <- end of keyword list
LC Class
% <- end of record marker
A possible extract of the example record is as follows:
Teaching
secondary
school
introduction
#
LB
%
The criteria of admissibility of the extract record will be at least 1 keyword is extracted, otherwise the record will be drop off from the collection.
The resultant LC catalog records extracts will be arbitrary divided into two collection – one for training and another for testing.
The size of the collection is pending for further study of the available source data.
Multilayer feedforward network with backpropagation is chosen as it is one of the oldest and most established supervised learning model. The tentative number of layers of the neural network is four – 1 input layer, 2 hidden layer and 1 output layer. The keywords in the training collection will serve as the input for building and training the neural network. The actual LC class of each record will contribute to the backpropagation learning process.
A neural network tool Trajan 4.0 will be used. It is a fully-featured Neural Network simulation package which includes support for a wide range of Neural Network types, training algorithms, and graphical and statistical feedback on Neural Network performance.
The criteria for completion of training is pending for further study of the neural network model and the capability of the tools selected.
Upon completion of training stage, the testing collection will be fed into the neural network to measure the accuracy of classification of new records in terms of percentage. Detail analysis and presentation of results is pending for further consideration. A possible option is to use a statistical tool to analyse the results.
Similar tools will be used as in the first two subproblems. The tentative new articles will come from newspaper which can be obtained online or from CD-ROM sources. Keywords of a particular will be extracted to feed into the neural networks and the output will be the LC Class which should be assigned to it.
The size of the collection of new articles is pending for further study of their availability and format with consideration of potential difficulty in keyword extraction.
RESOURCES REQUIREMENT
PROJECT PLAN
|
WEEK |
ACTIVITIES |
|
Subject study on the project topic |
|
|
Year 2 semester 1 |
|
|
Week 2 - 5 |
Literature review |
|
Week 2 |
Submission/confirmation of dissertation topic and preparation of draft work plan to supervisor |
|
Week 3 - 5 |
Analyze source data; Evaluation and selection of tools |
|
Week 5 - 11 |
Collection of new articles collection for subproblem 3 |
|
Week 6 - 7 |
Tackle subproblem 1 |
|
Week 8 - 11 |
Tackle subproblem 2 |
|
Week 12 - 14 |
Tackle subproblem 3 |
|
Review results and refine solutions when necessary |
|
|
Year 2 semester break |
Write-up chapters on:
|
|
Year 2 semester 2 |
|
|
Week 2 |
Submission of progress report |
|
Week 3 - 8 |
Review results and refine solutions when necessary |
|
Week 9 |
Submission of final draft of dissertation to supervisor |
|
Week 13 |
Submission of two soft-bound copies of dissertation for examination (using form R/768/96) |
|
Week 14 - 17 |
Dissertation examination and revision |
|
Week 22 |
Submission of 3 hard-bound copies to Division office and a softcopy on 3.5" disk |
REFERENCES
Haykin, S. (1999). Neural Networks: A Comprehensive Foundation. New York: Macmillan Publishing.
Patterson, D. (1996). Artificial Neural Networks. Singapore: Prentice Hall.
Chester, M. (1993). Neural Networks, A Tutorial. New Jersey: Prentice Hall.
Landau, L. (1998). Concepts for Neural Networks. London: Springer-Verlag.
Sarle, W. ed. (1997). Neural Network FAQ, part 2 of 7: Learning. [Online]. Available: ftp://ftp.sas.com/pub/neural/FAQ2.html