| [1] |
Akiko N. Aizawa.
Linguistic techniques to improve the performance of automatic text
categorization.
In NLPRS, pages 307-314, 2001. [ bib | .pdf | .pdf ] |
| [2] |
Maria-Luiza Antonie and Osmar R. Zaïane.
Text document categorization by term association.
In ICDM, pages 19-26, 2002. [ bib | http | .pdf ] A good text classifier is a classifier that efficiently categorizes large sets of text documents in a reasonable time frame and with an acceptable accuracy, and that provides classification rules that are human readable for possible fine-tuning. If the training of the classifier is also quick, this could become in some application domains a good asset for the classifier. Many techniques and algorithms for automatic text categorization have been devised. According to published literature, some are more accurate than others, and some provide more interpretable classification models than others. However, none can combine all the beneficial properties enumerated above. In this paper, we present a novel approach for automatic text categorization that borrows from market basket analysis techniques using association rule mining in the data-mining field. We focus on two major problems: (1) finding the best term association rules in a textual database by generating and pruning; and (2) using the rules to build a text classifier. Our text categorization method proves to be efficient and effective, and experiments on well-known collections show that the classifier performs well. In addition, training as well as classification are both fast and the generated rules are human readable. |
| [3] |
V. Richard Benjamins, Dieter Fensel, and Asunción Gómez-Pérez.
Knowledge management through ontologies.
In PAKM, 1998. [ bib | .ps | .pdf ] |
| [4] |
Caterina Caracciolo, Willem Robert van Hage, and Maarten de Rijke.
Towards topic driven access to full text documents.
In ECDL, pages 495-500, 2004. [ bib | http | .pdf ] We address the issue of providing topic driven access to full text documents. The methodology we propose is a combination of topic segmentation and information retrieval techniques. By segmenting the text into topic driven segments, we obtain small and coherent documents that can be used in two ways: as a basis for automatically generating hypertext links, and as a visualization aid for the reader who is presented with a small set of focused and restricted text snippets. In the presence of a concept hierarchy, or ontology, information retrieval techniques can be used to connect the segments obtained to concepts in the ontology. In this paper we concentrate on the text segmentation phase: we describe our approach to segmentation, discuss issues related to evaluation, and report on preliminary results. |
| [5] |
Franca Debole and Fabrizio Sebastiani.
Supervised term weighting for automated text categorization.
In SAC, pages 784-788, 2003. [ bib | .pdf ] The construction of a text classifier usually involves (i) a phase of term selection, in which the most relevant terms for the classification task are identified, (ii) a phase of term weighting, in which document weights for the selected terms are computed, and (iii) a phase of classifier learning, in which a classifier is generated from the weighted representations of the training documents. This process involves an activity of supervised learning, in which information on the membership of training documents in categories is used. Traditionally, supervised learning enters only phases (i) and (iii). In this paper we propose instead that learning from the training data should also affect phase (ii), i.e. that information on the membership of training documents to categories be used to determine term weights. We call this idea supervised term weighting (STW). As an example of STW, we propose a number of �supervised variants� of tfidf weighting, obtained by replacing the idf function with the function that has been used in phase (i) for term selection. The use of STW allows the terms that are distributed most differently in the positive and negative examples of the categories of interest to be weighted highest. We present experimental results obtained on the standard Reuters-21578 benchmark with three classifier learning methods (Rocchio, k-NN, and support vector machines), three term selection functions (information gain, chi-square, and gain ratio), and both local and global term selection and weighting. |
| [6] |
Evgeniy Gabrilovich and Shaul Markovitch.
Feature generation for text categorization using world knowledge.
In IJCAI, pages 1048-1053, 2005. [ bib | .pdf | .pdf ] We enhance machine learning algorithms for text categorization with generated features based on domain-specific and common-sense knowledge. This knowledge is represented using publicly available ontologies that contain hundreds of thousands of concepts, such as the Open Directory; these ontologies are further enriched by several orders of magnitude through controlled Web crawling. Prior to text categorization, a feature generator analyzes the documents and maps them onto appropriate ontology concepts, which in turn induce a set of generated features that augment the standard bag of words. Feature generation is accomplished through contextual analysis of document text, implicitly performing word sense disambiguation. Coupled with the ability to generalize concepts using the ontology, this approach addresses the two main problems of natural language processing�synonymy and polysemy. Categorizing documents with the aid of knowledge-based features leverages information that cannot be deduced from the documents alone. Experimental results confirm improved performance, breaking through the plateau previously reached in the field. |
| [7] |
Eui-Hong Han and George Karypis.
Fast supervised dimensionality reduction algorithm with applications
to document categorization & retrieval.
In CIKM, pages 12-19, 2000. [ bib | http | .pdf ] |
| [8] |
Eui-Hong Han, George Karypis, and Vipin Kumar.
Text categorization using weight adjusted k-nearest neighbor
classification.
In PAKDD, pages 53-65, 2001. [ bib | http | .pdf ] |
| [9] |
Thomas Hofmann.
Probabilistic latent semantic indexing.
pages 50-57. [ bib | .pdf ] |
| [10] |
Andreas Hotho, Steffen Staab, and Gerd Stumme.
Explaining text clustering results using semantic structures.
In PKDD, pages 217-228, 2003. [ bib | http | .pdf ] Common text clustering techniques offer rather poor capabilities for explaining to their users why a particular result has been achieved. They have the disadvantage that they do not relate semantically nearby terms and that they cannot explain how resulting clusters are related to each other. In this paper, we discuss a way of integrating a large thesaurus and the computation of lattices of resulting clusters into common text clustering in order to overcome these two problems. As its major result, our approach achieves an explanation using an appropriate level of granularity at the concept level as well as an appropriate size and complexity of the explaining lattice of resulting clusters. |
| [11] |
Alberto Lavelli, Bernardo Magnini, and Fabrizio Sebastiani.
Building thematic lexical resources by term categorization.
In SIGIR, pages 415-416, 2002. [ bib | http | .pdf ] We discuss work in progress in the semi-automatic generation of thematic lexicons by means of term categorization, a novel task employing techniques from information retrieval (IR) and machine learning (ML). Specifically, we view the generation of such lexicons as an iterative process of learning previously unknown associations between terms and themes (i.e. disciplines, or fields of activity). The process is iterative, in that it generates, for each ci in a set of themes, a sequenceof lexicons, bootstrapping from an initial lexicon Li 0 and a set of text corpora given as input. The method is inspired by text categorization, the discipline concerned with labelling natural language texts with labels from a predefined set of themes, or categories. However, while text categorization deals with documents represented as vectors in a space of terms, we formulate the task of term categorization as one in which terms are (dually) represented as vectors in a space of documents, and in which terms (instead of documents) are labelled with themes. As a learning device, we adopt boosting, since (a) it has demonstrated state-of-the-art effectiveness in a variety of text categorization applications, and (b) it naturally allows for a form of �data cleaning�, thereby making the process of generating a thematic lexicon an iteration of generate-and-test steps. |
| [12] |
Bing Liu, Yang Dai, Xiaoli Li, Wee Sun Lee, and Philip S. Yu.
Building text classifiers using positive and unlabeled examples.
In ICDM, pages 179-188, 2003. [ bib | http | .pdf ] This paper studies the problem of building text classifiers using positive and unlabeled examples. The key feature of this problem is that there is no negative example for learning. Recently, a few techniques for solving this problem were proposed in the literature. These techniques are based on the same idea, which builds a classifier in two steps. Each existing technique uses a different method for each step. In this paper, we first introduce some new methods for the two steps, and perform a comprehensive evaluation of all possible combinations of methods of the two steps. We then propose a more principled approach to solving the problem based on a biased formulation of SVM, and show experimentally that it is more accurate than the existing techniques. |
| [13] |
Tao Liu, Zheng Chen, Benyu Zhang, Wei-Ying Ma, and Gongyi Wu.
Improving text classification using local latent semantic indexing.
In ICDM, pages 162-169, 2004. [ bib | http | .pdf ] Latent Semantic Indexing (LSI) has been shown to be extremely useful in information retrieval, but it is not an optimal representation for text classification. It always drops the text classification performance when being applied to the whole training set (global LSI) because this completely unsupervised method ignores class discrimination while only concentrating on representation. Some local LSI methods have been proposed to improve the classification by utilizing class discrimination information. However, their performance improvements over original term vectors are still very limited. In this paper, we propose a new local LSI method called �Local Relevancy Weighted LSI� to improve text classification by performing a separate Single Value Decomposition (SVD) on the transformed local region of each class. Experimental results show that our method is much better than global LSI and traditional local LSI methods on classification within a much smaller LSI dimension. |
| [14] |
Alessandro Moschitti and Roberto Basili.
Complex linguistic features for text classification: A
comprehensive study.
In ECIR, pages 181-196, 2004. [ bib | http | .pdf ] Previous researches on advanced representations for document retrieval have shown that statistical state-of-the-art models are not improved by a variety of different linguistic representations. Phrases, word senses and syntactic relations derived by Natural Language Processing (NLP) techniques were observed ineffective to increase retrieval accuracy. For Text Categorization (TC) are available fewer and less definitive studies on the use of advanced document representations as it is a relatively new research area (compared to document retrieval). In this paper, advanced document representations have been investigated. Extensive experimentation on representative classifiers, Rocchio and SVM, as well as a careful analysis of the literature have been carried out to study how some NLP techniques used for indexing impact TC. Cross validation over 4 different corpora in two languages allowed us to gather an overwhelming evidence that complex nominals, proper nouns and word senses are not adequate to improve TC accuracy. |
| [15] |
Terri Oda and Tony White.
Developing an immunity to spam.
In GECCO, pages 231-242, 2003. [ bib | .pdf ] Immune systems protect animals from pathogens, so why not apply a similar model to protect computers? Several researchers have investigated the use of an artificial immune system to protect computers from viruses and others have looked at using such a system to detect unauthorized computer intrusions. This paper describes the use of an artificial immune system for another kind of protection: protection from unsolicited email, or spam. |
| [16] |
Sam Scott and Stan Matwin.
Feature engineering for text classification.
In ICML, pages 379-388, 1999. [ bib | .pdf ] Most research in text classification to date has used a �bag of words� representation in which each feature corresponds to a single word. This paper examines some alternative ways to represent text based on syntactic and semantic relationships between words (phrases, synonyms and hypernyms). We describe the new representations and try to justify our hypothesis that they could improve the performance of a rule-based learner. The representations are evaluated using the RIPPER learning algorithm on the Reuters-21578 and DigiTrad test corpora. On their own the new representations are not found to produce significant performance improvements. We also try combining classifiers based on different representations using a majority voting technique, and this improves performance on both test collections. In our opinion, more sophisticated Natural Language Processing techniques need to be developed before better text representations can be produced for classification. |
| [17] |
Gerd Stumme.
Formal concept analysis on its way from mathematics to computer
science.
In ICCS, pages 2-19, 2002. [ bib | http | .pdf ] |
| [18] |
Aixin Sun and Ee-Peng Lim.
Hierarchical text classification and evaluation.
In ICDM, pages 521-528, 2001. [ bib | .pdf ] Hierarchical Classification refers to assigning of one or more suitable categories from a hierarchical category space to a document. While previous work in hierarchical classification focused on virtual category trees where documents are assigned only to the leaf categories, we propose a topdown level-based classification method that can classify documents to both leaf and internal categories. As the standard performance measures assume independence between categories, they have not considered the documents incorrectly classified into categories that are similar or not far from the correct ones in the category tree. We therefore propose the Category-Similarity Measures and Distance- Based Measures to consider the degree of misclassification in measuring the classification performance. An experiment has been carried out to measure the performance of our proposed hierarchical classification method. The results showed that our method performs well for Reuters text collection when enough training documents are given and the new measures have indeed considered the contributions of misclassified documents. |
| [19] |
Osmar R. Zaïane and Maria-Luiza Antonie.
Classifying text documents by associating terms with text categories.
In Australasian Database Conference, 2002. [ bib | .pdf | .pdf ] |
| [20] |
Kjersti Aas and Line Eikvil.
Text categorisation: A survey., 1999. [ bib | .ps | .pdf ] this report we give a survey of the state-of-the-art in text categorisation. To be able to measure progress in this field, it is important to use a standardised collection of documents for analysis and testing. One such data set is the Reuters-21578 collection of newswires for the year 1987, and our survey will focus on the work on text categorisation that have used this collection for testing. |
| [21] |
Rie Kubota Ando.
Latent semantic-space: iterative scaling improves precision of
inter-document similarity measurement.
In SIGIR, pages 216-223, 2000. [ bib | http | .pdf ] We present a novel algorithm that creates document vectors with reduced dimensionality. This work was motivated by an application characterizing relationships among documents in a collection. Our algorithm yielded inter-document similarities with an average precision up to 17.8 higher than that of singular value decomposition (SVD) used for Latent Semantic Indexing. The best performance was achieved with dimensional reduction rates that were 43 Our algorithm creates basis vectors for a reduced space by iteratively �scaling� vectors and computing eigenvectors. Unlike SVD, it breaks the symmetry of documents and terms to capture information more evenly across documents. We also discuss correlation with a probabilistic model and evaluate a method for selecting the dimensionality using log-likelihood estimation |
| [22] |
Pierre Baldi, Paolo Frasconi, and Padhraic Smyth.
Modeling the Internet and the Web: Probabilistic Method and
Algorithms.
John Wiley, 2003. [ bib | http | .pdf ] |
| [23] |
A. Basu, Carolyn R. Watters, and Michael A. Shepherd.
Support vector machines for text categorization.
In HICSS, page 103, 2003. [ bib | http | .pdf ] Text categorization is the process of sorting text documents into one or more predefined categories or classes of similar documents. Differences in the results of such categorization arise from the feature set chosen to base the association of a given document with a given category. Advocates of text categorization recognize that the sorting of text documents into categories of like documents reduces the overhead required for fast retrieval of such documents and provides smaller domains in which the users may explore similar documents. In this paper we are interested in examining whether automatic classification of news texts can be improved by a prefiltering the vocabulary to reduce the feature set used in the computations. First we compare artificial neural network and support vector machine algorithms for use as text classifiers of news items. Secondly, we identify a reduction in feature set that provides improved results. |
| [24] |
Ron Bekkerman, Ran El-Yaniv, Naftali Tishby, and Yoad Winter.
Distributional word clusters vs. words for text categorization.
Journal of Machine Learning Research, 3:1183-1208, 2003. [ bib | .pdf ] We study an approach to text categorization that combines distributional clustering of words and a Support Vector Machine (SVM) classifier. This word-cluster representation is computed using the recently introduced Information Bottleneck method, which generates a compact and efficient representation of documents. When combined with the classification power of the SVM, this method yields high performance in text categorization. This novel combination of SVM with word-cluster representation is compared with SVM-based categorization using the simpler bag-of-words (BOW) representation. The comparison is performed over three known datasets. On one of these datasets (the 20 Newsgroups) the method based on word clusters significantly outperforms the word-based representation in terms of categorization accuracy or representation efficiency. On the two other sets (Reuters-21578 and WebKB) the word-based representation slightly outperforms the word-cluster representation. We investigate the potential reasons for this behavior and relate it to structural differences between the datasets. |
| [25] |
Soumen Chakrabarti.
Mining the Web. Discovering Knowledge from Hypertext Data.
Morgan Kaufmann Publishers, 2002. [ bib | .pdf ] |
| [26] |
Stephen D'Alessio, Keitha Murray, Robert Schiaffino, and Aaron Kershenbaum.
The effect of using hierarchical classifiers in text categorization.
In Proceeding of RIAO-00, 6th International Conference
``Recherche d'Information Assistee par Ordinateur'', pages 302-313, Paris,
FR, 2000. [ bib | .pdf | .pdf ] Given a set of categories, with or without a preexisting hierarchy among them, we consider the problem of assigning documents to one or more of these categories from the point of view of a hierarchy with more or less depth. We can choose to make use of none, part or all of the hierarchical structure to improve the categorization effectiveness and efficiency. It is possible to create additional hierarchy among the categories. We describe a procedure for generating a hierarchy of classifiers that model the hierarchy structure. We report on computational experience using this procedure. We show that judicious use of a hierarchy can significantly improve both the speed and effectiveness of the categorization process. Using the Reuters-21578 corpus, we obtain an improvement in running time of over a factor of three and a 5% improvement in F-measure. |
| [27] |
Offer Drori.
Identifying the subject of documents in digital libraries
automatically using frequently-occurring words - study and findings, May 23
2003. [ bib | .pdf | .pdf ] Contemporary information databases contain millions of electronic documents. The immense number of documents makes it difficult to conduct efficient searches on the Internet. Several studies have found that associating documents with a subject or list of topics can make them easier to locate online [5] [6] [7]. Effective cataloging of information is performed manually, requiring extensive resources. Consequently, at present most information is not cataloged. |
| [28] |
Dave Elliman.
Automatic derivation of on-line document ontologies, July 26 2001. [ bib | .pdf | .pdf ] This paper describes a method for constructing an ontology which will represent the set of web pages on a specified site. We are developing a technique that will extract knowledge from digital sources, create ontologies containing reusable knowledge to be shared with software agents, and present a view of this knowledge to users. This method will provide a solution to the problem of classifying information and supporting mechanisms that explore its structure, as well as allowing knowledge to be extracted and shared with other software agents. |
| [29] |
Dave Elliman and J. R. G. Pulido.
Visualizing ontology components through self-organizing maps.
In IV, page 434, 2002. [ bib | http | .pdf ] This paper describes a method for identifying Ontology components by using Self-Organizing Maps. Our system represents the knowledge contained in a particular digital archive by assembling and displaying the ontologies components. This novel approach provides an alternative solution to the problem of classifying on-line information and retrieval, supportmechanisms that explore domains, and allows knowledge to be displayed in a browsable manner. |
| [30] |
Radu Florian and David Yarowsky.
Dynamic nonlocal language modeling via hierarchical topic-based
adaptation.
In 37th Annual Meeting of the Association for Computational
Linguistics, pages 167-174, 1999. [ bib | .pdf ] |
| [31] |
Hichem Frigui and Olfa Nasraoui.
Simultaneous categorization of text documents and identification of
cluster-dependent keywords, April 07 2002. [ bib | .pdf | .pdf ] In this paper, we propose a new approach to unsupervised text document categorization based on a coupled process of clustering and cluster-dependent keyword weighting. The proposed algorithm is based on the K-Means clustering algorithm. Hence it is computationally and implementationally simple. Moreover, it learns a different set of keyword weights for each cluster. This means that, as a by-product of the clustering process, each document cluster will be characterized by a possibly different set of keywords. The cluster dependent keyword weights have two advantages. First, they help in partitioning the document collection into more meaningful categories. |
| [32] |
Gongde Guo, Hui Wang, David A. Bell, Yaxin Bi, and Kieran Greer.
An kNN model-based approach and its application in text
categorization.
In Alexander F. Gelbukh, editor, Proceedings of CICLING-04, 5th
International Conference on Computational Linguistics and Intelligent Text
Processing, pages 559-570, Seoul, KO, 2004. Springer Verlag, Heidelberg,
DE.
Published in the ``Lecture Notes in Computer Science'' series, number
2945. [ bib | .pdf ] |
| [33] |
Andreas Hotho and Gerd Stumme.
Conceptual clustering of text clusters, May 23 2002. [ bib | .pdf | .pdf ] Common clustering techniques have the disadvantage that they do not provide intensional descriptions of the clusters obtained. Conceptual Clustering techniques, on the other hand, provide such descriptions, but are known to be rather slow. In this paper, we discuss a way of combining both techniques. We first cluster the documents by a variant of #-Means, using a thesaurus as background knowledge. This clustering reduces the large number of documents to a relatively small number of clusters, which can then be clustered conceptually in the second step. |
| [34] |
M. Jarrar and R. Meersman.
Scalability and knowledge reusability in ontology modeling.
In Veljko Milutinovic, editor, Proceedings of the International
conference on Infrastructure for e-Business, e-Education, e-Science, and
e-Medicine, volume SSGRR2002s, Rome, Italy, 2002. SSGRR education center. [ bib | .pdf ] |
| [35] |
George Karypis and Eui-Hong Han.
Concept indexing: A fast dimensionality reduction algorithm with
applications to document retrieval and categorization.
Computer science department TR-00-0016, University of
Minnesota, 2000. [ bib | .pdf ] In recent years, we have seen a tremendous growth in the volume of text documents available on the Internet, digital libraries, news sources, and company-wide intranets. This has led to an increased interest in developing methods that can efficiently categorize and retrieve relevant information. Retrieval techniques based on dimensionality reduction, such as Latent Semantic Indexing (LSI), have been shown to improve the quality of the information being retrieved by capturing the latent meaning of the words present in the documents. Unfortunately, the high computational requirements of LSI and its inability to compute an effective dimensionality reduction in a supervised setting limits its applicability. In this paper we present a fast dimensionality reduction algorithm, called concept indexing (CI) that is equally effective for unsupervised and supervised dimensionality reduction. CI computes a k-dimensional representation of a collection of documents by first clustering the documents into k groups, and then using the centroid vectors of the clusters to derive the axes of the reduced k-dimensional space. Experimental results show that the dimensionality reduction computed by CI achieves comparable retrieval performance to that obtained using LSI, while requiring an order of magnitude less time. Moreover, when CI is used to compute the dimensionality reduction in a supervised setting, it greatly improves the performance of traditional classification algorithms such as C4.5 and kNN. |
| [36] |
Taku Kudo and Yuji Matsumoto.
A boosting algorithm for classification of semi-structured text.
In Proceedings of EMNLP-04, 9th Conference on Empirical Methods
in Natural Language Processing, Barcelon, ES, 2004. [ bib | .pdf | .pdf ] The focus of research in text classification has expanded from simple topic identification to more challenging tasks such as opinion/modality identification. Unfortunately, the latter goals exceed the ability of the traditional bag-of-word representation approach, and a richer, more structural representation is required. Accordingly, learning algorithms must be created that can handle the structures observed in texts. In this paper, we propose a Boosting algorithm that captures sub-structures embedded in texts. The proposal consists of i) decision stumps that use subtrees as features and ii) the Boosting algorithm which employs the subtree-based decision stumps as weak learners. We also discuss the relation between our algorithm and SVMs with tree kernel. Two experiments on opinion/modality classification confirm that subtree features are important. |
| [37] |
Dawn Lawrie and W. Bruce Croft.
Discovering and comparing topic hierarchies, October 13 2000. [ bib | .pdf | .pdf ] Hierarchies have been used for organization, summarization, and access to information, yet a lingering issue is how best to construct them. In this paper, our goal is to automatically create domain specific hierarchies that can be used for browsing a document set and locating relevant documents. We examine methods of automatically generating hierarchies and evaluating them. To this end, we compare and contrast two methods of generating topic hierarchies from the text of documents: one, subsumption hierarchies, uses subsumption relations found within document sets, and the other, lexical hierarchies, utilizes frequently used words within phrases. Our evaluation shows that subsumption hierarchies divide documents into smaller groups, allowing one to find all relevant documents without looking at as many non-relevant documents. However, such hierarchies are more likely to contain no path to a relevant document. |
| [38] |
R. E. Madsen, J. Larsen, and L. K. Hansen.
Part-of-speech enhanced context recognition.
In S. Douglas A.K. Barros, J. Principe, J. Larsen, T. Adali,
editor, Proceedings of IEEE Workshop on Machine Learning for Signal
Processing XIV, pages 635-644, Piscataway, New Jersey, September 2004.
IEEE Press. [ bib | http | .pdf ] Language independent `bag-of-words' representations are surprisingly efective for text classi�cation. In this communi- cation our aim is to elucidate the synergy between language inde- pendent features and simple language model features. We consider term tag features estimated by a so-called part-of-speech tagger. The feature sets are combined in an early binding design with an optimized binding coefficient that allows weighting of the relative variance contributions of the participating feature sets. With the combined features documents are classi�ed using a latent semantic indexing representation and a probabilistic neural network classi- fier. Three medium size data-sets are analyzed and we find consis- tent synergy between the term and natural language features in all three sets for a range of training set sizes. The most significant en- hancement is found for small text databases where high recognition rates are possible. Keywords: text mining, latent space, context recognition |
| [39] |
Seong-Bae Park and Byoung-Tak Zhang.
Co-trained support vector machines for large scale unstructured
document classification using unlabeled data and syntactic information.
Information Processing and Management, 40(3):421-439, 2004. [ bib | .pdf ] Most document classification systems consider only the distribution of content words of the documents, ignoring the syntactic information underlying the documents though it is also an important factor. In this paper, we present an approach for classifying large scale unstructured documents by incorporating both lexical and syntactic information of documents. For this purpose, we use the co-training algorithm, a partially supervised learning algorithm, in which two separated views for the training data are employed and the small number of labeled data are augmented by a large number of unlabeled data. Since both lexical and syntactic information can play roles of separated views for the unstructured documents, the co-training algorithm enhances the performance of document classification using both of them and a large number of unlabeled documents. The experimental results on Reuters-21578 corpus and TREC-7 filtering documents show the effectiveness of unlabeled documents and the use of both lexical and syntactic information. |
| [40] |
William M. Pottenger and Ph. D.
Detecting patterns in the LSI term-term matrix.
Technical report, September 25 2002. [ bib | .pdf | .pdf ] applications use techniques that explicitly or implicitly employ a limited degree of transitivity in the co-occurrence relation. In this work we show use of higher orders of co-occurrence in the Singular Value Decomposition (SVD) algorithm and, by inference, on the systems that rely on SVD, such as LSI. Our empirical and mathematical studies prove that term cooccurrence plays a crucial role in LSI. |
| [41] |
David Ramamonjisoa.
Towards automated research topics discovery on scientific domain by
agents system, January 02 2003. [ bib | .pdf | .pdf ] In our project on multiagent for web mining, we developed KAROKA (Keywords Association Rules Optimizer Knobots Advisers) as a model of discovery in text database used in WWW. In this paper, we explain our model and its application to discover new research topics in scientific domain on the web. This tool aims to support researchers for their bibliographical investigation and help to avoid information overload. The WWW sources are converted into a highly structured collection of text. Then, KAROKA tries to extract topics, association rules, regularities, exception and useful information in the collection of text. |
| [42] |
Anthony Scime.
Web Mining: applications and techniques.
Idea Group, 2005. [ bib | .pdf ] |
| [43] |
Mark Sinka and David Corne.
Evolving document features for web document clustering: A
feasability study.
In Proceedings of the 2004 IEEE Congress on Evolutionary
Computation, pages 891-897, Portland, Oregon, 20-23 June 2004. IEEE Press. [ bib | .pdf ] Document analysis research underpins the envisaged 'semantic web'. A key issue is how to encode a document without losing salient information. Current research almost always uses fixed-length vectors based on word (term) frequency (TF) and/or variants thereof. We explore alternative encodings using an evolutionary algorithm (EA). These alternatives use a variety of other features that can be extracted from a document, and the EA explores the space of weighted combinations of these. Tests are able to find encodings which outperform previous results. Among several tentative findings it seems clear that the ideal encoding is highly task-dependent, and we can recommend certain features as useful for specific types of document clustering tasks. Keywords: Other, Real-world applications |
| [44] |
Noam Slonim and Naftali Tishby.
The power of word clusters for text classification.
In Proceedings of ECIR-01, 23rd European Colloquium on
Information Retrieval Research, Darmstadt, DE, 2001. [ bib | http | .pdf ] The recently introduced Information Bottleneck method provides an information theoretic framework, for extracting features of one variable, that are relevant for the values of another variable. Several previous works already suggested applying this method for document clustering, gene expression data analysis, spectral analysis and more. In this work we present a novel implementation of this method for supervised text classification. Specifically, we apply the information bottleneck method to find word-clusters that preserve the information about document categories and use these clusters as features for classification. Previous work used a similar clustering procedure to show that word-clusters can significantly reduce the feature space dimensionality, with only a minor change in classification accuracy. In this work we reproduce these results and go further to show that when the training sample is small word clusters can yield significant improvement in classification accuracy (up to 18%) over the performance using the words directly. |
| [45] |
Alexander Strehl, Joydeep Ghosh, and Raymond Mooney.
Impact of similarity measures on web-page clustering.
In Proceedings of the 17th National Conference on Artificial
Intelligence: Workshop of Artificial Intelligence for Web Search (AAAI 2000),
30-31 July 2000, Austin, Texas, USA, pages 58-64. AAAI, July 2000. [ bib | .pdf ] |
| [46] |
Domonkos Tikk, Jae Dong Yang, and Sun Lee Bang.
Hierarchical text categorization using fuzzy relational thesaurus,
April 22 0. [ bib | .pdf | .pdf ] Text categorization is the classification to assign a text document to an appropriate category in a predefined set of categories. We present a new approach for the text categorization by means of Fuzzy Relational Thesaurus (FRT). FRT is a multilevel category system that stores and maintains adaptive local dictionary for each category. The goal of our approach is twofold; to develop a reliable text categorization method on a certain subject domain, and to expand the initial FRT by automatically added terms, thereby obtaining an incrementally defined knowledge base of the domain. We implemented the categorization algorithm and compared it with some other hierarchical classifiers. Experimental results have been shown that our algorithm outperforms its rivals on all document corpora investigated. |
| [47] |
J. J. Verbeek.
Supervised feature extraction for text categorization, February 14
2002. [ bib | .ps.gz | .pdf ] This paper concerns finding the `optimal' number of word groups for text classification. We present a method to select which words to cluster into word groups and how many such word groups to use on the basis of a set of pre-classified texts. The method involves a `greedy' search through the space of possible word groups. The words are grouped according to the `Jensen-Shannon divergence' between the corresponding distributions over the classes. The criterion to decide which number of word groups to use is based on Rissanen's MDL Principle. We present empirical results that indicate that the proposed method performs well. Furthermore, the proposed method outperforms cross-validation in the sense that far fewer word groups are selected while prediction accuracy is just slightly worse. For the experimentation we used a subset of the `20 Newsgroup' dataset [10]. |
| [48] |
Ian H. Witten, Gordon W. Paynter, Eibe Frank, Carl Gutwin, and Craig G.
Nevill-Manning.
KEA: Practical automatic keyphrase extraction.
CoRR, cs.DL/9902007, 1999. [ bib | http | .pdf ] Keyphrases provide semantic metadata that summarize and characterize documents. This paper describes Kea, an algorithm for automatically extracting keyphrases from text. Kea identifies candidate keyphrases using lexical methods, calculates feature values for each candidate, and uses a machine-learning algorithm to predict which candidates are good keyphrases. The machine learning scheme first builds a prediction model using training documents with known keyphrases, and then uses the model to find keyphrases in new documents. We use a large test corpus to evaluate Kea�s effectiveness in terms of how many author-assigned keyphrases are correctly identified. The system is simple, robust, and available under the GNU General Public License; the paper gives instructions for use. |
| [49] |
Wai chiu Wong and Ada Wai chee Fu.
Incremental document clustering for web page classification,
August 31 2000. [ bib | .ps | .pdf ] Motivated by the benefits in organizing the documents in Web search engines, we consider the problem of automatic Web page classification. We employ the clustering techniques. Each document is represented by a feature vector. By analyzing the clusters formed by these vectors, we can assign the documents within the same cluster to the same class automatically. Our contributions are the following: (1) We propose a feature extraction mechanism which is more suitable to Web page classification. (2) We introduce a tree structure called the DC-tree to make the clustering process incremental and less sensitive to the document insertion order. (3) We show with experiments on a set of Internet documents from Yahoo! that the proposed clustering algorithm can classify Web pages effectively. Keywords: Incremental update, Tree, Document, Clustering, Web, Classification 0 1 Introduction The popularity of the Internet has caused a continuous massive increase in the amount of Web pages (o... |
This file has been generated by bibtex2html 1.79