Loading...
Please wait, while we are loading the content...
Similar Documents
Categorical proportional difference: a feature selection method for text categorization,” the australasian data mining conference (aus dm (2008).
| Content Provider | CiteSeerX |
|---|---|
| Author | Simeon, Mondelle Hilderman, Robert |
| Abstract | Supervised text categorization is a machine learning task where a predefined category label is automati-cally assigned to a previously unlabelled document based upon characteristics of the words contained in the document. Since the number of unique words in a learning task (i.e., the number of features) can be very large, the efficiency and accuracy of the learning task can be increased by using feature selection methods to extract from a document a subset of the features that are considered most relevant. In this paper, we introduce a new feature selection method called categorical proportional difference (CPD), a measure of the degree to which a word contributes to differentiating a particular category from other categories. The CPD for a word in a particular category in a text corpus is a ratio that considers the number of documents of a category in which the word occurs and the number of documents from other categories in which the word also occurs. We conducted a series of experiments to evaluate CPD when used in conjunction with SVM and Naive Bayes text classifiers on the OHSUMED, 20 Newsgroups, and Reuters-21578 text corpora. Recall, precision, and the F-measure were used as the measures of performance. The results obtained using CPD were compared to those obtained using six common feature selection methods found in the literature: χ2, information gain, document frequency, mutual information, odds ratio, and simplified χ2. Empirical results showed that, in general, according to the F-measure, CPD outperforms the other feature se-lection methods in four out of six text categorization tasks. |
| File Format | |
| Publisher Date | 2008-01-01 |
| Access Restriction | Open |
| Subject Keyword | Feature Selection Method Text Categorization Categorical Proportional Difference Australasian Data Mining Conference Au Dm Particular Category Naive Bayes Text Classifier Unique Word Common Feature Selection Method Text Categorization Task Feature Se-lection Method Learning Task Predefined Category Label Text Corpus Document Frequency Machine Learning Task Mutual Information Empirical Result Reuters-21578 Text Corpus New Feature Selection Method Information Gain |
| Content Type | Text |