Authors: EMİN BORANDAĞ, AKIN ÖZÇİFT, YEŞİM KAYGUSUZ
Abstract: The increase in the number of texts as digital documents from numerous sources such as customer reviews,news, and social media has made text categorization crucial in order to be able to manage the enormous amount ofdata. The high dimensional nature of these texts requires a preliminary feature selection task to reduce the featurespace with a potential increase in the prediction accuracy. In this study, we developed an ensemble feature selectionmethod, namely majority vote rank allocation, was developed for Turkish text categorization purposes. The methoduses a majority voting ensemble strategy in combination with a rank allocation approach to combine weak filters suchas information gain, symmetric uncertainty, relief, and correlation-based feature selection. Thus, the proposed methodmeasures the quality of the features among all features with the majority votes of the filters and ranking allocation. The feature selection efficacy of the method was tested on two datasets, one from the literature and a newly collected dataset.The effect of the obtained features on the classification prediction performance was evaluated on top of the naive bayes,support vector machine J48, and random forests algorithms. It was empirically observed that the developed methodimproved the prediction accuracies of the classifiers compared to the mentioned filters. The statistical significance of theexperimental results were also validated with the use of a two-way analysis of variance test.
Keywords: Hybrid feature selection, new enhance, Turkish text categorization, majority voting, ensemble featurestrategy, rank allocation
Full Text: PDF