Turkish Journal of Electrical Engineering and Computer Sciences
DOI
10.3906/elk-1501-236
Abstract
The immense number of documents published on the web requires the utilization of automatic classifiers that allow organizing and obtaining information from these large resources. Typically, automatic web pages classifiers handle millions of web pages, tens of thousands of features, and hundreds of categories. Most of the classifiers use the vector space model to represent the dataset of web pages. The components of each vector are computed using the term frequency inversed document frequency (TFIDF) scheme. Unfortunately, TFIDF-based classifiers face the problem of the large-scale size of input data that leads to a long processing time and an increase in resource requests. Therefore, there is an increasing demand to alleviate these problems by reducing the size of the input data without influencing the classification results. In this paper, we propose a novel approach that improves web page classifiers by reducing the size of the input data (i.e. web pages and feature reduction) by using the hypertext induced topic search (HITS) algorithm. We employ HITS results for weighting remaining features. We evaluate the performance of the proposed approach by comparing it with the TFIDF-based classifier. We demonstrate that our approach significantly reduces the time needed for classification.
Keywords
Hypertext induced topic search, link analysis, support vector machine, web mining
First Page
2015
Last Page
2032
Recommended Citation
MEADI, MOHAMED NADJIB; BABAHENINI, MOHAMED CHAOUKI; and AHMED, ABDELMALIK TALEB
(2017)
"New use of the HITS algorithm for fast web page classification,"
Turkish Journal of Electrical Engineering and Computer Sciences: Vol. 25:
No.
3, Article 32.
https://doi.org/10.3906/elk-1501-236
Available at:
https://journals.tubitak.gov.tr/elektrik/vol25/iss3/32
Included in
Computer Engineering Commons, Computer Sciences Commons, Electrical and Computer Engineering Commons