DOI

10.3906/elk-1501-236

Abstract

The immense number of documents published on the web requires the utilization of automatic classifiers that allow organizing and obtaining information from these large resources. Typically, automatic web pages classifiers handle millions of web pages, tens of thousands of features, and hundreds of categories. Most of the classifiers use the vector space model to represent the dataset of web pages. The components of each vector are computed using the term frequency inversed document frequency (TFIDF) scheme. Unfortunately, TFIDF-based classifiers face the problem of the large-scale size of input data that leads to a long processing time and an increase in resource requests. Therefore, there is an increasing demand to alleviate these problems by reducing the size of the input data without influencing the classification results. In this paper, we propose a novel approach that improves web page classifiers by reducing the size of the input data (i.e. web pages and feature reduction) by using the hypertext induced topic search (HITS) algorithm. We employ HITS results for weighting remaining features. We evaluate the performance of the proposed approach by comparing it with the TFIDF-based classifier. We demonstrate that our approach significantly reduces the time needed for classification.

Keywords

Hypertext induced topic search, link analysis, support vector machine, web mining

First Page

2015

Last Page

2032

Recommended Citation

MEADI, MOHAMED NADJIB; BABAHENINI, MOHAMED CHAOUKI; and AHMED, ABDELMALIK TALEB (2017) "New use of the HITS algorithm for fast web page classification," Turkish Journal of Electrical Engineering and Computer Sciences: Vol. 25: No. 3, Article 32. https://doi.org/10.3906/elk-1501-236
Available at: https://journals.tubitak.gov.tr/elektrik/vol25/iss3/32

Download

Included in

Computer Engineering Commons, Computer Sciences Commons, Electrical and Computer Engineering Commons

COinS

Turkish Journal of Electrical Engineering and Computer Sciences

New use of the HITS algorithm for fast web page classification

DOI

Abstract

Keywords

First Page

Last Page

Recommended Citation

Included in

Issues by Year

Search

Turkish Journal of Electrical Engineering and Computer Sciences

New use of the HITS algorithm for fast web page classification

Authors

DOI

Abstract

Keywords

First Page

Last Page

Recommended Citation

Included in

Share

Issues by Year

Search