Abstract

This paper introduces the concept of product identity-clustering based on new similarity metrics and new performance metrics for web-crawled products. Product identity-clustering is defined here as the clustering of identical products, e.g., for price comparison purposes. Products blindly crawled over web sources, e.g., online marketplaces, have different description formats, where the features describing the same products differ in both number and representation formats. This problem causes imperfect feature vectors, where the vectors are considered to be not uniform in length and structure, with the features of various data types (numeric, categorical), and unknown vector structures. Furthermore, the product information usually contains redundant, missing, or faulty data, which are regarded as noise here. Product identity-clustering becomes a challenge when the vectors' metadata are previously unknown and the imperfect nature of the feature vectors is considered with the occurrence of noise. In this paper, the product identity-clustering concept is introduced as a new mining metric in e-commerce. Then novel similarity metrics are introduced to improve the product identity-clustering performance of legacy metrics. Finally, novel performance metrics are proposed to measure the performance of the identity-clustering algorithms. Using these metrics, a comparison of the legacy-based similarity metrics (Euclidian, cosine, etc.) and the proposed similarity metrics is given. The results show that legacy metrics are not successful in discriminating identical web-crawled products and the proposed metrics enable better achievement in the product identity-clustering problem.

DOI

10.3906/elk-1307-127

Keywords

Product clustering, similarity metrics, identity clustering, performance metrics, web mining

First Page

1195

Last Page

1208

Recommended Citation

YETGİN, ZEKİ and GÖZÜKARA, FURKAN (2015) "New metrics for clustering of identical products over imperfect data," Turkish Journal of Electrical Engineering and Computer Sciences: Vol. 23: No. 4, Article 20. https://doi.org/10.3906/elk-1307-127
Available at: https://journals.tubitak.gov.tr/elektrik/vol23/iss4/20

Download

Included in

Computer Engineering Commons, Computer Sciences Commons, Electrical and Computer Engineering Commons

COinS

Turkish Journal of Electrical Engineering and Computer Sciences

New metrics for clustering of identical products over imperfect data

Abstract

DOI

Keywords

First Page

Last Page

Recommended Citation

Included in

Issues by Year

Search

Turkish Journal of Electrical Engineering and Computer Sciences

New metrics for clustering of identical products over imperfect data

Authors

Abstract

DOI

Keywords

First Page

Last Page

Recommended Citation

Included in

Share

Issues by Year

Search