Turkish Journal of Electrical Engineering and Computer Sciences
DOI
10.3906/elk-1307-127
Abstract
This paper introduces the concept of product identity-clustering based on new similarity metrics and new performance metrics for web-crawled products. Product identity-clustering is defined here as the clustering of identical products, e.g., for price comparison purposes. Products blindly crawled over web sources, e.g., online marketplaces, have different description formats, where the features describing the same products differ in both number and representation formats. This problem causes imperfect feature vectors, where the vectors are considered to be not uniform in length and structure, with the features of various data types (numeric, categorical), and unknown vector structures. Furthermore, the product information usually contains redundant, missing, or faulty data, which are regarded as noise here. Product identity-clustering becomes a challenge when the vectors' metadata are previously unknown and the imperfect nature of the feature vectors is considered with the occurrence of noise. In this paper, the product identity-clustering concept is introduced as a new mining metric in e-commerce. Then novel similarity metrics are introduced to improve the product identity-clustering performance of legacy metrics. Finally, novel performance metrics are proposed to measure the performance of the identity-clustering algorithms. Using these metrics, a comparison of the legacy-based similarity metrics (Euclidian, cosine, etc.) and the proposed similarity metrics is given. The results show that legacy metrics are not successful in discriminating identical web-crawled products and the proposed metrics enable better achievement in the product identity-clustering problem.
Keywords
Product clustering, similarity metrics, identity clustering, performance metrics, web mining
First Page
1195
Last Page
1208
Recommended Citation
YETGİN, ZEKİ and GÖZÜKARA, FURKAN
(2015)
"New metrics for clustering of identical products over imperfect data,"
Turkish Journal of Electrical Engineering and Computer Sciences: Vol. 23:
No.
4, Article 20.
https://doi.org/10.3906/elk-1307-127
Available at:
https://journals.tubitak.gov.tr/elektrik/vol23/iss4/20
Included in
Computer Engineering Commons, Computer Sciences Commons, Electrical and Computer Engineering Commons