Turkish Journal of Electrical Engineering and Computer Sciences

A fast text similarity measure for large document collections using multireference cosine and genetic algorithm

HAMID MOHAMMADI
SEYED HOSSEIN KHASTEH

DOI

10.3906/elk-1906-30

Abstract

One of the critical factors that make a search engine fast and accurate is a concise and duplicate free index. In order to remove duplicate and near-duplicate (DND) documents from the index, a search engine needs a swift and reliable DND text document detection system. Traditional approaches to this problem, such as brute force comparisons or simple hash-based algorithms, are not suitable as they are not scalable and are not capable of detecting near-duplicate documents effectively. In this paper, a new signature-based approach to text similarity detection is introduced, which is fast, scalable, and reliable and needs less storage space. The proposed method is examined on standard text document datasets such as CiteseerX, Enron, Gold Set of Near-duplicate News Articles, and other similar datasets. The results are promising and comparable with the best cutting-edge algorithms considering accuracy and performance. The proposed method is based on the idea of using reference texts to generate signatures for text documents. The novelty of this paper is the use of genetic algorithms to generate better reference texts.

Keywords

Text similarity, near-duplicate, reference text, genetic algorithm

First Page

999

Last Page

1013

Recommended Citation

MOHAMMADI, HAMID and KHASTEH, SEYED HOSSEIN (2020) "A fast text similarity measure for large document collections using multireference cosine and genetic algorithm," Turkish Journal of Electrical Engineering and Computer Sciences: Vol. 28: No. 2, Article 28. https://doi.org/10.3906/elk-1906-30
Available at: https://journals.tubitak.gov.tr/elektrik/vol28/iss2/28

Download

Included in

Computer Engineering Commons, Computer Sciences Commons, Electrical and Computer Engineering Commons

COinS

Turkish Journal of Electrical Engineering and Computer Sciences

A fast text similarity measure for large document collections using multireference cosine and genetic algorithm

DOI

Abstract

Keywords

First Page

Last Page

Recommended Citation

Included in

Issues by Year

Search

Turkish Journal of Electrical Engineering and Computer Sciences

A fast text similarity measure for large document collections using multireference cosine and genetic algorithm

Authors

DOI

Abstract

Keywords

First Page

Last Page

Recommended Citation

Included in

Share

Issues by Year

Search