•  
  •  
 

Turkish Journal of Electrical Engineering and Computer Sciences

DOI

10.3906/elk-1901-91

Abstract

Sentence similarity is the task of assessing how similar the two snippets of text are. Similarity techniques are used extensively in clustering, summarization, classification, plagiarism detection etc. Due to a small set of vocabularies, sentence similarity is considered to be a difficult problem in natural language processing. There are two issues in solving this problem: (1) Which similarity techniques to be used for word pair similarity and (2) How to generalize that to sentence pairs. We have used the weighted path, a WordNet-based similarity assessment, and the paraphrase database to obtain word pair similarity values. Thereafter, we extracted maximum values from the pairwise similarity matrix and computed a similarity value for a sentence pair. We have also incorporated a vector space model technique to form a robust similarity measure. Our method outperformed state-of-the-art methods on the STSS65 test dataset in Pearson's correlation of 87 % compared to human similarity scores. Moreover, our approach performed on par with other methods on the STSS131 test data using the same test. Our approach outperforms all the other WordNet-based methods compared on both datasets.

Keywords

Sentence similarity, plagiarism detection, text mining, vector space model, paraphrase database

First Page

3779

Last Page

3790

Share

COinS