In multiword expressions (MWEs), multiple words unite to build a new unit in language. When MWE identification is accepted as a binary classification task, one of the most important factors in performance is to train the classifier with enough number of labelled samples. Since manual labelling is a time-consuming task, the performances of MWE recognition studies are limited with the size of the training sets. In this study, we propose the comparison-based and common-decision co-training approaches in order to enlarge the MWE dataset. In the experiments, the performances of the proposed approaches were compared to those of the standard co-training  and manual labelling where statistical and linguistic features are employed as two different views of the MWE dataset . A number of tests with different settings were performed on a Turkish MWE dataset. Ten different classifiers were utilized in the experiments and the best performing classifier pair was observed to be the SMO-SMO pair. The experimental results showed that the common-decision co-training approach is an alternative to hand-labeling of large MWE datasets and both newly proposed approaches outperform the standard co-training  when the training set is to be enlarged in MWE classification.
Multiword expression, classification, training set, co-training
METİN, SENEM KUMOVA
"Enlarging multiword expression dataset by co-training,"
Turkish Journal of Electrical Engineering and Computer Sciences: Vol. 26:
5, Article 34.
Available at: https://journals.tubitak.gov.tr/elektrik/vol26/iss5/34