The agglutinative nature of the Turkish language has a complex morphological structure, and there are generally more than one parse for a given word. Before further processing, morphological disambiguation is required to determine the correct morphological analysis of a word. Morphological disambiguation is one of the first and crucial steps in natural language processing since its success determines later analyses. In our proposed morphological disambiguation method, we used a transformer-based sequence-to-sequence neural network architecture. Transformers are commonly used in various NLP tasks, and they produce state-of-the-art results in machine translation. However, to the best of our knowledge, transformer-based encoder-decoders have not been studied in morphological disambiguation. In this study, in addition to character level tokenization, three input subword representations are evaluated, which are unigram, bytepair, and wordpiece tokenization methods. We have achieved the best accuracy with character input representation which is 96.25%. Although the proposed model is developed for Turkish language, it is not language-dependent, so it can be applied to a larger set of languages.
Natural language analysis, agglutinative languages, machine learning methods, morphological disambigua tion, morphological analysis, transformer network
ÖZER, HİLAL and KORKMAZ, EMİN ERKAN
"Transmorph: a transformer based morphological disambiguator for Turkish,"
Turkish Journal of Electrical Engineering and Computer Sciences: Vol. 30:
5, Article 15.
Available at: https://journals.tubitak.gov.tr/elektrik/vol30/iss5/15