Turkish Journal of Electrical Engineering and Computer Sciences
Author ORCID Identifier
SERDAR YILDIZ: 0000-0002-0430-690X
ABBAS MEMİŞ: 0000-0003-2645-8071
SONGÜL VARLI: 0000-0002-1786-6869
Abstract
This paper introduces a novel and high-performance encoder-decoder-based deep model called TRCaption Net++ for generic Turkish image captioning tasks. The proposed model is an improved and refined version of TRCap tionNet, which essentially employs a CLIP (contrastive language–image pretraining) image encoder, a feature projection layer and a BERT (bidirectional encoder representations from transformers) text decoder. Within the scope of the study, the regular TRCaptionNet model was trained and specifically fine-tuned with a massive set of image data. In this respect, approximately 2,000,000 random images representing the words in the MS COCO and Flickr caption sets were retrieved through web crawling in the initial stage. Then, nearly 8,000,000 caption texts were generated for each image via 4 different image captioning models. Finally, the text decoder module of the proposed model was improved by using the image-caption features of these crawled images. The performance evaluation test of the TRCaptionNet++ model was carried out on two Turkish caption datasets (TasvirEt and Turkish MS COCO) and two machine-translated caption sets (MS COCO and Flickr30K) by measuring common image captioning metrics such as BLEU, METEOR, ROUGE-L, CIDEr and SPICE. As a result of the performance tests, quite remarkable captioning success rates were achieved and it is observed that the proposed model has a superior performance outperforming all the related works. Project details and demo links of TRCaptionNet++ will also be available on the project’s page https://serdaryildiz.com/TRCaptionNetpp
DOI
10.55730/1300-0632.4150
Keywords
Image captioning, Turkish image captioning, image encoders, text decoders, deep learning
First Page
669
Last Page
687
Publisher
The Scientific and Technological Research Council of Türkiye (TÜBİTAK)
Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 International License.
Recommended Citation
YILDIZ, S, MEMİŞ, A, & VARLI, S (2025). TRCaptionNet++: A high-performance encoder-decoder based deep Turkish image captioning model fine-tuned with a large-scale set of pretrain data. Turkish Journal of Electrical Engineering and Computer Sciences 33 (5): 669-687. https://doi.org/10.55730/1300-0632.4150