Turkish Journal of Electrical Engineering and Computer Sciences

TRCaptionNet++: A high-performance encoder-decoder based deep Turkish image captioning model fine-tuned with a large-scale set of pretrain data

Author ORCID Identifier

Abstract

This paper introduces a novel and high-performance encoder-decoder-based deep model called TRCaption Net++ for generic Turkish image captioning tasks. The proposed model is an improved and refined version of TRCap tionNet, which essentially employs a CLIP (contrastive language–image pretraining) image encoder, a feature projection layer and a BERT (bidirectional encoder representations from transformers) text decoder. Within the scope of the study, the regular TRCaptionNet model was trained and specifically fine-tuned with a massive set of image data. In this respect, approximately 2,000,000 random images representing the words in the MS COCO and Flickr caption sets were retrieved through web crawling in the initial stage. Then, nearly 8,000,000 caption texts were generated for each image via 4 different image captioning models. Finally, the text decoder module of the proposed model was improved by using the image-caption features of these crawled images. The performance evaluation test of the TRCaptionNet++ model was carried out on two Turkish caption datasets (TasvirEt and Turkish MS COCO) and two machine-translated caption sets (MS COCO and Flickr30K) by measuring common image captioning metrics such as BLEU, METEOR, ROUGE-L, CIDEr and SPICE. As a result of the performance tests, quite remarkable captioning success rates were achieved and it is observed that the proposed model has a superior performance outperforming all the related works. Project details and demo links of TRCaptionNet++ will also be available on the project’s page https://serdaryildiz.com/TRCaptionNetpp

DOI

10.55730/1300-0632.4150

Keywords

Image captioning, Turkish image captioning, image encoders, text decoders, deep learning

First Page

669

Last Page

687

Publisher

The Scientific and Technological Research Council of Türkiye (TÜBİTAK)

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 International License.

Recommended Citation

YILDIZ, S, MEMİŞ, A, & VARLI, S (2025). TRCaptionNet++: A high-performance encoder-decoder based deep Turkish image captioning model fine-tuned with a large-scale set of pretrain data. Turkish Journal of Electrical Engineering and Computer Sciences 33 (5): 669-687. https://doi.org/10.55730/1300-0632.4150

Download

COinS

Turkish Journal of Electrical Engineering and Computer Sciences

TRCaptionNet++: A high-performance encoder-decoder based deep Turkish image captioning model fine-tuned with a large-scale set of pretrain data

Author ORCID Identifier

Abstract

DOI

Keywords

First Page

Last Page

Publisher

Creative Commons License

Recommended Citation

Issues by Year

Search

Turkish Journal of Electrical Engineering and Computer Sciences

TRCaptionNet++: A high-performance encoder-decoder based deep Turkish image captioning model fine-tuned with a large-scale set of pretrain data

Authors

Author ORCID Identifier

Abstract

DOI

Keywords

First Page

Last Page

Publisher

Creative Commons License

Recommended Citation

Share

Issues by Year

Search