•  
  •  
 

Turkish Journal of Electrical Engineering and Computer Sciences

Author ORCID Identifier

SELEN PEHLİVAN: 0000-0002-8658-8890

Abstract

Transferring knowledge from large-scale, independently pretrained image and text models to video understanding requires addressing several challenges, including maintaining generalization capabilities of models, integrating them into multimodal architectures, and fine-tuning with temporal dynamics. This study evaluates the effectiveness of parameter-efficient fine-tuning (PEFT) techniques in transferring pretrained knowledge from two independent models for video action recognition within a simple, streamlined multimodal fusion pipeline. Specifically, we adapt CLIP as the text branch and DINOv2 as the image branch, keeping both backbones frozen to preserve their pretrained robustness, while introducing lightweight, task-specific modules to adapt and fuse the branches with temporal dynamics. A simple fusion transformer combines the image and text branches, enabling their efficient integration with minimal training cost. We systematically evaluate the framework on widely-recognized midscale video benchmark datasets, comparing prompt-based and adapter-based PEFT techniques across different data regimes. Our results demonstrate that this combination achieves competitive performance, highlights the transferability and scalability of independent pretrained models for a targeted task, and provides practical insights for adapting large models using midscale, task specific video datasets. In particular, adaptations of the DINOv2 image encoder and CLIP text encoder improve recognition accuracy over the frozen baseline up to an average absolute gains of 3.47\% across K5--KAll. Moreover, the proposed DoRA DINOv2 combined with an adapter-based CLIP text encoder achieves competitive state-of-the-art performance on UCF101, HMDB51, and DIVING48, consistently outperforming prior methods in few-shot scenarios and reaching up to 82.0% accuracy with K2 training examples.

DOI

10.55730/1300-0632.4184

Keywords

Action recognition, multimodal, adaptation, parameter-efficient fine-tuning

First Page

437

Last Page

452

Publisher

The Scientific and Technological Research Council of Türkiye (TÜBİTAK)

Creative Commons License

Creative Commons Attribution 4.0 International License
This work is licensed under a Creative Commons Attribution 4.0 International License.

Share

COinS