Turkish Journal of Electrical Engineering and Computer Sciences
Author ORCID Identifier
SELEN PEHLİVAN: 0000-0002-8658-8890
Abstract
Transferring knowledge from large-scale, independently pretrained image and text models to video understanding requires addressing several challenges, including maintaining generalization capabilities of models, integrating them into multimodal architectures, and fine-tuning with temporal dynamics. This study evaluates the effectiveness of parameter-efficient fine-tuning (PEFT) techniques in transferring pretrained knowledge from two independent models for video action recognition within a simple, streamlined multimodal fusion pipeline. Specifically, we adapt CLIP as the text branch and DINOv2 as the image branch, keeping both backbones frozen to preserve their pretrained robustness, while introducing lightweight, task-specific modules to adapt and fuse the branches with temporal dynamics. A simple fusion transformer combines the image and text branches, enabling their efficient integration with minimal training cost. We systematically evaluate the framework on widely-recognized midscale video benchmark datasets, comparing prompt-based and adapter-based PEFT techniques across different data regimes. Our results demonstrate that this combination achieves competitive performance, highlights the transferability and scalability of independent pretrained models for a targeted task, and provides practical insights for adapting large models using midscale, task specific video datasets. In particular, adaptations of the DINOv2 image encoder and CLIP text encoder improve recognition accuracy over the frozen baseline up to an average absolute gains of 3.47\% across K5--KAll. Moreover, the proposed DoRA DINOv2 combined with an adapter-based CLIP text encoder achieves competitive state-of-the-art performance on UCF101, HMDB51, and DIVING48, consistently outperforming prior methods in few-shot scenarios and reaching up to 82.0% accuracy with K2 training examples.
DOI
10.55730/1300-0632.4184
Keywords
Action recognition, multimodal, adaptation, parameter-efficient fine-tuning
First Page
437
Last Page
452
Publisher
The Scientific and Technological Research Council of Türkiye (TÜBİTAK)
Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 International License.
Recommended Citation
PEHLİVAN, S (2026). Adapting independent large-scale pretrained models for human action recognition. Turkish Journal of Electrical Engineering and Computer Sciences 34 (3): 437-452. https://doi.org/10.55730/1300-0632.4184
Included in
Computer Engineering Commons, Computer Sciences Commons, Electrical and Computer Engineering Commons