Author ORCID Identifier

Abstract

Transferring knowledge from large-scale, independently pretrained image and text models to video understanding requires addressing several challenges, including maintaining generalization capabilities of models, integrating them into multimodal architectures, and fine-tuning with temporal dynamics. This study evaluates the effectiveness of parameter-efficient fine-tuning (PEFT) techniques in transferring pretrained knowledge from two independent models for video action recognition within a simple, streamlined multimodal fusion pipeline. Specifically, we adapt CLIP as the text branch and DINOv2 as the image branch, keeping both backbones frozen to preserve their pretrained robustness, while introducing lightweight, task-specific modules to adapt and fuse the branches with temporal dynamics. A simple fusion transformer combines the image and text branches, enabling their efficient integration with minimal training cost. We systematically evaluate the framework on widely-recognized midscale video benchmark datasets, comparing prompt-based and adapter-based PEFT techniques across different data regimes. Our results demonstrate that this combination achieves competitive performance, highlights the transferability and scalability of independent pretrained models for a targeted task, and provides practical insights for adapting large models using midscale, task specific video datasets. In particular, adaptations of the DINOv2 image encoder and CLIP text encoder improve recognition accuracy over the frozen baseline up to an average absolute gains of 3.47\% across K5--KAll. Moreover, the proposed DoRA DINOv2 combined with an adapter-based CLIP text encoder achieves competitive state-of-the-art performance on UCF101, HMDB51, and DIVING48, consistently outperforming prior methods in few-shot scenarios and reaching up to 82.0% accuracy with K2 training examples.

DOI

10.55730/1300-0632.4184

Keywords

Action recognition, multimodal, adaptation, parameter-efficient fine-tuning

First Page

437

Last Page

452

Publisher

The Scientific and Technological Research Council of Türkiye (TÜBİTAK)

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 International License.

Recommended Citation

PEHLİVAN, S (2026). Adapting independent large-scale pretrained models for human action recognition. Turkish Journal of Electrical Engineering and Computer Sciences 34 (3): 437-452. https://doi.org/10.55730/1300-0632.4184

Download

Included in

Computer Engineering Commons, Computer Sciences Commons, Electrical and Computer Engineering Commons

COinS

Turkish Journal of Electrical Engineering and Computer Sciences

Adapting independent large-scale pretrained models for human action recognition

Author ORCID Identifier

Abstract

DOI

Keywords

First Page

Last Page

Publisher

Creative Commons License

Recommended Citation

Included in

Issues by Year

Search

Turkish Journal of Electrical Engineering and Computer Sciences

Adapting independent large-scale pretrained models for human action recognition

Authors

Author ORCID Identifier

Abstract

DOI

Keywords

First Page

Last Page

Publisher

Creative Commons License

Recommended Citation

Included in

Share

Issues by Year

Search