DOI

10.55730/1300-0632.4038

Abstract

The training of supervised machine learning approaches is critically dependent on annotating large-scale datasets. Semisupervised learning approaches aim to achieve compatible performance with supervised methods using relatively less annotation without sacrificing good generalization capacity. In line with this objective, ways of leveraging unlabeled data have been the subject of intense research. However, semisupervised video action recognition has received relatively less attention compared to image domain implementations. Existing semisupervised video action recognition methods trained from scratch rely heavily on augmentation techniques, complex architectures, and/or the use of other modalities while distillation-based methods use models that have only been trained for 2D computer vision tasks. In another line of work, pretrained vision-language models have shown very promising results for generating general-purpose visual features with reports of high zero-shot performance for many downstream tasks. In this work, we exploit a language-supervised visual encoder for learning video representations for video action classification tasks. We propose a teacher-student learning paradigm through feature distillation and pseudo-labeling. Our experimental results are a proof-of-concept revealing that multimodal feature extractors can be utilized for spatiotemporal feature extraction in a semisupervised learning context and show compatible performance with SOTA methods, especially in a low-label regime.

Keywords

Video action classification, multimodal learning, semisupervised learning, feature distillation

First Page

1129

Last Page

1145

Recommended Citation

ÇELİK, ASLI; KÜÇÜKMANİSA, AYHAN; and URHAN, OĞUZHAN (2023) "Feature distillation from vision-language model for semisupervised action classification," Turkish Journal of Electrical Engineering and Computer Sciences: Vol. 31: No. 6, Article 14. https://doi.org/10.55730/1300-0632.4038
Available at: https://journals.tubitak.gov.tr/elektrik/vol31/iss6/14

Download

Included in

Computer Engineering Commons, Computer Sciences Commons, Electrical and Computer Engineering Commons

COinS

Turkish Journal of Electrical Engineering and Computer Sciences

Feature distillation from vision-language model for semisupervised action classification

DOI

Abstract

Keywords

First Page

Last Page

Recommended Citation

Included in

Issues by Year

Search

Turkish Journal of Electrical Engineering and Computer Sciences

Feature distillation from vision-language model for semisupervised action classification

Authors

DOI

Abstract

Keywords

First Page

Last Page

Recommended Citation

Included in

Share

Issues by Year

Search