Transferring knowledge from task-agnostic pre-trained deep models for downstream tasks is an important topic in computer vision research. Along with the growth of computational capacity, we now have open-source vision-language pre-trained models in large scales of the model architecture and amount of data. In this study, we focus on transferring knowledge for video classification tasks. Conventional methods randomly initialize the linear classifier head for vision classification, but they leave the usage of the text encoder for downstream visual recognition tasks undiscovered. In this paper, we revise the role of the linear classifier and replace the classifier with different knowledge from the pre-trained model.
2022: Wenhao Wu, Zhun Sun, Wanli Ouyang
Ranked #1 on Action Recognition on ActivityNet
https://arxiv.org/pdf/2207.01297v3.pdf
view more