Transferring knowledge from task-agnostic pre-trained deep models for downstream tasks is an important topic in computer vision research. Along with the growth of computational capacity, we now have open-source vision-language pre-trained models in large scales of the model architecture and amount of data. In this study, we focus on transferring knowledge for video classiﬁcation tasks. Conventional methods randomly initialize the linear classiﬁer head for vision classiﬁcation, but they leave the usage of the text encoder for downstream visual recognition tasks undiscovered. In this paper, we revise the role of the linear classiﬁer and replace the classiﬁer with different knowledge from the pre-trained model.
2022: Wenhao Wu, Zhun Sun, Wanli Ouyang
Ranked #1 on Action Recognition on ActivityNet