Modern self-supervised learning algorithms typically enforce persistency of instance representations across views. While being very effective on learning holistic image and video representations, such an objective becomes suboptimal for learning spatio-temporally fine-grained features in videos, where scenes and instances evolve through space and time. In this paper, we present Contextualized Spatio-Temporal Contrastive Learning (ConST-CL) to effectively learn spatio-temporally fine-grained video representations via self-supervision. We first design a region-based pretext task which requires the model to transform instance representations from one view to another, guided by context features. Further, we introduce a simple network design that successfully reconciles the simultaneous learning process of both holistic and local representations. We evaluate our learned representations on a variety of downstream tasks and show that ConST-CL achieves competitive results on 6 datasets, including Kinetics, UCF, HMDB, AVA-Kinetics, AVA and OTB. Our code and models will be available at https://github.com/tensorflow/models/tree/master/official/projects/const_cl.
|Title of host publication
|Proceedings - 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022
|IEEE Computer Society
|Number of pages
|Published - 2022
|2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022 - New Orleans, United States
Duration: 2022 Jun 19 → 2022 Jun 24
|Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
|2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022
|22/6/19 → 22/6/24
Bibliographical notePublisher Copyright:
© 2022 IEEE.
All Science Journal Classification (ASJC) codes
- Computer Vision and Pattern Recognition