Notice détaillée

Temporal-masked skeleton-based action recognition with supervised contrastive learning

Article Ecrit par: Zhao, Zhifeng ; Lin, Yuxiang ; Chen, Guodong ;

Résumé: Recent years have seen the resurgence of self-supervised learning in visual representation thanks to Contrastive Learning and Masked Image Modeling. The existing self-supervised methods for skeleton-based action recognition typically learn feature invariance of the data only through contrastive learning. In this paper, we propose a contrast learning method combined with a temporal-masking mechanism of skeleton sequences to encourage the network able to learn action representations other than feature invariance, e.g., occlusion invariance, by implicitly reconstructing the masked sequences. However, the direct masking mechanism destroys the feature consistency of the samples, for which we propose Supervised Positive Sample Mining and self-attention module for embeddings to improve the generalization of the model. First of all, supervised contrastive learning can improve the robustness of models using prior knowledge of labels. Secondly, to avoid excessive masking mechanism that hinders the model from learning the correct occlusion invariance, a self-attention mechanism is necessary, which further discriminate the distance for each action class in the feature space. The results of various experimental protocols on NTU 60, NTU 120, PKU-MMD datasets demonstrate the advantages of our method and that our method outperforms the existing state-of-the-art contrastive methods. Code is available at https://github.com/ZZFCV/SASOiCLR.

Langue: Anglais