Bi-calibration Networks for Weakly-Supervised Video Representation Learning
Article Ecrit par: Long, Fuchen ; Luo, Jiebo ; Mei, Tao ; Tian, Xinmei ; Yao, Ting ; Qiu, Zhaofan ;
Résumé: The leverage of large volumes of web videos paired with the query (short phrase for searching the video) or surrounding text (long textual description, e.g., video title) offers an economic and extensible alternative to supervised video representation learning. Nevertheless, modeling such weakly visual-textual connection is not trivial due to query polysemy (i.e., many possible meanings for a query) and text isomorphism (i.e., same syntactic structure of different text). In this paper, we introduce a new design of mutual calibration between query and text to achieve more reliable visual-textual supervision for video representation learning. Specifically, we present Bi-Calibration Networks (BCN) that novelly couples two calibrations to learn the correction from text to query and vice versa. Technically, BCN executes clustering on all the titles of the videos searched by an identical query and takes the centroid of each cluster as a text prototype. All the queries constitute the query set. The representation learning of BCN is then formulated as video classification over text prototypes and queries, with text-to-query and query-to-text calibrations. A selection scheme is also devised to balance the two calibrations. Two large-scale web video datasets paired with query and title, named YOVO-3M and YOVO-10M, are newly collected for weakly-supervised video feature learning. The video features of BCN with ResNet backbone learnt on YOVO-3M (3M YouTube videos) obtain superior results under linear protocol on action recognition. More remarkably, BCN trained on the larger set of YOVO-10M (10M YouTube videos) with further fine-tuning leads to 1.3% gain in top-1 accuracy on Kinetics-400 dataset over the state-of-the-art TAda2D method with ImageNet pre-training.
Langue:
Anglais