Efficient and self-adaptive rationale knowledge base for visual commonsense reasoning
مقال من تأليف: Song, Zijie ; Hong, Richang ; Hu, Zhenzhen ;
ملخص: Visual commonsense reasoning (VCR) task leads to a cognitive level of understanding between vision and linguistic domains. Three sub-tasks, i.e., , , and , require the ability to predict the correct answer and rational explanation according to the given image and question. Different from other visual reasoning tasks, such as VQA and GQA, VCR focuses on the exploration of the facts that clarify the causes, context, and consequences of the image and questions, which is the process of acquiring knowledge and thorough understanding. In this paper, we propose a rationale knowledge base (RKB) incorporating the convolution fusion mechanism to import the VCR-related knowledge. We emphasize that (1) the RKB is extracted and then trained over VCR's dataset (VCR-set) itself, and (2) the convolution fusion mechanism is subtly designed to be self-adaptive and computationally efficient. Experiments on the large-scale VCR-set demonstrate the effectiveness of our proposed method with respect to the three sub-tasks.
لغة:
إنجليزية