img

Notice détaillée

Co-attention graph convolutional network for visual question answering

Article Ecrit par: Liu, Chuan ; Tan, Ying-Ying ; Xia, Tian-Tian ; Zhang, Jiajing ; Zhu, Ming ;

Résumé: Visual Question Answering (VQA) is a challenging task that requires a fine-grained understanding of both the visual content of images and the textual content of questions. Conventional visual attention model, which is designed primarily from the perspective of attention mechanism, lacks the ability to reason about relationships between visual objects and ignores the multimodal interactions between questions and images. In this work, we propose a combined both graph convolutional network and co-attention network to circumvent the aforementioned problem. The model employs binary relational reasoning as the graph learner module to learn a graph structure representation that captures relationships between visual objects and learns image representation related to the specific question that has an awareness of spatial location via spatial graph convolution. After that, we perform parallel co-attention learning by passing image representations and features of question words through a deep co-attention module. Experiment results demonstrate that the Overall accuracy of our model delivers on the 68.67% on the test-std set of the benchmarkVQA v2.0 dataset, which outperforms most existing models.


Langue: Anglais