あらすじ:
Image captioning is a popular task in vision and language, which aims to generate proper textual descriptions of images. Recently, some works use objects to ease image and text alignment for learning better cross-modal representation, resulting in good performance in this task. In this paper, we consider relation is also important for learning semantics, here we use relations between objects to explore if relations as a prior can also improve performance. First, we consider the annotated relations between objects, and use them as tags in an image captioning model for aligning the image and text. Moreover, we also aim at integrating relationships between text to image features. For this, we focus on the masking strategy and change the strategy from random masking to relation masking to further study the training strategy for enhancing semantic alignment of object relations. In the experiments, we found that considering object relations improved the captioning performance in common metrics. Further, when changing the masking strategy for focusing on a specific part in caption to be masked when training, we found that it could lead to capturing more object relations of an image, while it destroyed the randomness when training, the performance decreases and the relations appear to be not compatible with the image contents.
種類: Poster at MIRU Symposium (画像の認識・理解シンポジウム)
日付: July 2023