Captions are often expected to carry detailed, essential information of images, but current image captioning models tend to play safe and generate generic captions that is less informative. Cross-modal retrieval is a promising solution as texts with more details has better performance in retrieval. In this
work, we first explore two types of salient n-grams, i.e., Support N-grams (SN) and Deletion N-grams (DN), in captions which significantly affect the performance of typical cross-modal retrieval models. We further exploit these n-grams to enhance the original learning objectives for generating descriptive captions with more details. The experiments on two benchmark datasets show that our proposed model outperforms baselines significantly when evaluated with a wide range of metrics.
Article ID: 2021S23
Publisher: Canadian Artificial Intelligence Association