Descriptive Image Captioning with Salient Retrieval Priors

Captions are often expected to carry detailed, essential information of images, but current image captioning models tend to play safe and generate generic captions that is less informative. Cross-modal retrieval is a promising solution as texts with more details has better performance in retrieval. In this work, we ﬁrst explore two types of salient n-grams, i


Introduction
Image captioning [1,2] is a multimodal problem that lies in the intersection of the natural language processing and computer vision. Good captions are expected to have a high fidelity to images and are descriptive and fluent. The most recent years have seen a significant improvement of image captioning performance [1,[3][4][5]. However, existing models tend to play safe and generate generic captions, and generating informative captions needs further studies.
To assess how informative a caption is, we may design an explicitly discriminative task the success of which would depend on how accurately the caption describes the visual input. Leveraging crossmodal retrieval [6] is a promising approach to better solving this problem, due to the fact that more descriptive captions often results in better performance in finding relevant images in text-to-image retrieval or can be more easily retrieved when images are used as queries in image-to-text retrieval.
Inspired by the concept of saliency maps [7], we further investigate the cross-retrieval model by searching for fine-grained n-grams contributing to retrieval performance. To be more specific, given Figure 1, we try to find solution from answering the two questions: (1) What n-grams could cause a plunge of retrieval performance if removed from the caption? (Deletion N-grams), and (2) what n-grams could significantly support the retrieval performance alone? (Support N-grams) (see Section 3.2 for details). We leverage the salient n-grams into our training process by regularizing the original learning objective (see Section 3.3 for details) in the cross entropy (CE) and reinforcement learning (RL) optimization period to encourage the generated sentence covering them.
We perform extensive experiments with the proposed models on two benchmarks: MSCOCO [8] and FLICKR30K [9]. Our contributions are summarized as follows: (1) We investigate two categories of salient n-grams in captions and show how they are critical for cross-modal retrieval. (2) We design novel mechanisms to incorporate the salient n-grams to regularize the learning objective for generating more descriptive captions. (3) Our method achieves significantly better performance over a wide range metrics on two benchmark datasets over baselines. Human evaluation further confirms the effectiveness of the proposed method.

Related Work
Image captioning, aiming at generating visually grounded descriptions for images, often leverage an CNN or variants as the image encoder and an RNN as the decoder to generate sentences [2,10,11]. To improve the performance on reference-based automatic evaluation metrics, visual attention mechanism [1,3,4], explicit high-level attributes detection [12,13], reinforcement learning methods [14], contrastive or adversarial learning [15,16], multi-step decoding [17] and scene graph detection [18,19] are proposed. The work of [20,21] is most related to ours, which uses retrieval loss as a reward signal to produce descriptive captions. Unlike their approaches, our method utilizes salient n-grams, which are critical to retrieval performance, to guide caption generation. We build our method on top of these related methods to verify the effectiveness of our method.

Cross Modal Retrieval
We leverage stacked cross attention network (SCAN) [22] as our cross-modal retrieval model. SCAN takes multi-modal inputs: an image with multiple object features I = {v 1 , · · · , v k } and a sentence of contextualized word features C = {c 1 , · · · , c n }. The output is a similarity score s(I, C) for the input image-sentence pair. SCAN calculates fine-grained latent alignment between each v i and c j to compute image-sentence similarity. We further train the score function s(I, C) through the triplet loss: and α is the margin between positive s(I, C) and negatively sampled image-sentence pairs s(I,Ĉ), s(Î, C).

Salient N-grams Searching
To find n-grams critical for retrieval, we propose to search two types of salient n-grams (n from 1 to 5): Deletion N-grams and Support N-grams. The former focuses on finding n-grams that would significantly impair retrieval performance if removed, and the latter attempts to obtain n-grams that maximally support the retrieval.
Given any n-gram m in caption C, we can represent C as a union of two partitions C = C m ∪C \m . We are interested in image-text similarity when, (1) C m are unobserved, in which the similarity score can be expressed as s(I, C \m ), and (2) C \m are unobserved, in which the score can be written as s(I, C m ). Our objective is to find all m that satisfy either of the two following conditions: where α is the margin in cross-modal retrieval model. For simplicity, we denote the n-grams set satisfying Equation (3.2) as M DN and Equation (3.3) as M SN .

Image Captioning Training
The goal of image captioning is to train a conditional text generation model p θ (C | I) based on training dataset (I, C) ∈ D. We use the CE and then RL loss to optimize the model: Regularized Learning Objectives. We leverage the salient n-grams set M to enhance the CE and RL objective, respectively: where w i is a weight for c i and its value is 1 if it is not contained in any m ∈ M, otherwise it would be a constant λ c greater than 1.
where #(C , M ) is the number of n-grams in C belonging to M, and λ r is the balanced weight.  [23], CIDEr-D [24], METEOR [25], ROUGE-L [26], and BLEU [27]. In addition, we also evaluate descriptiveness with an extrinsic evaluation -we use the generated captions in a text-to-image retrieval model [22] and measure the retrieval performance with the R@K(K=1, 5) metric based on 1K images in test set [22]. Furthermore, we also perform human evaluation for descriptiveness, fluency, and fidelity. Compared Models We use AoANet, ATTN and DISC as the baselines. ATTN [14] is a LSTM based visual attention network which encodes image into a set of features and decodes tokens using state from LSTM and weighted average of image features. AoANet [28] adopts attention on attention module, which achieves superior performance over automatic evaluation metrics. Moreover, we also leverage the self-retrieval enhanced captioning model DISC [20] which is built upon ATTN to verify if our method can further boost descriptiveness. We also list the performance when different salient n-grams set are used, namely, Implementation Details To make fair comparison, we use the same experiment setup that the compared baselines used. The hyper-parameter λ c is set to be 1.1. λ r is set to be 0.1 for MSCOCO and 0.2 for FLICKR30K. Following Updown [4], we use the same bottom-up features which is extracted by Faster R-CNN [29] (ResNet-101 [30] as backbone) fine-tuned on Visual Genome [31] for AoANet and ATTN baselines over MSCOCO and FLICKR30K dataset. We set the initial learning rate to be 0.0002 (CE phase), 0.00002 (RL phase) for both baselines. The mini-batch size to be 128 for ATTN and 16 for AoANet. The maximum number of training epochs is 25 for CE and 20 for RL optimization respectively. For sequence generation in the inference stage, we adopt the beam search strategy with beam size to be 2.

Results and Analysis
Overall performance Table 1  Performance with Fewer Training Data. We can see that the salient n-grams have a greater benefit when there are fewer training data. As the size of the training data of the MSCOCO standard split is three fourths of that of the Karpathy split, AoA+SN achieves a considerably higher relative improvement, improving CIDEr-D and R@1, by 0.9% and 2.4%, respectively, compared to 0.6% and 1.5% in the Karpathy Split. Moreover, the FLICKR30K standard split has only 30% training data compared to the MSCOCO Karpathy split. AoA+SN obtains a relative improvement of 8.7% and 1.8% on R@1 and SPICE, respectively, significantly higher than 1.5% and 0.9% on the MSCOCO Karpathy split. Types of Salient N-grams. We additionally investigate the four salient n-grams sets. M SN is much larger than M DN in both datasets. In MSCOCO, the intersection M SN ∩ M DN covers more than half of the total n-grams in M DN , and in FLICKR30K, the coverage is more than 70%. Hyper-parameters and Ablation Analysis There are two major hyper-parameters λ c and λ r : (1) λ c is the weight for a token in the salient n-grams (Equation 3.7). We observe that λ c has little effect on the final performance after the RL optimization, when its value is in the range of [1, 1.15]; otherwise it has a negative influence on the final performance. However, λ c has a positive impact on the performance after the CE training (See ablation analysis below for details). (2) λ r is the weight for salient n-grams items in regularized reward function. As shown in Figure 2, we find that the model achieves the best performance when λ r is set to 0.1 for MSCOCO. In addition, we observe that the generated sentences would tend to repetitively produce salient n-grams in a sentence, e.g., "A wooden table with chairs on a wooden table", when λ r is too large. Ablation Analysis. We conduct an ablation analysis on FLICKR30K with AoANet, presenting results for various combinations of different training stages. As shown in Figure 3, the regularized CE (RCE) outperforms the basic CE on CIDEr-D and SPICE. This improvement, however, is mitigated after     Table 3. Human evaluation between different models. Figure (3) shows that ATTN+SN performs better than ATTN on fluency, descriptiveness, and fidelity. Moreover, DISC+SN can further improve the performance on descriptiveness as well. Figure (4) includes two examples in which we compare our model (ATTN+SN) with ATTN, DISC and the ground truth. Our model produces captions with more detailed and key content; e.g., we generate more details such as "container", "broccoli" in the first case and "next to a train" in the second.

Conclusions
We propose an effective approach to produce more descriptive captions using salient n-grams from cross-modal retrieval models. Specifically, we incorporate two types of salient n-grams, i.e., deletion n-grams and support n-grams, into the CE and RL learning objectives. The proposed model is shown to outperform the compared models on two widely used benchmarks on a wide range of metrics and human evaluation.

Acknowledgement
We would like to thank the anonymous reviewers for their valuable comments. This research is supported by NSERC Discovery Grants and DND Supplement.