Learning to Recover Reasoning Chains for Multi-Hop Question Answering via Cooperative Games

We propose the new problem of learning to recover reasoning chains from weakly supervised signals, i.e., the question-answer pairs. We propose a cooperative game approach to deal with this problem, in which how the evidence passages are selected and how the selected passages are connected are handled by two models that cooperate to select the most confident chains from a large set of candidates (from distant supervision). For evaluation, we created benchmarks based on two multi-hop QA datasets, HotpotQA and MedHop; and hand-labeled reasoning chains for the latter. The experimental results demonstrate the effectiveness of our proposed approach.


Introduction
NLP tasks that require multi-hop reasoning have recently enjoyed rapid progress, especially on multihop question answering (Ding et al., 2019;Nie et al., 2019;Asai et al., 2019).Advances have benefited from rich annotations of supporting evidence, as in the popular multi-hop QA and relation extraction benchmarks, e.g., HotpotQA (Yang et al., 2018) and DocRED (Yao et al., 2019), where the evidence sentences for the reasoning process were labeled by human annotators.
Such evidence annotations are crucial for modern model training, since they provide finer-grained supervision for better guiding the model learning.Furthermore, they allow a pipeline fashion of model training, with each step, such as passage ranking and answer extraction, trained as a supervised learning sub-task.This is crucial from a practical perspective, in order to reduce the memory usage when handling a large amount of inputs with advanced, large pre-trained models (Peters et al., 2018;Radford et al., 2018;Devlin et al., 2019).
Manual evidence annotation is expensive, so there are only a few benchmarks with supporting evidence annotated.Even for these datasets, the structures of the annotations are still limited, as new model designs keep emerging and they may require different forms of evidence annotations.As a result, the supervision from these datasets can still be insufficient for training accurate models.
Taking question answering with multi-hop reasoning as an example, annotating only supporting passages is not sufficient to show the reasoning processes due to the lack of necessary structural information (Figure 1).One example is the order of annotated evidence, which is crucial in logic reasoning and the importance of which has also been demonstrated in text-based QA (Wang et al., 2019).The other example is how the annotated evidence pieces are connected, which requires at least the definition of arguments, such as a linking entity, concept, or event.Such information has proved useful by the recently popular entity-centric methods (De Cao et al., 2019;Kundu et al., 2019;Xiao et al., 2019;Godbole et al., 2019;Ding et al., 2019;Asai et al., 2019) and intuitively will be a benefit to these methods if available.
We propose a cooperative game approach to recovering the reasoning chains with the aforementioned necessary structural information for multi-arXiv:2004.02393v1[cs.CL] 6 Apr 2020 hop QA.Each recovered chain corresponds to a list of ordered passages and each pair of adjacent passages is connected with a linking entity.Specifically, we start with a model, the Ranker, which selects a sequence of passages arriving at the answers, with the restriction that each adjacent passage pair shares at least an entity.This is essentially an unsupervised task and the selection suffers from noise and ambiguity.Therefore we introduce another model, the Reasoner, which predicts the exact linking entity that points to the next passage.The two models play a cooperative game and are rewarded when they find a consistent chain.In this way, we restrict the selection to satisfy not only the format constraints (i.e., ordered passages with connected adjacencies) but also the semantic constraints (i.e., finding the next passage given that the partial selection can be effectively modeled by a Reasoner).Therefore, the selection can be less noisy.
We evaluate the proposed method on datasets with different properties, i.e., HotpotQA and Med-Hop (Welbl et al., 2018), to cover cases with both 2-hop and 3-hop reasoning.We created labeled reasoning chains for both datasets. 1Experimental results demonstrate the significant advantage of our proposed approach.

Task Definition
Reasoning Chains Examples of reasoning chains in HotpotQA and MedHop are shown in Figure 1.Formally, we aim at recovering the reasoning chain in the form of (p where each p i is a passage and each e i,i+1 is an entity that connects p i and p i+1 , i.e., appearing in both passages.The last passage p n in the chain contains the correct answer.We say p i connects e i−1,i and e i,i+1 in the sense that it describes a relationship between the two entities.Our Task Given a QA pair (q, a) and all its candidate passages P, we can extract all possible candidate chains that satisfy the conditions mentioned above, denoted as C. The goal of reasoning chain recovery is to extract the correct chains from all the candidates, given q, a and P as inputs.Related Work Although there are recent interests on predicting reasoning chains for multi-hop QA (Ding et al., 2019;Chen et al., 2019;Asai et al., 2019), they all consider a fully supervised setting; i.e., annotated reasoning chains are available.Our work is the first to recover reasoning 1 We will release our code and labeled evaluation data.

Method
The task of recovering reasoning chains is essentially an unsupervised problem, as we have no access to annotated reasoning chains.Therefore, we resort to the noisy training signal from chains obtained by distant supervision.We first propose a conditional selection model that optimizes the passage selection by considering their orders (Section 3.1).We then propose a cooperative Reasoner-Ranker game (Section 3.2) in which the Reasoner recovers the linking entities that point to the next passage.This enhancement encourages the Ranker to select the chains such that their distribution is easier for a linking entity prediction model (Reasoner) to capture.Therefore, it enables our model to denoise the supervision signals while recovering chains with entity information.Figure 2 gives our overall framework, with a flow describing how the Reasoner passes additional rewards to the Ranker.

Passage Ranking Model
The key component of our framework is the Ranker model, which is provided with a question q and K passages P = {p 1 , p 2 ...p K } from a pool of candidates, and outputs a chain of selected passages.
Passage Scoring For each step of the chain, the Ranker estimates a distribution of the selection of each passage.To this end we first encode the question and passage with a 2-layer bi-directional GRU network, resulting in an encoded question Q = { q 0 , q 1 , ..., q N } and H i = { h i,0 , h i,1 , ..., h i,M i } for each passage p i ∈ P of length M i .Then we use the MatchLSTM model (Wang and Jiang, 2016) to get the matching score between Q and each H i and derive the distribution of passage selection P (p i |q) (see Appendix A for details).We denote Conditional Selection To model passage dependency along the chain of reasoning, we use a hard selection model that builds a chain incrementally.Provided with the K passages, at each step t the Ranker computes P t (p i |Q t−1 ), i = 0, ..., K, which is the probability of selecting passage p i conditioned on the query and previous states representation Q t−1 .Then we sample one passage p t τ according to the predicted selection probability.
The first step starts with the original question Q 0 .A feed-forward network is used to project the concatenation of query encoding and selected passage encoding mt pτ back to the query space, and the new query Q t+1 is used to select the next passage.

Reward via Distant Supervision
We use policy gradient (Williams, 1992) to optimize our model.As we have no access to annotated reasoning chains during training, the reward comes from distant supervision.Specifically, we reward the Ranker if a selected passage appears as the corresponding part of a distant supervised chain in C. The model receives immediate reward at each step of selection.
In this paper we only consider chains consist of ≤ 3 passages (2-hop and 3-hop chains). 2 For the 2-hop cases, our model predicts a chain of two passages from the candidate set C in the form of p h → e → p t .Each candidate chain satisfies that p t contains the answer, while p h and p t contain a shared entity e.We call p h the head passage and p t the tail passage.Let P T /P H denote the set of all tail/head passages from C. Our model receives rewards r h , r t according to its selections: 2 It has been show that ≤ 3-hops can cover most real-world cases, such KB reasoning (Xiong et al., 2017;Das et al., 2018).
For the 3-hop cases, we need to select an additional intermediate passage p m between p h and p t .If we reward any p m selection that appears in the middle of a chain in candidate chain set C, the number of feasible options can be very large.Therefore, we make our model first select the head passage p h and the tail passage p t independently and then select p m conditioned on (p h , p t ).We further restrict that each path in C must have the head passage containing an entity from q. Then the selected p m is only rewarded if it appears in a chain in C that starts with p h and ends with p t : (3)

Cooperative Reasoner
To alleviate the noise in the distant supervision signal C, in addition to the conditional selection, we further propose a cooperative Reasoner model, also implemented with the MatchLSTM architecture (see Appendix A), to predict the linking entity from the selected passages.Intuitively, when the Ranker makes more accurate passage selections, the Reasoner will work with less noisy data and thus is easier to succeed.Specifically, the Reasoner learns to extract the linking entity from chains selected by a well-trained Ranker, and it benefits the Ranker training by providing extra rewards.Taking 2-hop as an example, we train the Ranker and Reasoner alternatively as a cooperative game: Reasoner Step: Given the first passage p t3 selected by the trained Ranker, the Reasoner predicts the probability of each entity e appearing in p t .The Reasoner is trained with the cross-entropy loss: Ranker Step: Given the Reasoner's top-1 predicted linking entity e, the reward for Ranker at the 2 nd step is defined as: The extension to 3-hop cases is straightforward; the only difference is that the Reasoner reads both the selected p h and p t to output two entities.The Ranker receives one extra reward if the Reasoner picks the correct linking entity from p h , so does p t .

Settings
Datasets We evaluate our path selection model on HotpotQA bridge type questions and on the MedHop dataset.In HotpotQA, the entities are preprocessed Wiki anchor link objects and in MedHop they are drug/protein database identifiers.
For HotpotQA, two supporting passages are provided along with each question.We ignore the support annotations during training and use them to create ground truth on development set: following (Wang et al., 2019), we determine the order of passages according to whether a passage contains the answer.We discard ambiguous instances.
For MedHop, there is no evidence annotated.Therefore we created a new evaluation dataset by manually annotating the correct paths for part of the development set: we first extract all candidate paths in form of passage triplets (p h , p m , p t ), such that p h contains the query drug and p t contains the answer drug, and p h /p m and p m /p t are connected by shared proteins.We label a chain as positive if all the drug-protein or protein-protein interactions are described in the corresponding passages.Note that the positive paths are not unique for a question.
During training we select chains based on the full passage set P; at inference time we extract the chains from the candidate set C (see Section 2).Baselines and Evaluation Metric We compare our model with (1) random baseline, which randomly selects a candidate chain from the distant supervision chain set C; and (2) distant supervised MatchLSTM, which uses the same base model as ours but scores and selects the passages independently.We use accuracy as our evaluation metric.As HotpotQA does not provide ground-truth linking entities, we only evaluate whether the supporting passages are fully recovered (yet our model still output the full chains).For MedHop we evaluate whether the whole predicted chain is correct.More details can be found in Appendix B. We use (Pennington et al., 2014) as word embedding for Hot-potQA, and (Zhang et al., 2019) for MedHop.

Results
HotpotQA We first evaluate on the 2-hop Hot-potQA task.Our best performed model first selects the tail passage p t and then the head passage p h , because the number of candidates of tail is smaller (∼2 per question).significantly better than the random baseline, showing that the training process itself has a certain degree of denoising ability to distinguish the more informative signals from distant supervision labels.By introducing additional inductive bias of orders, the conditional selection model further improves with a large margin.Finally, our cooperative game gives the best performance, showing that a trained Reasoner has the ability of ignoring entity links that are irrelevant to the reasoning chain.Table 2 demonstrates the effect of selecting directions, together with the methods' recall on head passages and tail passages.The latter is evaluated on a subset of bridge-type questions in HotpotQA which has no ambiguous support annotations in passage orders; i.e., among the two human-labeled supporting passages, only one contains the answer and thus must be a tail.The results show that selecting tail first performs better.The cooperative game mainly improves the head selection.
MedHop Results in table 1 show that recovering chains from MedHop is a much harder task: first, the large number of distant supervision chains in C introduce too much noise so the Distant Supervised Ranker improves only 3%; second, the dependent model leads to no improvement because C is strictly ordered given our data construction.Our cooperative game manages to remain effective and gives further improvement.

Conclusions
In this paper we propose the problem of recovering reasoning chains in multi-hop QA from weak supervision signals.Our model adopts an cooperative game approach where a ranker and a reasoner cooperate to select the most confident chains.Experiments on the HotpotQA and MedHop benchmarks show the effectiveness of the proposed approach.
Model overview.The cooperative Ranker and Reasoner are trained alternatively.The Ranker selects a passage p at each step conditioned on the question q and history selection, and receives reward r1 if p is evidence.Conditioned on q, the Reasoner predicts which entity from p links to the next evidence passage.The Ranker receives extra reward r2 if its next selection is connected by the entity predicted by the Reasoner.Both q and answer a are model inputs.While q is fed to the Ranker/Reasoner as input, empirically the best way of using a is for constructing the candidate set thus computing the reward r1.We omit the flow from q/a for simplicity.
chains in a more general unsupervised setting, thus falling into the direction of denoising over distant supervised signals.From this perspective, the most relevant studies in the NLP field includes Wang et al. (2018); Min et al. (2019) for evidence identification in open-domain QA and Lei et al. (2016); Perez et al. (2019); Yu et al. (2019) for rationale recovery.

Table 1 :
Table 1 shows the results.First, training a ranker with distant supervision performs Reasoning Chain selection results.