Investigating the State-of-the-Art Performance and Explainability of Legal Judgment Prediction

In the past decade deep learning models have achieved impressive performance on a wide range of tasks. However, they still face challenges in many high-stakes problems. In this paper we study Legal Judgment Prediction (LJP), which is an important high-stakes task utilizing fact descriptions obtained from court cases to make ﬁnal judgements. We investigate the state-of-the-art of the LJP task by leveraging the most recent deep learning models, longformer , and demonstrate that we obtain the state-of-the-art performance, even with a limited amount of training data, beneﬁting from the advantage of pretraining and the long-sequence modeling capability of longformer . However, our analyses suggest that the improvement is due to the model’s ﬁtting to spurious correlations , in which the model makes correct decisions based on information irrelevant to the task itself. We advocate that caution should be seriously exercised when explaining the obtained results. The second challenge in many high-stakes problems is interpretability required for models. The ﬁnal predictions made by deep learning models are useful only if the evidences that support the models’ decisions are consistent with those used by subject-matter experts. We demonstrate that by using post-hoc interpretation, the conventional method XGBoost is actually capable of providing explainable results with a performance comparable to the longformer model, while not being subject to the spurious correlation issue. We hope our work contributes to the line of research on understanding the advantages and limitations of deep learning for high-stakes problems.


Introduction
Legal Judgment Prediction (LJP) is an important task which involves utilizing fact descriptions obtained from court cases in order to make decisions of the final outcome. The development of these models is crucial as they can reduce the time taken by legal professionals in determining the outcome of an ongoing case [1]. Alternatively, they may be used by these same professionals to reinforce their opinion on a decision to be made. By analyzing the bias which may have been present, practitioners in the legal field can identify decisions which may have previously been made due to various other commonly present factors, such as nationality or gender [2]. This factor is essential in determining a lesser number of cases which are able to generalize well and contain an acceptable amount of human bias for further annotations by professionals, as this task may be expensive and time-consuming. Additional insights in the model behaviour are essential in high-stakes decision making fields such as healthcare, finance, and law. For this reason, we are not only interested in analyzing predictions which are made by models, and obtaining models with capabilities of high performance, but we are also interested in interpretability. Previous works utilizing fact descriptions in predicting final outcomes under binary classification and multi-label classification settings [3] mainly focus on performance, and interpretability is largely ignored. Few works [4] tackle the challenge of interpretability for legal judgment prediction, although they are mainly focussed on datasets in the Chinese language.
In this paper, we obtain state-of-the-art results by using a transformer-based model called longformer [5], capable of modelling long sequences. To the best of our knowledge, we are the first at utilizing the longformer for the LJP task. Previous models such as BERT [6], were restricted to modelling sequence lengths of 512 tokens. The task in the legal domain we tackle for achieving these results is the high-stakes decision making problem of LJP, which comprises of binary classification for determining the presence of a violation in a human rights article. As per our analyses, we find that improvement in the models' predictive capabilities is due to the fact that these results were reliant on spurious correlations. This fact is echoed in our results, as state-of-the-art performance is obtainable even under the consideration of a limited amount of training data, leading to sparse data conditions. In other words, we must be wary of the capabilities of transformer-based models producing state-of-the-art results in the legal judgment task, as they may be due to the wrong reasons. Through making use of post-hoc interpretation methods such as LIME [7], we are able to address the challenge of high-stakes decision-making having the requirement of interpretability. We exhibit the capabilities of the conventional XGBoost model [8] in providing explainable results with similar performance to that of the best results produced by the longformer [5].

Related Work
Language modelling architectures have been created for modelling long sequences [5], which are particularly useful for legal tasks, as court cases comprise of long documents. Most current legal document classification works utilize hierarchical architectures [3], while other works simply pose the long document challenge [9]. In this work, we use a recently introduced transformer-based model for modelling long sequences, called longformer [5].
Previous works have also dealt with elements pertaining to explaining specific model predictions, and this can be seen as an important goal in research for interpretability. However, it is non-trivial to explain why a deep non-linear neural network model makes a certain decision, a task categorized under the umbrella of ante-hoc methods [10]. For the uncertainty, and lack of ability of these methods to draw out a cross-comparison between models chosen in this paper, we refrain from utilizing ante-hoc methods. On the other hand, post-hoc interpretability is the application of interpretation methods to a trained model. We utilize LIME [7], and Layer Integrated Gradients [11] to obtain insights on the decision-making path from input text to outputs of our models for classification. In the legal field, [12] highlights the importance of using such methods, although the application has largely been unexplored. In this work, we explore the application of these post-hoc interpretation methods to the task of legal judgment prediction.

Dataset
We utilize the ECHR dataset [3] for experiments. The dataset is split into training, validation and testing sets. The training and validation datasets are roughly balanced, and the test set is imbalanced, with 66% of cases having violations. The ground truth corresponds to the presence/absence of violations in cases. Following the original work [3], we also utilize an anonymized dataset to identify tokens which may be influencing predictions based on demographic or factual information. This setting simply makes use of a named entity recognizer to replace tokens related to location, gender, etc. with type tags [3].

Methodology
We utilize a combination of different methods for models and interpretations methods to identify the best performing pair in terms of performance, and explainability. We emphasize that in this paper, we treat both, the models, and these interpretability methods XGBoost: This model is created by an ensemble of weak prediction decision trees [8]. In addition to performance, XGBoost is highly interpretable, making it an ideal candidate for experiments. Longformer: The recently introduced transformer-based model is able to achieve state-ofthe-art results on long sequences. It makes use of a linearly scalable attention mechanism, which combines a local windowed attention with a task motivated global attention [5]. Due to the performance achieved on long sequences, we select this model to gain insights on attributes influencing decisions made. LIME: Local Interpretable Model-agnostic Explanations (LIME) [7] is a widely used method. It works by learning a sparse linear model around a perturbed instance, which we consider as our test case. The idea behind creating a simpler model to provide an interpretation is that it may be more difficult to obtain an interpretation globally, whereas obtaining the same locally is simpler. In this paper, we use LIME for both models XGBoost, and Longformer. Layer Integrated Gradients: A strong axiomatic attribution method [11] which can be used for identifying input features which attribute to a prediction of a deep neural network. We use this method for Longformer to gain additional insights in the model behaviour.

Experimental Setup
For the XGBoost model, we utilize TF-IDF vectors with an ngram range of (1, 5), following the hyperparameter values set in the original paper [3]. For purposes of providing interpretation, we use LIME [7] out-of-the-box with default parameters, which produces 10 features under the consideration of 5000 samples.
For the longformer model, we run experiments with a batch size of 1, learning rate of 2e-5, and for 10 epochs on two RTX 2080Ti GPUs. Training time for 100% of the data is ≈ 8 hours. For purposes of providing interpretation for the transformer-based model using LIME, we set the number of features calculated to 5, and the number of samples to 50. For the layer integrated gradient [11] method, we set the number of attribution steps to 500.

Results and Discussion
Results on both, non-anonymized and anonymized datasets are shown in Table 1. The results of the majority class are identical to those shown in [3]. Under the non-anonymized data setting, our XGBoost model achieves stable, and high performance which is comparable to the results of our transformer-based model for the same setting. The longformer, being capable of modelling longer sequences achieves better results consistently as expected. We attain state-of-the-art results by a margin of ≈ 1% for the F1 score when utilizing as less as 10% of the training data for the longformer model. However, we must be wary of the results produced by utilizing lower amounts of training data here, as the performance may be a direct result of spurious correlations.
For the anonymized setting, the XGBoost model consistently outperforms the longformer under consideration of all variations of subsamples of data. Again, without utilizing the entire training dataset, the best results are obtained using ≥ 50% of the training dataset for both models considered. Under the anonymized data setting, we observe that the results produced by the XGBoost model, again, are stable with variations in the amount of the training data utilized. In addition to this, the XGBoost model consistently outperforms the longformer. This indicates that the high performance of the longformer in the nonanonymized setting may have been due to tokens such as gender and location which are now replaced by type tags by anonymization. This directly accounts for decisions being made for incorrect reasons, under the well-known problem of spurious correlations. On the other hand, the performance of the XGBoost model in contrast to the longformer is largely unaffected by the anonymized data setting. Further insights into decisions being made by transformer-based models and the comparison with the conventional model can be drawn from interpretation methods which perturb inputs and deal with the models themselves as black boxes. We use a post-hoc interpretation method [7] for the XGBoost model (LIME [7]). We can observe from figure 2 that the most commonly occurring tokens which account for the predictions are relatively unchanged in terms of impact on probability with respect to the percentage of data utilized. This suggests that a limited amount of data may suffice in determining not only the predictions, but also for the right reasons under the non-anonymized dataset for the XGBoost model. For the longformer model, we observe that post-hoc interpretation methods such as LIME [7], and Layer Integrated Gradients [11], when used out-of-the-box, highlight the importance of the beginning of sentences, punctuation, and numbers. In terms of interpretability, this may suggest that the models may be making correct predictions for incorrect reasons. On the other hand, it may suggest that invalid reasons are being picked up due to the computational resource constraints of methods required for interpreting large transformer-based models. Comparing it to the results for the longformer under the anonymized setting in Table 1, we can observe that for state-of-the-art results obtained when utilizing 10% of the training dataset may mostly have been due to the confidence of the model in the choice of a select number of tokens which consistently appear in text leading to violations.

Conclusion
In this paper, we produce state-of-the-art results on a binary classification task of legal judgment prediction for identifying violations in English under the non-anonymized data setting using the longformer model. We observe that a limited number of training samples are sufficient to achieve these results. Furthermore, under this data setting we can see that the XGBoost model is capable of achieving comparable performance. We also experiment with the anonymized data setting and conclude that conventional XGBoost models are capable of outperforming transformer-based models under circumstances where the model is unable to learn from tokens which lead to bias, such as nationality and gender. We can conclude that the XGBoost model is largely unaffected by the introduced bias, whereas the longformer is prone to learning from these bias-inducing tokens. This can be observed in the reduced performance when comparing the anonymized setting of data for the longformer with the non-anonymized setting. Finally, on experimentation with post-hoc interpretation methods, we observe that predictions made by the XGBoost model at the bare-minimum, offer transparency. On the other hand, the interpretability obtained by utilizing these methods indicate that the transformer-based models may have been making predictions for invalid reasons (spurious correlations).