A Simple and Interpretable Predictive Model for Healthcare

Deep Learning based models are currently dominating most state-of-the-art solutions for disease prediction. Existing works employ RNNs along with multiple levels of attention mechanisms to provide interpretability. These deep learning models, with trainable parameters running into millions, require huge amounts of compute and data to train and deploy. These requirements are sometimes so huge that they render usage of such models as unfeasible. We address these challenges by developing a simpler yet interpretable non-deep learning based model for application to EHR data. We model and showcase our work's results on the task of predicting first occurrence of a diagnosis, often overlooked in existing works. We push the capabilities of a tree based model and come up with a strong baseline for more sophisticated models. Its performance shows an improvement over deep learning based solutions (both, with and without the first-occurrence constraint) all the while maintaining interpretability.


INTRODUCTION
Deep Learning has taken the world by a storm and has become the goto choice for developing solutions in areas such as image processing [12,23], text processing [21], and even healthcare [8,20]. Usage of RNNs (particularly LSTMs [13] and its variants) to model sequential EHR data for disease prediction has been seen in many recent works [14,18] . Recent advancements in deep learning space through the use of attention [2] has helped in adding interpretability as well to these otherwise sophisticated black-box models. RNNs with Attention in healthcare have also been successfully applied in several works [6,7,11,27]. Choi et al. [8] in their paper titled "RETAIN: An Interpretable Predictive Model for Healthcare using Reverse Time Attention Mechanism" used reversed time attention mechanism to achieve good performance while being clinically interpretable for application to the Electronic Health Records data. Ma et al. [17] rectified some drawbacks of RETAIN, by introducing bidirectional RNNs over normal RNN based approach to capture both the past and future medical experiences of patients. The research and development into this space is possible due to adoption and availability of EHR data. Several works have highlighted the positive impact of such predictive works towards improving quality of healthcare [4,15,26].
Deep Learning models showcase exemplary performance, at times even outpacing human counter parts. This boost in performance comes at the cost of compute requirements, training time and volume of training data. In many cases, these costs are prohibitory both financially and otherwise.
We address these limitations/challenges by proposing a simpler yet interpretable tree based predictive model for healthcare. The following were the major motivations behind this work. First and foremost was to test the performance of non-deep learning approaches. We wanted to develop a competitive baseline which can act as a benchmark for highly parameterised current deep learning approaches. Second was to provide a novel way of preparing sequential EHR data for non-deep learning approaches. Our approach had significant impact on overall model performance. Third, models such as RETAIN [8] provide an intuitive way to interpreting instance level results. We wanted to apply model agnostic approaches to nondeep learning models and provide similar levels of interpretability. The final motivation was to provide an efficient and easy to deploy alternative to Deep Learning models while providing on-par performance and interpretability.
Our model was tested on multiple EHR datasets, each having at least a 24 month historical timeline. We experimented with different datasets and target disease to ensure generalisable performance metrics. We also cater to first-occurrence prediction, i.e. predicting the first ever incidence of a diagnosis in consideration. This constraint adds additional complexity to the prediction task. Our experiments showcase that our simpler approach improves over deep learning based solutions (both, with and without the firstoccurrence constraint) all the while maintaining interpretability.
The rest of the paper is organised as follows: section 2 details the Data Preparation step. It had a significant impact on the overall model performance. Section 3 describes the overall approach, model choice along with different experiments and their results. We also provide details on the choice of evaluation metric used. We present interesting comparison with different deep learning based approaches. We compare performance with models such as RETAIN [8], Dipole [17], etc. In section 4 we discuss the need for model interpretability. We also showcase instance level interpretability results using a model agnostic approach called SHAP [16]. We also showcase global interpretability results of our model. Section 5 presents commentary on the effectiveness of our work in this domain and section 6 concludes the paper.

DATA PREPARATION
Data Preparation is an important aspect of this work and a major motivation. As mentioned earlier, the aim was to prepare longitudinal EHR data for first occurrence prediction task. To capture enough historical traits and variability, similar to the works of Choi et al., we also pick up a 24 month historical period for training. For the predictions to be useful and actionable, we use a delta of 3 months between the training period and first occurrence date of the diagnosis. Table 1 showcases a quick summary of our EHR dataset.
Let us denote a patient as p having a certain history of diagnosis, denoted as H = {t 1 , t 2 , ....t N } where t i is one timestep or a visit in his/her history. Each timestep consists of various diagnosis codes (represented as ICD codes), procedures codes (represented as CPT codes), prescription (represented as RX codes) and demographic details of the patient and can be represented as t i = {ICD (1..N ) , CPT (1..N ) , demoдraphics} . Assume that the task is to predict the first occurrence of a disease D. Given that this patient was diagnosed with d at time steps t n and t (n+p) (where p > 0) time step. The first occurrence of d for p would time step t n which would be our target instance. The response variable is_d (for instance is_diabetic) would be set to 1 for such patients and 0 otherwise (i.e. patients who have never been diagnosed with diabetes).
Using the above mentioned procedure, we prepare the response variables for our population. We prepare a different dataset for each diagnosis. The feature space consists of ICDcodes along with demographic attributes like aдe and дender . To prepare an aggregated feature vector for each patient p, we dissolve the time steps, i.e. the patient vector is represented as: , where the value for each ICD i depends upon the experiment being considered. We experimented with two different versions. First experiment involved setting the value of each ICD i , CPT i and RX i to the number of such diagnosis, procedures or medication. So, for a given patient p i the feature vector would be referred as: , where n i j is the ICD count value for patient i and ICD j (similarly for CPT and RX codes), a i , д i and y i are the age, gender and response value respectively for patient I; with a i ∈ (0, ∞) д i ∈ {Male, Female} and y i ∈ {0, 1}. In the second experiment we treated ICD i as a binary categorical feature. We share details on the model performance on these two different experiments in the following section.

EXPERIMENTAL SETUP
Deep learning models are very effective in a majority of tasks. Before the widespread usage of such models, tree based models [3,5,19] were the go-to choice. The major reasons behind the popularity of tree based models were their low bias, robustness against outliers, ease of interpretability and speed of training and inference. These were our motivations as well to decide upon XGBoost [5] as our model of choice. Being battle-tested in different scenarios such as production use-cases, academics and ML competitions further reinforced our decision. Global interpretability is another factor which contributes in understanding model behaviour. We utilised different model-agnostic instance level feature interpreters like LIME [24] and SHAP [16] as well. These instance level approaches were used to identify contributing features at each patient level. Figure 1 shows our overall approach in a flow diagram.

Evaluation Metric
Disease incidence is usually very low in observed patient samples. This low incidence leads to issues associated with class imbalance for classification models. Accuracy as a measure is not helpful in such scenarios and leads to models biased towards majority class. We use Receiver Operating Characteristic-Area Under Curve or ROC-AUC [10] as our metric of choice. ROC is a probability curve to understand performance measurement of the model at different thresholds while the area under the curve denotes the degree of separability between classes. ROC-AUC is robust measure for imbalanced datasets. We also measure Recall@K as an additional metric and define it as: given the predicted probability scores across all the observations binned into deciles, Recall@30 is defined as the percentage of true cases for which the predicted probability falls in the top 3 deciles.

Experiments and Results
We performed various experiments to understand model performance using ROC-AUC as our metric of choice. The aim is to prepare a classifier to identify first occurrence of a disease in the dataset. Considering our focus is towards a classification task (first occurrence/diagnosis for diabetes, heart failure and kidney failure respectively), we split the dataset into three parts: train, validation and test. Our datasets are highly imbalanced. The class imbalance stands at (class 1:class 0 = 12:88 ) for diabetes, (class 1:class 0 = 15:85 ) for heart failure and (class 1:class 0 = 14:86 ) for kidney failure . Stratified sampling was performed while As a first step we fit a logistic regression model on our datasets. This was done to have a baseline in place and understand the relative strength of each of the models. Since this was a binary classification task, we could directly fit a logistic regression model. The models achieved an ROC-AUC of 0.711, 0.754 and 0.731 for diabetes, heart failure and kidney failure respectively. This is quite a decent performance given the simplicity of the model. The models where we utilised count of features rather than binary categorisation achieved better performance throughout. For the rest of this section we will refer to count based feature set as our primary dataset unless stated otherwise.
Moving ahead with this baseline, the next experiment involved fitting an XGBoost model with default parameters. XGBoost is a tree based boosting algorithm with numerous hyper-parameters such as learning rate, number of estimators, regularisation parameters and so on. We denote this XGBoost model with default parameters as xдb de f henceforth. The XGBoost models with default setting for diabetes, heart failure and kidney failure resulted in an ROC-AUC value of 0.78, 0.837 and 0.823 respectively, which is good improvement over the logistic regression baseline.

Hyper-parameter Tuning
As mentioned earlier, XGBoost has a host of hyper-parameters available for fine-tuning. Since our aim was to try and push the boundaries of non-deep learning models, it was logical next step to fine tune the xдb de f model. One of the ways of identifying the right values for each of the hyper-parameters is to perform a greedy search. We did not proceed with the usual grid-search due to the shear size of the hyper-parameter search space. A grid search would have been too time and effort consuming.
The greedy search paradigm works as follows: • Use xдb de f model as the base for this greedy search

Comparison with Recent works
Works by Choi et al. [8] and Ma et al. [17] utilise complex attention mechanisms to showcase improvements in their results. These works compare their results against weak baselines only. These works also overlook experiments concerning first occurrence prediction. We believe this additional constraint is an important one for the models to be useful in real life use cases. We also observed that the performance (across models) tends to improve drastically if this constraint is removed. This makes intuitive sense as for many diagnosis, a repeat occurrence is quite obvious. From a business and healthcare stand-point, it makes sense to predict first occurrence to take any preventive/corrective action in time.
To provide a common framework, competitive baseline and useful constraints, we trained RETAIN [8] 2 and Dipole[17] 3 on our datasets, preparing data in the formats expected and performed hyper-parameter tuning to report the best results on our test dataset.
We observed significant improvements in ROC-AUC values for both RETAIN and Dipole as compared to xдb def and logistic regression baselines. This was expected as both these models are highly parameterised and complex implementations. Also, both these works present improvements against logistic regression in their respective works as well. The surprising aspect was the comparison with our fine-tuned XGBoost model, i.e. xдb opt . Our proposed model shows considerable improvements as compared to RETAIN and Dipole for all three target diseases respectively. The results are showcased in table 3 for reference.

Experiments with Full ICD Feature Set
The experiments and results outlined in previous section utilised ICD codes truncated till 3 characters (or ICD3 for short  Table 3: Comparison of results of our proposed method with recent papers same to only 250 which refers to the class of diabetic diagnosis. By doing so, we reduce the overall dimensionality of our already sparse feature set.
Even though such a grouping is helpful in reducing the impact of a sparse feature set, it leads to loss of understanding/interpretability. To enable better and granular interpretation, we experimented with complete ICD codes or ICD-Full 4 . The dataset was prepared as mentioned in the Data Preparation section with only difference being the feature set consists of ICD-Full while targets are still ICD3. This was done to ensure we have enough training samples for each class. This new dataset was used to train and tune RETAIN, Dipole and Xgboost for comparison. Similar to previous experiments, in this case also XGBoost outperformed its more sophisticated competitors on ROC-AUC metric. Results are shared in table 4 for reference. We attribute the improvement in performance across models to the added granularity in the feature set all the while maintaining similar class distribution. The results were cross validated to ensure model stability.

INTERPRETABILITY
Interpretability is an important factor when it comes to use cases such as disease prediction. Typically there is a trade-off between model performance and its interpretability. Most Deep Learning models are highly complex and are often treated as black boxes. To overcome these limitations, works by Choi et al. [8] and Ma et al. [17] utilise attention mechanisms.
Since our fine-tuned XGBoost model, xдb opt is not a deep learning model, the interpretability at instance level had to be solved in a different way. For global (or dataset) level feature importance, tree based algorithms are a go-to choice. The XGBClassifier[1] also provides similar functionality out of the box. Important features for xдb opt for first occurrence prediction of diabetes are reported
as ICD_I10(Hypertension), ICD_R73(Elevated blood glucose levels), etc. which are inline with factors leading to diabetes. Heart Failure task using ICD-Full as feature set resulted in top 5 features as ICD5939 (Unspecified disorder of kidney and ureter), ICD7931 (Nonspecific (abnormal) findings on radiological and other examination of lung field). Figure 3 presents the top 5 features in detail for each of the target diseases.

Patient Diagnosis Interpretability
xдb opt outperforms its deep learning counterparts for the task of first occurrence prediction while remaining globally interpretable. One downside of XGBoost is its inability to provide instance or in this case, patient level interpretability. To handle this scenario, we leverage a model agnostic approach by Lundberg and Lee [16].
This approach mimics the behaviour outlined in the works of Choi et al.. Their work explains theoretical motivations and working in detail. To better understand the impact on our work, let us work through an instance of a patient from our test dataset itself. Let us consider a randomly sampled patient from our test dataset for diabetes. We use xдb opt to predict the first occurrence probability of this patient being diabetic. This particular patient turns out to be diabetic with a probability score of 0.979 (ground truth for this patient was observed to be 1). XGBoost is supported by the SHAP framework out of the box. Upon analysing this particular instance using SHAP, we observe the following for this particular patient. Features such as age, LAB_4548-4_H (a diagnostic test for Haemoglobin A1c), RX_841(diabetes testing supplies) and so on have positive SHAP values. Positive SHAP values move the logit value (classification decision) of the classifier from approximately −3.0 to 3.85. These are the past diagnosis, events, lab test or prescriptions which the model uses to have a high probability (0.979) for predicting this patient as diabetic three months down the line. The same is visually showcased for heart failure and kidney failure as well in figure 4 for reference.
Similar to attention based interpretation plots as showcased in RETAIN [8], we leverage SHAP values and force plots to provide patient level interpretability for our work. Similar exercise can be performed using LIME [24] to understand instance level feature importance.

Known Limitations
One known limitation of XGBoost models as compared to deep learning counterparts [8] [17] is the visit-level importance. In the data preparation step, we outlined the fact that we the aggregated feature vector does not include time aspect of a patient's history. We dissolve the time steps while preparing the patient vector p.
While sequence to sequence based deep learning models can provide time-step level interpretability, our model does not have such capability out of the box. To handle this scenario, we present a simple workaround. We firstly narrow down the top most important features at the patient level. The next step is to identify visits which had these features present. We can then mark such visits as important in identifying the final diagnosis and also provide physicians with supplementary information regarding such a decision.

EFFECTIVENESS OF TREE BASED APPROACHES
Our experiments outline the effective predictive performance of XGBoost based models as compared to its sophisticated deep learning counterparts. Despite having far less parameters, our optimised versions were able to outperform attention based architectures such as RETAIN [8] and Dipole [17]. Such a strong baseline can be attributed to two main aspects of our experimental setup. The first and the foremost is the domain and its data. We shared our results and corresponding interpretations with medical professionals. The experts were able to verify the results and the interpretations from a random sample of our test sets. They also highlighted the importance of sequential/longitudinal nature of electronic health records. Though an important factor in a number of diagnosis (such as Alzheimer's ), not all diagnosis are time dependent, especially for the target diseases in our experiments. They highlighted the fact that even though past diagnosis impact current and future health states, the time gap is not always an important factor. This goes hand in hand with our results and the fact that a simpler model out performs complex ones. This also highlights a gap in the data recording process. Each diagnosis in EHR datasets is associated with a visit to a doctor/medical professional and is not the actual date of incidence. Thus, the time information from EHR dataset is dependent upon when a particular person visits a medical facility for diagnosis. This might include delays due to personal preferences such as seriousness of symptoms, access to healthcare, pain tolerance and so on. Such variability between incidence and reporting requires more study and experiments.
The second aspect is from the algorithmic standpoint. Despite successful application across various domains and data types (mostly unstructured), deep learning is yet to make a mark when it comes to tabular or structured datasets. Tree based ensembles, especially XGBoost and variants dominate this space [22]. Real world datasets are typically high dimensional yet sparse in nature. In other words, they can be represented in a lower dimensional space easily (say a hyperplane). This process is termed as unfolding or manifold learning. Tree based boosting algorithms are highly efficient for manifold learning with hyperplane boundaries (a characteristic of tabular datasets) [9]. Another reason behind better performance of tree based ensembles over deep learning counterparts is their  ease and speed of training. Deep Learning models are over parameterised and even though they are termed as universal function approximators, finding the optimal set of parameters is not a trivial task. These require far more training samples and time as compared to traditional methods [25].

CONCLUSION
We presented a simple and interpretable predictive model for disease prediction. Sophisticated and complex deep learning models are the focus of research work in disease prediction domain.Choi et al. [8] present attention based approach to prepare interpretable disease prediction model. Their work and the likes present comparison with weak baselines, mostly using logistic regression. The focus of this work was to push the capabilities of a tree based non-deep learning model and come up with a strong baseline for more sophisticated models. We present a novel data preparation pipeline which is observed to have a positive impact on the overall model performance. We used ROC-AUC [10] as our evaluation metric, given the fact that dataset in consideration is highly skewed. Our work outlined different experiments and a simple algorithm to fine-tune the XGBoost model for performance. We compared the performance of our work with that of RETAIN [8] and Dipole [17]. It was surprising to observe that our fine-tuned model outperformed these deep learning solutions by a good margin. This was despite the fact that both deep learning implementations were fine-tuned with respect to the dataset in consideration. We also presented strategies to interpret our model at both global and instance levels. The instance level interpretation utilised SHAP framework by Lundberg and Lee. SHAP values help us understand patient level feature importance. We also discussed about the limitation of our model while identifying visit level importance. We closed by providing a simple workaround for this known limitation. We leveraged XGBoost implementation by Chen and Guestrin [5] to prepare our models.