An Explainable Deep Few-shot Network for Protein Family Classification

Protein sequence analysis is arguably a challenging bioinformatics problem covering various areas and applications such as sequence annotation, metagenomics, and comparative genomics. Recent proteomics studies report the superior results of machine learning techniques in comparison to conventional alignment-based and alignment-free methods for analyzing protein sequences. However, the machine learning techniques are dependent on handcrafted features, often extracted from large-scale data sets, that may require domain knowledge in addition to analytics expertise. In this study, by leveraging a deep language model, designed for proteins, and transfer learning, we propose an explainable high-performing deep few-shot Siamese network for the protein family classification task. To the best of our knowledge, this is the first explainable deep network tailored for primary sequence family classification that can highly perform with a very limited number of observations. We are now running intensive experiments, both quantitatively and clinically, to validate the proposed network. We plan to release the network and findings publicly once the validation process is terminated.


Introduction
Proteins are essential for biological processes, such as functionality, structure, and regulation of the body's tissues. They play a central role in various functions derived from enzymatic catalysis, transporting molecules from one organ to another to manage cells. Understanding unknown properties of proteins, and their functions based on measurable features is crucial for disease research, precision medicine, and therapeutics. Proteins are built from 20 different amino acids making a 1-dimension sequence, called the primary protein sequence. When a new protein is discovered, the primary sequence family classification is seeking to examine the probability of its similarity based on its properties to a prior labeled family class with similar function and general behavior. Given the emergence of sequencing technologies and the resulting large-scale protein databases with unknown properties, protein family classification is an open problem in the bioinformatics research area [1].
Recent advances in computer science and digital technologies have opened new gates to researchers in various scientific domains [2]. Bioinformatics, as an intermediary research field, takes advantage of these advancements from conventional machine learning methods to large language models, and biostatistics [3]. Conventional alignment-based and alignmentfree techniques such as K-mer and Kmacs are commonly used for protein family classification, but they are limited to defining parameters that are computationally expensive, or difficult to estimate. Moreover, in the classification step, they depend on heuristic methods that can lead to a rough approximation of alignment distances. The latter limitation has been offset via the training step in machine learning models [4]. Although machine learning techniques such as Random Forest, Support Vector Machines, and K-Nearest Neighbors have been utilized for protein family classification, they are dependent on domain experts to generate features which could be a time-consuming and challenging task [5].
Deep learning (DL) algorithms have shown promising results in proteomics; however, their application is limited to the availability of massive data sets for training. Since the required data comes from experiments, it can be highly complex or incomplete. Moreover, optimizing parameters in DL models can be time-consuming and costly. Addressing the black box problem and understanding what is happening behind a complex DL network is also crucial, especially in bioinformatics research [3]. As an alternative, techniques adopted from computer vision such as meta-learning models can be generalized from a few observations. To rank the similarity between inputs, these networks employ a unique ranking structure, not requiring extensive training .To address the mentioned limitations, in this research, we plan to design and implement an explainable deep few-shot learning network for protein family classification.

Data and Methodology
We use UniProtKB/Swiss-Prot dataset [6]. First, we retrieved reviewed protein sequences resulting in 569,213 (as of March 2023) with various lengths [5]. Despite other sequence classification networks, we do not filter out the records according to the length of sequences, making sure the model is insensitive to the length of the sequence. Our few-shot deep Siamese neural network contains two identical ProtBert [8] models, as pre-trained transformers. Siamese neural networks are metric-based meta-learning models that can generalize from a few observations, aka shots The Siamese network is given a pair of proteins as the input, in which the ProtBERT model obtains embeddings of both protein sequences. ProtBert is employed as an embedding model in our proposed architecture. ProBERT, proposed by Elnaggar et al. (2021), is pre-trained on 2.1 billion protein sequences [8]. This transformer model is based on the BERT [9] model originally stemming from the natural language processing (NLP) domain. The output vector of this embedding network is fed to fully connected layers to produce the classification output. For interpretability and assess the trustworthiness of the network, we plan to use Local Interpretable Model-agnostic Explanations (LIME). It provides insights into how the classifier makes its predictions by highlighting the features that the model relies on. This explanatory framework aids in making the model more interpretable for end users [10].
We validate our architecture using an unreviewed dataset from UniProt as an unseen data set. We restricted our set of data to the human organism category. In addition, we plan to test the proposed solution via a clinical dataset. The goal of our deep few-shot learning architecture is to classify unseen primary protein sequences that can be highly performed with a very limited number of observations. The effectiveness of our architecture will be tested with three different baseline architectures.