Protein sequence analysis is arguably a challenging bioinformatics problem covering various areas and applications such as sequence annotation, metagenomics, and comparative genomics. Recent proteomics studies report the superior results of machine learning techniques in comparison to conventional alignment-based and alignment-free methods for analyzing protein sequences. However, the machine learning techniques are dependent on handcrafted features, often extracted from large-scale data sets, that may require domain knowledge in addition to analytics expertise. In this study, by leveraging a deep language model, designed for proteins, and transfer learning, we propose an explainable high-performing deep few-shot Siamese network for the protein family classification task. To the best of our knowledge, this is the first explainable deep network tailored for primary sequence family classification that can highly perform with a very limited number of observations. We are now running intensive experiments, both quantitatively and clinically, to validate the proposed network. We plan to release the network and findings publicly once the validation process is terminated.
Article ID: 2023GL6
Publisher: Canadian Artificial Intelligence Association