4
Industrial applications can significantly benefit from leveraging pretrained Transformer-based language models, where their efficient exploitation of textual content provides a competitive edge to many processes. However, some specialized corpora and problems require additional handling to adequatly adapt to Transformer models. Our application involves one of these problems, because long-term dependencies in long longitudinal sequences of specialized texts require careful modeling. This paper proposes LongiBERT, a classification model that relies on a BERT-like transformer pretrained language model using our novel Same File Prediction training task. This pretraining objective captures repeated elements in a longitudinal text sequence. We evaluate this by studying the detection of costly insurance claims, a binary classification task using the private corpus of a major Canadian insurer. Our study indicates that our proposed model and pretraining objective yield more stable performance and outperform RoBERTa's robust MLM training approach for modeling long-term dependencies.
Article ID: 2024L1
Month: May
Year: 2024
Address: Online
Venue: The 37th Canadian Conference on Artificial Intelligence
Publisher: Canadian Artificial Intelligence Association
URL: https://caiac.pubpub.org/pub/3tx6wizq