Personality Trait Detection using an Hierarchy of Tree-transformers and Graph Attention Network

Automatic personality trait detection from a person’s writings is helpful for professionals to assess the mental health of an individual, as well as helping individuals to determine their strengths and weaknesses for making choices such as personal improvement, workplace compatibility, and life-style decision-making. Psychologists have identified a set of personality traits that may be present in an individual’s personality. This work classifies the writings of an individual into a subset of these traits. The classifier model comprises an hierarchical structure of tree-transformers and a graph attention network (GAT). The tree-transformers encode the sentences and the following GAT layer encodes the complete text of an individual’s writing. Our model has shown a large performance boost over two benchmark corpora compared to previous works.


Introduction
Artificial intelligence (AI) has become a valuable tool for aiding psychiatrists and healthcare professionals in addressing the growing incidence of mental health related issues and disorders [1].This upward trajectory has garnered recent attention, with studies like "Changes in Mental Ill Health and Health-Related Behaviors in Two Cohorts of UK Adolescents" revealing that rates of depression symptoms as well as self-harm tendencies have risen to multiple times in 2015 compared to 2005 [2].In addition, research has examined the effect of social media on mental health, including its impact on adolescents' mental health and the increasing prevalence of teen suicide [3].
The COVID-19 pandemic has exacerbated the rising incidence of mental health concerns.A Kaiser Family Foundation survey has reported that individuals have become more distressed and disconnected from their social life, with nearly 50% residents of America reporting that the pandemic has negatively impacted their mental wellbeing [4,5].
A 2020 Harris Poll [6] shows social media usage has increased among US adults, about 50% reporting higher usage during the pandemic.This trend was particularly noticeable among younger age groups, with 60% of those aged 18 to 34, 64% of those aged 35 to 49, and 34% of those aged 65 and older reporting increased social media usage [7].
Personality traits refer to a collection of enduring qualities, rooted in psychological research [8], that define an individual's emotions and actions in a relatively consistent manner.The Big-Five personality traits (also called OCEAN) is the best accepted and most commonly used model of personality [9].OCEAN describes personality with these five measures: Openness to Experience, Conscientiousness, Extraversion, Agreeableness, and Neuroticism (or positively keyed as emotional stability) [1].Another frequently used personality model, the Myers-Briggs Type Indicator (MBTI) [10], categorizes 16 types of personalities characterized by a combination of four binary categories: Extroversion or Introversion, Sensing or Intuition, Thinking or Feeling, and Judging or Perceiving.These traits play important roles in an individual's future and life outcomes [11,12].
The upsurge in social media activity during the pandemic has resulted in more digital footprints being left behind.These footprints can reveal an individual's personality and emotional traits, as has been demonstrated by Kosinski et al. [13].This presents an opportunity to leverage these data to provide tailored support to individuals based on their unique needs, thus transforming the pandemic challenge into a potential advantage in terms of mental health care.
Many countries have faced additional burden on their mental health services due to the COVID-19 pandemic, as highlighted by a survey conducted by the World Health Organization (WHO) [14].Given the scarcity of mental health service resources and the surge in mental health issues, the rise in social media usage presents an window of opportunity for AI researchers to leverage the resulting digital footprints to aid in diagnosing individuals' mental health concerns.
Prior research has explored the connection between personality traits and mental health disorders.Several studies have evidenced that neuroticism is a crucial factor in the development of depression and anxiety disorders [15,16].In addition, studies have found that resilience is inversely correlated with neuroticism and positively associated with conscientiousness and extraversion.Moreover, The positive correlation between openness and resilience is modest, but significant statistically [17].Thus, automatic comprehension of an individual's personality can have a significant impact on the treatment process for mental health concerns.This has the potential to improve treatment outcomes and alleviate the burden on mental health services.
In this study, we have developed two deep-learning models that integrate tree-transformers [18] and graph attention networks [19].The aim is to generate more nuanced vector representations of statements, preserving their underlying semantics, to facilitate the identification of personality traits through subsequent classification.To evaluate the efficacy of our approach, we have conducted experiments on various benchmark corpora, where statements are labeled with one or more personality traits.Our experimental results demonstrate that our model surpasses the performance of many state-of-the-art methods.

Related Works
Due to limited mental health resources compared to the demand in current time, automated assistant tools can be a great support to help diagnose mental health issues and AI models have the potentiality to offer a great help.AI models have shown promise as automated assistants for such services due to their superior performance in personality judgment compared to humans [20].Numerous studies have effectively utilized machine learning techniques to identify personality traits in social media content [21,22].Detecting personality traits can be achieved through various features, including demographic data and text data (e.g., self-interpretation and content from social media).One early example is Argamon et al.'s [23] model, which has used support vector machines (SVMs) and statistical features extracted from functional lexicons to identify personality traits.Farnadi et al. [24] built on the work of Argamon et al. [23] and used SVM to detect personality traits based on features such as network size, density, and frequency of updating status.Zhusupova [25] utilized social media activity and demographic data to detect the personality traits of Twitter users from Portugal.
In recent years, several notable works have employed various deep learning models for the task of identifying personality traits.Kalghatgi et al. [26] used neural networks, specifically multilayer perceptrons (MLP), along with hand-crafted features to detect personality traits.Meanwhile, Su et al. [27] employed recurrent neural networks (RNN) and hidden Markov models (HMM) to identify personality traits from Chinese Language Inquiry and Word Count (LIWC) annotations extracted from dialogues.Tandera et al. [28] and Sun et al. [29] employed long-short-term-memory (LSTM) and convolutional neural network (CNN) to detect personality traits directly from text data collected from facebook posts.Meanwhile, Liu et al. [30] developed a hierarchical structure based on Bidirectional recurrent neural network to learn representations of words and sentences that can predict personality traits from multi-lingual (English, Spanish, and Italian) statements.After experimenting on 275 LinkedIn profiles, Van de Ven et al. [31] have demonstrated with evidence that extroversion can be accurately inferred from self-description in the user profiles.Lynn et al. [32] used message level attention over facebook posts to analyze users' personality traits.Majumder et al. [33] have utilized psycholinguistic features [34] and deep learning models, such as hierarchical CNN, for automatic personality detection.Gjurković et al. [35] utilized Sentence-BERT [36] over their self-created corpus.Kazameini et al. [37] has applied an ensemble of SVMs over BERT embeddings and achieved better performance compared to other models on the Essays corpus [38] for Big Five trait classification.Mehta et al. [39] has experimented with various combinations of BERT-based models and psycholinguistic features and analyzed each feature's impact on the trait prediction.They have also achieved state-of-the-art results on different corpora.A comprehensive analysis of previous models is presented in [40], while a review of perspectives is discussed in [41].
While these models have improved the accuracy over time, they face several limitations that hinder their effectiveness in practice.The major reason is that textual representations are complex in nature and that the word level dependencies between long distant words plus the constituency representations mean a lot while generating the semantics.The use of only sequential models cannot capture this information appropriately.Again, though the pre-trained language model-based approaches have achieved state-of-the-art results over the benchmark personality trait classification corpora, they are limited to handle 512 words from the statements as these language models can take a maximum of 512 input tokens.This is definitely a hindrance for real-life applications of these models as automated assistant tools.Considering these two issues, we have investigated a model which utilizes tree-transformers to utilize word-level dependency and phrasal information followed by a graph attention network (GAT) to combine the sentence representations when generating the full statement representation.By using the tree-transformers and the GAT, this approach has the ability to preserve the syntactical structure of the sentences, and at the same time is not restricted to word limits of each sentence and the text as a whole, like Kazameini et al. [37] and Mehta et al. [39].

Methodology
The personality trait classification model is built upon an hierarchical structure consisting of a sentence encoder and then a full statement encoder.The sentence encoder unit works over the words and generates a vector for each sentence in the text.Then the statement encoder unit generates a vector representation of the complete text from the individual sentences.To utilize the syntactical information that is present in the textual representation, as the sentence encoder we have experimented with two types of tree-transformers: the constituency and dependency tree-transformers.For combining these sentence representations to obtain the statement representation, we have utilized a graph attention network (GAT).This section first talks about the individual building blocks, providing a better understanding of the concepts, and then it describes the proposed model as a whole.

Sentence Encoding Module
To analyze an individual's personality traits from texts, we need to consider the syntactical structures of the sentences as the sentence representations play a crucial role while generating the whole statement embedding.To serve this purpose, we have investigated two types of tree-structured transformer models in this work as Tai et al. [42] has showed that tree structured representations are a better fit while working with text data compared to sequential representations.Sequential models are not capable enough to consider correlations between long distant words and phrasal representations present in the sentence.Attention [43,44] goes a long way to solving the problem which occurs due to the long distance between the considered words.However, they are not capable of competing with tree-structured models [18,45] which also take into account the relationships between words and the phrases that words make up.
There are two types of tree-based representations used to convey information about a sentence: constituency and dependency trees.These representations capture different aspects of the sentence's syntax, with constituency trees representing the structure of phrases and dependency trees illustrating the relationships between individual words located at different positions in the sentence.In a study by Ahmed et al. [18], two tree-transformer models have been proposed to make use of this syntactic structure information: a constituency treetransformer and a dependency tree-transformer.The aim of these models is to carefully examine each sub-tree inside a constituency or dependency tree structure and recursively compute the root of each sub-tree to generate a sentence vector representation at the root of the tree through attentive processing over branches.
In a dependency tree, each node corresponds to a word in the sentence.When traversing a sub-tree in this type of tree, the dependency tree-transformer takes into account the representations of both the parent and child nodes.On the other hand, a constituency tree only has words at the leaf nodes, while the vectors for non-terminal nodes are computed only after the full traversal of the sub-tree is completed.
Ahmed et al. [18] have enriched the dependency and constituency tree representations of a sentence by using self-attention over the branches, which involves computing query (Q), key (K), and value (V) matrices.The matrices are computed in the following way [44]: To create the matrix M in a dependency tree, the word vectors of all child nodes corresponding to each parent node are concatenated.For a constituency tree, M is formed by concatenating the word vectors within a constituent.The tree-transformer models use the Q, K, and V matrices to compute the self-attention matrix in the following manner: Here, d k represents the dimension of the K matrix.To perform multi-branch attention, denoted as B i , with n branches, n sets of the key (K), query (Q), and value (V) matrices are created using the relevant weight matrices (ω i ).Finally, a scaled dot product attention, as per Eq.3.4, is performed on each branch as seen in Eq. 3.5.
Next, a residual connection is applied to these tensors, followed by a layer-wise batch normalization layer.After this, a scaling factor µ is used to create the branch representation as shown below: In the subsequent step, a position-wise CNN (PCNN) is used on each Bi comprising of two convolution operations on each position, separated by a ReLU activation function.The PCNN layer operates as shown in Equation 3.7: The final attentive representation of the semantic sub-spaces, generated from the PCNN layer, is obtained by carrying out a linear weighted summation (as shown in Equation 3.8), where γ ∈ R n is a trainable hyper-parameter of the model.
Finally, a residual connection is created with the output of the BranchAttn layer, followed by the application of a non-linear activation function (tanh).The parent node representation is calculated by performing element-wise summation (ExS).Equation 3.9 depicts the operation of this step.
The attention calculation module's input and output features are χ and χ attn in Eq. 3.9.

Statement Encoding Module
Once the sentence representations are generated from the sentence encoding module, the graph attention network (GAT) [19] is applied over it to generate the vector representation of the statement.For this work, we have designed the graph G = {V, E} in such a way that the sentence nodes present in the statement are connected to the statement node D. So, for any statement comprising n sentences, there will be n + 1 nodes in the graph (n nodes for n number of sentences and one node to represent the whole statement from the individual) and V = {s 1 , s 2 , ..., s n , D}.The edges are established between node D and the sentence nodes (s 1 , s 2 , ..., s n ), thus the graph G ends up with n number of edges.
This module updates only the statement node (D) using the sentence nodes (s 1 , s 2 , ..., s n ).The sentence nodes are initialized with the sentence embeddings generated by the sentence encoding module (see Section 3.1).The GAT layer is formulated as follows: where || indicates the concatenation operation.The weight matrices ω a , ω q , ω k , and ω v in the GAT layer are updated by back-propagation.The set of neighbouring nodes for a given node is represented by N i , while the attention score between h i and h j is represented by α i,j .The GAT layer with multi-head attention, using M attention heads, is expressed as: This final hidden representation H i is used as the statement representation vector.

Model Architecture
For each individual's statement, this model at first utilizes RoBERTa [46] for generating word embeddings.In our experiments, we have also tried glove [47], fasttext [48] and BERT [49] word embeddings.However, the best results have been achieved when the model has been fed with RoBERTa word embedding.
Over these word embeddings the tree-transformers are applied to generate the sentence embeddings (see Section 3.1).The following statement encoding module then generates the embedding for the whole statement using GAT (see Section 3.2).This feature vector for the individual's statement is then fed to a dense layer with a following sigmoid classifier which returns a probability score for each personality trait.The sigmoid classifier returns the probability score for every particular traits.We have used binary cross-entropy loss function for calculating the overall loss of the model for model training.Considering N as the total number of considered personality traits, y i as the original label and prob(y i ) as the predicted probability of that particular trait, the binary cross-entropy can be formulated as: The overall model architecture is sketched in Figure 1.

Experimental Setup and Result Analysis
In this section, we report on how well our model performed for personality trait classification, using accuracy as the evaluation metric.In the context of personality trait identification, each individual can be assigned multiple personality traits at the same time, as each trait is not mutually exclusive.Therefore, we have formulated the personality trait identification as a multi-label classification task and the performance of the model is assessed on each individual class label.This section also provides a concise overview of the benchmark datasets utilized in the experiments.We conclude this section by comparing the effectiveness of the model proposed in this study with that of the top-performing previous models.

Overview of the Benchmark Corpora
Two benchmark personality datasets: (i) Essays [38], (ii) Kaggle MBTI [50] are publicly available and have been used in our analyses.

Essays
The stream-of-consciousness also known as the "Essays" dataset contains 2468 essays written by students, which were annotated with binary labels over five personality traits.These binary labels indicate the presence or absence of the Big Five personality traits.These traits were identified using a standardized self-report questionnaire [38].

Kaggle MBTI
The data used in this corpus was accumulated from the PersonalityCafe forum, which provides a broad range of individuals interacting in an informal online social environment.The dataset consists of 8675 entries, each containing the last 50 posts made by each individual on the website.Each entry comes with its corresponding binary MBTI personality type.To work with this corpus, we have slightly modified the class labels.This corpus comes with four binary class labels: (i) Extroversion or Introversion, (ii) Sensing or Intuition, (iii) Thinking or Feeling, and (iv) Judging or Perceiving.Each entry in the corpus is labeled with four traits, one from each of four binary labels.For our experiments, we have tagged entries with 1 and 0 for Extroversion (1) and Introversion (0); Sensing (1) and Intuition (0); Thinking (1) and Feeling (0; and Judging (1) and Perceiving (0), accordingly.

Experimental Setup
The model uses an initial learning rate of 0.1 and reduces it by 80% in each iteration if the validation accuracy decreases from the previous iteration.The batch size is set to 10.The multi-branch attention block consists of six PCNN layers, and six branches of attention layer has been used for the tree-transformers in the sentence encoding module.Following Ahmed et al.'s [18] work, we have deployed each PCNN layer with two CNNs, where the first one uses 341-dimensional kernels and no dropout.The second layer utilizes 300-dimensional kernels with dropout rate 0.1.The GAT in the statement encoding unit employs six attention heads.The model hyper-parameters are trained using the 'Adagrad' [51] optimizer.
Both models use 768-dimensional RoBERTa word embeddings as input.These word embeddings are collected by feeding each sentence to the pre-trained RoBERTa.We have assessed the performance of our models using 10-fold cross-validation.To perform this cross-validation, we have utilized StratifiedKFold from the scikit-learn package.All the experiments have been conducted in a Ubuntu 22.04 LTE environment with an NVIDIA 1080ti GPU.For parsing the sentences to generate the dependency and constituency tree representations of the sentences, we have used the Stanford Core NLP parser.

Performance Analysis
Tables 1 and 2 show the accuracies achieved by our model over the two benchmark corpora along with the published results of the previous notable works.Along with the accuracies achieved over the whole corpora, accuracies over each individual class are also provided here for a better assessment of the improved results.
Looking at Table 1, it is clearly visible that both the proposed models outperform the previous works by a margin of 5.4 to 5.8 percentage points on average for the Essays dataset.For individual traits, the margin is 3.4 to 7.1 percentage points.The proposed models show the best performance while predicting conscientiousness.For this particular trait, the performance boost margin is 6.7 to 7.1 percentage points.And the lowest performance gain is for the label "Neuroticism" with a gain margin of 3.4 to 4.1 percentage points.In all the cases, apart from one ("Agreeableness"), the model using the dependency tree-transformer as the sentence encoding module outperforms the one which uses the constituency treetransformer.Table 2 depicts the performance of the models over the Kaggle MBTI corpus.Over Kaggle MBTI corpus, on average, our models have shown 2.9 to 3.5 percentage points performance boost compared to the previous models.Over the "Thinking/Feeling" class, the model with dependency tree-transformer has achieved 3.8 percentage points more accuracy than the previous works.The model with constituency tree's performance gain for this class is 3 percentage points over the previous works.Among the proposed models, the model with dependency tree-transformer performs better than the other one.Over all the classes, it has gained a 0.5 to 0.8 percentage point accuracy boost over the model which parses sentences using the constituency parser.
In our research, conducting an ablation study is not possible due to the interdependence of the modules in our pipeline.However, we have employed comparative studies to enhance our analysis, as presented in Table 3.To investigate the significance of both the tree-transformers and the GAT, we have conducted two experiments on each dataset.In the first experiment, we have replaced the tree-transformer layer with RoBERTa CLS tokens to generate sentence representations.In the second experiment, we have substituted the GAT layer with a mean-pooling layer over the sentence representations obtained from the tree-transformers.These experiments have allowed us to gain valuable insights into the contributions of the tree-transformer and GAT components in our model.The outcomes of our experimentation clearly indicate a decline in performance across all the aforementioned cases.When the tree-transformer layer is replaced, there is a notable drop of 5.3-5.5 percentage points (comparing averages) for the Essays dataset and 4.8-5.4percentage points (comparing averages) for the MBTI corpus, as compared to the performance of our proposed model which can be seen in Tables 1 and 2. These findings provide compelling evidence that preserving syntactical information through tree-structured representations contributes to better semantic preservation in our model.Similar results are observed when the GAT layer is replaced with a mean-pooling layer.This time the results drop by 6.1-6.3 percentage points (comparing averages) for the Essays and 6.5-7.1 percentage points (comparing averages) for the MBTI corpora.These findings provide strong evidence that the fusion of sentence representations with attentive graph neural networks, such as GAT, can generate superior statement representations.This is attributed to the ability of GAT to assign varying weights to different sentences within a statement, despite its higher computational cost compared to a mean-pooling layer.
From these statistics, it is clear that our proposed models surpass the previous works in terms of performance.There are two reasons behind these performance boosts achieved by the proposed models.Firstly, our proposed models have the capability to work with the complete text unlike the BERT-based personality trait classifier models [39,52].These BERT-based models, due to the 512 word limitations of BERT, consider either only the first 512 words or the last 512 words, or the first 256 and last 256 words.On the other hand, our proposed models are able to work with sentences and texts of any length.While assessing an individual's personality traits it is important to consider that person's complete written statement.Secondly, while generating the sentence embeddings, we have utilized tree-structured representations of the sentences which has helped the models to incorporate syntactical information and preserve better semantics.Because of using dependency and constituency tree-transformers, our models can consider word-level dependencies and phrasal information.However, we have also noticed that the model with the dependency treetransformer gives better performance compared to the model with the constituency treetransformer.By analyzing the data, we have arrived at the hypothesis that the sentences in the benchmark corpora are reasonably simple with few phrases used and that's why considering word-level dependencies is more beneficial here.Furthermore, unlike the other models [29,33,37,39], our models don't require any additional psycholinguistic features and still provide better results compared to them.

Conclusion
In this paper, we have proposed two models using the hierarchy of tree-transformers and graph attention network for personality trait identification and these models have outperformed the previous state-of-the-art models over Essays and Kaggle MBTI corpora.Analysis of the results also shows that using tree-structured representations while sentence embedding preserves better semantics while encoding the whole statements from individuals.Still, there are some scopes for improvement.Instead of using fixed word embeddings from BERT-based models, we can update the word embeddings like Wang et al. [54] to improve the performance of the model.Furthermore, like Kazameini et al. [1] this model can be modified to provide interpretable representations.

Figure 1 .
Figure 1.Structure of the suggested system for identifying personality traits

Table 1 .
Performance analysis of the proposed models along with the other prominent works over the Essays dataset.All the performance scores are accuracy (in %).The best results are presented in bold texts.Here, CTT means constituency tree-transformer and DTT represents dependency tree-transformer.Column headings: O: Openness, C: Conscientiousness, E: Extraversion, A: Agreeableness, and N: Neuroticism

Table 2 .
Performance analysis of the proposed models along with the other prominent works over the Kaggle MBTI dataset.All the performance scores are accuracy (in %).The best results are presented in bold texts.Here, CTT means constituency treetransformer and DTT represents dependency tree-transformer.Column headings: I/E:

Table 3 .
Comparative studies of the proposed model with different modules replaced.Row headings: RoBERTa [CLS]: The tree-transformer layer is replaced with RoBERTa CLS tokens; Mean Pooling: The GAT layer is replaced with a mean-pooling layer.