Ultra Efﬁcient Transfer Learning with Meta Update for Continuous EEG Classiﬁcation Across Subjects

The pattern of Electroencephalogram(EEG) signal differs signiﬁcantly across different subjects, and poses challenge for EEG classiﬁers in terms of 1) effectively adapting a learned classiﬁer onto a new subject, 2) retaining knowledge of known subjects after the adaptation. We propose an efﬁcient transfer learning method, named M eta UP date S trategy (MUPS-EEG), for continuous EEG classiﬁ-cation across different subjects. The model learns effective representations with meta update which accelerates adaptation on new subject and mitigate forgetting of knowledge on previous subjects at the same time. The proposed mechanism originates from meta learning and works to 1) ﬁnd feature representation that is broadly suitable for different subjects, 2) maximizes sensitivity of loss function for fast adaptation on new subject. The method can be applied to all deep learning oriented models. Extensive experiments on two public datasets demonstrate the effectiveness of the proposed model, outperforming current state of the arts by a large margin in terms of both adapting on new subject and retain knowledge of learned subjects.


Introduction
Electroencephalogram(EEG) signal is widely used to analyze the activities of human brain.The signal is recorded by placing electrodes on different regions of human scalp when the subject performs executive/imaginary tasks or perceives stimulus from outside.EEG signal has proved to be effective for restoring motion capabilities of disabled people [1], human intention interpretation [2], and enhanced experience in gaming control [3].
EEG signal exhibit significant pattern variability across subjects, resulting in two major challenges for EEG classifiers: 1) achieve good performance on new users previously unseen, 2) retain knowledge of previous learnt subjects after the adaptation.We propose to simultaneously tackle the challenge with Meta UPdate Strategy (MUPS-EEG) involving two steps: (1) extracting versatile features which are effective across different subjects with meta learned representations, and (2) perform meta update for fast adaptation on new subject.The meta update mechanism significantly reduced the amount of labeled target data needed to adapt on target subject, and the meta learned representations help preserve learned knowledge on previous subjects.This facilitates the utility of BCI systems in real world scenarios with constant shift between different subjects.
For the extraction of versatile and subject invariant features, previous works adopt either signal processing techniques or deep learning models.For example, [4] utilized filter bank (FB) and common spatial pattern (CSP) for effective feature extraction which are then sent to a fisher linear discriminator (FLD).[5] extracted features from power spectral density (PSD) of EEG signals and used support vector machines (SVM) as the classifier.Models based on deep learning emerged as a promising approach as they alleviate the need for manual feature engineering and achieved state of the art performance.EEGNet [6] is a compact convolutional neural network (CNN) that can be applied to different BCI paradigms.[7] introduced a cascade and parallel structure on CNN for improved performance.CRAM [8] is proposed recently which adopts LSTM with attention mechanism to help the model focusing on most discriminative temporal features, and achieved promising result.
Transfer learning techniques are utilized in EEG classifiers to transform models onto target subject for improved performance.Previous works involves both classic transfer learning [9] [10] and domain adaptation [11] [12] to transfer knowledge across subjects.[13] proposed an inter-subject transfer learning framework built on top of CNN model.[11] and [12] explored performance of multiple domain adaptation methods including transfer component analysis (TCA-EEG), maximum independence domain adaptation (MIDA-EEG) and information theoretical learning (ITL) for emotion recognition.Deep-Transfer [14] is a transfer learning framework built on deep CNN-LSTM network to transfer knowledge across subjects.RA-MDRM [15] utilized covariance matrix from different subjects and forms a calibration less system suitable for low resource scenarios.
In this work, we propose a simple and computationally efficient meta updating strategy to tackle cross subject EEG classification, which is applicable to all deep learning oriented classifiers.It allows the EEG classifier to adapt onto a new subject utilizing only a small amount of target data.Furthermore, the model mitigates forgetting that often occurs when transferring a deep learning model to a new context.This Meta UPdate Strategy (MUPS-EEG) originates from meta learning [16] [17].It involves a meta representation learning phase followed by meta adaptation onto target subject.The meta representation learning is performed on the known source subjects which extracts versatile features that are effective across different subjects, and meta adaptation fits the model onto a new subject through a small number of gradient steps without losing knowledge on known subjects.A desirable property of the model is that it doesn't overfit even if target data is very limited, allowing it to properly function in low target-resource scenarios.

Methodology
MUPS-EEG allows efficient adaptation onto a new subject and simultaneously retain knowledge on known subjects.Its meta learnt representations are broadly effective across different subjects, and meta adaptation fits the model to a new subject with efficient target data usage.The difference between MUPS-EEG and classic transfer learning lies in both the optimization process and the training mechanism.
For traditional optimization, weights are sequentially updated after each time step, seeking sensible parameters with where Θ is the collection of model parameters, D s is training data from source subjects, and D t is the small amount of data from target subject.
MUPS-EEG decomposes the problem into two steps by setting up meta parameters Φ.Given Maximizing the log likelihood is approximated to first finding meta parameters that maximizes log p(Φ|D s ) Then approximates eq.2.1 to be The mechanism can thus be interpreted as helping the model learn a prior of transferable knowledge on the subjects.This prior is later used to infer the posterior parameters in the network after the model sees a small amount of data from the new subject.The prior learned during meta training act as an inductive bias for minimizing the generalization error during evaluation, which allows the EEG classifier to properly functions on the new subject.
During meta training, MUPS-EEG involves interaction between a base learner and a meta learner, each formed with a representation learning network and a prediction learning network.Representation learning network extracts effective features from raw EEG signal which is then feed to prediction learning network for classification.Both representation learning network and prediction learning network can be arbitrary deep learning models.
The workflow of MUPS-EEG is as follows: An ensemble of M meta tasks E meta = {T 1 , T 2 , ..., T M } is created from source dataset D s = {(x 1 , y 1 ), ..., (x N , y N )} with a total of L known subjects.Each meta task T i = {(x i 1 , y i 1 ), ..., (x i m , y i m )} contains m data points from l subjects, where m ≪ N and l < L. Each cycle of meta update is called an episode, including two phases: base learning and meta learning.In each episode, a meta task T i is sampled from the task pool E meta , with p data points for base learning T b , q data points for meta learning T m (omitted indexing on i here for conciseness), and p + q = m.
MUPS-EEG adopts a two stage optimization approach with two sets of optimizers, one for optimizing base learner and the other for optimizing meta learner.Base learner includes representation learning net parameterized with ϕ and prediction learning net parameterized with θ.Meta learner keeps another set of parameters {ϕ * , θ * }.During initialization, {ϕ, ϕ * } is pretrained to have a warm start, and {θ, θ * } is randomly initiated.In later episodes, both base learner and meta learner inherit parameter values from meta learner of previous episode.
In base learner, gradient is evaluated with the loss function L T b (θ, ϕ) being cross entropy for the classification task.When updating base learner, we only update parameters in prediction learning net, which is where α is the learning rate for base optimizer.Here Adam can be replaced by any optimizer functioning on first order gradient.After base learning loop ends, meta task T m is applied to get meta gradient ∇ {θ,ϕ} L Tm (θ, ϕ), and parameters of both representation learning net and prediction learning net get updated accordingly where β is the learning rate for meta optimizer.Note this meta optimization is performed over the meta learner, whereas the objective gradient is computed from updated base learner for its gradient descent direction is broadly effective on different subjects.Meta learner is kept between different episodes and then adapt to target subject during evaluation.The algorithm is outlined in Algorithm 1.

Experiments
We compare our method against current state of the arts on two public datasets, with detailed experiment setting described as below.
Dataset: The proposed model is evaluated on two public datasets, namely BCI competition IV dataset 2a (abbreviated as BCI IV-2a below) [18] 1 and DEAP dataset [19] 2 .BCI IV-2a involves 9 subjects doing 4 class motor imaginary tasks.Each subject is tested in two sessions and each session consists 288 trials.Signals are recorded with 22 electrodes at 250Hz sampling rate.DEAP dataset is for emotion recognition, with a total of 32 subjects.40 trials are recorded for each subject as they watched music videos with different types of arousals.The signal comprises 32 channels at a sampling rate of 512Hz.
Implementation Details: The model is implemented with Pytorch.We used a three layer convolutional neural network (CNN) similar to EEGNet [6] as representation learning network, which is compact and versatile across different BCI paradigms.Prediction learning network includes two fully connected layers.Representation learning network is pretrained on SGD optimizer with learning rate set to 0.01.Adam optimizer is adopted during meta training for adaptation of base learner and meta learner, with learning rate set to 0.001.The learning rate is discounted by 0.2 every 5 steps.We run 10 epochs for representation learning pretraining, and 20 epochs for meta training.Each meta episode involves ten iterations of base learner update and one meta update.During the meta episode, one data batch containing 12 sampled meta tasks are feed into the model, and each task is made up with 20 data segments.10 data segments are used for base update and the other 10 segments for meta update.
Table 1.Comparison of Accuracy and ROC-AUC on target subject for BCI-IV 2a and DEAP dataset.BCI-IV 2a has a total of nine subjects, the models are trained on eight subjects and tested on the subject left out.Similarly, for DEAP one subject is left out for testing and models are trained on the other 31 subjects.Reported result is averaged across all the subjects.The first three models are subject independent and don't use any target subject data.For the other transfer learning approaches we used the same amount of target data (1 minute of EEG recording for BCI-IV 2a and 5 minutes recording for DEAP dataset) for a fair comparison.MUPS-EEG outperforms comparison methods on both datasets with its efficient meta adaptation mechanism.We evaluate model on: 1) the performance with the new target subject, 2) knowledge retained from previous learnt subjects.The performance on the new subject is measured with both accuracy and AUC-ROC.And the knowledge retaining ability is measured with the averaged accuracy (Avg.Acc) and averaged ROC-AUC (Avg.RA) across previous subjects after adaptation finishes.

Method
Result Analysis: Model performance on target subject for BCI IV-2a dataset and DEAP dataset are presented in table 1.We did a comprehensive comparison to models that perform well on cross subject classification tasks with code publicly available.The first three comparison models (EEG-Net, CTCNN, CRAM) don't involve the transfer process and no target data is used 3 .For the other transfer learning approaches, we used the same amount of target subject data (1 minute of EEG recording for BCI-IV 2a and 5 minutes EEG recording for DEAP dataset) to have a fair comparison.For BCI-IV 2a dataset, MUPS-EEG has an improvement of at least 2.2% on accuracy and 1.3% on AUC-ROC compared with other models.The classification accuracy varies across individual subjects.MUPS-EEG classified 7 out of 9 subjects to above 70% accuracy, which is generally deemed an acceptable threshold for application of BCI systems [13].For DEAP dataset, MUPS-EEG outperforms other approaches by at least 3.4% in accuracy and 1.5% in AUC-ROC.This performance improvement comes from MUPS-EEG's ability to rapidly adapt onto the target domain with a small amount of target data.Table 2 reveals the model's capability to retain knowledge on previously learnt subjects after adaptation finishes.MUPS-EEG outperforms other models by a margin of 1.6% on Avg.Acc and 0.5% on Avg.RA for BCI-IV 2a dataset.For DEAP dataset, the model achieved a 2.6% gain on Avg.Acc and 0.7% gain on Avg.RA.
We further explored the influence of different amount of target subject data on model performance, shown in fig. 1.The performance is positively correlated with target data, and we observed both accuracy and AUC-ROC fully converges with 2 minutes of EEG recording from target subject on BCI IV-2a task, while 5 minutes of recording is needed for DEAP dataset.
Comparing between model performance on the two datasets, DEAP posed to be more challenging than BCI IV-2a for the cross subject classification task, where only 3 out of 8 models reaches above 60% accuracy in table 1, given the theoretical chance for random guessing is 33.3%.With current model performing below 70% accuracy, which is generally deemed an acceptable threshold for application of BCI systems [13], further performance improvement is needed for DEAP dataset.

Conclusion
Pattern variability of EEG signal across different subjects is a major challenge for cross subject EEG classification.We propose an efficient transfer learning model built on meta update mechanism for the task.The two step meta update approach functioning on meta tasks enables the model to rapidly adapt onto a new subject and retain knowledge on known subjects at the same time.The model is efficient in terms of target data utilization with its tailored optimization process for target adaptation.We evaluate the model on two public datasets, where it outperforms current state of the arts by a large margin.

Algorithm 1 :
MUPS-EEG Input : data from source subjects D s , data from target subject D t , base learning rate α, meta learning rate β Output: optimal meta learned model 1 for samples in D s do 2 pretrain ϕ based on L Ds (ϕ) 3 end 4 while not done do 5 sample a batch of tasks {T 1∼K } ∈ E meta 6 for meta episode k from 1 to K do 7 Split T k into T b and T m 8 for number of base updates do 9 optimize θ with T b by Eq. 2.5 10 end 11 optimize {θ * , ϕ * } with T m by Eq. 2.6.12 {θ, ϕ} ← {θ * , ϕ * } 13 end 14 end

Figure 1 .
Figure 1.MUPS-EEG performance on (a) BCI IV-2a dataset and (b) DEAP dataset with different amount of target subject data.We observed both Accuracy and AUC-ROC score converge with 2 minutes of target subject data for BCI IV-2a dataset.5 minutes of target data is needed for convergence on DEAP dataset.

Table 2 .
Comparison of averaged accuracy (Avg.Acc) and averaged ROC-AUC (Avg.RA) on learnt source subjects for BCI IV-2a dataset and DEAP dataset.The training setting is the same as described in Table1, Avg.Acc and Avg.RA are evaluated on the source subjects after adaptation finishes.EEGNet, CTCNN and CRAM are subject independent approaches and not included here, as their performance are the same as reported in Table1.MUPS-EEG performs consistently better than comparison baselines in retaining knowledge of learned subjects.