Guided Learning of Human Sensor Models with Low-Level Grounding

Sensor data often lacks intuitive interpretability in its raw form, unlike language or image data. Furthermore, standard end-to-end training leaves little control over local representation learning. We postulate that guided local representation learning could be used to tackle both issues. In this paper we introduce a novel framework for sensor models which uses low-level grounding for guided learning of human sensor models. Our framework is amenable to different model architecture. We demonstrate our method on two different human activity datasets, one containing labels of low-level actions used in performing high-level activities, and one without any low-lev el labeling. We provide comprehensive analysis of our framework’s performance across many low-level action subsets and demonstrate how it can be easily adapted to data with no low-level labeling. Our results demonstrate that low-level grounding can be used to improve both the interpretability and performance of sensor models.


Introduction
Sensor-based models for activity recognition have many applications within healthcare, health monitoring, activity tracking and smart home technology [1].One key limitation of most sensor models is the difficulty in interpreting how decisions are being made inside the model.It is challenging to understand the physical representations of the data encoded within the model.In addition, one also desires to understand if it is possible to leverage human knowledge to help build sensor-based models.
Deep Learning models learn good local representations which are composed to make global predictions.As shown in the CHARM [2] models trained end-to-end on high-level activities tend to build low-level feature representations based on how they are correlated to high-level activities and not based on the underlying low-level motions themselves.
In this study, we aim to structure the local model representations in a way that ensures they use valid and relevant criteria during decision making and perhaps leverage human knowledge during training.This task presents a conflict as we would also like to give the model freedom to learn organic representations.During supervised learning we optimize the model based on high-level criteria (such as classification accuracy) which leaves us little control over the local representation learning.Standard explainable AI (XAI) methods will often relate high-level predictions to local input, requiring human interpretation of the input.
Unlike other modalities such as language, sensor data is not inherently interpretable, which makes the use of standard explainable AI solutions highly limited.Often the solutions include the use of feature extraction and weighing (SHAP) [3] and heatmap visualizations (LRP [4] etc.).When adopting these strategies for sensor data, one must rely on manual feature extraction, or the interpretation of the raw waveform.In this research, we aim to explore explainable/interpretable sensor-based frameworks.We would like to impose some control over how and what the model is learning during training.Specifically, to interpret the decision making at a high level, we must understand how the neural network's local representations are used for high-level activity prediction.At the same time training a model to predict low-level activities would require large amounts of training data which, unfortunately, is infeasible in most cases and would restrict the use of many current activity recognition datasets.
In this paper we propose an adaptable framework for the use of these low-level activities to aid in local representation learning.We ground the model's local representations to the low-level activity representations, while allowing the model to build these representations itself.Our framework takes advantage of vector quantization (VQ) which, combined with lowlevel grounding of the embedding space, allows us to evaluate how the model is making predictions internally.We conduct our experiments on human activities, which are made up of many smaller low-level actions and demonstrate the advantages of the proposed model.
In summary, our contributions are as follows: • We introduce a novel interpretable sensor-based method for activity recognition tasks.

Related Work
Sensor-based activity recognition studies have started to attract more attention recently [5]- [7].Most of this prior research has focused on improving model performance and does not focus on the low-level representations learning.For a model to be explainable we need to understand the local feature representations and their physical meaning within the context of the activity being performed.

High-level Human Activity Recognition
We define low-level activities as a set of actions performed within a high-level activity.For example, when preparing breakfast, one may open the fridge, open a cupboard, or spread butter on toast.Another example is while running, one performs a set of gait actions (lowlevel activities) which make up the high-level act of running.In essence, low level activities are how we can break down and understand more complicated high-level activities.
CHARM [2] investigates complex high-level activities over large time windows.They introduce a two-stage neural architecture which compresses short sequences of sensor data to local continuous feature representations.The local representations are then passed to a high-level encoder to model the high-level activity probability distribution.Even without training on low-level activities, the model learns to cluster certain low-level activities.Importantly, there is still overlap within the embedding space between many of the low-level actions.Due to the continuous nature of the embedding space, it can be challenging to interpret physical meaning of values within the embedding space.
Low-level representation learning has also been explored as a way of predicting high-level activities by [8].They use temporal pattern mining to generate features from low-level actions based on frequent temporal patterns within the sequences of low-level activities.Unfortunately, this strategy relies on low-level labeling of the training set, making it prohibitively restrictive for most common HAR datasets and expensive for practical applications.

Interpretable Models for Sensor Data
Interpretable and explainable methods have generated interest for sensor data, especially for health monitor applications.One study [9] demonstrates the use of Layer-Wise Relevance Propagation (LRP).This can be effective in correlating the parts of the input signal which most impacts the predictions.Unfortunately, it relies on the users to interpret the raw sensor input, which is inherently difficult to interpret compared to image and text data.
More recent work [10] applies both SHAP, LRP, Grad-CAM and Relevance-CAM as methods for wearable-sensor-based gait analysis to identify patients with osteopenia and sarcopenia.SHAP can be used to explain the model predictions according to the input features.Unfortunately, it relies on the use of hand-crafted features which limits the use of some modern deep learning architectures.Other methods mentioned were used to analyze deeplearning-based architectures and ran into the same raw signal interpretability problem mentioned above.
The work presented in [11] introduces a CNN based architecture which can select important sensor signals and boost interpretability of multi-sensor systems for HAR.While the model can tell which sensors are being used most for predictions, it does not reveal what the model has learned about those sensor signals.

Vector Quantization
Vector quantization (VQ) models were introduced in VQ-VAE [12], a variational autoencoder which was proposed to generate high quality images, speech, and video data.VQ-VAE successfully models important features that span over many dimensions.When processing sensor data, we often reduce the dimensionality of the input when learning local representations.We expect VQ to allow us to control low-level representation learning.

Methods
In Figure 1 we introduce our framework for low-level grounding.In typical end-to-end training it is very difficult to control the model parameters, and we avoid manually building the embedding space which would inhibit representation learning.We chose to use VQ to generate discrete embeddings, which can then be manipulated using a set of low-level actions.Specifically, we first apply the low-level encoder to the input data, generating the low-level representations of the sequence.We apply the encoder to the low-level actions generating matching low-level representations for each value in the codebook.We compare the low-level embeddings with the values from the codebook and choose the closest codebook value for each embedding.These are then passed to the high-level encoder for activity prediction.
During training, we calculate the pairwise distance between each encoded low-level action and the codebook values.

Vector-quantization-based Model Design
Our method employs vector quantization as a way of grounding the model's low-level representations to the encoder representation of the low-level actions.VQ-based models have not been applied previously for inertial sensor data and here we discuss the construction of the VQ model without low-level grounding.The embedding space is defined as ℝ () , where is the number of local embeddings generated by the encoder,  is the embedding dimension. is passed to the encoder () which generates embedding vectors   ℝ  1,2, . .,  .These embedding vectors ℝ (C) , where  is the size of the discrete embedding space, and   is mapped onto the nearest c.These are then fed to the decoder D(e c ) as shown in Equation 1.
The VQ model without low-level grounding is optimized using a loss with three components shown in Equation 2. We use mean square error (MSE) to calculate the distance between embedding vectors and cross-entropy loss (CE) to optimize the model output.

Learning
While our model is identical to the VQ-base model during the inference phase, we perform low-level grounding through optimization of the embedding space.Our training paradigm is meant to optimize the model to make the correct predictions, shift the codebook towards the low-level embeddings, and shift the encoder outputs towards the codebook embeddings.This encourages our model to use the real physical representations requested, without limiting the actual mapping between the sensor data and the encoder embedding output.We can then interpret whether the model is preserving the physical meaning of the embeddings or not.
We introduce our low-level actions which are of form ϵℝ () .To ground our codebook to the low-level actions we calculate the pairwise distance using Equation 3 as our low-level loss as: where  is a vector of ones and e is a small constant.We aim to ensure that the output of the encoder is predicting values close to the values of the codebook since the output is continuous.To this end, we match the codebook with the closest corresponding embeddings (as in Equation 1) denoted as   , and calculate the MSE between the encoder and the codebook values.The overall loss function is shown in Equation 4.  = (((  ), ) + [  , ()]) *   (4) Pairwise distance can drown out the other terms during optimization, so we introduce a damping coefficient  which controls the influence of the low-level grounding.

Low-Level Actions for OPPORTUNITY dataset
The Opportunity [13] dataset provides comprehensive low-level labeling.To generate our low-level action subset for training we randomly pull one sensor sample for each action from the training dataset.We would like to note that we only use one low-level action subset during training.We expect our model to be easily adaptable for other sensor-based activity tasks and training on a large range of low-level subsets would make the application of our methods much more resource intensive.

Low-Level Actions for HAR dataset
In our study, we chose 10 low-level actions which humans can associate with the highlevel activities.The ones associated with walking related motions were initial contact, final contact, acceleration, upwards acceleration, and downwards acceleration.For the sitting, standing, and laying actions we chose neutral (device is a neutral position as it would be during sitting), to sit, stand, to lay and laying.We chose the to lay and to sit low-level actions as we noticed that some of these trials did not look entirely at rest, although we may expect some confusion as these may be very similar to the sit and lay low-level actions.A single example of each low-level action was recorded.

Encoder Design
Our model is agnostic to the design of the encoders.We design a low-level encoder to generate the vectors used for comparison to the codebook and a high-level encoder to map the LL embeddings for classification.A sequence of sensor data is defined as ℝ () , where  is the total length of the raw sensor sequence and  is the dimension of the sensor data.To create the discrete local representations, we transform the data into form ℝ (()) where w is the discrete window size and q is of shape /.
The low-level encoder is applied to , generating the local representations.These local representations are then consolidated and fed to the high-level encoder to make predictions.The low-level encoder consists of two linear layers.Layer normalization and leaky ReLU functions with a negative slope of 0.01 follow each layer.The high-level encoder follows the same structure but with a softmax function instead of the final activation function.We apply dropout between each layer.
When we remove the vector quantization and low-level grounding, after directly connecting the low-level and high-level encoder, it closely resembles that of CHARM, with a few minor changes.We will refer to this model as the deep temporal neural network (DTNN).

The OPPORTUNITY Task
The opportunity dataset consists of four users with six runs per user.Five of these runs are daily living runs in which the users are free to perform high-level activities naturally.The sixth run is a drill run where the user is instructed to perform a scripted sequence of activities.This dataset is used to perform high-level activity recognition over long sequences.We follow the same dataset and sensor configuration as [2] to maintain comparability within the literature.The data is split into sequences of 2,560 datapoints and ensure that there is no overlap in high level activities within each sequence.50% overlap is allowed between sequences if applicable.For sensors, we use one sensor on each lower arm and one sensor on the back (accelerometer and gyroscope, 18 total sensor channels).The final user is held out for testing purposes.In CHARM they found that the longest sequences of the "relaxing" type were ~100s, while the shortest sequences of the other classes were 100s and chose to omit this class.For consistency we also left out the relaxing class.There were many instances in the dataset where the high and low-level activities were labeled as "null".We remove all these sequences as they have no associated meaning.The final set of high-level activities were Coffee time, Early Morning, Cleanup and Sandwich time.
We randomly sample a low-level sequence of length 16 to match our local encoder input size for each low-level action.We chose the low-level activities associated with the right hand as it seemed to be the dominant hand across the user base.The 13 low level activities were unlock, stir, lock, close, reach, open, sip, clean, bite, cut, spread, release, and move.

The Human Activity Recognition
Methods that rely on large volumes of low-level labeling are simply infeasible and impractical for most applications.We wanted to test how our method adapts for datasets with no low-level labeling.We chose the widely used UCI-HAR dataset [14].Here users perform six tasks, and the sensor data is recorded via a smartphone attached at the waist.The tasks are walking, walking upstairs, walking downstairs, sitting, standing, and laying.The data is presegmented to a length of 128 datapoints.

Training Details
To maintain consistency and comparability we choose a window size of 16 for all tasks.We use a hidden dimension of 32.We maintain a similar number of training steps across each model.For the Opportunity dataset and DTNN we use a batch size of 1, learning rate of 5e-4 and train for 10 epochs.For the VQ based models we use a batch size of 4 and train for 40 epochs.We found that a  value of 0.1 provided the best results for the VQ model with low-level grounding and 0.2 without low-level grounding.We tested randomly varying the low-level action subset during our development and did not find it improved the results.On the HAR dataset we use a batch size of 128, learning rate of 1e-3 and train for 40 epochs across all models.We would like to note that there is severe class imbalance within the Opportunity dataset and the number of training samples is quite small, therefore we sample from the dataset proportionally for each class and use weighted cross entropy loss with weights that are inversely proportional to the number of samples in the dataset.Our implementation is conducted in PyTorch [15].

Results on the OPPORTUNITY Task
Table 1 shows our implementation of the discrete temporal neural network which achieves almost 2% higher accuracy than the original CHARM model.The VQ model has lower accuracy on classification tasks compared to the DTNN.Our model accuracy varies between 0.92 and 0.98 depending on the set of low-level actions used for grounding.We believe this was due to the quality of the low-level action used.Poor low-level actions could inhibit the model during training while strong low-level actions may enhance and support the construction of the embedding space.Overall, our model reduces error compared to the original CHARM model by roughly 66% and shows a 50% reduction in error compared to the base DTNN when suitable low-level actions are used.1: Classification accuracies of each model on the Opportunity dataset.

OPPORTUNITY: Low-Level Ablation
We perform an ablation of 100 different low-level action samples used for low-level grounding of our model.The model is trained 5 times with each sample.Figure 2 shows that there is no variation between each low-level sample.This indicates that the model performance is directly predicated on the quality of the samples being used for low-level grounding.

OPPORTUNITY: Low Level Embedding Interpretation
Table 2 shows the codebook values associated with each high-level activity.Below, we list several challenges with using the real low-level actions as embeddings for our model.The true low-level actions for each task are extremely imbalanced where classes such as "move", or "reach" make up large proponents of the overall low-level activities but are not specific to the high-level activity.We removed the null class from our analysis as it has no interpretable meaning; the model however needs to match each discrete interval with an embedding and therefore we cannot expect it to correctly predict all the low-level embeddings.We also noticed that "Bite" is a key low-level action for activities labeled "Sandwich time" but only accounts for 1.74% of the low-level actions within the "Sandwich Time" class even with the null class removed.Additionally, low-level activities may be performed for longer periods of time than our discrete time intervals and may even overlap within these intervals.This means that two completely different looking inputs can be related to the same low-level action.While humans can visually recognize different actions, there is overlap between the motions of the low-level actions from the sensor frame of reference.
All these factors mean that directly using these low-level actions would likely result in poor classification performance and highly unlikely that the model will learn to make predictions directly using the low-level actions.What we hope to achieve is to ground the learned embeddings of the model to a corresponding low-level action while allowing the model to build its own representation of each low-level action.This could theoretically ground the embeddings to each representation and help guide the model when building the embedding space.Our results demonstrate that this is in fact the case, wherein a good set of low-level actions can significantly improve model performance.The caveat is that when using a poor set of low-level actions for grounding, performance is inhibited.Our model also helps with interpretability but does not guarantee that the model will be able to use the physical representations of the low-level actions for predictions.In this case we can see that while the model is highly accurate, the physical interpretation of the embedding space is not preserved.

OPPORTUNITY: T-SNE codebook analysis
Our T-SNE [16] analysis shows the difference between the codebook representations of two models, trained with different low-level embeddings.One with an accuracy of 98% and one with an accuracy of 92%.We want to compare the low-level embeddings used to guide the model against the optimal solution to the embedding space the model has found.The lowlevel embeddings for the higher accuracy model show improved separation when compared to those of the worst accuracy.Better separation may mean that they are more descriptive (we would not want two embeddings to represent the same data), and they are likely a better point of reference when building the embedding space.This helps support our theory that a strong choice of low-level embeddings chosen for grounding can improve the construction of a descriptive embedding space which improves model accuracy and vice versa.3. Based on these accuracies we can see that the VQ model does not perform as well as the other two models.We demonstrate that by using only one example of each low-level action picked from the training dataset, we can achieve almost the same performance as the base model.It is highly likely that given better choice of low-level actions, the accuracy could be further improved.

HAR: Low Level Embedding Interpretation
We apply this model using a single, manually selected set of low-level actions which does not significantly increase the amount of data labeling required.In this case our model is slightly less accurate than the base model (although this could likely be improved through better choice of low-level actions).When we compare the quality of our grounded embeddings versus those of the VQ model without grounding, we can clearly see that the embeddings generated through low-level grounding are more descriptive.One key example is that the non-grounded embeddings appear to rely heavily on embeddings 1, 2 and 3 when predicting standing, despite no other action strongly associating with these embeddings, the prediction of standing is low at 78%.Furthermore, all three walking activities share a very similar set of embeddings which may be why the model has a slightly lower accuracy compared to the grounded model.
We would like to point out that since we do not train the model to correctly predict lowlevel actions (we allow it to learn these relationships organically) we expect some error.The grounded model does not associate "laying" with laying and instead strongly represents "to lie" with "laying".This could be caused by the "to lie" and "laying" embeddings being too

Model
F1-Score Discrete Temporal NN 0.9324 VQ NN 0.9161 VQ + LL Grounding 0.9252 Table 3: Classification accuracies on the HAR dataset.similar and perhaps the laying embedding was not a strong identifier.The model more strongly associated "standing" with "final contact" and has some confusion when differentiating between the acceleration types.A large amount of the error could likely be associated with the manual selection of the low-level actions.Additionally, overlapping, and mixed actions within each timeframe would contribute to some error.Our model is interpretable as it shows which actions are most strongly associated with predictions.
Our model solves the problem found in CHARM where models build low-level representations based on how they are correlated to high level predictions.Ours builds lowlevel embeddings based on important qualities of our choosing through low-level grounding and then learns to high level relationships for classification.In essence, we can choose lowlevel activities we want the model to learn, without overly restricting representation learning.

HAR: T-SNE Codebook Analysis
The T-SNE analysis between the codebook and the low-level action embeddings show that the model preserves relationships between the low-level embeddings and the codebook for well separated low-level actions (Initial contact, upstairs, acceleration, final contact, stand etc.); it tends to create separation for low-level actions that are close together (downstairs and sit, to lie and laying).This may explain why the model is very sensitive to the low-level grounding samples provided.If the quality of the low-level actions is poor, it must work to build against the grounding when building its own descriptive codebook and the low-level grounding is counterproductive.On the other hand, if the quality of the low-level embeddings is high then the grounding can help improve optimization.This may also explain some of the confusion between the "to lie" and "laying" codebook representations.

Conclusion
We introduce a novel VQ based model which uses low-level grounding to help guide the construction of the model's embedding space.Our model can boost classification performance and can be applied to activity recognition datasets with minimal additional effort.Since the codebook of our VQ model is tied to local representations of real actions we can interpret how the model is making predictions.We provide two different cases that demonstrate when the model is making predictions based on our predefined valid criterion and one which demonstrates when the model does not understand the relation between the high-and low-level activities.In either case our method improves the quality of the embedding space compared to the standard VQ model.
We believe that low-level action grounding is a very interesting means for both guiding embedding space construction and providing the opportunity to attach physical meaning to low level representation learning.At the same time, the encoder can learn representations organically.Our recommendations for future work are centered around two limitations.First, our method does not guarantee that the model will find a solution which preserves the physical meaning of the embedding space.One challenge would be to guarantee this without limiting the organic representation learning which has allowed deep learning to become so powerful.Secondly, it can be challenging to determine which set of low-level actions will have a positive impact on guiding embedding space construction.At the same time, one key advantage of this model is that it needs a trivial number of low-level actions for grounding.Ideally, we could automatically ensure that we select the optimal set of low-level actions without the need to increase low-level labeling.

Acknowledgement
This research is partially supported by the Ingenuity Labs Research Opportunities Seed Funds (ROSF) at Queen's University and the NSERC Discovery Grants.

Figure 1 :
Figure 1: Low-level grounding.Sensor input (xi) is passed to the encoder and then compared to the codebook embeddings using Equation 1.The codebook values are then compared to the lowlevel actions using Equation 3.During backpropagation we pass the gradient of the decoder (G) directly to the encoder.Our training paradigm is to optimize the model output, encourage the encoder to predict values (ei) close to that of the codebook (CBi), and ground the codebook values to the encoder representation of the low-level actions (eLi).

Figure 2 :
Figure 2: Accuracy of our model with low-level grounding.The model was trained 5 times on each action, there was no variance in accuracy across the model for the same set of low-level actions.

Figure 3 :
Figure 3: T-SNE analysis of the highest and lowest accuracy models.

Figure 4 :
Figure 4: T-SNE analysis between the LL actions and learned codebook embeddings on the HAR dataset.

•
We propose the use of low-level grounding to guide local representation learning and allow for interpretation.•We test our model on two different datasets and show how it can easily be adapted for activity recognition-based datasets which do not have extensive lowlevel labeling.• We show how our model can be used to evaluate sensor-based decision making without relying on the raw waveform.

Table 2 :
Summary of codebook values associated with each task.(Top) shows the distribution of the codebook values without low-level grounding.(Middle) shows the codebook associations with low-level grounding.(Bottom) shows the true percentage of low-level actions associated with each task.

Table 4 :
Analysis of LL embeddings used for HAR.Our model uses LL actions that belong to the correct high-level activity most of the time, despite no training to correctly associate the LL actions.