An Empirical Analysis on Pattern Reconstruction for Optimal Storage of Wearable Sensor Data

In this data age, connected devices are continuously generating petabytes of images, text, and internet of things (IoT) sensor data. One approach to efficiently store this massive data is to extract the relevant and representative features and store only those features instead of the continuous streaming data. However, it raises a question as to the amount of information content we can retain from the data and if we can reconstruct the pseudo-original data when needed. By facilitating relevant and representative feature extraction, storage and reconstruction of near original pattern, we aim to address some of the challenges faced by the explosion of the streaming data. We present a preliminary study, where we explored multiple autoencoders for the concise feature extraction and reconstruction for human activity recognition (HAR) sensor data. Our Multi-Layer Perceptron (MLP) deep autoencoder achieved a storage reduction of 90.18%, where as convolutional autoencoder achieved 11.18%. For Long-Short Term Memory (LSTM) autoencoder the reduction was 91.47% and for convolutional LSTM autoencoder it was 72.35%. The storage reduction depended on the size and dimension of the concise representation. For higher dimensions of the representation, the storage reduction was low. But relevant information retention was high which was validated by the classification performed on the reconstructed data.


Introduction
Streaming data is growing exponentially. Forbes reports that, by 2025, the quantity of data will double every 12 hours [1]. A very good portion of the streaming data comes from the internet of things (IoT) i.e., numerous connected devices on the Internet. This explosion in IoT data has presented a critical challenge in effective data storage and management for efficient query, analysis, and decision support [2][3][4].
Specially after the two COVID affected years for telehealth utilization, contact tracing, outbreak tracking, virus testing, remote work, and medical research, the explosion of healthcare data is now beyond any prediction or estimate that has been done earlier [5]. Currently due to low cost of storage, the general approach is to store all the data, which creates the problem of effectively extracting and computing useful knowledge, storing knowledge effectively, executing search, and linking knowledge for decision support. These challenges led us to the idea of storing only the relevant and characteristic features as opposed to the whole incoming data.
We aim to validate our approach for the human activity recognition (HAR) use case scenario using wearable sensor data (accelerometer, gyroscope).
We address this data management problem using the following Research Questions (RQs): • RQ1: How can the representative features be extracted from the streaming data using a machine learning model? • RQ2: Can the representative features be stored for reducing storage? • RQ3: How can a pseudo-original representation be reconstructed from the stored concise representative feature? The contributions of this research work are as follows: A) Reducing storage using the concise representative features instead of storing the whole incoming data. B) Reconstructing a pseudo-original data from the stored representative features which can be used instead of the original data.
The paper is organized into the following sections. Section 2 discusses the related work for storage reduction of IoT data as well as time-series data reconstruction using autoencoders. Section 3 presents the description, results, and discussion of the performed experiments. Finally, section 4 discusses the final thoughts and conclusion of the research work.

Related Work
Rani, Khurana, Sharma, and Moudgil [6] discussed the different storage optimization techniques for IoT data [6]. Moreover, according to Correa, Pinto, and Montez [7], lossy Data Compression (DC) techniques can be better alternatives to the lossless ones as they are computationally less complex as well as provide a better compression ratio. In their paper, they discussed a category of lossy compression techniques which was based on the machine learning models using artificial neural network (ANN) architectures. For our research work, we focus only on this category as there is an increasing use of ANNs in IoT scenarios to implement smart devices, paving a way to fulfilling the concept of smart cities.
Reconstructing and generating of text, image, and other types of data has come a long way since its advent due to the progress and widespread application of deep learning using ANNs. Just by itself, the area of image reconstruction in generating high-quality images from corrupted, noisy, or low-quality images has opened the door to a whole myriad of applications. We focus on autoencoders for the reconstruction of IoT data for their increasing and versatile use in reconstructing and generating other types of data like image. Sagheer and Kotb [8] and Nguyen, Tran, Thomassey, and Hamad [9] showed that LSTM autoencoder works better in modeling time-series data.

Implementation
We have saved the concise representation from the encoder part of a trained autoencoder. Four models implemented for the reconstruction component are MLP deep autoencoder, convolutional autoencoder, LSTM autoencoder, and convolutional LSTM autoencoder.
The experiments have been run on Google Colab using GPU hardware accelerator in runtime. The dataset used has been UCI HAR data [10] because it is a very simple dataset with only 6 activity classes (WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS, SITTING, STANDING, LAYING).
We run the experiments as follows: The deep autoencoders have, first, been trained on the same training data as the convolutional LSTM classifier. The concise representations have, then, been saved from the encoder layers of the deep autoencoders. The saved representations have, next, been loaded from the storage and reconstructed using the decoder parts of the autoencoders. The reconstructed representations have, finally, been fed into the convolutional LSTM classifier.
• Experiment 1: The first experiment is performed with a simple MLP deep autoencoder. The MLP encoder has 5 layers consisting of 512, 256, 128, 64, 32 neurons, respectively. The MLP decoder has 5 layers consisting of 64, 128, 256, 512, 1152 neurons, respectively. All layers use Rectified Linear Unit (ReLU) as the activation function except for the last layer in the decoder which, uses Sigmoid activation function.   (9)). The activation function that is used by all of the conv1D and LSTM layers is ReLU, except for the last two layers in the decoder, which uses linear activation functions. Validation: The validation was done based on the accuracy achieved on the convolutional LSTM classifier. The storage reduction in percentage was considered as to determine the useability of the approach for the optimum storage of the wearable sensor data. Results: The storage space was calculated using the st_size attribute from the Python function os.stat(). The attribute returns the file size in Bytes. So it was divided by 1e+6 to determine the storage space in MB. For all of the experiments, the storage size for the training data was 67.75MB. Table 1 shows the results of the comparison of the storage reduction achieved for the different experiments.

Discussion
When the concise representation was taken from the encoder part of an autoencoder, it provided the option for reconstructing a pseudo-original data which is comparable to the original data. The classification performance of the reconstructed pseudo-original data depended on the size and dimension of the concise representation. For higher dimensions of the concise representation, the classification performance was higher as it was demonstrated for convolutional autoencoder. The storage reduction for this specific case was not as good as the other options though. But if a good balance is found for the storage reduction, then this approach is better for meeting the requirements of both storage reduction and reconstruction of the pseudo-original data.
The approach has not been validated for real life streaming data yet. This work has not considered the issue of concept drift for the streaming data either. No ablation study was done on the implemented models.

Conclusion
IoT data is constantly growing, even more so in the recent years with the rise in remote health monitoring systems and the connected wearable sensor data. If efficient storage reduction techniques for the streaming data are not adopted, then response time to the remote health monitoring systems will eventually become slower. Our approach to storing the representative concise features rather than the whole incoming data provides a prospect to overcome the challenges raised by the growing need of the fast response time for the remote healthcare monitoring systems. Through our empirical study of the pattern reconstruction, we found that the efficacy of the concise representation depended on the dimension and size of the representation. Further exploration and experimentation is needed to verify the preliminary findings found in this research work.