Detecting Flashover in a Room Fire based on the Sequence of Thermal Infrared Images using Convolutional Neural Networks

Flashover phenomena accompanying rapid (cid:28)re propagation in a room occur when the hot smoke from a (cid:28)re accumulates in the room’s upper part. This phenomenon presents one of the most frightening and challenging situations for (cid:28)re(cid:28)ghters. A typical approach to mitigate and prevent the impact of (cid:29)ashover is to train (cid:28)re(cid:28)ghters to monitor a few common indicators of (cid:28)re in pre-(cid:29)ashover time, such as moving dark smoke, high heat, and (cid:28)re rollover. In actual compartment (cid:28)re events, these pre-(cid:29)ashover indicators are hard to recognize. Furthermore, determination of exact (cid:29)ashover time is di(cid:30)cult by just observing (cid:28)re activities while there are other vital rescue duties to do by (cid:28)re(cid:28)ghters. Hence, automatic detection and prediction of (cid:29)ashover in real time are of paramount importance to save lives and reduce the cost of damages. Flashover prediction is still an open area of research by (cid:28)re safety experts. Deep convolutional neural networks are currently dominating the area of computer vision, and these state-of-the-art deep learning models have been successfully used in various applications, including object detection, localization, and segmentation. Unlike previous studies that use RGB images, sensors, and gauges, we utilized the power of deep learning techniques to detect (cid:29)ashover from image sequences captured by thermal infrared (IR) cameras. Our experimental results indicate that not only our proposed approach can detect (cid:29)ashover in IR video data with high precision, but it can detect (cid:29)ashover a few frames before happening. Our technique is a promising approach that can be used in future for (cid:29)ashover prediction in real time.


Introduction
A re that started in a room could grow and spread to the building. The increasing number of buildings has boosted this problem to a higher dangerous level each year, and reghters encounter room res routinely in emergency response calls. More than 30,000 reghters get injured each year during reghting operations [1]. In a room with modern furniture containing many combustible materials [2], such as chemical bre/foam products, re tends to grow fast, which would leave a short period of time available for the reghters to rescue, operate, and escape until the room ashover. When the room ashover occurs, all the combustible materials in the room are suddenly ignited near-simultaneously since the hot smoke layer accumulated near the ceiling rapidly heats the exposed surface of all the items in the room. Figure 1.A to 1.D shows a few images of a test compartment re. As can be seen in Figure 1, it takes less than 3 mins from ignition to ashover onset.
Failing to recognize signs of the impending ashover could cost the lives of reghters and occupants in the building. There is no absolute criterion for ashover prediction since ashover occurrence depends on several factors, such as room specications, fuels in the room, and the sizes of windows and doors. In a typical room re, the onset of ashover corresponds to the time when the upper smoke layer in the room reaches a temperature around 600 • C [3]. Also, at the time of ashover occurrence, the heat ux (HF) from the upper smoke layer reaches approximately 20 kW/m 2 when measured on the oor [3]. In addition, as another criterion for ashover detection, the onset of ashover can be estimated by monitoring the growth of the Heat Release Rate (HRR). While varied with the size of the room, doors, and windows, the average HRR value needed for ashover is more than 1000 kW and less than 2000 kW for a small room (the standard compartment mentioned in ISO 9705 standard [4]) with one open doorway.
Various investigations have been conducted to develop a new method for ashover detection and prediction, employing HRR, HF, and temperature data [35]. Simulation data by CFD (Computational Fluid Dynamics) modelling and data analysis by machine learning techniques have been used to study ashover occurrences and to predict ashover onset time [68]. Many of these studies used data such as temperature, HF, and HRR trends to detect ashover time, the acquisition of which requires various sensors and gauges. The main drawback of their methods is that several sensors such as thermocouples and heat ux gauges with sucient high-temperature resistance should be xed readily in specic places in a re compartment. Therefore, their ashover prediction approaches cannot be easily utilized by reghters in actual re situations. Rather, they could be used for limited analyses of a re in controlled environments and simulations.
With recent advancements in digital technology and computer science, thermal IR (Infrared) and digital RGB cameras have become more aordable and powerful than before. Thus, reghters have been equipped with dierent types of these cameras (see Figure 1.E). Consequently, using these cameras currently available on the reground could bring straightforward solutions for detecting and predicting ashover. In addition, the new vision-based solutions could be powered by the state-of-the-art machine learning (ML) techniques. For instance, a recent study [9,10] employed RGB and IR cameras to predict ashover in room res [9], and the authors in [11] proposed a preliminary method for prediction of ashover time based on the dynamic change of measured smoke area in RGB video data. To maximize the data analytics of both RGB and thermal IR data, which is essential in developing a powerful vision-based technique for ashover detection and prediction, it is necessary to extract features such as the temperature of smoke from the dierent RGB and thermal IR frames by aligning them from the same eld of view and in the same scale. To mitigate this problem, [9,10,12] modied Generative Adversarial Networks (GANs), which is a new deep learning method for image transformation, followed by image segmentation techniques are used to nd superimposed RGB and thermal IR data. The drawback of this idea is that transformed data could not be completely accurate and reliable, and the whole process is computationally expensive. The authors in [13] studied ashover prediction in image data generated by CFD-based simulations. For this reason, an approach relying only on IR data or RGB video data entices fully automatic solutions saving some loops of image processing.

Method
In this study, we proposed a new classier using deep learning to detect ashover occurrences in room res captured by a thermal IR camera. Our approach is fully automatic and end-to-end since it relies only on the IR data without using RGB video data, which enhances the accuracy and speed of the solution. Convolutional Neural Networks (CNN) are feedforward neural networks with several consecutive layers designed, inspired by the visual cortex region in the human brain, to extract hierarchies of digitalized data features such as images from low-level to high-level patterns [14]. A typical CNN model consists of an input layer, many convolutional blocks, a few fully connected layers, and an output layer. Each convolutional block consisting of convolutional layers might be followed by pooling, dropouts, and batch-normalization layers [14,15]. Mathematically, CNN is a simple linear and non-linear weight multiplication from input to output. By ne-tuning thousands of weights in dierent layers of CNN, it can memorize or be trained to extract features of input data. Depending on the application, the output layer acts as a classier and classies input data based on extracted features. Training of a CNN network is an iterative optimization process in which the gradient of error (the dierence between the output images and ground-truth images) is calculated layer by layer using the chain rule and back-propagated from the output layer to the input layer in each iteration. After several iterations, network weights are tuned to extract features from even novel input images. Deep details of CNN architectures and functionality of each layer can be found in deep learning literature [1416].
In spite of the achievements of CNN models in various computer vision tasks, the number of convolutional layers in original CNN models was limited to a few layers. The reason is that the back-propagated gradient, calculated from the output layer toward the input layer, decreases after each backward layer. For a CNN architecture with hundreds of layers, the gradient reaches zero ahead of the input layer. This problem is called the vanishing gradient. One of the recent enhancements in CNN models to mitigate this problem is convolutional blocks named Residual block [17]. Figure 2 shows a sample of the residual block where gradient could also skip between CNN layers to avoid vanishing. This new idea helps researchers to propose and train various deeper convolutional neural networks applicable to more complicated computer vision problems than the previous CNN models. In the present study, we selected the Residual Network model [17] with 18 layers (ResNet18). We trained ResNet18 from scratch for detecting ashover in thermal IR image sequences. We also trained ResNet18 using transfer learning [18] by freezing all layers of the CNN network while training only the last fully connected layer. ResNet18 was initialized in both training approaches by pre-trained weights using ImageNet [19] dataset. We modied the last layer of ResNet18 to have two neurons. Details of the ResNet18 network used in this study are presented in Figure 3.

Experiments and Discussions 3.1. Dataset Preparation
In order to train, validate, and test our ResNet18 model, we gathered data consisting of thermal IR image frames captured from six room res, which were conducted by the Fire Safety Unit, the National Research Council of Canada (NRC), Canada [20] for the Characterization of Fires in Multi-Suite Residential Dwellings (CFMRD) project. Thermal IR video and sensor data were acquired from the six tests using dierent room specications and materials. By synchronizing thermal IR videos with the data from sensors, we could nd the ashover onset and determine the corresponding IR frame. In general, we considered four videos for training and validation with the ratio of 90% and 10%, respectively. Table  1 provides metadata of the videos used in our experiments. To test our trained ResNet18 models, we considered two novel full thermal IR videos: 23-SI-76 comprising ashover (Figure 4.A) and 22-SI-22 (Figure 4.B) containing no ashover. As it can be seen from Figure  4, both videos have similar scene and re conditions; however, analysis of the data recorded by sensors and thermal IR video veried that ashover did not occur in the test 22-SI-22. It is noteworthy to mention that thermal IR videos contain thermal information where the intensity of pixels is not in the range of RGB video data (i.e., [0-255]) used in computer vision applications. However, normalizing thermal IR video frames into the range of [0-255] made them not only useable as similar to RGB format but also preserved the original thermal information in the form of intensity levels. It means that the higher the intensity, the higher the temperature in those pixels.  Table 1 also itemized the number of thermal IR frames used in our experiments as well as the frame identied as the ashover onset, which was determined using the common ashover criteria (Temperature, HF, and HRR). The detail of the test room specications, the number of sensors, the types of burning materials in each room and the test data can be

IR Video name Utilization
Burning items The number of IR frames from ignition to end of ashover found in the original CFMRD report [20]. Our train/validation dataset contains 222 frames of no-ashover scenes and 93 ashover scenes, in total 315 thermal IR frames. For this reason, the dataset is not balanced between the two classes of no-ashover and ashover.
To mitigate the problem of small dataset size, we used a simple data augmentation consisting of normalization, random and center cropping and resizing, and horizontal ipping.

Experiments and Results
As mentioned in the previous section, we developed two versions of ResNet18 by training the whole network and training the last layer by transfer learning method using pre-trained weights from the ImageNet dataset [19]. The input size of both networks is 224×224, and we selected Cross-Entropy loss function, and Stochastic Gradient Descent (SGD) optimization algorithms [15] for training and validation. The learning rate of SGD gradually decreases from 0.001 in every seven iterations using parameter values of 0.9 for momentum and 0.1 for gamma. In general, we used the mini-batch training method by randomly selecting minibatches of size 5 for 25 epochs for both ResNet18 models. The models with the best loss and accuracy values in the validation stage were saved for the subsequent testing stage. The whole experiments were conducted by a Tower PC equipped with an NVidia Titan RTX GPU in Python language and using PyTorch deep learning library.
Our ashover detection experimental results revealed that both ResNet18 models performed similarly for the test video 23-SI-76. In contrast, for the test video 22-SI-22, the ne-tuned ResNet18 by transfer learning performed better with fewer false ashover detections. Figure 5 to 8 present ashover prediction of our method, compared with the sensor data from the room re tests. In the gures, the ashover period is marked from its onset estimated based on the sensor data. The temperature of the near ceiling region, HRR and heat ux on the oor are plotted and compared with our model results of ashover probability since re-safety experts consider all ashover criteria together (i.e., the temperature of near ceiling region, HRR, door, and window specications, and heat ux on oor level). In Figure 5 and 6, it can be seen that ResNet18 can detect the impending ashover quite earlier (specically, on average 41 seconds (or 21 frames) in advance with an average probability of 94.4% ashover detection) than the actual ashover onset veried by the test sensor data. One reason for this early detection is that in visual terms dissected in a second, ashover onset is not a sharp phenomenon, and pre-ashover frames have a high visual correlation with the frames of the ashover region, which is well captured by our model.
Our experimental results also show that the sharp increase in ResNet18 predictions before the actual ashover (such as in Figure 5) can be generalized as a promising solution for the prediction of ashover. To evaluate the performance of ResNet18 models for situations where the re doesn't grow to the ashover level, we conducted the same comparison study of our model for the test video 22-SI-22. The sensor data from the test veried that no ashover happened in the 22-SI-22 experiment. As compared in Figure 7 and 8, the ResNet18 model trained by the transfer learning method performed better than the ResNet18 model trained from scratch. This was not an unexpected conclusion since many studies have proved that training a deep learning model by the idea of transfer learning and domain adaptation increases the performance of the model in most cases. Figure 5. Results of ashover detection by ResNet18 model trained from scratch, which is compared with the ashover period veried by the sensor data from the room re test of 23-SI-76. Figure 6. Results of ashover detection by ResNet18 model which only its last layer trained by transfer learning technique, which is compared with the ashover period veried by the sensor data from the room re test of 23-SI-76.
There are plenty of predetermined classication and detection metrics (see Table 2) that allow us to assess the performance of our deep learning models. To calculate those metrics, the rst step is to determine the confusion matrix for each experiment that presents the number of true positive (TP), true negative (TN), false positive (FP) and false negative (FN) predictions. Binary classication is a particular problem where there are just two classes of data, in our case: ashover (positive) and non-ashover (negative). Since the output of the last dense layer of deep learning models is the probability of ashover occurrence in every frame of test videos, a threshold value should be applied to convert the probabilities into binary values. For this reason, the values of classication metrics are dependent on the  selected threshold value. We evaluated our ResNet18 models on the two test thermal video data. Figure 9 present classication metrics rates for each experiment.
False Positive Rate (FPR) (Equation 1) is also known as type I error. This criterion shows the fraction of false alerts that our model predicted. It is important for re safety research because as the False Positive alarm raised by the system would be a massive waste of re safety resources. False Negative Rate (FNR) (Equation 2), also known as type II error, is one of the most critical metrics for re safety research which shows what portion of the ashover frames were predicted by the model as non-ashover cases. True Negative Rate (TNR) (Equation 3) or Specicity measures how many cases out of all the non-ashover cases the model could classify as non-ashover correctly. Similarly, True Positive Rate (TPR) (Equation 4) or Recall, also known as Sensitivity, looks for how many ashover cases correctly have been classied as a ashover situation. These four metrics are barely used alone for evaluations of the classication results, and they are utilized for calculating more reliable metrics. From gure 9, both models show relatively low values for FPR and FNR and high values for TNR and TPR for thresholds between 30 and 70 which can be considered as good classiers. FPR value rarely reached zero, and the reason is that there is no clear, sharp border between ashover and non-ashover region in our test thermal IR video data. It also indicates that our system should work pretty well for re cases similar to the ones in our dataset. At the same time, the imbalanced nature of our dataset, with fewer positive samples than negative ones, is the reason for getting non-zero FNR for thresholds of more than 30. We calculated Positive Predictive Value (PPV) (Equation 5) or Precision to measure how many ashover cases are in fact actual ashovers. Our experimental results show that our models can reach the precision of 0.8, which is good for our dataset with a limited number of samples. Usage of Accuracy (ACC) (Equation 6) metric for an imbalanced dataset like ours will provide a better result. However, we determined the accuracy by measuring how many cases, both ashover and non-ashover, were correctly classied. Our best ResNet18 model could reach an average accuracy of 91.3%, which is considerable. From the combination of precision and recall, we could nd F1 score criteria (Equation 7). It is one of the most commonly used metrics for classication tasks in the machine learning area. The higher the F1 score is, the classication model performs better. It is also crucial for the dataset with unbalanced samples on performance judgement. Thus, it would tell more than other metrics, such as accuracy, considering the nature of our dataset. Our best ResNet18 model could reach the average F1-score of 84.1%. From the F1 score, we can also determine which threshold value could be selected for our experiments. It can be seen from Figure 9 that the results of our models will not be changed signicantly by changing the thresholds, which indicates the robustness of our ResNet18 models.
There is a trade-o between many classication metrics. For instance, TPR and FPR are in contrast. To gure out this trade-o better, we draw TPR and FPR for every threshold value in a graph known as Receiver Operating Characteristic (ROC). For all threshold values, a larger area under the ROC Curve (AUC) means better classication results. Similar to the F1 score, AUC has advantages in revealing the real performance on an unbalanced dataset by considering the impact of both the positive samples and negative samples. Therefore, it is vital to our model. Likewise, we can analyze the trade-o between precision and recall values and determine the AUC of this graph as well. Figure 10 shows the ROC and Precision-Recall graphs related to both ResNet18 models for the test thermal IR video 23-SI-76, where the AUC for the ResNet18 model trained from scratch was 0.88, and the model ne-tuned its last layer by transfer learning reached to AUC of 0.93.

Conclusion
In this study, we showed that deep convolutional neural networks, specically the ResNet18 model, can be trained by a small number of IR thermal images using a domain adaptation approach to detect ashover onset in thermal IR videos. Unlike previous attempts of using deep learning methods for ashover prediction, our approach is end-to-end and fully automatic, without using any data processing, video transformations, or extensive data  augmentation. The temperature of the smoke layer near the ceiling and heat ux of oor gathered by sensors and gauges as well as calculated heat release rate from room re tests were also used to determine the accurate occurrence of ashover in terms of time and the corresponding IR frame in our dataset. Two thermal IR videos were selected, one captured from a room re test with no-ashover and another room re test with ashover to evaluate our proposed deep learning methods. The comparison study between the network predictions for the two test thermal IR videos and the sensor data from the room re tests indicates that the ResNet18 model can detect the impending ashover earlier than the actual ashover onset. We also calculated a comprehensive list of classication performance metrics to evaluate our methods. ResNet18 models could reach an average accuracy of 91% with the ROC of 0.93, which is a considerable result relative to our small dataset. As a future path, our new vision-based technique, beneting only from thermal IR video data, is a promising solution that can be used in predicting ashover way before it actually occurs.