Deep Learning based Object Detection Application for Fire Recognition in Video

Robin Yadav, Year 2 Engineering.

Abstract

Fire disasters are dangerous and damaging events that require accurate surveillance and monitoring for detection and mitigation. In this paper, I propose a video based fire detection system using the YOLOv3 object detection model. The model was trained on a dataset of approximately 3000 images of fire in various contexts and additional augmented data. YOLOv3 achieved an AP of 89.5% on a test set consisting of fire in high risk and emergency situations and a 97% AP on a test set consisting of single flame images. Data preparation significantly impacts the performance of the model suggesting that certain labeling and augmentation techniques make patterns and features of fire more distinct and recognizable. Due to the high AP and fast inference speed of the model, this system is viable for real time monitoring and fire detection on a Raspberry Pi.

Introduction

Just in Canada, there are approximately forty four thousand instances of fire every year, which result in an average of four hundred deaths and three thousand injuries (Canadian Center for Justice Statistics, 2017). Fire disasters have destroyed over a hundred structures between the years of 2010 and 2014(Statistics Canada, 2017). It is critical to have an accurate fire detection system that can help mitigate the effects of a catastrophic fire and provide monitoring information to firefighters. Most previous video based fire detection systems use hand-crafted features such as spatial, temporal and color characteristics of fire to perform recognition (Phillips et al. 2002, Toulouse et al. 2015, Toulouse et al. 2017). Although in recent years, there has been an interest in leveraging CNNs (Convolutional Neural Networks) for fire image classification. A CNN adjusted from GoogleNet was used to classify images as fire or non-fire (Muhammed et al. 2018). Other developments include a CNN which performs patch identification on images of fire (Zhang et al. 2016). However, CNNs lack localization functionality, so they cannot identify where the fire is in an image. In this paper, I propose a video based fire detection system using state-of-the-art deep learning object detection models. YOLOv3 (You Only Look Once) and YOLOv3 Tiny(Redmond and Farhadi, 2018) were chosen and compared to determine the most suitable model for fire detection. The advantage of YOLO is that can identify the position of the fire within an image without the use of hand-crafted features. This approach allows for robust and accurate fire detection while still maintaining the ability to run in real time on low cost devices (Vidyavani et al. 2019). Also, this model is very versatile because the training data includes fires in many different contexts, from small localized fires to larger forest and bush fires. Implementing the model on a Raspberry Pi will offer a light weight and accurate fire detection system.

Materials and Methods

Digital images of fire were web scraped or collected from smaller pre-existing datasets. The FireSmoke, FireDetectionImage and FlickrFireSmoke (DeepQuestAI 2019, Cair 2017, Cazzolato T. Mirela et al. 2017) datasets were used. Images with low resolution (below 200 by 200 pixels), low fire visibility and inaccurate representations of fire were discarded. This includes images in which the fire was completely obstructed by smoke or was too small and blurred to be recognized distinctly as fire. An example of a discarded image is present in Figure 1.

Figure 1: The only distinguishing feature of the fire in this picture is the light. However, the emitted light from the fire is a small portion of the image and is very similar to the other non-fire emitted light.

Figure 2: The top 3 images are from the 50 image single flame test set. The bottom 3 images are from the high risk emergency fire situations test set.

Negative examples that contained red objects e.g. fire hydrants were web scraped and added to the dataset. The dataset contained a total of 3057 images comprised of 2732 images of fire (totaling 8000 instances) and 325 images of no fire. A test set of 100 images of fire in high risk emergency situations and another test set of 50 single flame fire images e.g. campfires were compiled. Images from both test sets are displayed in Figure 2.

The images were annotated using Microsoft VOTT (Visual Object Tagging Tool) to specify bounding boxes around the fire objects. Fire was annotated using 2 different labeling strategies. In the first approach, individual flames of a whole fire were annotated separately. While in the second approach, fire was unsegmented and annotated as a whole regardless of the fact that it was composed of individual flames. The two labeling approaches are presented in Figure 3.

Figure 3: The 2 images on the left show segmented labelling where the fire is labelled according to the individual flames. The 2 images on the right show their unsegmented counterpart where the whole fire is labelled.

Data augmentation was used to increase the size of the dataset. Image cropping, translating, rotating, reflecting, and HSV (Hue Saturation and Value) transformations were applied. Different variations of the dataset were created based on the amount and type of augmented images.

A 90%-10% split was used to form the training and validation sets. Transfer learning (Pascal-VOC pre-trained weights) was implemented during training and YOLOv3 Tiny was trained on TensorFlow while YOLOv3 was trained on DarkNet. The images were resized to 416 by 416 and the batch size was varied between 16 and 128. An Intel i5-6200U CPU was used to perform inference and test the models using the mAP (mean Average Precision) metric.

Figure 4: Schematic of the Raspberry Pi fire detection system.

A Raspberry Pi 3 was used to run real time inference with YOLOv3 Tiny. A camera connected to the Raspberry Pi was activated when a signal from the smoke detector was received. Inference was performed on the video data for fire detection and an alarm was activated. Figure 4 shows a schematic of the detection system.

Results

Figure 5: Segmented fire labelling leads to a significantly greater model performance at every measured batch size. The average difference in mAP between segmented and unsegmented fire labelling is 17.7%.

YOLOv3 Tiny is used to experiment with the data. The AP values are measured at a confidence threshold of 10% and the predictions are made on the 100 image test set. Figure 5 shows the difference in AP between segmented and unsegmented fire labeling with no data augmentation. The average AP was calculated from IoU thresholds between 10% and 50% with a step size of 10%. Unsegmented labeling performs better in every case.

Figure 6: The highest performance on YOLOv3 Tiny with segmented labelling was obtained when 25% of the raw data was augmented with a batch size of 16. 25% augmentation of the raw data outperforms 50% augmentation by a slight margin in 2 out of the three tests. Having augmented data leads to higher mAP in almost every case. Note that this is when HSV augmentation is applied.

Further testing is done to see if data augmentation could improve the performance of segmented labelling. Figure 6 compares the performance of YOLOv3 Tiny on different amounts of augmented data with HSV transformations and Figure 7 shows compares the performance without HSV transformations.

Figure 7: Data augmentation without HSV transformations increases the performance of YOLOv3 Tiny compared to no augmentation across all batch sizes. The greatest performance is obtained when 50% of the raw data is augmented.

Disabling HSV transformation increases model performance by 1.1%. A more significant effect is observed with unsegmented fire labeling. Figure 8 highlights the difference in AP on unsegmented fire labeling without HSV transformation.

Figure 8: Data augmentation increases the performance of the YOLOv3 Tiny model with unsegmented fire labelling. Specifically, the highest model performance is obtained when 50% of the raw data is augmented.

The highest AP achieved with unsegmented fire labeling without HSV transformation is 64.3% which is an 8.4% increase from segmented fire labeling. Interestingly, when the entire dataset is augmented, an AP of 64.6% AP@[0.1:0.1:0.5] is obtained which is on par with 50% of the dataset consisting of augmented images.

Around 400 additional images of fire were collected and labeled in the unsegmented format. In Figure 9, the performance effect of the additional images is compared to the original amount of raw training data.

Figure 9: YOLOv3 Tiny trained on 2700 raw images consistently has a higher performance than being trained on 2500 raw images.

YOLOv3 (trained with optimal parameters observed from YOLOv3 Tiny) obtains a significantly higher AP@[0.1:0.1:0.5] of 93.3% on the test set. The [email protected] is 89.5% which is twice the AP obtained by YOLOv3 Tiny at IoU 0.5. On the single flame test set, YOLOv3 obtained a 97.6% mAP at IoU 0.5 and the tiny version obtained 84.5% mAP at IoU 0.5.

The large increase in mAP resulted in further testing of the DarkNet based YOLOv3 Tiny model. The DarkNet implementation obtains an AP@[0.1:0.1:0.5] of 79.4 which is a 19.2% improvement from the TensorFlow version. Additionally, the mAP at IoU 0.5 is 68.2% which is an 51.6% improvement from the TensorFlow version.

The average inference speeds for YOLOv3 and YOLOv3 Tiny and 0.16 seconds and 1.3 seconds per image respectively.

Discussion

It was determined that the unsegmented labeling approach with augmented data (no HSV transformation) resulted in the greatest AP value. YOLOv3 had the highest performance with an AP of 89.5% on the test set of emergency fire images and 97.6% on the single flame images.

Perhaps annotating fire with bounding boxes is not the most suitable method for representing the features of fire and is degrading model performance. Certain objects are well defined by a rectangular box e.g. cars, pedestrians, signs etc. but sometimes fire is very irregular. Polygon or semantic annotation can be used to eliminate ambiguities and difficult edge cases for fire labeling and they also provide a richer representation of objects (Endres et al. 2010) Also, some object detection applications for medical imaging show that performance is better for brain detection than lung detection because the shape of the brain can be better approximated by a bounding box. (Rajchl et al. 2016)

Many ambiguities arise when labeling fire with bounding boxes. For example, two flames which are connected at the base can be labeled as two independent flames or as one large flame. Segmenting fire creates additional patterns the object detector must recognize. There are more shapes, sizes and illumination and angles to consider. Notably, contextual information about the location and shape of the whole fire is required for segmentation. In contrast, labeling a region of fire in its entirety without flame segmentation can decrease the complexity of the task and allow the model to recognize consistent patterns in the data.

Usually, an increase in augmented data prevents the model from over fitting during training which leads to better performance (Shorten and Taghi 2019).However, an increase in augmented data in the segmented fire dataset does not consistently improve the performance of YOLOv3 Tiny. This inconsistency might emerge from the fact that segmented fire detection is already a complex task. Increasing the amount of augmented data can result in unnecessary noise and irregularity making object detection more difficult.

Furthermore, model performance increased when HSV transformations were removed. Since most of the features of fire are seen in the red spectrum, HSV transformations create unrealistic data by shifting the color of fire, which creates variance and leads to inconsistent results.

Interestingly, there is not a significant difference in AP even when data augmentation is applied to the unsegmented fire dataset. Possibly because there is not enough augmented data to create large changes in performance. The greatest ratio of augmented data to raw data was 1:1. Perhaps a larger ratio such as 2:1 or 3:1 is needed to observe significant effects of data augmentation. In addition, there was only one test for every batch size and ratio. Testing each batch size and ratio 3 to 4 times and averaging the results can be a better indication of the performance on the dataset.

There is a notable difference in AP between the emergency fire images test set and the single flame test set. Since a single flame has a much simpler structure and shape than a multi-flame fire, its patterns are more recognizable and distinct making it easier to detect. Also, fire in emergency situations is usually clouded by smoke and is less distinct against the background making it more difficult to detect. The ability to perform well on both test sets is representative of the difficulty and diversity of the training data. The 3000 images in the data set include fire in many different contexts and visibility settings. Large fires such as forest, bush, house fires are included in the data set with smaller more localized fires. Therefore, the model is able to generalize and perform well on difficult and diverse test sets. Other object detection approaches either lack a large data set (500-1000 images) or their data set is too homogenous (Sucuoğlu et al. 2019, Barmpoutis et al. 2019). A 97% AP obtained by YOLOv3 on the single flame test set demonstrates that it could be used in real settings i.e. homes, offices and other public areas where fire usually begins as a small flame. Also, it is versatile enough to perform in settings where there is a large high risk fire, such as forest fires. Figure 10 shows some examples of the detections made on both test sets.

Figure 10: The top 3 images show detections from the single flame test set. The bottom 3 images are from the emergency fire test set. These detections demonstrate the model’s ability to perform in many different settings and conditions.

Another important consideration is the speed and accuracy trade-off between YOLOv3 and YOLOv3 Tiny. YOLOv3 is fast enough for real time detection but is not at the level of human visual cognition. YOLOv3 Tiny can operate well within that speed range although with a lower accuracy. Differences in accuracy and speed between DarkNet and TensorFlow are mostly likely due to the framework implementations.

While YOLOv3 obtains an [email protected] in the 90% range, it can be improved for inference by applying some computationally inexpensive post processing steps. Detections can be processed to remove false positives which emerge when a smaller flame is detected separately within a larger flame. Also, the confidence threshold of the detections could be changed dynamically when implemented in a video feed. If consecutive frames show a high confidence threshold, then the threshold could be lowered to detect more fire. If consecutive frames show a low confidence threshold then the threshold can be increased to eliminate those detections. Also, YOLOv3 can be used for initial detection and when the detections surpass a 90% confidence threshold (very high chance there is fire) then the model could be switched to the Tiny version(for faster performance).

Further improvement could be made to the model by increasing the size of the dataset. Manual and automated web scraping can be used to find an additional 1000 to 2000 images. A pseudo-labeling process can be executed by having an already trained model run predictions on the new data. Any bad predictions are relabeled manually and the model is retrained with the larger dataset.

To improve inference speed on lightweight, low power devices, YOLOv3 can be converted into optimized frameworks such as TensorFlow Lite or TensorFlow-MKL for model deployment.

While a deep learning model on a Raspberry Pi is relatively lightweight and portable, the cost of the system is a drawback. For a household, a smoke alarm is much more affordable than this system. Therefore, its practical application is in commercial and large buildings e.g. warehouses. It can be integrated with existing smoke alarms to reduce the high number of false positives they detect (Ruttiman 2014). Furthermore, outdoor settings can benefit from this system since smoke alarms don’t have much range outside while video cameras do. Although this method can still suffer from false positives, it is feasible for a human to check the video feed from a remote location to determine whether there is an emergency.

Future research can test different object detection models and compare them to YOLOv3, specifically SSD (Single Shot Detector) and Faster-RCNN (Region Proposal Convolutional Neural Network) (Liu et al. 2016, Girshick et al. 2014). Additionally, this system can be modified to include algorithms that use hand crafted fire features as post processing steps to a deep learning model.

In this paper, I have presented a light-weight video based fire detection system by leveraging the advantages of deep learning. I experimented with data preparation and certain model parameters to optimize the performance of YOLOv3 and YOLOv3 Tiny. This system has the possibility of making fire detection in outdoor areas, offices and other buildings more reliable. Furthermore, the portability and versatility of this fire detection system can allow it to be integrated with other firefighting technology. Fire detection combined with surveillance and monitoring equipment can provide critical information to firefighters helping them evaluate situations and fight fires.

References

Canadian Center for Justice Statistics. Fire statistics in Canada, Selected Observations from the National Fire Information Database 2005 to 2014. Govt. of Canada, 2017.

Statistics Canada. Table 35-10-0195-01 Fire-related deaths and persons injured, by type of structure. Govt. of Canada. https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=3510019501.

Muhammad, Khan, et al. “Convolutional Neural Networks Based Fire Detection in Surveillance Videos.” IEEE Access, vol. 6, 6 Mar. 2018, pp. 18174–18183.

Zhang, Qingjie, et al. “Deep Convolutional Neural Networks for Forest Fire Detection.” Proceedings of the 2016 International Forum on Management, Education and Information Technology Application, 2016.

Redmon, Joseph, and Ali Farhadi. “YOLO9000: Better, Faster, Stronger.” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 25 Dec. 2017.

DeepQuestAI. “DeepQuestAI/Fire-Smoke-Dataset.” GitHub, 28 June 2019, github.com/DeepQuestAI/Fire-Smoke-Dataset.

Cazzolato T. Mirela, et al. “FiSmo: A Compilation of Datasets from Emergency Situations for Fire and Smoke Analysis”, Brazilian Symposium on Databases – Dataset Showcase Workshop, Uberlândia, MG, Brazil October 2017.

Cair. “Cair/Fire-Detection-Image-Dataset.” GitHub, github.com/cair/Fire-Detection-Image-Dataset. Accessed September 2019.

“Visual Object Tagging Tool” VoTT (Visual Object Tagging Tool), Microsoft , 11 Sept. 2019, https://github.com/microsoft/VoTT.

Shorten, Connor, and Taghi M. Khoshgoftaar. “A Survey on Image Data Augmentation for Deep Learning.” Journal of Big Data, vol. 6, no. 1, June 2019.

Barmpoutis, Panagiotis, et al. “Fire Detection from Images Using Faster R-CNN and Multidimensional Texture Analysis.” ICASSP 2019 – 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019

Sucuoğlu, Saygın H, et al. “Real Time Fire Detection Using Faster R-CNN Model.” International Journal of 3D Printing Technologies and Digital Industry, vol. 3, no. 3, 9 Dec. 2019, pp. 220–226.

Jadon, Arpit, et al. “FireNet: A Specialized Lightweight Fire & Smoke Detection Model for Real-Time IoT Applications.” ArXiv, 4 Sept. 2019.

Toulouse, T, et al. “A Multimodal 3D Framework for Fire Characteristics Estimation.” Measurement Science and Technology, vol. 29, no. 2, 2018, p. 025404.

Iii, Walter Phillips, et al. “Flame Recognition in Video.” Pattern Recognition Letters, vol. 23, no. 1-3, 2002, pp. 319–327.

Toulouse, Tom, et al. “Automatic Fire Pixel Detection Using Image Processing: a Comparative Analysis of Rule-Based and Machine Learning-Based Methods.” Signal, Image and Video Processing, vol. 10, no. 4, 2015, pp. 647–654.

Vidyavani, A., et al. “Object Detection Method Based on YOLOv3 Using Deep Learning Networks.” International Journal of Innovative Technology and Exploring Engineering Regular Issue, vol. 9, no. 1, Oct. 2019, pp. 1414–1417.

Endres, Ian, et al. “The Benefits and Challenges of Collecting Richer Object Annotations.” 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition – Workshops, 2010.

Rajchl, Martin, et al. “DeepCut: Object Segmentation From Bounding Box Annotations Using Convolutional Neural Networks.” IEEE Transactions on Medical Imaging, vol. 36, no. 2, 2017, pp. 674–683.

Ruttiman, Lance. REDUCING FALSE FIRE ALARMS A Study of Selected European Countries. Siemens Switzerland Ltd, 2014, pp. 1–10

Liu, Wei, et al. “SSD: Single Shot MultiBox Detector.” Computer Vision – ECCV 2016 Lecture Notes in Computer Science, 2016, pp. 21–37.

Girshick, Ross, et al. “Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.” 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014.