A Comparative Analysis of Variant Deep Learning Models for COVID-19 Protective Face Mask Detection

The world is in the midst of a paramount pandemic owing to the rapid dissemination of coronavirus disease (COVID-19) brought about by the spread of the virus „SARS-CoV-2‟. It is mainly transmitted among persons through airborne diffusion of droplets containing the virus produced by an infected person sneezing or coughing without covering their face. The World Health Organization (WHO) has issued numerous guidelines which state that the spread of this disease can be limited by people shielding their faces with protective face masks when in public or in crowded areas. As a precautionary measure, many nations have implemented obligations for face mask usage in public spaces. But manual monitoring of huge crowds in public spaces for face masks is laborious. Hence, this requires the development of an automated face mask detection system using deep learning models and related technologies. The detection system should be viable and deployable in real-time, predicting the result accurately so as to be used by monitoring bodies to ensure that the face mask guidelines are followed by the public thereby preventing the disease transmission. In this paper we aim to perform a comparative analysis of various sophisticated image classifiers based on deep learning, in terms of vital metrics of performance to identify the effective deep learning based model for face mask detection.


Introduction
The outbreak of COVID-19 has created a catastrophic situation around the world. According to the latest COVID-19 Epidemiological Update report [1] published by the World Health Organization (WHO) over 146 million people got infected and over 3 million have died with the coronavirus disease 2019 .Researchers of various fields have been working to develop intelligent systems using digital technologies which aid in monitoring and control of the spread of the disease. S. Tuli, et al. [2] have devised a system for predicting development of the outbreak using machine learning and cloud computing so as to develop policies and strategies for managing its propagation and to efficiently monitor disease. The system presented in [3] would use an Internet of Things (IoT) framework to gather users' data on the disease symptoms in real-time to detect suspected COVID-19 cases in the earlier stages, to track the treatment response of patients recovered and to comprehend virus behaviour by collecting and analysing relevant data. An Artificial Intelligence (AI) based tool was used by Y. Ke,et al.,in [4], to identify the potential of marketed medicines for treating the COVID-19 disease.
Following the COVID-19 outbreak, the WHO has released a number of precautionary recommendations to combat coronavirus transmission. Some of the most significant guidelines are practicing social/physical distancing, sanitization of hands, wearing face masks, avoiding crowds, etc. Even though few global pharmaceutical companies have successfully developed COVID-19 vaccines which were approved by the WHO and medical authorities, more and more COVID-19 cases are emerging everyday worldwide. This is because of the disease transmission caused among people by not following the protective measures and guidelines mentioned earlier. The use of face masks plays a significant role to help reduce the person to person transmission of the virus [5] and this helps in breaking the chain. Governments of the majority of countries have mandated the use of face masks in public places to control community spread of the virus. However, manually inspecting huge crowds in public spaces for face masks is an arduous process. Thus, the implementation of an automatic face mask detection system is needed to aid this process.
The idea of a face mask detection system is to identify whether or not a person is wearing a face mask using data in the form of images, recorded videos or a live stream video. An efficient system is developed by selection of a Convolutional Neural Network (CNN) which is optimal in terms of computational intensity without compromising the performance of the system. In this paper we perform face mask detection using various deep learning networks to compare performance and analyse their efficiency. This paper is further structured as follows: section II shows the related work and literature survey, section III discusses the examined methodology, section IV presents the results and section V gives the conclusion.

Problem Definition
Face mask detection is the process of determining whether or not a person is wearing a mask and where their face is located. It is a combination of general object recognition, which is used to identify different categories of objects (the mask in this case), and face detection, which is used to identify the face, in a digital image. The aforementioned systems have an assortment of real world applications such as surveillance, autonomous vehicles, robot vision and activity, etc.
Object recognition is primarily concerned with locating and classifying objects in images. Traditional algorithms (non-neural approaches) such as Scale-Invariant Feature Transform (SIFT), Histogram of Oriented Gradients (HOG) can be used for such tasks, but they depend largely on feature engineering. Neural networks (in deep learning) can outshine the traditional algorithms without the need of handcrafted features. There are two groups of such object detection algorithms [6]:  Two-stage detectors: In stage one, a Region Proposal Network (RPN) will identify regions of interest.
In stage two, a classifier processes the region candidates identified. Examples are Faster R-CNN (Region-based Convolutional Neural Networks), Mask R-CNN.
 One-stage detectors: They directly run detection over the possible locations (a dense sample) and exclude the RPN stage. Examples are You Only Look Once (YOLO), Single Shot MultiBox Detector (SSD), etc.

Literature Survey
The following is the literature survey of various research works related to the development of face mask detection systems.
The proposed system by A. Das, et al. [7] entails a cascaded classifier and a CNN (pre-trained) containing dense neuron layers connected with two 2D convolution layers; it gave an accuracy of 94.58% on a dataset obtained from Kaggle [8]. A. Negi, et al. [9] built a custom image based CNN model with Haar cascade classifier for face mask detection and used Keras-Surgeon for model pruning, to reduce the model size so that it can be implemented on embedded systems, and attained 98.9% validation accuracy. In [10] M. R. Bhuiyan, et al. have used the YOLOv3 network, a one-stage object detector, for detection of face masks and obtained a mean average precision (mAP) of 0.96. G. Draughon, et al. [11] have presented a framework to detect and track people and face mask usage by them, using deep learning and Computer Vision. The face mask detector employed in the framework was a CNN-based binary classifier with residual network architecture; it gave a classification accuracy of 96%. M. Jiang, et al. in [12] have proposed the use of the one-stage detector, RetinaFaceMask. Higher-order semantic information is fused with several feature maps using the feature pyramid network comprised in the detector. RetinaFaceMask with ResNet was used to detect face masks with a precision of 93.4%. In [13] J. Zhang, et al. have developed a sophisticated framework named Context-Attention R-CNN which gave an mAP of 84.1% on a new practical dataset they created covering various conditions to achieve detection of fine-grained wearing condition of face masks. B. Wang, et al. in [14] devised a hybrid (deep) transfer learning and broad learning system for detecting face masks. It contains two stages: pre-detection (implemented using Faster-RCNN) followed by verification, the system gave a 94.84% precision score.

Methodology
As stated earlier, the aim of this paper is to compare various neural network architectures and evaluate their performance for face mask detection. A Convolutional Neural Network (CNN) convolves the input images or feature maps with convolution kernels in order to extract higher-level features. Thus it is an effective tool for Computer Vision tasks like classification of images, detection of objects, identifying patterns, etc. The neural network architectures studied, evaluated and compared in this paper are VGG16, InceptionV3, ResNet50V2, MobileNetV2 and Xception. Image classification is better achieved by the transfer learning of these models.

Understanding Transfer Learning
One of the most widely used methods for computer vision activities such as classifying and segmentation is transfer learning. In this method weights or information gained from solving a problem are shared to solve other problems that are similar to it. When the application areas are closely affiliated, transfer learning is beneficial to decrease training time. Transfer learning can be accomplished in one of the two following ways.

Research Article
Vol. 12 No.6 (2021), 2841-2848  The first is by employing a pre-trained model. It means it is a model that has been trained beforehand, on an extensive standard dataset such as ImageNet, MNIST, CIFAR, etc. ResNet, DenseNet, MobileNet, etc. are examples of this kind.  The second way to achieve transfer learning is by using a custom output layer to perform classification which uses the features extracted by the pre-trained model (without its output layer).

Neural Networks Evaluated
The following neural network architectures (pre-trained models) have been selected for evaluation. These models are pre-trained on the standard ImageNet dataset and can perform a 1000 class classification on any colour image given as input.

VGG16
The VGG-16, developed by the Visual Graphics Group (VGG) at the University of Oxford [15], is a popular pre-trained model for classification of images. Here "16" in VGG16 denotes the number of weighted layers in the network. VGG16 comprises of 13 convolutional, 3 dense and 5 pooling layers. VGG uses smaller filters with more depth because of fewer parameters and stacks more of them rather than using larger filters.

InceptionV3
Inception orGoogleNet [16] developed by Google was the winner ofILSVR competition in 2014. This network uses inception modules. In a naïve version, in each inception module,convolution on an input is performed with filters of three sizes p x p (where p=1,3,5) along with max pooling and the outputs are joined in sequence and sent to the following inception module. The architecture contains 9 such modules and it is 22 layers deep. InceptionV3, which is 48 layers deep, is the improved version of InceptionV2. InceptionV3 additionally used greater factorization, RMSProp Optimizer and normalization of batches.

ResNet50V2
It is a Residual Neural Network with a depth of 50 layers [17]. It uses residual learning concept to overcome accuracy saturation caused by growing depth of the network. Here, rather than learning features, the network seeks to learn a residual. Residual is the difference obtained by subtraction of a feature learned from a given layer input. For this, shortcut connections are used i.e. directly connecting input of layer x to another layer (x+n). Contrary to stacking of layers like in VGG16 or Inception networks, ResNet modifies underlying mapping of the layers.

MobileNetV2
MobileNetV2 expands on MobileNetV1's (developed by Google in 2017) concepts [18,19] by using depthwise separable convolution as effective building blocks and adds two additional architectural featureslinear bottlenecks between layers, and shortcut links between bottlenecks. Depthwise separable convolution is a combination of channel-wise (p×p, p=number of channels) spatial convolution called depthwise convolution followed by a convolution of size 1x1, called pointwise convolution. This architecture has 53 layers.

Xception
Xception (Extreme version of Inception) [20] is an Inception architecture extension in which depthwise separable convolutions replace the standard Inception modules of Inception network. Xception architecture has 36 convolutional layers establishing the its feature extraction foundation. It uses refashioned depthwise separable convolution i.e, the pointwise convolution (1x1) is performed first and is followed by channel-wise spatial convolution (nxn).

Proposed Methodology
The diagrammatic representation of the proposed model is depicted in Figure 1 and steps for the process flow are as follows: Step 1: The dataset (samples shown in Figure 2) taken is a publicly available dataset [21]. It consists of 3850 real-time images in which 1920 images contain a face mask and 1930 images are without a face mask. The dataset is split in the ratio 80:20 for training and testing.

Research Article
Vol.12 No.6 (2021), 2841-2848  Step 2: Image augmentation of the dataset is performed. It is a method to enlarge the training dataset by altering images in the dataset artificially using the operations rotation, zoom, horizontal flipping, shearing, height and width shift, etc.
Step 3: Images of size (224, 224, 3) (obtained after step 1) are given as input to the transfer learning model pretrained using ImageNet weights.
Step 4: The actual output layer of the pre-trained model is substituted with the subsequent set of layers-a flattening layer (to convert the data from pre-trained model into a 1-dimensional array for inputting it to the next layer), continued by a dense layer (with 128 neurons) with Rectified Linear Unit (ReLU) activation function and a 0.5 dropoutrate (dropout layer helps prevent the model from overfitting). The feature map from step 3 passes through these layers.

Research Article
Vol. 12 No.6 (2021), 2841-2848 Step 5: The output from step 4 is sent to a dense layer with sigmoid activation function and two neurons. This layer classifies whether or not the person in the image is wearing a mask.

Results and Analysis
This experiment of evaluation of deep learning models for face mask detection is implemented on Google Colaboratory (Colab Notebook) that runs on the cloud. The proposed methodology was implemented using Python and TensorFlow, the model training and tests are performed on a TESLA K80 GPU by NVIDIA.
The deep learning models used in this experiment are VGG16, InceptionV3, ResNet50V2, MobileNetV2 and Xception and the nature of classification results of the testing data is divided into the following four categories used for performance calculation:

Confusion Matrix
It is a square matrix of size "p", used to evaluate the efficiency of a classification model (where p = number of target categories). It compares the factual goal values with the model's predictions. Here p=2 as target groups are with and without masks. Table 1 shows the confusion matrices obtained for each model evaluated.

4.2.Performance Report
There are several measures that can be used to demonstrate the performance of a classification model. The four central metrics considered here are precision, recall, f1-score and accuracy. The classification report i.e. the precision, recall, f1-score and accuracy values are shown in Table 2. In the table, 0 signifies without mask and 1 signifies with mask.

4.3.Loss and Accuracy Graphs
Loss signifies how well or bad a model performs after each optimization iteration. The relation between accuracy and loss incurred during training is used to understand the nature and size of errors that a model has made. Figure 3 shows the training loss, validation loss, training accuracy and validation accuracy measured for various models.

Conclusion
The epidemic caused by the COVID-19 disease has caused an upsurge of infected cases worldwide. The governments of several nations around the world have enforced obligatory use of protective face masks to prevent the transmission of the virus and prevent infections. But manual monitoring of public for use of face masks is an arduous process. Hence this propelled researchers for the development of automated face mask detection systems that can effectively identify whether or not people are using face masks. In this paper, we have implemented the pre-trained models VGG16, InceptionV3, ResNet50V2, MobileNetV2 and Xception for face mask detection and evaluated their performance using various metrics. The results show that InceptionV3 and Xception models have produced outstanding accuracies on the given dataset. Our future work will concentrate on detection of face masks using hybrid models.