Face mask detection algorithm based on HSV+HOG features and SVM

2022-09-19HEYuminWANGZhaohuiGUOSiyuYAOShipengHUXiangyang

Journal of Measurement Science and Instrumentation 2022年3期

HE Yumin， WANG Zhaohui， GUO Siyu， YAO Shipeng， HU Xiangyang

(School of Mechanical and Electrical Engineering, Xi’an University of Architecture and Technology, Xi’an 710055, China)

Abstract： To automatically detecting whether a person is wearing mask properly, we propose a face mask detection algorithm based on hue-saturation-value (HSV)+histogram of oriented gradient (HOG) features and support vector machines (SVM). Firstly, human face and five feature points are detected with RetinaFace face detection algorithm. The feature points are used to locate to mouth and nose region, and HSV+HOG features of this region are extracted and input to SVM for training to realize detection of wearing masks or not. Secondly, RetinaFace is used to locate to nasal tip area of face, and YCrCb elliptical skin tone model is used to detect the exposure of skin in the nasal tip area, and the optimal classification threshold can be found to determine whether the wear is properly according to experimental results. Experiments show that the accuracy of detecting whether mask is worn can reach 97.9%, and the accuracy of detecting whether mask is worn correctly can reach 87.55%, which verifies the feasibility of the algorithm.

Key words： hue-saturation-value (HSV) features； histogram of oriented gradient (HOG) features； support vector machine (SVM)； face mask detection； feature point detection

0 Introduction

At the end of 2019, an outbreak of COVID-19 posed a great threat to lives and health of people around the world, which is mainly spread through respiratory methods such as droplets[1]. In order to block this transmission route, experts require people to wear masks correctly in public. At present, many public places are equipped with specialized personnel. On the one hand, they are responsible for their own duties, on the other hand, they also remind the general public to wear masks, which increases their work intensity, but cannot avoid mistakes caused by energy distraction and other reasons. Therefore, it is particularly important to study a method that can automatically detect whether people are wearing masks properly.

Computer vision technology can be used to quickly detect whether people are wearing masks and whether they are wearing them properly. Mask detection belongs to the class of target detection within the field of computer vision. Currently there are two main approaches for target detection： deep learning and traditional methods[2]. The deep learning method can obtain a model with excellent performance after using a large amount of data to train the network[3]. The extracted features by the model are more robust, the generalization ability is better, and end-to-end training can be achieved. The disadvantages are that the model is complex and a large number of samples must be used to ensure its robustness, otherwise it tends to overfitting[4], and the model requires high computational power of device. There are several companies and organizations that have made some progress in the research of mask detection algorithms based on deep learning[5]. Based on the PaddlePaddle deep learning platform, Baidu used Wider Face dataset and self-collected datasets of more than 100 000 faces wearing masks to train and develop a lightweight model of PyramidBox-Lite； SPIDER IoT proposed a lightweight target detection model based on CenterFace network and MobileNetV2 network for mask detection with over 10,000 training datasets. AIZOO company trained a mouthpiece detection model based on SSD with Keras deep learning framework. The training of above detection models requires the use of a large number of samples, and the collection, cleaning and labeling of samples are time-consuming and laborious, and requires high equipment performance.

Traditional methods mainly rely on manually designed extractors to extract features. Although the generalization ability and robustness are relatively poor,they are usually suitable for solving small sample problems. Liu et al.[6]proposed a method that used histogram of oriented gradient (HOG) features+support vector machine (SVM) to classify high-throughput digital polymerase chain reaction (dPCR) gene chip fluorescence images, and the accuracy rate was as high as 98% under small sample data. Ma et al.[7]proposed a rice disease detection method based on HOG features+SVM. They used 1 000 images to train the model, and model’s recognition rate of disease spots was over 94%, and the accuracy of disease spot location was 91.7%. Shi et al.[8]proposed an local binary patterns(LBP)+SVM three-dimensional face recognition algorithm, which used the Texas 3D face recognition database (Texas 3DFRD) with 1 151 images. Compared with deep learning methods, traditional methods based on SVM have advantages of less sample size and device computing power. In the traditional method, HOG features and hue-saturation-value (HSV) features can represent the edge shape information and color information of image, respectively, and combining them can describe some specific targets well. Hu et al.[9]constructed a vehicle appearance similarity model by fusing HSV features and HOG features. Cheng et al.[10]proposed a pedestrian tracking method based on kernel correlation filter algorithm for complex scenes by fusing HOG features and HSV features.

To date,most of the studies have focused on detecting whether masks are worn, and there are few studies related to detecting whether masks are worn properly. Xiao[11]used YCrCb elliptical skin color model to detect skin, and judged whether the mask wearing was correct by calculating the exposure status of skin in mouth and nose region on the detection results, and the recognition rate reached 82.48%. This method cannot precisely locate coordinates of nose and mouth in the picture, but can only determine their approximate relative positions by statistical observation, and requires no large angular tilt of the face, which results in a limited recognition rate.

In summary, for the problems related to detecting whether people are correctly wearing a mask, a face mask detection algorithm based on HSV+HOG features and SVM is presented in this paper. The algorithm is divided into two parts： (1) First, RetinaFace face detection algorithm is used to locate position of the face and five face feature points, then face alignment is performed. According to the face feature points, the mouth and nose region is located, and HSV+HOG features of this region is extracted and input into SVM for training, so as to realize the detection of wearing masks or not. (2) For the pictures with detection result of wearing a mask, the algorithm firstly locates the nose area centered on nose tip with Retinaface. Then YCrCb ellipse skin color model is used for skin detection, and whether the mask wearing is properly is determined by calculating exposure of skin in the nose region. The algorithm uses HSV+HOG features in mouth and nose region to maximize the distinction between the faces with and without masks, and uses RetinaFace to accurately locate nose position, which improves the accuracy. The algorithm is discussed below in two aspects： detecting whether mask is worn and detecting whether mask is worn properly.

1 Mask wearing detection

For the problem of automatically detecting whether people are wearing masks, the flow chart of the algorithm is shown in Figs.1 and 2. Fig.1 shows the process of training SVM classifier, and Fig.2 shows the process of detecting whether a face is wearing a mask or not with the face mask detection algorithm.

Fig.1 Flow chart of training SVM classifier

Fig.2 Flow chart of face mask detection algorithm

1.1 Pre-processing

The pre-processing part of dataset includes face detection, face feature point localization, and face alignment with RetinaFace. The data used in the algorithm come from Wider Face dataset and part of data in masked faces dataset(MAFA dataset). Wider Face dataset is a commonly used dataset for face detection, containing 32 203 images with a total of 393 703 faces, which have a wide range of variation in scale, pose, and occlusion, from which a large number of images of unmasked faces can be collected. MAFA dataset is a dataset about face occlusion. Most of the occlusion objects are masks, and a large number of face images wearing masks can be collected. RetinaFace is a multi-task, single-stage face detection algorithm that can accurately locate multi-scale faces. It includes tasks such as face detection, face bounding box regression, face feature point positioning, etc. RetinaFace achieves an average accuracy of 91.4% in detecting faces on Wider Face dataset[12]. The effect of detecting human faces and positioning feature points (left and right eyes, nose tip, left and right mouth corners) is shown in Fig.3. After detecting the coordinates of face frame on image and its size, the face is intercepted from the image.

Fig.3 Detecting faces and locating feature points with RetinaFace

When there is a certain tilt angle of face, it will adversely affect subsequent detection effect. The larger the tilt angle, the more obvious the effect. Taking this into account, the algorithm introduces a face alignment algorithm based on detection of faces, which eliminates undesirable effects caused by the tilt angle. The method is to calculate inclination angle of the connection between left and right corners of mouth relative to the horizontal line, and then rotate the image as a whole by this angle value so that the left and right corners of mouth are on the same horizontal line, which improves description ability of subsequent extracted features, as shown in Fig.4. Calculating inclination angle of left and right corners of mouth is calculated by

Fig.4 Schematic diagram of oblique angle of left and right corners of mouth

(1)

whereθis the tilt angle, and (x1,y1) and (x2,y2) are the coordinates of left and right corners of mouth in the picture respectively. The original face image and face alignment effect are shown in Fig.5(a) and Fig.5(b), respectively.

Fig.5 Example of face alignment effect

1.2 Mouth and nose region positioning

Fig.6 shows the mouth and nose region located by proposed algorithm.

(a) Mouth and nose region without a mask

(b) Mouth and nose region with a maskFig.6 Mouth and nose region localized by proposed algorithm

When a person is wearing a mask, the feature information of mouth and nose region will be lost due to occlusion, while the feature information of other regions will not be affected. In order to extract more representative features, reduce amount of computation and improve speed of the algorithm, the algorithm uses facial feature points located by RetinaFace to locate mouth and nose region, and uses the feature information of this region to describe faces with and without masks.

1.3 Feature extraction and fusion

After locating mouth and nose region, the HSV features and HOG features of this region are extracted and the two kinds of features are concatenated and stitched together.

When a person wears a mask, the mouth and nose region will be obscured by mask. The commercially available masks are mostly blue-white and black, which are far from common skin color, so the algorithm extracts HSV features of this region to distinguish faces with and without masks. HSV is a color space created by A.R.Smith in 1978 based on intuitive characteristics of color, also known as hexcone model. HSV is hue, saturation, value, also known as HSB (B is brightness). Hue is measured by angles ranging from 0° to 360°. 0° is red, counterclockwise, 60° is yellow, 120° is green, 180° is cyan, 240° is blue, and 300° is magenta. Saturation is used to indicate how close a certain color is to the spectral color. A color can be seen as a mixture of corresponding spectral color and white. The greater the proportion of spectral color is, the closer the color is to spectral color, and the higher the saturation is. Value is used to measure brightness of a color and usually ranges from 0% to 100%. Firstly, the localized muzzle region is converted from RGB color space to HSV color space. Then its H channel (i.e. color channel) is separated out, and HSV features of this channel are extracted using the function calcHist() in Opencv. The extracted features are one-dimensional vectors containing 360 values, and each value is the number of pixels corresponding to the hue value. HSV features extracted from Fig.6(a) and Fig.6(b) are drawn into a histogram, as shown in Fig.7(a) and Fig.7(b), respectively.

(a) HSV color histogram of mouth and nose area without a mask

(b) HSV color histogram of mouth and nose area with a maskFig.7 HSV color histogram of mouth and nose region

When color of the mask is close to skin color, it is obviously not enough to extract HSV features alone. The edge shape information of mouth, nose, and chin of face is lost by mask obscuration, and HOG features can be used to describe the contour information[13]. In this way, the algorithm extracts HOG features of mouth and nose region. The HOG feature extraction algorithm was proposed by Dalal et al.[14]in 2005 and is widely used in pedestrian detection. The algorithm in this study uses the function hog() in the skimage.feature module to extract HOG features. Firstly, the mouth and nose region is gray processing and scaled down to a size of 64 pixel×64 pixel. Then it is divided into small connected regions, and histograms of the gradient directions of each pixel point in these regions are collected. Finally, these histograms are combined to form the HOG feature, which is essentially a set of one-dimensional vectors. The HOG feature maps extracted in Fig.6(a) and Fig.6(b) are shown in Fig.8(a) and Fig.8(b), respectively.

(a) HOG feature map of unmasked mouth and nose region

(b) HOG feature map of masked mouth and nose regionFig.8 HOG feature map of mouth and nose region

After extracting HSV features and HOG features,the two kinds of features are stitched together in series. The obtained HSV+HOG features can describe both color information and edge shape information of image, which can well distinguish faces with and without masks.

1.4 Training SVM classifier

The acquired HSV+HOG features are input into SVM for training. SVM[15]was first proposed in 1995 and has many unique advantages in solving small sample, nonlinear and high dimensional pattern recognition. It also can be extended to other machine learning problems such as function fitting. SVM is generally used to solve binary classification problems, also it can be used to solve multi-classification problems. The HSV+HOG features of the data set are input into SVM for training, and appropriate kernel function is selected to convert the input features into a high-dimensional space. After that, SVM will automatically search for the optimal separation hyperplane in the high-dimensional space, and SVM classifier will be obtained after the training.

In this study, SVM in the Sklearn module is used to divide the data set into training set and test set. On the training set, grid search and cross validation are adopted to train classifier. The testing set is used to test performance of the classifier and find optimal model parameters. The training finally outputs SVM classifier with the best performance.

1.5 Face mask detection

When using the algorithm for detection,firstly, face detection and face alignment are performed on images captured by camera. Secondly, the algorithm locates mouth and nose region, extracts HSV features and HOG features of this region, and stitches them together in series. Finally, the resulting HSV+HOG features are input to the classifier to detect whether the face is wearing a mask or not.

2 Detection of whether mask is worn properly

The detection of whether mask is worn properly is based on result of the detection of whether the mask is worn or not. The flow chart of detection algorithm is shown in Figs.9 and 10. Fig.9 shows the process of finding optimal classification threshold from the training set pictures, and Fig.10 shows the process of using the algorithm to detect pictures.

Fig.9 Flow chart of finding optimal classification threshold

Fig.10 Flow chart of detecting whether mask is worn properly

1) Load the training set images. The training set images are all face images wearing masks, which are divided into two categories： properly wearing and improperly wearing, corresponding to labels mask_well and mask_wrong, respectively.

2) Skin detection with YCrCb elliptical skin tone model. YCrCb elliptical skin color model[16]was proposed by Hsu et al. Compared with other skin color detection models, YCrCb can perform better and minimize influence of light intensity on detection results. The image is mapped from RGB color space to YCrCb color space and nonlinear transformation is performed. After the detection of the elliptical skin color model, a binary image is output, in which the pixel value of skin part is set as 255(white) and the pixel value of non-skin part is set as 0(black), as shown in Fig.11.

Fig.11 Original image with mask, nose tip and result of skin color detection

3) Calculate proportion of non-skin pixels in the nose tip area. Generally speaking, when wearing a mask, as long as nose is exposed, it is considered that the mask is not properly worn. Therefore, in order to determine whether mask is correctly worn, it is only necessary to calculate the ratio of non-skin pixels in an area near tip of nose to total pixels. In other words, by calculating exposure of skin in the area, then a judgment can be made. In this algorithm, original face image and nose tip detected by RetinaFace are shown in Fig.11(a) and Fig.11(b), respectively. A pixel area of 30 pixel×30 pixel centered on nasal tip point is taken to calculate proportion of non-skin pixels to total pixels in the nasal tip area.

4) Finding optimal classification threshold. A certain threshold is set according to the proportion value of non-skin pixels in nose tip area of all images in the training set. If the proportion exceeds this threshold, it will be deemed as properly wearing； otherwise, it will be deemed as improperly wearing. When the threshold is set too low, the recognition accuracy of correct category is high, and the recognition accuracy of incorrect category is low. The opposite is true when the threshold is set too high. Therefore, it is necessary to find optimal classification threshold, so that the recognition accuracy for the two types of images is roughly equal.

5) Mask wearing correctness detection algorithm is used for detection. For practical application, face image of wearing a mask is input, and steps 2 and 3 are performed. If the proportion calculated in step 3 is greater than the optimal classification threshold, it will be judged to be wearing correctly. Otherwise, it will be judged to be incorrectly.

3 Experiment and analysis

3.1 Evaluation indexes

The evaluation metrics of the proposed mouthpiece detection algorithm are shown in Table 1.

Table 1 Evaluation index of proposed algorithm

In Table 1,ntpmeans the number of samples which are judged to be positive and positive indeed;nfpmeans the number of samples which are judged to be positive but actually negative；nfnmeans the number of samples which are judged to be negative but actually positive；ntnmeans the number of samples which are judged to be negative and negative indeed；MCis a confusion matrix, each column represents predicted category, and the total number of each column represents the number of samples whose predicted value is that category. Each row represents true category of samples, and the total number of data in each row represents the number of samples in that category. In addition to those shown in the table, there are several other evaluation indicators.Pmeans number of samples which are judged to be positive；Nmeans number of samples which are judged to be negative；Faccuracyrepresents the proportion of number of samples whose predicted value is consistent with the true value to total number of samples；Fprecisionrepresents the proportion of samples whose true value is positive among samples whose predicted value is positive；Frecallrepresents the proportion of samples whose predicted value is positive among samples whose true value is positive；FF-scorerepresents the harmonic mean ofFprecisionandFrecall.

3.2 Dataset

In this study, the algorithm uses Wider Face dataset and some data from MAFA dataset for experiments. There are 3 983 images of faces without masks collected from Wider Face, and 3 909 images of faces with masks collected from MAFA dataset, for a total of 7 892 images. Correspondingly, there are two types of labels： mask and no_mask, 6 314 of which are used for training and 1 578 for testing. The specific allocation is shown in Table 2. Images that are labeled as wearing a mask and correctly predicted will be used as a data set to test whether the mask is properly worn.

Table 2 Specific allocation of training set and testing set

3.3 Experiment of mask detection algorithm

3.3.1 Finding optimal SVM classifier parameters

There are four commonly used kernel functions of SVM： (1) linear kernel function； (2) polynomial kernel function； (3) radical basis kernel function (RBF)； (4) sigmoid kernel function. A suitable kernel function not only improves the classification performance of SVM classifier, but also improves generalization ability of the SVM classifier and reduces time overhead of modeling and classification. RBF kernel function has become the preferred kernel function due to its low number of parameters, high generality, and high classification accuracy. In this study, RBF kernel function is used as the kernel function of SVM, and the penalty parameterCof SVM and the parametergof RBF kernel function are taken as a set of parameters to be optimized. The experimental procedure is as follows.

1) HSV+HOG features extracted from the training set data are input into SVM for training.

2) The parameters of grid search algorithm are set. The penalty parameterCtakes value space of [10.5,20] and search step is 0.5； The RBF kernel function parametergtakes value space of [0.05,1] and search step is 0.05.

3) According to the parameters that are already set in step 2, parametersCandgare optimized by three-fold cross validation (3-CV).

The 3D surface plotted according to the classification accuracy of different parameter combinations is shown in Fig.12.

Fig.12 Parametric optimization graphs

According to the experimental results, the highest classification accuracy of the final output SVM classifier is 98.7%, and the correspondingCis 11.5 andgis 0.35.

3.3.2 Testing performance of SVM classifier

The performance of SVM classifier is tested on test set and the experimental results are represented by confusion matrix, as shown in Fig.13.

Fig.13 Confusion matrix of mask detection algorithm

Faccuracy=(ntp+ntn)/(P+N)=0.979，

Fprecision=ntp/(ntp+nfp)=0.985,

Frecall=ntp/(ntp+nfn)=0.97,

According to the confusion matrix, the recognition accuracy of masked category and no masked category can reach 97% and 98.7%, respectively, and the overall accuracy can reach 97.9%, which meets standard of practical application. The specific recognition effect sample is shown in Fig.14. The face without mask is recognized in thin-line box, and the face with mask is recognized in thick-line boxs.

Fig.14 Experimental results of mask detection algorithm

3.4 Experiment of mask wearing correctness detection algorithm

3.4.1 Finding optimal classification threshold

Thresholds are set to a series of different values to calculate the recognition accuracy under different thresholds respectively, and a line chart is drawn as shown in Fig.15. From the figure, we can learn that the detection accuracy of correct category gradually decreases when the threshold value gradually increases from 20% to 40%, and the detection accuracy of incorrect category gradually increases, and the two intersect near 29%. Considering that the detection accuracy of incorrect category can be improved appropriately, the threshold value is set to 30%.

Fig.15 Detection accuracy at different thresholds

3.4.2 Testing whether mask is properly worn

The test set to test whether mask is worn properly is composed of face pictures, and predicted label and true label of these pictures are both masked in the SVM classifier performance test experiment. A total of 450 images are selected: 250 images in correct wearing category, labeled as mask_well, and 200 images in incorrect wearing category, labeled as mask_wrong. The detection results are represented by confusion matrix, as shown in Fig.16.

Fig.16 Confusion matrix of mask wearing correctness detection algorithm

Faccuracy=(ntp+ntn)/(P+N)=0.875 5,

Fprecision=ntp/(ntp+nfp)=0.894,

Frecall=ntp/(ntp+nfn)=0.88,

According to the confusion matrix, the recognition accuracy of correctly category and incorrectly category can reach 88.0% and 87.0%, respectively, and the overall accuracy can reach 87.55%, which basically meets the standard of practical application. A partial identification sample is shown in Fig.17. The thick-line box indicates properly wearing, and the thin-line box indicates improperly wearing.

Fig.17 Experimental results of mask wearing correctness detection algorithm

4 Conclusions

For the problem of automatically detecting whether a person is wearing mask properly, a face mask detection algorithm based on HSV+HOG features and SVM is proposed. Firstly, the algorithm uses RetinaFace for face detection and face feature point localization. Then, the feature points are used to locate to mouth and nose region, and HSV+HOG features in this region is extracted and used to train SVM to distinguish between faces with and without masks. For the image with the detection result of wearing a mask, the algorithm first uses RetinaFace to locate to nose region, and then calculates exposure of skin in this region based on the result of skin detection by YCrCb ellipse skin color model to determine whether mask wearing is properly. The experiments show that the face mask detection algorithm proposed has high detection accuracy and meets the requirements of practical applications.