APP下载

A Machine Learning Approach for Expression Detection in Healthcare Monitoring Systems

2021-12-16MuhammadKashifAyyazHussainAsimMunirAbdulBasitSiddiquiAaqifAfzaalAbbasiMuhammadAakifArifJamalMalikFayezEidAlazemiandOhYoungSong

Computers Materials&Continua 2021年5期

Muhammad Kashif,Ayyaz Hussain,Asim Munir,Abdul Basit Siddiqui,Aaqif Afzaal Abbasi,Muhammad Aakif,Arif Jamal Malik,Fayez Eid Alazemi and Oh-Young Song

1Department of Computer Science&Software Engineering,International Islamic University,Islamabad,44000,Pakistan

2Department of Computer Science,Quaid-i-Azam University,Islamabad,44000,Pakistan

3Department of Computer Science,Capital University of Science&Technology,Islamabad,44000,Pakistan

4Department of Software Engineering,Foundation University,Islamabad,44000,Pakistan

5Department of Computer Science,Abdul Wali Khan University,Mardan,23200,Pakistan

6Department of Computer Science and Information Systems,College of Business Studies,PAAET,12062,Kuwait

7Department of Software,Sejong University,Seoul,05006,Korea

Abstract:Expression detection plays a vital role to determine the patient’s condition in healthcare systems.It helps the monitoring teams to respond swiftly in case of emergency.Due to the lack of suitable methods,results are often compromised in an unconstrained environment because of pose,scale,occlusion and illuminationvariations in the image of the face of the patient.A novel patch-based multiple local binary patterns(LBP)feature extraction technique is proposed for analyzing human behavior using facial expression recognition.It consists of three-patch[TPLBP]and four-patch LBPs[FPLBP]based feature engineering respectively.Image representation is encoded from local patch statistics using these descriptors.TPLBP and FPLBP capture information that is encoded to find likenesses between adjacent patches of pixels by using short bit strings contrary to pixel-based methods.Coded images are transformed into the frequency domain using a discrete cosine transform (DCT).Most discriminant features extracted from coded DCT images are combined to generate a feature vector.Support vector machine (SVM),k-nearest neighbor (KNN),and Naïve Bayes (NB) are used for the classification of facial expressions using selected features.Extensive experimentation is performed to analyze human behavior by considering standard extended Cohn Kanade(CK+) and Oulu-CASIA datasets.Results demonstrate that the proposed methodology outperforms the other techniques used for comparison.

Keywords:Detection;expressions;gestures;analytics;pain;patch-based local binary descriptor;discrete cosine transform;healthcare

1 Introduction

In the last two decades,excessive research has been done in the field of facial expression recognition (FER).Facial expression is the most expressive way of communication among humans and is generally categorized into seven basic expression types named anger,disgust,fear,happiness,sad,surprise,and neutral [1].Human behavior detection through recognition of expressions plays a very important role in human-computer interaction and has attracted much attention in the areas of surveillance,healthcare,forensics,missing individual identification,crime investigation,interactive games,intelligent transportation and many other applications [2].Facial expression recognition is a problem related to pattern recognition and computer vision,where a twodimensional image of the face is acquired to extract the features for classification.Thorough research has been carried out in recent years and several techniques have been proposed to achieve better performance.These techniques usually produce good results in a constrained environment.Though the task to identify expressions from images captured in an unconstrained environment is still challenging due to the presence of variations in resolution and illumination in certain datasets.The proposed method contains the ability to deal with such images that resemble real-world images in terms of these factors to estimate the expression adequately.

In facial expression recognition,the main step is the feature extraction from still images or video frames to obtain the appearance-based and geometric-based variations that map to a target facial expression [3].This paper investigates the use of multiple LBPs [TPLBP,FPLBP]in facial expression recognition.Coded images are obtained after the extraction of features using TPLBP and FPLBP.In the next step,these features are converted into the frequency domain using DCT.The most discriminative features are computed to analyze emotions from both forms of coded DCT images to obtain a fused feature vector.SVM,K-NN and NB classifiers are used to classify the emotions from publicly available databases namely,CK+ [4-10]and Oulu-CASIA [11-13].The proposed technique gives better performance in terms of accuracy and robustness than the techniques presented in the literature survey.

This research paper is categorized into five sections.Section 2 briefly describes the relevant literature review.The proposed technique is presented in Section 3.Section 4 describes the experimental results and discussion.Finally,Section 5 presents the conclusion and future work.

2 Related Work

Texture-based and appearance-based descriptors are popular and these are extensively used for multi-scale facial expressions such as principal component analysis (PCA) to reduce dimensionality,linear discriminate analysis (LDA) for feature selection,and local binary patterns(LBP) [14,15].LBP [16-19]is a successful feature extraction technique for emotion recognition and other image processing applications.LBP is calculated for each pixel in an image by involving the neighbors around that pixel.It provides binary numbers using 8 neighbors and a threshold is applied on these eight values corresponding to the central pixel.The binary values are used to generate histograms representing the appearance-based regions.Many LBP variations have been investigated in the previous related work to resolve the issues of illumination,multi-scale,and high dimension variations in FER.Automatic emotion recognition is investigated in [20,21]using the weber local descriptor (WLD) technique for frontal and spontaneous images and implemented in the e-healthcare environment.The WLD histogram features are computed using the Fisher Discriminate Ratio (FDR).These features are classified through a support vector machine (SVM).The WLD-based system gives better performance using the JAFFE and Cohn Kanade Databases.Guo et al.[22]proposed an enhanced deep learning hybrid CNN-BiLSTM (EJH-CNN-BiLSTM)algorithm to detect pain intensity using facial expression.The fine-tuned VGG-Face pre-trainer is used as a feature extraction tool by considering the balanced UNBC-McMaster Shoulder Pain Archive Database principle.Principal Component Analysis was applied to reduce dimensionality and enhance efficiency.The algorithm is used to estimate four various stages of pain.The results explored that the algorithm is the potential tool in medical diagnostics for automatic pain detection.

A novel algorithm called online sequential extreme learning machine and spherical clustering(OSELM-SC) is proposed by Muhammad et al.[23].In this approach,different techniques are applied to original face images.This includes the Voila-Jones detector for face detection and cropping and histogram equalization for illumination variations.Features are extracted by applying the curvelet transform to every region of the face image.Then the statistical features are extracted through mean,standard deviation,and entropy.The features are classified through the proposed algorithm OSELM.The best performance is achieved in the case of frontal and spontaneous images.A new facial decomposition technique named IntraFace (IF) algorithm is presented that uses landmarks to compute regions of interest (ROI).Texture,shape-based LTP,HOG,LBP,and CLBP features are extracted and classified through SVM.The better performance based on the recognition rate is achieved with this technique than other decomposition-based techniques.The appearance-based and geometric-based features are extracted by computing local face regions and then combined [24].The LBP and normalized central moments (NCM) are used as features.The proposed technique compares the local region-based and grid-based holistic representation.The local region-based method performs better than grid-based representation after applying the SVM classifier on the features.The histogram of oriented gradients (HOG) and the most discriminant discrete cosine transform (DCT) features are extracted in [25].The proposed system is accurate and reliable in handling illumination variation and multi-scale (resolution variance) problems.The system achieved better performance in the case of MMI and CK+ datasets images feature classified with KNN,Sequential Minimal Optimization (SMO),and Random Forest (RF).Donia et al.[26]proposed a new framework DSAE,which is based on Deep Sparse Network that automatically recognizes the expressions.This technique is only feature-based in which the Active Appearance Model (AAM),Principle Component Analysis (PCA),and Histogram of Oriented Gradients (HOG) features are extracted.The output features are used as input to Deep Sparse Coding Network,Deep Sparse Auto Encoder (DSAE) which gives the better performance considering the accuracy.Weber’s local binary image cosine transform (WLBI-CT) has been proposed in [27].This technique investigates the multi-orientation and multi-scale face images.The frequency components of images are computed through local binary descriptors (LBP) and Weber local descriptors (WLD).The WLD and LBP generate face image features that are obtained through DCT with orientation and without orientation,respectively.Then an evaluation is done using these feature vectors with different classifiers including Naïve Bayes classifier (NB),Sequential Minimal Optimization (SMO),Multilayer Perceptron (MLP),K-nearest neighbors (KNN),and Classification Tree.The main aim of the research is to recognize the basic emotions [28].Microsoft Kinect is used for 3D face modeling,which models the face using 121 specific points and arranges them based on face points.And plot it in coordinates by the Kinect Device.The six Action Units are used to describe the emotions presented by the FACS System and are used as features that are classified through KNN and MLP.Min Guo et al.proposed an algorithm K-ELBP by integrating Extended Binary Local Pattern (ELBP) and Karhunen-Loeve Transform (KLT) [29].The ELBP is used for uniform patterns and removed the others.KLT is used to reduce the dimensionality.The K-ELBP histograms are obtained from the segmented blocks.The multi SVM classifier is applied to a combined histogram to find accurate expressions.A salient geometric feature-based framework is presented by Ekman et al.[29]for the automatic FER system.The elastic bunch graph matching (EBGM) algorithm and Kanade Lucas Tomaci (KLT) tracker are used to create and track facial points and feature points initialization,respectively.Three different geometricbased points,line,and triangle features are extracted from generated tracked facial point results.The line and triangle discriminant features are extracted through the Extreme Learning Machine(ELM) and AdaBoost.SVM is used to classify the selected features.The line and triangle-based features are computed when the features are selected while the point-based features are computed directly.The best performance is obtained using multiple datasets.To improve the power of learning deep features,a novel island loss technique is implemented for convolutional network CNN [30].The island loss technique with CNN (IL-CNN) outperforms the baseline CNN.It is used to reduce the intraclass variations that happen due to head position changes,occlusions and illumination variations.The IL-CNN outperforms the other techniques while using the CK+dataset for a class of seven expressions and the Oulu-CASIA dataset.For the enhancement of the services of healthcare in smart cities,the FER technique [30]is proposed to extract the subbands by applying the “band let” transform to the face image.The weighted center-symmetric local binary pattern (CS-LBP) is implemented for every sub-band in the image in a block-wise manner.The feature vector is formed by combining CS-LBP histograms.

The most dominant features are extracted and classified by Support Vector Machine (SVM)and Gaussian Mixture Model.The performance of the technique is better in the case of JAFFE and CK datasets [30].A novel FER system is proposed with a support vector machine (FERS)by Bargshady et al.[31].The faces are detected from the image by combining self-quotient image (SQI) filter and Haar-like features.The SQI filter is used to overcome the light variations.The features are computed using the angular radial transforms (ART),discrete cosine transform(DCT),and the Gabor filter (GF) from the faces.The support vector machine (SVM) classifies the features and gives the best performance in terms of recognition rate for training and testing the patterns.“Simultaneous feature and dictionary learning” (SFDL) technique is proposed for sets of face images.In each training and testing set,the images were captured with different illumination and pose variations.SFDL method is implemented for the raw face pixels that learned the features and dictionaries.In stage one of the learning procedure,the facial image sets are manipulated together.The deep SFDL (D-SFDL) method is proposed for non-linear face samples of image-sets,by learning both class-specific dictionaries and hierarchical non-linear transformations.A shallow module is executed to extract the most discriminant information from the global and local regions to learn the low-level features.Then a part-based module is constructed to extract and learn dynamic local region information related to facial expressions.The long-short term memory (LSTM) and gated recurrent unit (GRU) layers are used to learn long-term dependencies.The extensive experiments show that the proposed technique gives better performance in the case of CK+ and Oulu-CASIA datasets.

3 Proposed Patch Based Multiple Descriptors Technique

The proposed technique describes novel patch-based multiple LBP descriptors using TPLBP and FPLBP.Image representation is computed from local patch values through these descriptors and encodes the properties of the local micro-texture around every pixel using short binary strings.Three patch LBP and four patches LBP [31]are implemented that encode the most discriminate types of local texture-based information.This technique consists of four steps.In the first step,faces are detected and cropped to overcome the multi-scale variations in the preprocessing step using the Viola-Jones algorithm described in Section 3.1.The feature engineering and fusion step introduces texture-based features.It helps in face detection through multiple LBP-based techniques.The discriminative features are extracted through DCT and fusion is performed.The feature vector is formed by fusing the selected features in Section 3.2.Finally,the classification step identifies the expression type.The experimental results for each data set are discussed in Section 3.3.The architecture of the proposed patch-based multiple descriptors technique is shown in Fig.1.

Figure 1:Architecture of the proposed patch-based multiple descriptors technique

3.1 Preprocessing

The input images are preprocessed using the Viola Jones algorithm to detect and crop the face from the entire image as shown in Fig.2.The faces are converted to grayscale if the images are already in RGB form.The pre-processing is performed due to the variance in the resolution of the images that uniformly maps all the data into 384×288 for CK+ and 65×65 for Oulu-CASIA datasets.The features are extracted and encoded in the feature engineering and fusion step.

Figure 2:(a) Input image (b) face detection using Viola Jones algorithm

3.2 Feature Engineering and Fusion

This section includes a description of three patch LBP and four patch LBP (TPLBP,FPLBP)feature extraction mechanism,and encoding process that are performed on pre-processed images.

3.2.1 Three Patch LBP(TPLBP)Coding

TPLBP and FPLBP work like the simple local binary pattern (LBP) [19]technique but are extended and developed to introduce a patch-based version.Three patch pixel values are compared with each other to produce a single value and assign it to every pixel in the image to form a TPLBP coded image.A w×w patch positioned at the central pixel is computed for every pixel in the input face image.The S extra patches are allocated consistently around a central patch in a circle denoted by radius r and the central patch is compared with each z pair of patches values.The single bit value is defined,based on the two patches values that is closer to the middle patch thus resulting in S bits per pixel code.The following formula is executed for every pixel in the image to produce three-patch LBP coding.

In Eq.(1),Ypdenotes the central patch and the two patches are denoted byYiandYi+ZmodSalong the ring.To calculate the distance between any two patches,the d(p1,p2) is used (e.g.,the gray level differences between p1 and p2 is L2 norm).The function f is formulated as:

For uniform regions some stability is provided using t value in Eq.(2) (e.g.,t=0.01).The processing speed is increased by obtaining the patches through nearest-neighbor sampling instead of interpolating their values.

A coded image is produced by encoding the input image similar to the CSLBP descriptor [24].The coded image is split into a grid of non-overlapping regions and for every region,the histogram is computed that measures the frequency of every binary value.Unit length is produced by normalizing each histogram region;their values are truncated at 0.2 and normalized to unit length again.A single vector is formed by concatenating these histograms generated for an image.

3.2.2 Four Patch LBP Coding

In four patched LBP,the two circles of the radii r1and r2are positioned on the central pixel for every pixel in the input image.S extra patches of size w×w are split out around each circle consistently.In the internal circle,two central patches are compared with the two center patches in the external circle that is located z patches apart from each other in the circle.By comparing the two pairs,the one with higher similarity is used to define one bit in every pixel’s value.S/2 center symmetric pairs for S extra patches along every circle are used for the computing the binary coded length [14].By executing the two-step process,the coded image is computed.The following equation computes FPLBP coded image.

Figure 3:(a,b) Original images of the CK+ and Oulu-CASIA dataset (c,d) TPLBP coded images of the CK+ and Oulu-CASIA dataset (e,f) FPLBP coded image of the CK+ and Oulu-CASIA dataset

3.2.3 Discrete Cosine Transform(DCT)

DCT is a technique which describes an image as a sum of sinusoids or just like sinusoidal waves of varying magnitudes and frequencies.For feature extraction,DCT-2 technique is implemented in the proposed approach.Two-dimensional (2-D) DCT is applied on an input image that converts it into DCT coefficients of the same matrix as the input image.The most significant information is stored in just a few coefficients on the top left corner of the transform output called low frequencies.These are extracted in a zigzag manner and high frequencies are discarded as shown in Figs.3c,3d.Due to this reason,the DCT is often used in image compression applications.For example,in JPEG,for an X×Y input image,the DCT is computed by the following equation:

where f(u,v) function denotes the image intensity while F(x,y) function denotes the computed DCT coefficients in a 2D matrix form.

3.2.4 Feature Fusion

The TPLBP and FPLBP codes are generated and features are extracted through DCT,separately from each of these methods in the feature engineering phase.The features are also fused to construct a feature vector in a zigzag manner.The zigzag function takes a matrix and a certain number of features such as 64 (8×8) as an input and returns a one-dimensional array consisting of the results of zigzag scans.For example,it stores,the value of the first pixel and flows in the right and down direction until the 8 x 8 matrix is complete as shown in Figs.4d,4h.The same process is repeated for all the images that are coded through TPLBP and FPLBP respectively to form a concatenated feature vector of 128 values (64 and 64 for each image) as shown in Tab.1.

Figure 4:(a,e) Original images of CK+ and Oulu-CASIA dataset (b,f) TPLBP coded images of original image (c,g) DCT output Image of coded images (d,h) Feature extraction in zigzag manner from DCT output image

3.3 Classification

The features are encoded and extracted in the feature engineering section followed by the feature fusion section,separately to form the feature vector of each dataset.SVM,KNN and SMO classifiers [31]are used to classify facial expressions using selected features.

4 Experimental Results and Discussion

Experiments have been performed on two different facial expression datasets the extended Cohn-Kanade (CK+) and partial Oulu-CASIA dataset.The detail of datasets is given in Tab.2.Execution of the proposed technique has been assessed using performance measures such as precision,recall,accuracy,specificity,sensitivity,and F-score.To determine the efficiency of the proposed technique,the comparison has been performed with the existing methods based on the included datasets.

Table 1:Concatenated feature vector record

Table 2:Specifications of facial expression datasets

4.1 Experiments on Extended Cohn Kanade(CK+)Dataset

Initially,experiments are performed on the CK+ dataset according to the descriptions mentioned in Tab.3 and the distribution of the images from dataset is shown in Tab.4.Fig.5 shows the performance of multiple classifiers such as SVM,KNN,and SMO using CK+ dataset.

Table 3:Contents of facial expression datasets

Tab.5 shows the recognition rate of linear kernel SVM classifier.Values in bold indicate the best recognition cases for each class of the CK+ dataset.The recognition rate is high but anger is misclassified as disgust and happy.

Tab.6 shows the recognition rate when KNN classifier is used.The recognition rate is good but anger and disgust are confused with fear and disgust is misclassified as happy.

Table 4:Number of CK+ images per expression

Figure 5:Performance metric values using CK+ dataset for multiple classifiers

Table 5:Confusion matrix for linear kernel SVM classifier using CK+ database

Tab.7 shows the recognition rate in the case of SMO classifier.The recognition rate is high,but anger is confused with sad while disgust and sad are misclassified as anger.

A comparison of the results of different techniques presented in literature work along with the proposed technique using the CK+ dataset is presented in Tab.8.A histogram of oriented features is generated along the discrete cosine transform.The hybrid technique has performed better when images with multi-scale and illumination variations are used.The face images are used to obtain local binary pattern (LBP) and weber local descriptor (WLD) features along with feature extraction mechanism based on DCT to perform classification.It is observed that the technique is not robust to noise.Tab.8 shows the performance of the proposed system is comparable with other relevant techniques and the results describe the effectiveness of the proposed technique.

Table 6:Confusion matrix for KNN classifier using CK+ database

Table 7:Confusion matrix for SMO classifier using CK+ database

Table 8:Comparison of proposed technique with existing techniques using CK+ dataset

4.2 Experiments on Oulu-CASIA Dataset

The performance of the proposed technique is also evaluated on the subset of Oulu-CASIA dataset.The dataset consists of two camera types named NIR (near to infrared) and VL (visible light) to produce image sequences.The images are also captured in different illumination situations including dark,strong,and weak light with these cameras.The main dataset contains the image sequences of 80 different persons.Data of 20 persons is obtained from VL camera with occlusion(with glasses) and without occlusion (without glasses),separately.Image sequences having dark,strong and weak light for each of 20 persons are combined in both cases,with occlusion and without occlusion for each expression,separately.

4.2.1 Experiments on Oulu-CASIA with Occlusion

The experiments are evaluated on a combination of dark,strong,and weak illumination having frontal and spontaneous images with glasses.Fig.6 shows the average accuracy,sensitivity,specificity,precision,recall and F-score,using multiple classifiers such as SVM,KNN and SMO.

Figure 6:Performance of multiple classifiers using Oulu-CASIA dataset with occlusion

Tab.9 shows the recognition rate of linear kernel SVM classifier using Oulu-CASIA dataset with occlusion.The recognition rate is good but disgust is misclassified as anger while happiness and surprise are confused with fear.

Table 9:Confusion matrix for linear kernel SVM using Oulu-CASIA with occlusion dataset

Tab.10 shows the recognition rate in the case of the KNN classifier.Bold values indicate the best recognition rate of Oulu-CASIA with occlusion dataset.The recognition rate is accurate,but disgust misclassifies with fear and happiness while fear is confused with the surprise factor.

Table 10:Confusion matrix for KNN classifier using Oulu-CASIA with occlusion dataset

Tab.11 shows the recognition results of SMO classifier using Oulu-CASIA with occlusion dataset.The confusion matrix indicated that disgust is confused with anger and sad while fear is classified as surprise.Sad is also confused with anger,disgust and fear.

Table 11:Confusion matrix for SMO classifier using Oulu-CASIA with occlusion dataset

4.2.2 Experiments on Oulu-CASIA Without Occlusion

The experiments are performed on the dark,strong,and weak illumination frontal and spontaneous images having faces without glasses and combined as one dataset.Fig.7 shows the average accuracy,sensitivity,specificity,precision,recall,and F-score using multiple classifiers.

Figure 7:Performance of multiple classifiers using Oulu-CASIA without occlusion dataset

Tab.12 shows the results of linear kernel SVM classifier using Oulu-CASIA without occlusion dataset.The recognition rate shows that anger is confused with disgust.

Table 12:Confusion matrix for linear kernel SVM using Oulu-CASIA without occlusion dataset

Tab.13 describes the results obtained by employing the KNN classifier using Oulu-CASIA dataset without occlusion.

Table 13:Confusion matrix for KNN classifier using Oulu-CASIA without occlusion dataset

The results obtained by employing SMO classifier using Oulu-CASIA without occlusion dataset are listed in Tab.14.The values reveal that disgust is confused with fear and happiness.

Table 14:Confusion matrix for SMO classifier using Oulu-CASIA without occlusion dataset

The average recognition rate of the proposed technique is compared with existing methods using the same Oulu-CASIA dataset.IL-CNN technique is often used to reduce the intra-class variations.The performance and accuracy are low with the IL-CNN technique as compared to the proposed technique.The long-short term memory (LSTM) and gated recurrent unit (GRU) layers are used to learn long-term dependencies.Tab.15 shows that the performance of the proposed system is better than other techniques.

Table 15:Comparison of proposed technique with existing using Oulu-CASIA dataset

5 Conclusion and Future Work

A novel patch-based multiple LBP descriptors techniques namely three patch local binary patterns (TPLBP) and four patch local binary patterns (FPLBP) have been proposed.The proposed system exploits the feature extraction ability of TPLBP and FPLBP along with DCT to overcome the issues of illumination,scale variations,high dimensions,noisy images,and higher computational complexity of texture-based features.Multiple classifiers are used to classify standard CK+and Oulu-CASIA datasets with posed,spontaneous emotions,illumination variant and multi-scale face images.The proposed technique can obtain a high-performance rate,which is relatively tough in situations with variations in angles and noise.The performance can be further improved to manage these factors using some pre-processing techniques along with TPLBP and FPLBP.

Funding Statement:This research was supported in part by the MSIT (Ministry of Science and ICT),Korea,under the ITRC (Information Technology Research Center) support program (IITP-2020-2016-0-00312) supervised by the IITP (Institute for Information &communications Technology Planning &Evaluation) and in part by the Faculty Research Fund of Sejong University in 2019.

Conflicts of Interest:The authors declare that they have no conflicts of interest to report regarding the present study.