APP下载

Label distribution expression recognition algorithm based on asymptotic truth value

2021-09-15HUANGHaoGEHongwei

HUANG Hao,GE Hongwei

(1. School of Artificial Intelligence and Computer Science,Jiangnan University,Wuxi 214122,China; 2. Jiangsu Provincial Engineering Laboratory of Pattern Recognition and Computational Intelligence,Wuxi 214122,China)

Abstract:Ambiguous expression is a common phenomenon in facial expression recognition (FER).Because of the existence of ambiguous expression,the effect of FER is severely limited.The reason maybe that the single label of the data cannot effectively describe complex emotional intentions which are vital in FER.Label distribution learning contains more information and is a possible way to solve this problem.To apply label distribution learning on FER,a label distribution expression recognition algorithm based on asymptotic truth value is proposed.Under the premise of not incorporating extraneous quantitative information,the original information of database is fully used to complete the generation and utilization of label distribution.Firstly,in training part,single label learning is used to collect the mean value of the overall distribution of data.Then,the true value of data label is approached gradually on the granularity of data batch.Finally,the whole network model is retrained using the generated label distribution data.Experimental results show that this method can improve the accuracy of the network model obviously,and has certain competitiveness compared with the advanced algorithms.

Key words:facial expression recognition (FER);label distributed learning;label smoothing;ambiguous expression

0 Introduction

As one of the most important and accessible emotional expressions of human beings,facial expressions have been extensively studied in psychology.The facial expression recognition (FER)system in the field of computer vision mainly focuses on automatic FER.Its major task is to recognize the facial expressions in pictures or picture sequences (including videos).It is believed that this technology is of great significance to the efficient human-computer interaction in the future world[1],as well as to the filed of fatigue driving detection[2]and the treatment of mental diseases[3].Li S et al.[4]concluded that there are two nodus in deep expression recognition at this stage:network overfitting due to lack of effective data and difficulty in feature extraction due to a large amount of redundant information.A large number of recent studies have shown that it is not appropriate to simply regard deep expression recognition as an application of neural networks.Even with high-standard valid data and advanced network models,the accuracy of deep expression recognition is still limited by the characteristics of the expression itself.Especially in the past two years,the ambiguous expression phenomenon has been paid more and more attention by researchers[5-7].

What we study is the FER of static pictures.Usually,a picture corresponds to only one label.In fact,a picture may contain complex emotional intentions.From this perspective,the labels for static expression recognition are not accurate enough.To illustrate this problem intuitively,we will use real data from FER2013 database as an example to describe the problem,as shown in Fig.1.

Fig.1 Picture of FER2013

Taking the above figures as an example,we give a brief description of what we have learned in the three ways of one-hot label,multi-label and label distribution.These images are taken from the training set of the FER2013 database,and their labels in FER+[8]are shown in Table 1.

Table 1 Corresponding label of Fig.1

a1,a2,and a3 will be classified as Angry,Neutral,and Unknown.Respectively,when the label conversion is carried out by majority voting,for single-label data,a1,a2 and a3 lost data of 2,4 and 10 labels;for figure a1,the essence one-hot label is [0,0,0,0,10,0,0,0,0,0].Obviously,if each annotator has the same weight,this classification is not quite fair.When figure a3 is converted into multiple labels,the form of data is changed to [1,0,1,0,0,0,0,0,0],and the essence of what the network learns is [5,0,5,0,0,0,0,0],which will also lose part of the information.The classification of the ten annotators is partly accidental,and both are acceptable,but when applied in practice,it is not wise to ignore 20% or 10% of emotional preferences.

Label distribution learning is very promising in solving this problem.Label distribution learning (LDL)was clearly proposed by Geng et al.[9].He believes that although multi-label learning can solve the problem of label ambiguity to a certain extent,for many practical problems,the overall label distribution is more important.Subsequently,Gao et al.[10]proposed deep label distribution learning (DLDL)model,which applies label distribution learning to tasks such as age prediction and head pose estimation.In the direction of FER,similar work is more influential.In addition to the FER+ database relabeling work used in the previous example,there is also the work of Chen et al.[5].They proposed the lable distribution learning on auxiliary lable space graphs (LDL-ALSG)framework,which uses an approximate K-nearest neighbor algorithm to calculate the distribution of current data in the auxiliary task solution space of similar data through auxiliary tasks similar to FER,and then uses the auxiliary network as a judgment.Barsoum et al.[8]used crowd sourcing to convert single-label data into label distribution data,which improves the recognition effect of deep neural networks.However,this method requires multiple people to relabel,which takes a lot of manpower and high economic costs.The method of Chen et al.has the disadvantage that the amount of calculation during training is extremely large,and the effect of the method depends on a good auxiliary task model.

As far as we know,the IPA2LT framework proposed by Zeng et al.[11]is the first attempt to solve the problem of ambiguous labels.Their work is based on the recognition that facial expression data have a potential true label.That is,for a certain face picture,the emoticon label contained therein has a certain truth value (they think that the ambiguous emoticon is an inconsistent labeling of emoticon labels).Inspired by this idea,the content of our work is based on the following idea:For certain facial expression data,there is a potential true emotional distribution.In the data containing single labels,the label is a high-level generalization of this distribution.We use an approximation method to find the potential real expression label distribution of the data.The significant difference from the former work is that the more appropriate way to describe expressions is the label distribution of expressions.Therefore,this study acknowledges the authenticity and representativeness of the labels of the data annotators.

1 Proposed method

1.1 Overal framework

The latent truth is a classical assumption.In essence,assuming that the data can get a perfect label distribution is the same as assuming that the loss function of the network can be minimized,which is an ideal situation.Although the loss function of the neural network cannot be minimized in most cases,the characterization of the loss function makes it possible to optimize the network model step by step.In this process,the minimum value does not actually appear,but the expectation of the loss function to approach 0 actually improves the network model.The algorithm presented in this paper is trying to do such an approximation work.Although the true value of the data label distribution is not known,it can be approximated to the true value through a better expectation.

The overall framework of the algorithm in this paper is shown in Fig.2.The training process is as follows:First send the training data to the network model and use the single label of the data to learn knowledge,then collect the softmax output of the training data as the true label distribution during the last training.The initial valueLTmeanwhile collects the class average distributionLMof all kinds of data.The label distribution update process is as follows:Consider the original labelLOof the data as well as theLTandLMobtained during training at the same time,and update the label distribution of the training data.In the testing process,make predictions on the test set of the database with the network trained on label distribution data.

Fig.2 Overall framework of proposed method

1.1.1 Label update strategy

The algorithm adopts the class-average distributionLMfor every class of data in the training set as one of the references of the real label distribution,and 0.5 is taken as the lower limit of the label update strategy according to the principle of majority voting.Specifically,for a data sample in the training set,the network model can predict it in the following three situations:

1)The probability that the classification is correct and that it belongs to the correct classification is greater than the average expectation of the class inLM,the overall average of the data.In this case,the classification of the data is very accurate,and higher requirements can only be put forward for further improvement.The label distribution of this class is set as the one-hot label of the data;

2)The classification is correct,but the probability of prediction is lower than the average,which is the ambiguous data label that this study tries to solve.The expectation of the original single data label for this data is too optimistic,but in fact the distribution of facial expressions contained in the data is not so clear.Adjust the label distribution of the data to the value of the data inLM;

3)The classification is wrong,so we should not expect too much of the expression distribution label in the data,as long as the probability that it belongs to the correct value in the next distribution approaches 0.5.It is unrealistic to require this kind of data to achieve the precise classification in the single label.The essence of this process is label smoothing with a larger threshold.

The formulation of the above description is given by

(1)

Fig.3 Label update of three situations

1.1.2 Asymptotic bounds

The purpose of label update is to generate the data label distribution reasonably.There are two key issues in this process.The first is how to ensure the representativeness of the class average distributionLM,that is,how to ensure that the model has learned enough knowledge and there is no over-fitting to the training data.There is no simple solution to this problem,it can only be obtained with sufficient experience through a large amount of experimental.The second is when the network model is used asLTin the label update process after fixingLM.Ideally,this problem can be dealt with at the same time as the previous one,that is,the state of the network model after single-label training is exactly the most reasonable state for label update.However,the reality is that the learning of the network model is not iterative learning once,and knowledge can be learned for model every batch in the training process.LMconsiders the average data of the entire database training set.If the fixedLMmodel is also used forLTinitialization,it is almost certain that half of the data will be lower thanLMreference standard in the label distribution.Under the premise of authenticity and representativeness,it is obviously inappropriate to think that half of the data are accurately marked.

In order to solve the above two problems at the same time,when fixingLMmodel,a certain compromise is made to the classification ability of the network model,and the network model after a certain batch shall prevail.In fact,as the number of iterations increases,the growth of the feature extraction ability of the network model slows down,and the loss ofLMis acceptable for such processing.WhenLTis initialized,the network model after a certain batch shall prevail.After multiple batches of training,the feature extraction ability of the network is further improved.If there is an expected threshold for the accuracy of the data,the network model can be as close to this threshold as possible in the granularity of the batch.One of the practical problems is that in the batch processing process,if the batch size is too small,the efficiency of the system to update the entireLTis too low.The information of a batch is limited,and may not even be updated forLT,but it is necessary to traverse all the data.Undoubtedly,the training time cost of such processing is huge.The algorithm proposed will design two approximate processings,one is to updateLTonce after processingKbatches,and the other is to updateLTof the current batch for each batch.These two approximations will give priority to updating the top batches each time.However,since the order of the batches is random,taking multiple averages can offset this defect.Furthermore,after fixingLM,the network model is very close to fitting,and the difference betweenLTupdated in the previous batch and the following batch is not large.

2 Experiments

2.1 Experimental setup

2.1.1 Datasets

To evaluate the performance of the algorithm designed for FER in the wild environment,we select several popular wild databases in recent years:FER+[8],AffectNet[13],RAF-DB[12].FER+ database is a relabel of the FER2013 database,also written as Ferplus.For 35 887 pictures of FER2013,10 labels marked by 10 annotators are made for each picture.They are in order:neutral,happy,surprise,sadness,anger,disgust,fear,contempt,unknown,and no face.Ferplus solves the problem of low credibility of FER2013’s much criticized label.The FER2013 database provides low-resolution grayscale images of 48×48 pixels.Until today,such low-resolution is still a challenge.RAF-DB database is the database with the most stringent production standards so far.Li et al.provided single-label data with 40 annotators’labels for each picture (by majority voting ).The database provides a total of 29 672 high-resolution data,of which 15 339 are single-label 7-category basic expressions (including neutral),and 3 954 are single-label 11-category composite expressions.Even for these high-quality labeled data,their group also provides aligned face data that are uniformly processed as 100×100 pixels.Subsequent researchers can directly use the aligned data for research,thus simplifying the preprocessing process.AffectNet is currently the database with the most single-label data in number.It contains more than one million high-resolution pictures,of which about 450 000 pictures are manually marked as 11 types of single-label data (A none label added relative to FER+).Due to time and money costs,these labels are only marked by only one annotator.The data set also provides the valence and arousal labels of these manually annotated pictures,and each picture is marked by 12 professionals.For basic emotions system,AffectNet labels have low credibility,and making expression recognition predictions on this database is a huge challenge.

2.1.2 Parameters and environment

Without special instruct-tions,the experimental parameters are set as follows:The data type accepted by the network model is:224×224 size rgb three-channel image,SGD optimizer is selected,the momentum is 0.5,the learning rate is 0.01 and the number of iterations is 20.The weight decay and learning rate decay strategies are adopted,the weight decay coefficient is 10-5,and the learning rate decay method is exponential decay.The network model is deployed on one Nvidia 2080ti GPU,using the pytorch deep learning framework.Since the batch size needs to be set appropriately when approaching,the batch size should be adjusted to 64,which is relatively large.Set theKvalue to 50.The following experiments all adopt the best result in 20 iterations as the final result.

2.2 Analysis of hyper-parameter

In order to evaluate the influence of hyperpara-meters,in the experimental part,we first analyze the hyper-parameter precision threshold and the number of single-label training iterations,and then test and apply this method on different network models.The experimental design and analysis are as follows:

Approximate method 1:Considering the relatively large amount of calculation in approximate method 1,only experimental tests are performed on RAF-DB.Under the premise that the number of single iterations is 3,4 and 5.Tested on the ResNet18 and ResNet34 network models respectively,the parameter accuracy threshold fluctuates from 0.7 to 1.0 with a step size of 0.05.The experimental results are shown in Tables 2 and 3.

Table 2 Experiment of ResNet18 on RAF-DB database (approximate method 1)

Table 3 Experiment of ResNet34 on RAF-DB database (approximate method 1)

In order to analyse the data more intuitively,setting the accuracy threshold as the abscissa and the accuracy as the ordinate,we get Fig.4,which illustrates the experimental results of ResNet18 and ResNet34.

(a)ResNet18 on RAF-DB

Obviously,although the overall test accuracy obtained by the experiment fluctuates slightly with the change of the accuracy threshold.The following information can still be obtained:Firstly,the experimental results on the model ResNet34 are significantly better than that on ResNet18.Secondly,the overall single-label training shows that the number of iterations should not be too low (refer to line 3 in Fig.4).The fluctuation of the data is due to the characteristics of the neural network,each batch will be randomly scrambled in each training,the randomization parameters obtained in each training are different,and the fitting of the network parameters during the training process is also slightly different.The number of single-label training cannot be too low because the network at this time is seriously under-fitting.The initialization ofLTwill also contain a lot of error information.Especially when the precise threshold is set to a large value,the error caused by the under-fitting network model will be further reflected in theLTupdate process.According to the original intention of the scheme design,as the accuracy threshold changes,there will be two accuracy fluctuations in the results of the experiment,such as line 5 in Fig.5 and line 4 in Fig.5.The best result in the previous part is due to the “correction”of the data label distribution by the label update algorithm,and the latter is the joint effect of the further fitting of the network model parameters and the label update algorithm.

Fig.5 is schematic diagram of the accuracy change with the accuracy threshold.

(a)ResNet18 on RAF-DB

Approximate method 2:The basic parameters and experimental settings are the same as those of approximate method 1.The experimental results are listed in Tables 4 and 5.

Table 4 Experiment of ResNet18 on RAF-DB database (approximate method 2)

Table 5 Experiment of ResNet34 on RAF-DB database (approximate method 2)

It can be seen that for the approximation method 2,the experimental results when the number of single-label training iterations is chosen as 4 have obvious advantages,especially on the ResNet34 network model.At the same time,these two experimental result polylines with four iterations are also in line with the conjecture of the two local best values in the previous analysis.The obvious difference from the approximation method 1 is that the series of experiments with a single-label training iteration number of 5 perform poorly.It is speculated that the approximation method 2 is more sensitive to over-fitting data and is more negatively affected by the over-fitting network.

In order to visually analyse the performance of the two approximation methods,we take the best results of each shown in Fig.6 to compare them.Approximation method 1 takes a series of experiments when the number of iterations for two single-label training is 5,and the approximation method 2 also takes the a series of experiments when the number of iterations is 4.

(a)Comparison on RAF-DB (ResNet18)

It can be seen that for both the best value and the overall test accuracy,the approximation method 2 is better than the approximation method 1,so the approximation method 2 with better performance is selected in the subsequent comparative experiment stage.In addition,because there is no need to consider the huge amount of calculation caused by traversing the entire database training set multiple times,the batch size of the overall network model in training is no longer limited.The experimental results when taking 16,32 and 64 respectively are shown in Table 6.

Table 6 Impact of batch size

Fig.7 Influence of batch size on RAF-DB(ResNet34)

It can be seen that the experimental effect of this algorithm is better when the batch size is small.It is speculated that because the batch size is smaller,the granularity of the label distribution update is more detailed,which also partially explains the phenomenon that the approximation method 1 is worse than the approximation method 2.Obviously,the former updates more labels each time.

All in all,the method proposed in this paper is sensitive to hyper-parameters.The hyper-parameters that achieve the best result on RAF-DB are as follows:single-label training iterations is 4,batch size is 16,accuracy threshold is 0.9,and network model is ResNet34.The best accuracy is 86.51%.

2.3 Comparative experiment

First,to verify the effectiveness of the proposed method.We compare its experiment results on the three databases of FER+,AffectNet,and RAF-DB with that of benchmark method on ResNet34.

Table 7 shows that after applying the method proposed,the network model has improved by 1.35% on AffectNet,3.46% on RAF-DB,and 4.69% on FER+.

Table 7 Comparison with baseline

Then,a comparative experiment was conducted with some advanced algorithms that have performed well in recent years,and the experimental results shown in Table 8.On AffectNet,the method proposed is compared with DLP-CNN,EAU-Net,pACNN,and IPA2LT.The hyper-parameters that achieve the best result on AffectNet are as follows:the number of single-label training iterations is 4,the batch size is 32,the accuracy threshold is 1,and the network model is ResNet34.

Table 8 Comparison with baseline

As shown in Table 9,on the RAF-DB database,the method proposed is compared with DLP-CNN,EAU-Net,gACNN and DeepExp3D.Among the advanced methods,the effect of proposed method reaches the best.The hyper-parameters that obtain the best result on the RAF-DB database are as follows:the number of single-label training iterations is 4,the batch size is 16,the accuracy threshold is 0.95,and the network model is ResNet34.

Table 9 Comparison with baseline

As shown in Table 10,on FER+,the method proposed is compared with SHCNN,TFE-JL,VGG13-PLD,and ESR-9,the best results are achieved,and the best hyper-parameters on FER+ are as follows:the number of single-label training iterations is 4,the batch size is 16,the accuracy threshold is 0.95,and the network model is ResNet34.

Table 10 Comparison with baseline

3 Conclusions

In this work,a label distribution expression recog-nition algorithm based on asymptotic truth value is proposed to solve the problem of ambiguous expression.In order to accurately describe the emotional tendency in image data,we use label distribution to avoid the ambiguity problem caused by single-label data.We propose a simple label generation strategy and a set of corresponding training methods.Taking into account the rigor of the data,we use the overall inner-class mean and the lower bound introduced by the absolute majority voting method as constraints.This is the highlight of this work and the focus of improvement.In future research,a lot of different attempts are needed to find a more reasonable reference standard for label generation.