APP下载

Image-Based Lifelogging:User Emotion Perspective

2021-12-16JunghyunBumHyunseungChooandJoyceJiyoungWhang

Computers Materials&Continua 2021年5期

Junghyun Bum,Hyunseung Choo and Joyce Jiyoung Whang

1College of Computing,Sungkyunkwan University,Suwon,16417,Korea

2School of Computing,KAIST,Daejeon,34141,Korea

Abstract:Lifelog is a digital record of an individual’s daily life.It collects,records,and archives a large amount of unstructured data;therefore,techniques are required to organize and summarize those data for easy retrieval.Lifelogging has been utilized for diverse applications including healthcare,self-tracking,and entertainment,among others.With regard to the imagebased lifelogging,even though most users prefer to present photos with facial expressions that allow us to infer their emotions,there have been few studies on lifelogging techniques that focus upon users’emotions.In this paper,we develop a system that extracts users’ own photos from their smartphones and configures their lifelogs with a focus on their emotions.We design an emotion classifier based on convolutional neural networks(CNN)to predict the users’emotions.To train the model,we create a new dataset by collecting facial images from the CelebFaces Attributes (CelebA) dataset and labeling their facial emotion expressions,and by integrating parts of the Radboud Faces Database(RaFD).Our dataset consists of 4,715 high-resolution images.We propose Representative Emotional Data Extraction Scheme(REDES)to select representative photos based on inferring users’ emotions from their facial expressions.In addition,we develop a system that allows users to easily configure diaries for a special day and summaize their lifelogs.Our experimental results show that our method is able to effectively incorporate emotions into lifelog,allowing an enriched experience.

Keywords:Lifelog;facial expression;emotion;emotion classifier;transfer learning

1 Introduction

Nowadays,people record their daily lives on smartphones thanks to the availability of devices with cameras and high-capacity memory cards.The proliferation of social media encourages the creation and sharing of more personal photos and videos.Users can now store and access their entire life through their smartphones [1].As the number of photo collections preserved by individuals rapidly grows,it becomes more difficult to browse and retrieve among the unorganized collections.People often waste their time exploring vast numbers of photos trying to remember certain events or memorable moments;this accelerates the need to automatically organize large collections of personal media [2-4].

Different methods have been proposed for automatically collecting,archiving,and summarizing huge collections of lifelog data [5].Structuring images into temporal segments (that is,separating them into events) facilitates image retrieval [6].Metadata such as time and location are typically used to identify events in a photo collection and then key photos are selected within events.Lifelogging helps to increase the memory of elderly people and dementia patients by recording the events of their daily lives and allowing them to recall what they have seen and done [5,7].Photo-sharing social-networking services have received increased attention in recent years;smartphones and wearable cameras combined with social media extend the scope of lifelogging.

Most users prefer photos with facial expressions in their photo collections.According to a recent study [8],human presence and facial expressions are key indicators of how appealing photos are.Google Photos or Apple’s iPhoto applications now automatically generate a summary based on people,places,and things.With the recent development of research on the visualsentiment analysis,the sentiment-based recommender system and explanation interface have been studied [9].However,few studies use emotional characteristics to select key photos (especially emotions in facial expressions such as surprise,joy,and sadness).Moreover,accurate analysis and interpretation of the emotion conveyed by human facial expressions remain a great challenge.

With the advent of deep neural networks,facial expression recognition systems have changed profoundly [10].In particular,convolutional neural networks (CNN) have led to breakthroughs in image recognition,classification,and object tracking based on big datasets such as ImageNet [11],and are widely applied in various fields.Training a deep-learning model from scratch by collecting sufficient data is an expensive and time-consuming task.One recent alternative to this is transfer learning [12,13],whereby a new model is trained with pre-trained weights.Face recognition systems that used hand-crafted features such as histogram-oriented gradients (HOG) and Gabor filters have been moved to those using deep-learning features.Especially,CNN architectures have shown excellent results in the field of computer vision.Here,we exploit a new facial-emotion classifier based on CNN architectures.

In the present paper,we develop a lifelogging system that extracts photos from smartphones,analyzes the users’ emotions based on their faces,selects a set of representative images that show different emotions,and then outputs a summary in diary form.We also propose a selection technique that can improve emotional diversity and representativeness in a photo collection.Our main contribution is the creation of a new emotion dataset in high resolution and the development of an emotion classifier,and the use of emotion features to group images in a visual lifelog.The rest of this paper is organized as follows.Section 2 briefly reviews related studies.Section 3 explains the structure and processes of our image-based lifelogging system.Implementation and user experiments are presented in Section 4.Finally,future research directions are discussed in Section 5.

2 Related Work

Lifelog research that collects and visualizes data using wearable cameras and sensors has received much attention in recent years [14-17].Since high-end smartphones have become widely used by the public enabling lifelogs to be easily collected without requiring users to carry additional devices,various studies have been conducted on the use of smartphones as lifelogging platforms.UbiqLog is a lightweight and extendable lifelog framework that uses mobile phones as a lifelog tool through continuous sensing [18].Gou et al.proposed a client-server system that uses a smartphone to manage and analyze personal biometric data [2].To automatically organize large photo collections,a hierarchical photo organization method using topic-related categories has been proposed [4].The method estimates latent topics in the photos by applying probabilistic latent semantic analysis and automatically assigning a name to each topic by relying on a lexical database.

Lifelogs must be summarized in a concise and meaningful manner because similar types of data are repeatedly accumulated.We focus on images from among the different kinds of lifelog data.The most relevant study was presented in [19],where emotion factors in lifelogs were used when selecting keyframes.A wearable biosensor was used to measure emotions quantitatively through the physiological response of skin conductance.Assuming that the most important and memorable moments in life involve emotional connections among human beings,they collectively evaluated the image quality and emotional criteria for keyframe selection.The important moments that individuals want to remember are highly subjective,making it difficult to achieve satisfactory results using uniform objective criteria.Even though it is interesting to include user emotions in the keyframe selection,there are limitations arising from the need to wear a biosensor to measure emotions.A recent study has explored how lifelogging technology can help users to collect data for self-monitoring and reflection [20];they use a biosensor and a camera to provide a timeline of experience,including self-reported key experiences,lifelog experiences,heart rates,decisions,and valence.They conclude that their result supports recall and data richness.However,when those techniques are combined with automated tools such as key photo selection,better user experiences can be achieved [21].

EmoSnaps,the product of another emotion-related study,is a mobile application that allows a smartphone to take a picture while the user is unaware to help recall the emotion at that moment later [22].Users evaluate their photos every day and enter emotional values.In addition,they can observe changes in their happiness every day and week.The study shows that facial expressions play an important role in the memory of one’s emotions.It has been also shown that emotions are closely related to facial expressions,making it difficult to hide basic emotions such as anger,contempt,disgust,fear,happiness,neutrality,sadness,and surprise [23].In this study,emotion features are extracted from facial expressions and key images are selected to represent the diversity of lifelogs.

Facial emotion recognition is indeed a challenging task in the computer vision field.Largescale databases play a significant role in solving this problem.The facial emotion recognition 2013(FER2013) dataset was created in [24].The dataset consists of 35,887 facial images that were resized 48×48 pixels and then converted to grayscale.Human recognition on this dataset has an accuracy of 65%±5%.It has been revealed that CNNs were indeed capable of outperforming handcrafted features for recognizing facial emotion.The extended Cohn-Kanade (CK+) facial expression database [25]includes 592 sequences from 210 adults from various ethnicities (primarily European- and African-Americans).The database contains 24-bit color images,of which the majority are grayscale.The Japanese female facial expression (JAFFE) database [26]contains 213 images of 10 Japanese female models exhibiting six facial expressions.All images are grayscale.The Radboud faces database (RaFD) [27]consists of 4,824 images collected from 67 Caucasian and Moroccan Dutch models displaying eight emotions.RaFD is a high-quality face database,with all images being captured in an experimental environment with a white background.

With the advancement of deep neural networks,the CNN architecture has yielded excellent results in image-related problems with learning spatial characteristics [28,29].Research on facial emotion recognition has also been noticeably improved.Kumar et al.[30]utilized a CNN which has nine convolutional layers to train and classify seven types of emotion on the FER-2013 and CK+ databases.They were able to achieve an accuracy of about 90% or more.Li et al.[31]proposed new strategies for face cropping and rotation as well as a simplified CNN architecture with two convolution layers,two sub-sampling layers,and one output layer.The image was cropped to remove the useless region,and histogram equalization,Z-score normalization,and downsampling were conducted.They experimented on the CK+ and JAFFE databases to obtain high average recognition accuracies of 97.38% and 97.18%,respectively.Handouzi et al.[32]proposed a deep convnet architecture for facial expression recognition based on the RaFD dataset.This architecture consists of a class of networks which have six layers:two convolutional layers,two pooling layers,and two fully connected layers.Recent research has extended from basic emotion recognition to compound emotion recognition.Gou et al.[33]released a dataset consisting of 50 classes of compound emotions composed of dominant and complementary emotions (e.g.,happily-disgusted and sadly-fearful).The compound-emotion pairs are more difficult to recognize;thus,the top-5 accuracy is less than 60%.Slimani et al.[34]proposed a highway CNN architecture for the recognition of compound emotions.The highway layers of this architecture facilitate the transfer of the error to the top of the CNN.They achieved an accuracy of 52.14% for 22 compound-emotion categories.

VGGNet [35]is a CNN model composed of convolutional layers,pooling layers,and fully connected layers.VGG16 uses a 3 × 3 convolution filter,and the simplicity of the VGG16 architecture makes it quite appealing.VGG16 performs almost as well as the larger VGG19 neural network.Transfer learning has received much attention in recent years where the idea is to transfer knowledge from one domain to another [36].A pre-trained model is a saved network that was previously trained on a large dataset such as ImageNet;we can use the pre-trained model as it is or use transfer learning to customize this model for our own purposes.Transfer learning applies when insufficient data are provided;it has the advantage of requiring relatively little computation because it only needs to adjust the weights of the designated layers.In general,the lower layers of CNN maintain general features regardless of the problem domain and the higher layers to be optimized to the specific dataset.Therefore,we can reduce the number of parameters to be trained,while reusing the lower-level layers.In this paper,we experiment with various methods including modified transfer learning and finetuning to select the optimal model.

3 The Proposed Scheme

We propose the Representative Emotional Data Extraction Scheme (REDES),a technique for selecting representative images from the user’s emotional perspective to identify the most representative photos from the user’s lifelog data.We develop a novel emotion classifier in this process to correctly extract emotions and use the Systemsensplus server to archive photos from the smartphones.Details are provided in the following subsections.

3.1 System Architecture

The proposed system is divided into two parts:A mobile application and a server based on Systemsensplus.The overall architecture is shown in Fig.1.The mobile application consists of a face-registration module and a diary-generation module that includes REDES.The faceregistration module registers the user’s face for recognition,while the REDES submodule identifies the days of special events and chooses representative photos.The diary-generation module displays the representative photos selected by REDES on the user’s smartphone screen and creates a diary for a specific date.

Figure 1:Basic architecture and modules of the proposed system including REDES

Systemsens [37]is a monitoring tool that tracks data on smartphone usage (e.g.,CPU,memory,network connectivity).We use the extended Systemsensplus to effectively manage user photos and related tagging information.The emotion-detection module identifies the user’s face,predicts emotion from the facial expression,and tags the information using International Press Telecommunications Council (IPTC) photographic metadata standards.The IPTC format is suitable because it provides metadata,such as a description of the photograph,copyright,date created,and custom fields.The Systemsensplus client module operates as a background process on the user’s smartphone;it collects and stores data at scheduled times and events in a local database.When the smartphone is inactive,the Systemsensplus uploader sends data to the server overnight.

3.2 Emotion Classification Model

Well-refined,large-scale datasets are required for high accuracy of facial expression classification.On the FER2013 dataset [24],it is difficult to achieve a high performance due to the very low resolution of the given images.RaFD [27]consists of 4,824 images collected from 67 participants and includes eight facial expressions.The CelebFaces Attributes (CelebA) [38]dataset contains 202,599 celebrity facial images,but includes no label for facial expressions.We collect seven facial emotion expressions (excluding contempt) from the RaFD dataset and manually label facial emotion expressions in the CelebA dataset.The integrated dataset consists of seven emotions and Tab.1 shows the number of images for each facial emotion.

Table 1:The number of images per emotion in our dataset

For experiments,we divided the dataset into three parts:Training (70%),validation (10%) and testing (20%).AlexNet [39],VGG 11,VGG16,and VGG19 network structures are used for comparing transfer learning and finetuning of facial expression recognition problems.The pre-trained VGG16 and VGG19 networks perform facial expression classification using transfer learning methods.To make an acceptable comparison,the same training options are used for all scenarios.When recognizing emotions from facial expressions,a pre-processing is performed to detect facial regions—i.e.,detecting faces in images and cropping only the face region.For this pre-processing step,we utilize the face recognition python library [40].This library has an accuracy of 99.38%for the labeled faces in the wild benchmark [41].A sample of our dataset is shown in Fig.2.

Figure 2:Exemplary images from our refined dataset

We aim to select the best model by comparing the performance of facial emotion classifiers using transfer learning or finetuning techniques based on pre-trained CNN models.We evaluated the four models—AlexNet,VGG11,VGG16,and VGG19—using different learning scenarios.The last fully connected layer of each model is replaced with the new fully connected layers for the purpose of emotion classification and the softmax function is applied at the end.For the VGG16 and VGG19 models,four methods are impelmented:A) Transfer learning in which all of the convolution layers are frozen and only the fully connected layers are trained;B) and C) Two transfer learning techniques in which the front-convolution layers are..xed and the weight of highconvolution layers,i.e.,convolution block 5 for B) and convolution blocks 4 and 5 for C),are trained;and D) A finetuning technique in which the weights of all layers are adjusted.

Our experiments were conducted using Keras [42]deep-learning library.A Nvidia GeForce RTX 2080 Ti GPU has been used to execute the experiments,and the Adam optimizer has been applied to a mini-batch of 32 images with categorical cross-entropy loss.The number of epochs and the learning rate is set to 100 and 10−4,respectively.Tab.2 shows the validation and test accuracies of facial emotion classification in transfer learning for four CNN models.The results with the best test accuracy are highlighted in bold.

On VGG16,the worst performance is shown when its convolution layers are fixed and only its fully connected layers are trainable.Considering the results,we see that the features extracted from the pre-trained model on ImageNet are not perfectly suitable for our classification problem.Although it shows slightly different results,the best test accuracy is obtained when the lower convolution layers are fixed and the layers of the last convolution block are trained.We also observe that finetuning methods,for which weights of all layers were adjusted,failed to acquire the best score.

Table 2:The validation and test accuracies of the emotion classification models

Inspired by VGG16’s best results for transfer learning,we experimented to see if performance can be further improved when the weights of its layers 1 to 7 are fixed and the remaining layers are shallower.The CNN architecture of the emotion recognition classifier is shown in the Fig.3.Based on the VGG16 architecture,the number of layers and filters of convolution blocks 4 and 5 are adjusted to improve the training speed and to prevent overfitting.It is assumed that feature maps for emotion recognition are not much required at a high level.

Figure 3:Proposed CNN model for facial emotion classification

We trained the proposed model using the same parameters.The test and validation accuracies of the proposed model are 95.22% and 97.01%,respectively,which is higher than the bestperforming model,VGG16 transfer learning (fixed layer 1~7).Fig.4 depicts the confusion matrix for the test samples using the proposed model;Fig.5 depicts the model accuracy and loss,respectively,during the training and validation phase.We can observe from the plot that the model converges.The classification results of the model based on precision,recall,and F1-score are provided in Tab.3.We utilize this emotion classifier to extract emotion features from the user’s photos.

Figure 4:Confusion matrix for the seven emotion categories

Figure 5:Training accuracy vs. validation accuracy (left);training loss vs. validation loss (right)

Table 3:The classification results based on precision,recall,and F1-score

3.3 Data Collection Phase

This subsection describes how photos and related information are collected from users.The Systemsensplus client module operates as a background process and stores the required metadata in a local database,along with smartphone usage data.In this study,we focus on the image-based lifelogging system,which we consider to store mainly photo-related data.

When a user launches the application for the first time,the user’s face registration screen will be activated.After the user’s facial image has been successfully registered,it can be recognized and its emotion scores can be extracted.Users are also able to register new facial images through the face-registration user interface.The Systemsensplus supports two types of virtual sensors;eventbased sensors generate log records whenever the corresponding state changes,and are activated when a user registers new face photos;and polling sensors record information at regular intervals.The upload mechanism is designed to upload only when the smartphone is being charged and has a network connection.

The emotion-detection module on the Systemsensplus server estimates the emotion scores of new photos added at a predefined time when uploading is completed.The extraction of emotion scores occurs in two steps:(1) Determining whether the user’s face is in the photo and if it is,(2) Calculating the confidence scores for seven emotion types.If faces are found,their similarity to the registered user is checked,and if the face in the photo is the user’s face,the emotion scores are acquired.Otherwise,the process stops.The emotion types and scores acquired are tagged in the photo in accordance with the IPTC standards.The type of emotion with the largest score is tagged as text in the ‘category’ column and the seven emotion scores are given as an array in the custom field of the IPTC content.Systemsensplus transfers only the metadata of the photos that have been tagged with emotion data to the client during the next scheduled cycle.

3.4 Representative Photo Selection Phase

Our goal is to maximize the diversity and representativeness of the lifelog images from the emotional perspective.To obtain diversity,we use emotion scores.A group of photos with similar emotion scores is grouped together and we choose key photos among the group.REDES uses the k-means clustering algorithm for this task.This algorithm partitions the given data into k clusters such that the variance of the distances within each cluster is minimized [43].

We use the k-means clustering algorithm with eight-dimensional vectors,consisting of time and the seven emotion scores:

—Time:The date/timestamp from the exchangeable image file formant header of the photos normalized to a real number between 0 and 1.

—Emotion:Seven emotion scores for anger,disgust,fear,happiness,neutrality,sadness,and surprise (where ∑E[i]=1,i=1,...,7).

Given a set of photos {x1,x2,...,xN} with each photo represented in an 8-dimensional vector space,the algorithm partitions theNphotos intok(≤N) setsC={C1,C2,...,Ck},to minimize the pairwise-squared deviations of points in the same cluster,as shown Eq.(1):

whereµiis the mean vector of clusterCi.

To apply the k-means clustering algorithm,the numberkshould be specified.The number of clusters can be calculated as the square root of half the input size [44]as shown in Eq.(2):

whereNis the number of photos.

We obtain a finalkof cluster groups.Then REDES determines the photos represented by the data points closest to each center.Each cluster is a group of similar photos,so the data point closest to the center can be used to represent each cluster group.

4 Experiments

The proposed system extracts emotion scores and stores the photos on the server.Tab.4 shows the distribution of emtions collected from the facial expressions of one graduate student over the last five years;as we might expect,the distribution is quite skewed.However,it helps REDES to select diverse facial expressions.The results of the proposed method are shown in Fig.6.

Table 4:Ratio of emotions extracted from one user’s photos

To evaluate the performance of the REDES,we conducted a user experiment employing three datasets for 32 participants.One dataset contains 300 photos taken over the past three months,whereas the others are special day datasets of 97 and 48 photos,respectively.We employed a clustering method using only time feature as the baseline method,called Time in our comparison.In addition,we designed two comparable methods:One using only emotion features,called REDES(Emotion),and the other which selects photos with the largest emotion score,denoted by MaxEmotion.To conduct this experiment,we prepared an online questionnaire;first,we showed the entire dataset to the participants and instructed them to “Please rate each photo collection according to your preference (i.e.,how representative and/or appealing the group of photos is).” We prepared four different groups of photo collections according to the photo selection strategies—i.e.,REDES,REDES(Emotion),MaxEmotion,and Time—and randomly presented them to participants.The scoring system that we used was a scale of 1 to 5.The higher the score,the higher the user’s preference.

Figure 6:An example of the REDES result for a special day.The entire photostream of a day(top);representative photos selected by REDES (bottom)

The overall grade-distribution range and score averagesare presented in Fig.7,indicating that our proposed method outperforms the other methods.When comparing the mean values of each method for the experimental results,the preferred summary methods are REDES,REDES(Emotion),Time,and MaxEmotion,in that order.To further check whether the difference between the above four methods was statistically significant,one-way analysis of variance (ANOVA) was performed using the MATLAB statistics and machine learning toolbox [45].As a result,the ratio of the between-group variability to the within-group variability (F) is 34.604 and theP-value is 8.57E-20.Thus,we conclude that there was a significant difference in the preference of photo selection from the datasets among those methods at significance level of 0.01.In other words,there was a significant difference between at least two of the group means.

Figure 7:The rating results from the experiment.The dots indicate the means;the boxes indicate the interquartile range,with vertical lines depicting the range and the red lines indicating the median

To determine which means significantly differ from that of our proposed method,we conductedpost hoccomparisons between pairs of methods.Using the overall rating results (N=96),a pairedt-test was conducted.As shown by theP-values in Tab.5,there is a significant difference between REDES and the other methods.

Table 5:Result of a paired t-test between our proposed method and other methods

Our detailed findings are as follows.Firstly,the MaxEmotion method selects the photo with the largest emotion score;thus,there is a tendency for similar photos with a high score in the happiness column to be chosen.Secondly,the time clustering method—as the baseline method—selects the photos among significant time intervals.We found that this method randomly yielded good or bad results depending on the dataset.Finally,since the REDES(Emotion) method mainly focuses on emotional diversity,the selected photos sometimes do not cover the entire period of the datasets.Thus,we concluded that REDES outperforms the other methods,and through the paired t-test we observed that the result has statistical significance.Our proposed scheme visualizes the lifelog by distinguishing the emotions displayed in the user’s photos.Representative photos are selected using emotion and time features,and the user configures the diary form conveniently using different metadata provided from the photos.Representative photos and emotions are displayed together,such that the user can easily recall the emotions of the day.Therefore,our proposed system,REDES,is able to effectively generate a lifelog based on the users’ emotional perspective.

5 Conclusion and Future Work

In this work,we have proposed a new scheme for visualizing lifelogs using emotions extracted from the facial expressions in a user’s photos.The experimental results show that users preferred REDES over the baseline methods.For future work,we plan to extend our clustering scheme to more effectively capture the underlying clustering structure [46]and to develop a more sophisticated lifelogging system that can automatically generate a diary by capturing the objects,locations,and people as well as the user’s emotions.

Acknowledgement:The authors acknowledge Yechan Park for providing the personal dataset and supporting the experiments.

Funding Statement:This research was supported by the MSIT(Ministry of Science and ICT),Korea,under the Grand Information Technology Research Center Support Program (IITP-2020-2015-0-00742),Artificial Intelligence Graduate School Program (Sungkyunkwan University,2019-0-00421),and the ICT Creative Consilience Program (IITP-2020-2051-001) supervised by the IITP.This work was also supported by NRF of Korea (2019R1C1C1008956,2018R1A5A1059921) to J.J.Whang.

Conflicts of Interest:The authors declare that they have no conflicts of interest to report regarding the present study.