APP下载

Sentiment classification model for bullet screen based on self-attention mechanism

2021-12-21ZHAOShuxuLIULijiaoMAQinjing

ZHAO Shuxu, LIU Lijiao, MA Qinjing

(1. School of Electronic and Information Engineering, Lanzhou Jiaotong University, Lanzhou 730070, China;2. School of Information Engineering, Gansu Forestry Polytechnic, Tianshui 741020, China)

Abstract: With the development of short video industry, video and bullet screen have become important ways to spread public opinions. Public attitudes can be timely obtained through emotional analysis on bullet screen, which can also reduce difficulties in management of online public opinions. A convolutional neural network model based on multi-head attention is proposed to solve the problem of how to effectively model relations among words and identify key words in emotion classification tasks with short text contents and lack of complete context information. Firstly, encode word positions so that order information of input sequences can be used by the model. Secondly, use a multi-head attention mechanism to obtain semantic expressions in different subspaces, effectively capture internal relevance and enhance dependent relationships among words, as well as highlight emotional weights of key emotional words. Then a dilated convolution is used to increase the receptive field and extract more features. On this basis, the above multi-attention mechanism is combined with a convolutional neural network to model and analyze the seven emotional categories of bullet screens. Testing from perspectives of model and dataset, experimental results can validate effectiveness of our approach. Finally, emotions of bullet screens are visualized to provide data supports for hot event controls and other fields.

Key words: bullet screen; text sentiment classification; self-attention mechanism; visual analysis; hot events control

0 Introduction

With the advancement of online video technologies and rapid popularization of the Internet, the online video industry has developed rapidly. Commentary behaviors of film and television audiences often have a certain impact on development trends of film and television communications and public opinions. In real life, hot events such as “Tianjin Uncle Touching Porcelain” and “Chongqing Bus Falling into Water” occur all the time, and the live situations may be posted to BILIBILI and other platforms by onlookers or passersby, which may lead audiences to comment on the events themselves as well as people and things involved in them in the form of barrage. These comments also reflect audiences’ attentions and attitudes for those events. Through emotional analysis of bullet screens, relevant management departments can timely formulate corresponding measures to control fermentation of those events through audiences’ attitudes towards them and trends of emotional changes, so as to avoid continuous deteriorations. Compared with filmography barrages, only using positive, neutral and negative emotion classification methods cannot accurately describe the cognition of barrage senders about those hot events, that is, simply “happy” and “hate” cannot define whether there will be people looking at the lively scenes, anti-social personalities or others who seriously affect normal public opinions. Therefore, it is necessary to perform fine-grained emotional classifications in video barrages of hot events.

On this basis, text mining technology is used to model and analyze the data, which can provide reliable data supports for fields of event managements and controls. Based on it, a sentiment classification model multi-head attention convolutional neural network (MH-ACNN) is proposed in this paper for bullet screens of hot events. In this model, position encoding information is added to network input layer to enhance positional relationships among words, and a self-attention mechanism is used to calculate positional relationships among words, which solves the problem that existing sentiment analysis models cannot capture associations among short text words in video barrages. And the multi-head attention mechanism is used to capture feature information of different subspaces and obtain deep semantic representations of barrage sentences, which provides a better input representation for emotion classifications.

Video bullet screens are a kind of intensive and fast real-time comments that appear on video interfaces in forms of texts and other symbols, which is different from traditional comments in viewing message areas of movies[1]. As a key technology of text data mining, natural language processing (NLP) has been widely used in commodity comment emotion analyses, microblog short text emotion analyses and other aspects[2]. Compared with microblog and e-commerce comments, video barrages have characteristics of temporality and concentration. Meanwhile, as a typical short text, how to analyze emotional information contained in video bullet screens has become a hot topic of text emotion analyses and video researches.

The most important steps of traditional sentiment analysis methods based on machine learning are feature selection and model training. Feature selection methods[3]mainly include term frequency-inverse document frequency (TF-IDF), chi-square test (CHI), information gain (IG) and mutual information (MI). Words are sorted and thresholds are set to filter the words according to these calculated values. In practical applications, similar to video barrages which are short texts without grammatical structures, it is too sparse to extract features only through a single word in the texts, and network phrases like “666 and not bad” will be deleted as noisy data through data cleaning. Therefore, it is urgent to establish a corpus for video barrages and find a text representation method for video barrage texts. In the framework of machine learning algorithms, features do not need to be manually extracted in deep learning methods, and the texts can be represented by a vector trained through a deep learning algorithm model[4]. Therefore, deep learning methods provide theoretical and methodological supports for emotional analysis of video barrage texts.

Kim et al.[5]used vectors trained by deep learning model to represent texts and pass CNN[6]with local perceptions, parameter sharing characteristics and a long-short-term memory (LSTM) neural network[7]with temporal characteristics and strong correlations before and after, so as to distinguish the emotional categories of movie reviews. Kalchbrenner et al.[8]proposed a dynamic pooling strategy to model sentence semantics for the problem that CNN could not capture associations among long-distance words in sentences. With limited contextual information, Santos[9]used two convolutional layers of deep CNN to learn information from words to sentences, so as to construct semantic representations.

As a new type of comment, barrage commentaries have become an object of researches on text sentiment analyses. Based on bullet-screen data of BILIBILI’s films, Zheng et al.[10]analyzed sentence-level emotions through an emotion dictionary, and finally visualized emotions according to experimental results, so as to obtain distribution curves of bullet-screen emotions. Deng Yang et al.[11]constructed a barrage word classification algorithm with an implied Dirichlet distribution as a recommendation basis for video clips, however, its training dataset was a traditional normative text without considering special points of non-normative short essays. Hong Qing et al.[12]conducted an emotional classification of bullet screens by improving k-means algorithm on the basis of users, and used its classification results to analyze emotional differences among audiences of specific films. Zhuang Xuqiang et al.[13]used an attention mechanism to effectively identify emotional keywords in video barrages, and combined a LSTM model with emotional dependence in comments of front and rear barrages in the video to extract “highlighted” video clips. Wang Xiaoyan[14]classified feelings of bullet screens through text emotion analysis method based on deep learning, and marked clustering results of video keyframes with emotions, so that emotional information of those keyframes could be visually displayed for users to decide whether they would watch the segments according to others’ comments. However, their incomplete grammatical structures are not considered in current researches on video barrages, and there are only studies on barrages of film and television works. For this reason, a kind of emotional classification model of bullet screens oriented to hot events is proposed in this paper, so as to play a role of decision support in event management and control.

1 MH-ACNN model

Aimed at the needs of fine-grained emotion classifications in hotspot events and other fields, a MH-ACNN model is constructed. In this model, a self-attention mechanism is used to model relationships among words, and a multi-head attention mechanism is used to extend the model to extract emotional expressions in bullet screens at different positions, and in addition to word vector embedding, position and emotion symbol embedding are added to the input part, so that the model can make full use of input information in emotion modeling and analyses.

1.1 Position encoding

In a self-attention mechanism, sequence information cannot be captured because there is no iterative operation similar to the recurrent neural network[15]. Therefore, position information of each word must be provided in order to recognize order relationships in a language. In this paper, positional embedding method[16]is used to label word position information, in which position embedding dimension is [lmax,dmodel], among them,lmaxrepresents the maximum length of the text,dmodelis the word vector dimension. Specifically, a linear transformation of sin and cos functions is used to provide position information.

P(pos,2i)=sin(pos/10 0002i/dmodel),

(1)

P(pos,2i+1)=cos(pos/10 0002i/dmodel),

(2)

whereposrefers to position of a word in a sentence, whose value range is [0,lmax);iis word vector dimension, whose value range is [0,dmodel). Eqs.(1) and (2) correspond to a group of word vector dimensions of odd and even numbers, such as a set of (0,1) and (2,3), which are processed with the above sin and cos functions, respectively, resulting in different periodic changes. Periods of position embedding function vary from 2π to 10 000*2π, and a combination of sin and cos functions in different periods will be got at each position in the word vector dimension, thereby generating unique texture position information, so that the position between dependencies and timing characteristics of natural language can be learnt through the model.

1.2 Self-attention mechanism

The self-attention mechanism[17]is an encoding scheme for learning of text representations proposed by Google Machine Translation Team in 2017. For self-attention, in order to learn multiple meaning expressions, three weights namelyWQ,WKandWVare assigned to the input vectorXand done linear mapping. Specifically, as is shown in Fig.1, whereLxis the length of input, three linear mapping matrices namelyQ,KandVare obtained, whose mathematical expressions are shown in Eqs.(3)-(5).

Fig.1 Linear mapping of input vector

Q=Linear(Xembedding)=XembeddingWQ,

(3)

K=Linear(Xembedding)=XembeddingWK,

(4)

V=Linear(Xembedding)=XembeddingWV.

(5)

Self-attention can capture the syntactic or semantic features between the words in the same sentence, and calculate the correlation between any two words in the sentence, reduce the distance between the feature dependence[16], and make it easier to capture the interdependent features of long distances in a sentence. The calculation process is divided into three steps, and as shown in Fig.2.

Fig.2 Calculation process of self-attention

In Fig.2,Ki,QandVirepresent keywords, query and weight values, respectively;f(·,·) represents function;sirepresents similarity; * means multiplication;airepresents weight coefficient corresponding to value andArepresents attention value matrix of the input sentence.

In the first phase, weight coefficients of corresponding values of eachKare obtained by calculating similarities of eachQand eachK. Commonly used similarity measurement functions are dot (Eq.(6)), concat (Eq.(7)) and perceptron (Eq.(8))

f(Q,Ki)=QTKi,

(6)

f(Q,Ki)=concat(Wα[Q,Ki]),

(7)

f(Q,Ki)=Vαtanh(WαQ+UαKi).

(8)

In the second phase, in order to prevent the result from being too large, scaling is performed, and softmax function is used to normalize the weights. The specific calculation is

(9)

In the thirdphase, weightaiandVcorresponding toKare weighted and summed to obtain the final attention expression

(10)

1.3 MH-ACNN network framework

Contents of barrage texts are severely colloquial and contain almost no complete contextual information. Compared with the LSTM model that is widely used in the field of chapter-level sentiment analyses, CNN model has become a preferred one for sentiment analyses of video barrage texts. In this paper, based on CNN and combined with multi-head attention mechanism, an MH-ACNN model is constructed to solve the problem of emotional classifications of video barrage texts. In order to preserve complete feature information of the sentence, a multi-head attention layer is added in front of the convolution layer. As is shown in Fig.3, the MH-ACNN model includes an embedding layer, a multi-head attention layer, a convolutional layer, a pooling layer, a fully connected layer and a softmax layer.

Fig.3 Structure diagram of MH-ACNN model

Embedding layer: for each vocabulary and emotional symbol in the text, it will map itself and its corresponding positions to low-dimensional vector spaces. Load Word2vec pre-trained word vectors, construct a dictionary to encode words, and replace each part in sample data with an ID in the dictionary, in which the word vectors are directly obtained through ID mapping. According to encoding Eqs.(1)-(2) given earlier, positions with IDpos are mapped to admodeldimensional position vector, and then text sequences are unified into a fixed length by means of zeroing at the end, through which a position vector matrix is obtained. A matrix combined with word vectors, emotional symbol vectorExand position vector matrixEpare spliced again as input of the self-attention mechanism, whose description is

Eembedding=Ex⊕Ep.

(11)

Multi-head attention layer: the self-attention mechanism linearly maps the input matrix to obtain three matrices ofQ,KandVwith the same dimension. Based on a similarity calculation using dot product operation, in order to prevent inner products from being too large, a scaled dot product is used in adjustments. At this time, weighted value of the scaled dot product attention is calculated by

(12)

Q=Linear(Eembedding)=EembeddingWQ,

(13)

K=Linear(Eembedding)=EembeddingWK,

(14)

V=Linear(Eembedding)=EembeddingWV.

(15)

Eight self-attention mechanisms are used by the MH-ACNN model to enable itself to learn relevant information in different representation subspaces, so as to obtain deep semantic expressions of the input.Q,KandVmatrices are linearly transformed as inputs to the scaled dot product attention (as is in Eq.(16)) whose weights are calculated for eight times, but those of linear transformations are different from each time. The values of eight self-attention results are obtained by linearly transforming and are then taken as results of multi-head attention.

M(Q,K,V)=concat(Wo[h1,h2,…,h8]),

(16)

(17)

Convolutional layer: a dilated convolution is used to expand the receptive field without increasing the number of convolution kernels, thereby expanding the range of feature extractions, in which the expansion factor isnand the original convolution kernel size isN, then the expanded convolution kernel size isN′=n(N-1)+1. For a one-dimensional input sequenceX∈Rnand a convolution kernelf∶{0,…,N-1}→R, operations of a dilated convolution on elementsis defined as

(18)

wherefdmeans dilated convolution kernel; * indicates convolution operation at this time;dis dilated rate;s-diaccounts for the past information direction, in which - indicates shift operation anddiis equivalent to the step size of the convolution kernel. Therefore, the dilated convolution is equivalent to introducing a fixed hop interval between every two adjacent filter taps. Whenn=1, the dilated convolution is reduced to an original convolution. Use of a larger expansion factor allows outputs of the top layer to represent inputs of a larger range, thereby effectively expanding the receive domain. Feature graphs are obtained through operations performed by the dilated convolution on weight matrixM, which is obtained by the multi-head attention mechanism.

(19)

Pooling layer: local features extracted through dilated convolution are screened to obtain information of the most important emotional features. Max pooling is used to select features and extractC′, the most important feature information. Feature information after pooling is

(20)

Fully connected layer: integrate the most abstract features obtained through pooling operations. In order to avoid model overfitting, a dropout layer is integrated in the full connection layer whose neurons are randomly discarded with a certain probability to reduce dependence among the neurons, so as to improve generalization abilities of the model. Finally, a softmax function is used to determine emotional tendencies of the barrage text.

y=softmax(WfXr+Bf),

(21)

whereXris output of the pooling layer;Wfis weight matrix of the fully connected layer, andBfis offset of the fully connected layer.

This paper aims at minimizing the cross-entropy loss function when training the model, and the expression is

(22)

whereDis size of dataset;jis a sentiment category label corresponding to review texti;y′ is an actual category, andyis a predicted category.

2 Experiment and analysis

2.1 Experimental data and evaluation indicators

All barrages of hot events on BILIBILI website crawled by Python are used in this paper as a bullet screen dataset (BSD) to verify performances of the MH-ACNN model. There are noises in video bullet screen datasets, so it is necessary to remove punctuation marks, url links and advertisements before building a corpus.

In order to verify effectiveness of the MH-ACNN model, according to fine-grained expression of emotions in barrage texts, samples in BSD dataset are labeled as seven emotion tags, named like, sadness, anger, disgust, happiness, fear and surprise. At the same time, in order to verify generalization abilities of the model, public datasets MDS in NLP2013 Chinese microblog emotion evaluation task are used for testing, as is shown in Table 1.

Table 1 Datasets

Three measurement indicators based on a confusion matrix are selected as criteria to evaluate classification effects of the model, including precision (p), recall (r) andF1 value, which are calculated as

p=TP/(TP+FP),

(23)

r=TP/(TP+FN),

(24)

F1=2pr/(p+r),

(25)

whereTPis the number of categories predicted as positive categories;FPis the number of negative categories predicted as positive categories;FNis the number of positive categories predicted as negative categories.

2.2 Model parameters setting

In this experiment, convolution kernels of different windows are used in feature extractions, and model hyperparameters are selected through 10-fold cross-validation. After cross validations, the final weight of each layer of neuron connection is retrained by all training data. In training processes, a random search algorithm[17]is used to adjust other parameters, so as to achieve an optimal effect in model training classifications. Parameter settings are shown in Table 2.

Table 2 Parameters settings of model

2.3 Analysis of results

As is shown in Table 3, the highest evaluation index is selected as final experimental result of the MH-ACNN model in BSD dataset through multiple training adjustments.

Table 3 Experimental results of MH-ACNN model

In order to better evaluate performances of the MH-ACNN model, a comparison model is also included in multivariate Naive Bayes model (MNB)[18], multi-channel convolutional neural network (MCNN)[5], dual attention convolutional neural network (DAM)[15]and multi-head attention bidirectional LSTM model (MH-BiLSTM). Comparison experimental results are shown in Table 4.

Table 4 Experimental results of different models

According to experimental results, conclusions are obtained: the best classification effect is achieved by the MH-ACNN model in this paper onF1. Compared with the MNB model, macro avgF1 and weighted avgF1 have been increased by 12.7% and 24.31%, which shows that deep learning model is better than traditional machine learning model in emotion classification tasks; at the same time, the MH-ACNN model has fewer improvements on macro avgF1 than DAM model, but the weighted avgF1 has more obvious improvements, indicating that the MH-ACNN model can better learn emotional features, thereby improving classification effects. The MH-BiLSTM model has different degrees of improvements in indicators of macro avgF1 and weighted avgF1 compared with the DAM model, but its classification effect is slightly lower than that of MH-ACNN model, showing that a CNN is more suitable for feature extractions in short texts such as bullet screens.

In order to verify generalization abilities of the model, a microblog review dataset was used to verify classification performances of the MH-ACNN model in this experiment. Experimental results are shown in Table 5.

As can be seen in the Table 5, classification results of the MH-ACNN model in video barrage datasets are slightly higher than those of the Weibo dataset, with a maximum increase of 11%. Experimental effects of the model on the MDS dataset have been increased by 10%, indicating that the model in this article has stronger generalization abilities.

Table 5 Experimental results of different datasets

To evaluate the importance of different components of the MH-ACNN model, the parameters are varied in different ways, and the change in performance of different models on BSD dataset were measured. These results were presented in Figs.4 and 5.

In Fig.4, the number of attention heads is varied, keeping the other parameters constant. It can be seen by comparing the MH-ACNN model and the MH-BiLSTM model, while single-head attention is worse than the eight-head and four-head setting, quality also drops off with too many heads.

Fig.4 Effect of number of heads on classification performance

In Fig.5, the dimension of model embedding is varied, keeping the other parameters constant. Comparing the MH-ACNN model, the MH-BiLSTM model and DAM model, it can be observed that bigger model embedding is better, but with the increase of model embedding, the classification effect is not improved significantly. Therefore, this paper selects 256 as the dimension of model embedding.

Fig.5 Effect of dimension of model embedding on classification performance

2.4 Visual analysis of video barrage corpus

In experimental exploration processes, people should not pay attention to classification effects only, regularities behind the data should also be concerned, so as to provide data supports for practical applications.

Taking the hot event “Female Passenger Sitting on the Station on the Yangtze River Bridge Line” as an example, its barrage comments are visually analyzed. Bullet screens were counted in the video time from occurrence of the event to its advancements, as shown in Fig.6. As can be seen from the video time, barrage volume change chart that, compared to TV series with more plots, hot event barrage volumes have a clearer sense of hierarchy and more prominent focuses. Climax of the whole incident was at around 20 s-70 s of the video, where a series of actions of the female passenger who missed her stop caused ridicules among the public.

Fig.6 Amount of hot event barrage in video time

Bullet screens contain natural time attributes based on the feature of BILIBILI platform, barrages in natural time from the event’s occurrence to its oblivions are visualized. Fig.7 shows the total number of bullet screen statistics in this event on a monthly basis. The figure shows that its bullet screens were concentrated in the period from November to December in 2018, and the number of bullet screens from March 2019 to 2020 indicates that this event has received almost zero attention in later periods.

Fig.7 Variation curve of number of barrages of hot events in natural time

Fig.8 shows the distribution of barrages from November to December in 2018 with a relatively concentrated number of barrages intercepted in days. It can be seen that the number of barrages fluctuated in November 2018, most of which were from November 4thto November 9th, while the barrage amount tended to be zero in December. Therefore, barrages in November 2018 were visually analyzed again in units of days. As is shown in Fig.9, December 7 was a turning point from its peak to small fluctuations.

Fig.8 Variation curve of number of natural time barrages in 2018

Fig.9 Variation curvehe variation of number of barrages in natural time measured in days

After the above analyses, the conclusions are drawn. Firstly, the barrage volume reflects that the hotspot event has experienced six stages: initiation, gestation, development, climax, processing and rest periods, which is consistent with the hotspot event from occurrence to evolution and finally to the general law of calming down. Secondly, “the mouths of people are more important than the river”, which shows a simple but powerful expression of the importance of people’s views. In ancient times when information transmissions were extremely inconvenient, public responses were equivalent to a large river. In modern era when modern information is highly developed, public attitudes are changed from a large river to a vast ocean, whose importance is self-evident. According to development rules of hot events, we must know that managements of malignant events must be prior to those in climax period, and we must focus on key issues that must be resolved in development period, so as to avoid the situation from continually deteriorating.

In order to explore public views and main emotions on the event, key words are extracted from the barrage corpus, and at the same time the MH-ACNN model mentioned in this article is used to classify and visualize emotions in the barrage corpus.

As is shown in Fig.10, the public believes that the female passenger has the potential to “murder other passengers”. She not only frightens other passengers and users watching videos, but also seriously harms public safety, which has constituted a crime and should bear corresponding legal responsibilities or to be punished.

Fig.10 Barrage corpus keywords

Fig.11 intuitively reflects main emotions of video audiences, covering all emotion tags. Among them, emotions of “happiness” and “good” account for only a very small part, while those of “anger” and “disgust” account for the largest proportion. Looking back on the video, it is a typical crime of obstructing public safety. Behaviors of the female passenger are deeply disgusted by video audiences. However, through barrage sentiment analyses, we can see that sentiment tendencies of “happiness” and “like” appear, indicating that some barrage senders do not criticize such kind of behaviors, instead they vent their own emotions through a kind of psychology that is funning and watching lively, which has seriously affected normal directions of public opinions, and there are also certain hidden dangers in those communications. As a platform manager, you need to pay more attention to such groups of people. If there are multiple comments under the same account that do not conform to directions of public opinions, you should arouse high attention.

Fig.11 Emotional distribution of hot events

3 Conclusions

In the field of text sentiment analyses on video barrages, existing researches are mostly directed to personalized recommendations and other fields, which has not been combined with the field of public opinion communications based on the real-time nature of barrages. To this end, a sentiment classification model is built in this article for the needs of fine-grained divisions of public opinions and sentiments in the network, and characteristics of the barrage language are used to extract those of barrage short texts through the CNN, meanwhile relationships among words are calculated through a self-attention mechanism. Existing sentiment analysis models cannot capture relationships among short-text words in video barrages, so a multi-head attention mechanism is used to capture feature information of different subspaces, so as to obtain deep semantic representations of barrage sentences, providing better inputs for emotion classifications. By comparing advantages and disadvantages of each emotion classification model, effectiveness and superiorities of the MH-ACNN model in the field of text emotion classifications are verified. Finally, visual analyses on barrage volumes and emotional distributions in hot events provide certain data supports for event managements and controls.