Multivariate time series prediction based on AR_CLSTM

2021-09-15QIAOGangzhuSURongZHANGHongfei

Journal of Measurement Science and Instrumentation 2021年3期

QIAO Gangzhu，SU Rong，ZHANG Hongfei

(School of Data Science and Technology,North University of China,Taiyuan 030051，China)

Abstract：Time series is a kind of data widely used in various fields such as electricity forecasting,exchange rate forecasting,and solar power generation forecasting,and therefore time series prediction is of great significance.Recently,the encoder-decoder model combined with long short-term memory (LSTM)is widely used for multivariate time series prediction.However,the encoder can only encode information into fixed-length vectors,hence the performance of the model decreases rapidly as the length of the input sequence or output sequence increases.To solve this problem,we propose a combination model named AR_CLSTM based on the encoder_decoder structure and linear autoregression.The model uses a time step-based attention mechanism to enable the decoder to adaptively select past hidden states and extract useful information,and then uses convolution structure to learn the internal relationship between different dimensions of multivariate time series.In addition,AR_CLSTM combines the traditional linear autoregressive method to learn the linear relationship of the time series,so as to further reduce the error of time series prediction in the encoder_decoder structure and improve the multivariate time series Predictive effect.Experiments show that the AR_CLSTM model performs well in different time series predictions,and its root mean square error,mean square error,and average absolute error all decrease significantly.

Key words：encoder_decoder;attention mechanism;convolution;autoregression model;multivariate time series

0 Introduction

With the advent of the big data era,time series as a widely used data type has gradually become a research hotspot in various fields.Time series prediction refers to predicting the value within a period of time in the future by analyzing the relationship between the historical value of the time series and the future value[1].The current analysis and research on time seriesprediction has been extended to meteorology[2],economy[3],industry[4-5],transportation[6],medicine[7]and other fields.Traditional time series prediction methods using auto regressive (AR)models,moving average (MA)models,autoregressive moving average (ARMA)models,or autoregressive integrated moving average (ARIMA)models mainly focus on time series with stationarity,which can only model univariate,homoscedastic and linear time series.However,most of time series are multivariate,heteroscedastic and nonlinear,those methods cannot meet actual needs.Recently,lots of nonlinear model analysis methods have been proposed,such as artificial neural networks,support vector regression,genetic programming and other algorithms,which have strong learning and data processing capabilities,and do not need to assume functional relationship between the data in advance.Through training on a large number of sample data,the model can spontaneously approximate linear characteristics which are complex and even difficult to describe mathematically[13].

Recurrent neural network (RNN)is a type of the neural network used to process sequence data.However,the traditional RNN generally has the problems of gradient disappearance and gradient explosion,which greatly limits the effect of long-term sequence prediction.Therefore,a longshort-term memory (LSTM)network is proposed to solve the problem in a targeted manner,which makes the analysis results have great improvement,so a large number of researchers focus on the development and optimization of related models.

As LSTM has shown excellent performance in time series data processing,lots of improved LSTM-based time series processing methods have been proposed.Zhang et al.[8]divided the data into different scales,and processed the data of different scales with the multi-layer LSTM.Du et al.[9]use the two-way long and short-term memory network layer as the coding network,combined with the attention mechanism,to adaptively learn the long-term correlation and hidden correlation characteristics of multivariate time data.Heidari et al.[10]used the time-step attention LSTM model to predict the short-term energy utilization of solar water heating systems.The models listed above are all optimized models based on LSTM.Especially,the encoder-decoder combined with LSTM is widely used for multivariate time series prediction[11-15].However,since the encoder can only encode information into fixed-length vectors,the performance of the model decreases rapidly as the length of the input sequence or output sequence increases.Furthermore,due to the nonlinear characteristics of the neural network,the output scale is not sensitive to the input scale,while in a specific real data set,the input signal scale changes continuously in a non-periodic manner,which significantly reduces the prediction accuracy of neural network model.

To improve the sensitivity of the output scale of the neural networkkto the input scale and reduce difficulty of learning the relationship between the variables of the multivariate time series,we propose a combination model based on encode_decoder structure and linear autoregression is proposed called AR_CLSTM.

1 Convolution neural network and LSTM network

1.1 Convolution neural network

One of the important reasons why convolution neural network (CNN)can achieve excellent performance on images lies in its feature extraction capability.When multiple convolution layers are stacked,low-level features such as pixels of the original image can be gradually extracted to edges,corners,contours,…,to the entire target.This feature hierarchical representation phenomenon is not only present in image data.For time series data,due to its local correlation,this process also reflects the hierarchical representation phenomenon of features.Therefore,using convolution operations is used to learn correlation between short time steps and local dependencies between time series variables.In this study,one-dimensional convolution is used for time series data.Using different convolution kernels to perform convolution operations on sequence data,through the convolution operation,different features can be extracted from the time series to the feature space of the time series.The convolution operation structure is shown in Fig.1.

Fig.1 Convolution operation diagram

The convolution layer is generally composed of multiple filters or convolution kernels.The output result of thekth filter scanning the input matrix is

Xk=f(Wk*t+bk),

(1)

where * means convolution operation;Wkandbkare the parameters to be trained;tconsists of {t1,t2,t3,ti,tn};f(·)is the activation function;andXkrepresents the output of the convolution layer.

1.2 LSTM network

LSTM network is a variant of recurrent neural network.LSTM solves the long-distance dependence problem that traditional recurrent neural networks cannot handle[16-17].The LSTM adds a cell state to the recurrent neural network,and uses cells to store the long-term state.In order to enable the cell state to effectively preserve the long-term state,the LSTM neural network adds three control gates,a forget gate,an input gate and an output gate.The network structure of the LSTM unit is shown in Fig.2.

Fig.2 LSTM unit structure diagram

The forget gateftis used to control how much message of the unit statect-1from the previous moment can be retained to the current state.The forget gate is calculated by

ft=σ(Wf[ht-1,xt]+bf),

(2)

whereWfis the weight matrix of the forget gate;[ht-1,xt]represents a vector that concatenates the output of the previous moment and the input of the current moment into one;bfis the bias of the forget gate;σis the Sigmoid activation function,whose value is 0 to 1,which describes how much of each part can pass:0 means “no amount is allowed”to pass and 1 means “allow any amount to pass”.

The input gateitis used to control the input of the immediate state to the long-term statec,and it is calculated by

it=σ(W[ht-1,xt]+bi),

(3)

whereWiis the weight matrix of the input gate,biis the bias term of the forget gate.Assumingctis the unit state at the current moment,it is calculated by

(4)

wherectrefers to the current state of the cell input.

Thanks to the control of the forget gate,the state of the unit can save information from a long time ago.In addition the input gate can avoid irrelevant information from entering the memory.

The output gateotis used to control the influence of long-term memory on the current output,and it is calculated by

ot=σ(Wo[ht-1,xt]+bo).

(5)

The final output of the LSTM network is

ht=ottanh(ct),yt=σ(Wht).

(6)

2 Structure and principle of AR_CLSTM

2.1 Data processing

For the collected multivariate time series data,a fixed window is used to select specific step data as the input of the model,and the next step of the window is used as the output of the model.By sliding the window from left to right,the input and output of the model are constructed,respectively.As shown in Fig.3,the time series is represented byT=(t1,t2,t3,…,tn).a sliding window with a size of 4 is selected,where (X1,X2,…,Xp)is the input of the model,(Y1,Y2,…,Yp)is the output of the model,andpis the size of the divided data set.

Fig.3 Data processing diagram

2.2 AR_CLSTM

In AR_CLSTM model,a linear structure and a nonlinear structure are combined in an end-to-end way.Thus,the AR_CLSTM structure is divided into two parts,one part learns the nonlinear characteristics of the data,and the other learns the linear characteristics of the data.

1)Nonlinear learning

The data nonlinear feature learning module is shown in Fig.4.After the input data of the model constructed are processed by the convolution module,the feature space of the sequence data can be obtained,denoted asS=(s1,s2,…,st),S∈Rtxd,wheredis the number of variables in a multivariate sequence.Specific convolution operations are shown section 1.1.The feature space is input into the encoder for processing to obtain the hidden state of the sequence data.In the attention module,the hidden states of different time steps are selected in different proportions to form the coding vector of thetth time step.Inputting all the obtained time step encoding vectors into the decoder will get the predicted valueyt[18].

Fig.4 Nonlinear learning diagram

As shown in the above formula,the encoder is essentially a recurrent neural network with LSTM as the basic unit.For a given input featurest,the encoder can be used to learn the mapping relationshipf1fromsttoht,andht=f1(ht-1,st).It encodes the input sequence into a featurehof fixed size.The hidden stateh=(h1,h2,…,ht),whereht∈Rmis the hidden state of the encoder at timet,andmis the size of the hidden unit.

In order to adaptively select the hidden state of the relevant encoders at all time steps,we add an attention mechanism to the hidden states of different time steps.Specifically,thetth context vector will be weighted and obtained according to the hidden statehtof the decoder of each input vector,namely

(7)

Score=Vtanh(W1h+W2ht).

(8)

The vectorain Eq.(7)can be obtained by the Score function through the SoftMax function.The Score function is shown in Eq.(8),wherehis the hidden state of all steps of the encoder,htis the hidden state of stept,andW1,W2andVare the parameters to be trained.

The structure of the decoder is the same as that of the encoder.Initially,the last stepstandC1are spliced as the LSTM basic unit input.After that,the last step of the LSTM outputyt-1andCtare spliced together as the LSTM basic unit Input,and finally get the predicted valueyt.

2)Linear learning

Due to the non-linear characteristics of convolution and recursive components,one of the main disadvantages of neural network models is that the output scale is not sensitive to the input scale.In order to solve this problely,weadd the linear prediction part on the basis of the above structure,using the classic multivariate autoregressive model (VAR)to learn linear features,as shown in Eq.(9)

(9)

wherew1andb1are the parameters to be trained,tiis the time step in the training set,andwis the window size value.VAR learns the linear relationship between the current step and each historical time step.

The output result of the nonlinear model learning of the AR_CLSTM model is denoted asy1,the output result of the linear model learning is denoted asy2,and then the output of the AR_CLSTM can be denoted as

y=α1y1+α2y2,

(10)

whereα1andα2are the parameters to be trained,and the AR_CLSTM model is combined in an end-to-end way.Here,we use adaptive moment estimation (Adam)to solve the optimal parameters.In this way,the model combination can make the linear model and the nonlinear model make up for each other,so that the overall effect of the model can be improved.

3 Experiment

3.1 Data set

In order to prove the effectiveness of AR_CLSTM,we use a total of 4 data sets for testing.The first is the electricity dataset,which includes the electricity consumption data of 321 customers from 2012 to 2014 (recorded every 1 h)in kWh;The second is the exchange_rate dataset,which collects the data from 1990 to 2016,the daily exchange rates of eight countries including the United Kingdom,Canada,Switzerland,China,Japan,New Zealand and Singapore;The third is the solar_AL data set,which records 137 photovoltaic power plants from Alabama in 2006,the results of sampling every 10 min in the middle;The fourth is the traffic data set,which comes from the hourly data of road occupancy (between 0 and 1)measured by different sensors on highways in the San Francisco Bay area in the 48 months (2015-2016)of the California Department of Transportation.The number of training set,test set,and data set variables are shown in Table 1.In this study,80% of the experimental data are used as the training set and 20% as the test set.At the same time,in order to eliminate the influence caused by the difference in magnitude between the features,the data are normalized.

Table 1 Specifications of data set

3.2 Network parameters

The data flow changes in the AR_CLSTM network structure are shown in Fig.5.The specific parameter settings of AR_CLSTM are shown in Table 2.

It can be seen in Fig.5 that the convolution block used by AR_CLSTM is a three-layer convolution network.Each layer is composed of convId,maxpooling and dropout.The specific parameter settings are shown in Table 2.

Fig.5 AR_CLSTM network structure diagram

Table 2 AR_CLSTM parameter

Convolution learns the relationship between different dimensions.The pooling layer performs an up-sampling operation on the convolution output to retain strong features and remove weak features.At the same time,the dropout layer is introduced to reduce the number of parameters and prevent overfitting.In addition,it can be seen from Table 2 that the encoding and decoding structures are both composed of LSTM layers with a hidden unit size of 100.The size selected by dense is the basic unit of the output feature size to be fully connected.The structures of VAR,attention and model choic are described in section 2.2 and they have no control parameters.The model choose structure is shown in Eq.(10).The model uses mean square error(MSE)as the loss function,and uses the Adam optimization algorithm to update the weights of the network.

3.3 Evaluation criteria

In order to prove the effectiveness of the AR_CLSTM model,we use MSE,root mean square error (RMSE),and mean absolute error (MAE)to evaluate the errors produced by the model predictions.MSE,RMSE and MAE are calculated by Eqs.(11)-(13),respectively.The smaller the value of MSE,RMSE,and MAE,the better the accuracy of the model.

(11)

(12)

(13)

3.4 Experimental results and analysis

3.4.1 Analysis of evaluation index

In order to verify the effectiveness of the AR_CLSTM model,we compare it with four baseline methods on four data sets.The experimental results are shown in Table 3.

Table 3 Experimental results

Table 3 shows the index results of RMSE,MAE,and MSE for different models under different data sets.The window in the table is the size of the window mentioned in section 2.1.the value of the window is selected through the grid searchw={12,24,30,48}.

It can be seen from Table 3 that the proposed model can learn the linear and nonlinear characteristics of the data very well,and performs best on all data sets.Under the RMSE indicator,the effect of the AR_CLSTM model in the electricity data set is improved by 11.8% compared with VAR,by 57.8% compared with LSTM,and by 59.1% compared with GRU;The effect of the AR_CLSTM model in the exchange_rate data set is improved by 60.4% compared with VAR,by 49.7%compared with LSTM,and by 14.7% compared with GRU.The effect of the AR_CLSTM model in the solar_AL data set is improved by 13.5% compared with VAR,by 34.3%compared with LSTM,and by 40.2% compared with GRU.The effect of the AR_CLSTM model in the traffic data set is improved by 22.8% compared with VAR,by 26.7%compared with LSTM,and by 28.3% compared with GRU.It can be seen that CNN has the worst learning ability for time series data,GRU and LSTM are similar,and AR_CLSTM is the best.

3.4.2 Analysis of model time-consuming

In this study,we chose the data set solar_AL with the largest test set specification and Traffic with the highest data variable dimension to analyze and compare the prediction time of AR_CLSTM model with those of VAR,LSTM,and GRU.The time required to complete the prediction of the four models under the solar_AL data set is shown in Table 4,and the time required to complete the prediction under the Traffic data set is shown in Table 5.

Table 4 Time-consuming of solar_AL

Table 5 Time-consuming of traffic

From the comparison of Tables 4 and 5,compared with other models,the AR_CLSTM model does not take much time to complete the prediction,but the accuracy is higher.

3.4.3 Display of prediction results

In order to better observe the details of the prediction results,the experiment chose to draw the prediction effect diagram of the last 300 steps under each data set.Figs.6-9 show the partial prediction results of the AR,LSTM,and AR_CLSTM models in the data sets electricity,exchange_rate,solar_AL,and traffic,respectively.The predicted value is drawn with a dashed line,and the true value is drawn with a solid line.It can be seen that the prediction results of the model are not the same under different data sets.

(a)AR

For electricity and exchange rate data sets,AR_CLSTM has the best fitting effect.In the solar AL data set,the prediction results of AR model for night solar energy are not as good as that of AR_CLSTM,and the overall fitting effect of LSTM is not as good as that of AR_CLSTM.In the traffic data set,for the appearance of irregular super-high peaks in the traffic data set,the prediction results of the models AR,LSTM,and AR_CLSTM are not satisfactory.However,it can be seen from the prediction results that AR_CLSTM has the strongest learning ability for sequence data among different data set.

(a)AR

3.4.4 Ablation experiment

In order to further analyze the improvement effect of the convolution,attention and autoregression proposed by the (AR_CLSTM)model on the original model,we analyze and compare them under the electricity and traffic data sets.In order to facilitate the description,the encode_decoder network without attention mechanism mentioned in the above is named en_de,and the model with attention mechanism is named en_de_att.After the convolution module is introduced,it is named CNN_en_de_att.The specific situations are shown in Tables 6 and 7.It can be seen that the attention mechanism,convolution module,and autoregressive module have been improved to varying degrees on the basis of the original model.However,the obvious attention mechanism and autoregressive module contribute more to AR_CLSTM.For this reason,a new challenge is proposed for how to use CNN to learn the characteristics of sequence data.

Table 6 Ablation experiment on electricity

Table 7 Ablation experiment on traffic

4 Conclusions

The AR_CLSTM model is proposed for multivariate time series prediction，which uses the structure of convolution to learn the internal relations between the different dimensions of multivariate time series,and then the feature information with less redundant data are input into the encode_decoder network structure based on time step attention to extract the coupling relationship between time series data.The model also uses the traditional linear regression method to learn the linear relationship of the time series,and improves the accuracy of multivariate time series forecasting by learning the linear and non-linear relationships of the time series.The effectiveness of the model is verified by testing on four multivariate time series data sets.Experimental results show that the AR_CLSTM model can combine the advantages of linear prediction and nonlinear prediction,and has different effects on various types of data sets.

It can be seen from experiments that the addition of CNN modules can improve the learning ability of the model to a certain extent.But how to use the convolution operation to further extract the features of the time series data needs further research and exploration.In addition,since the performance of deep learning models is related to different parameters and data set conditions in most cases,how to improve and stabilize the predictive ability of this model and better support online time series predition and early warning applications is also the focus of future research.

Journal of Measurement Science and Instrumentation

2021年3期