Parallel Reinforcement Learning-Based Energy Efficiency Improvement for a Cyber-Physical System

2020-05-22TengLiuBinTianYunfengAiandFeiYueWang

IEEE/CAA Journal of Automatica Sinica 2020年2期

Teng Liu,, Bin Tian,, Yunfeng Ai,, and Fei-Yue Wang,

Abstract—As a complex and critical cyber-physical system (CPS),the hybrid electric powertrain is significant to mitigate air pollution and improve fuel economy. Energy management strategy(EMS) is playing a key role to improve the energy efficiency of this CPS. This paper presents a novel bidirectional long shortterm memory (LSTM) network based parallel reinforcement learning (PRL) approach to construct EMS for a hybrid tracked vehicle (HTV). This method contains two levels. The high-level establishes a parallel system first, which includes a real powertrain system and an artificial system. Then, the synthesized data from this parallel system is trained by a bidirectional LSTM network.The lower-level determines the optimal EMS using the trained action state function in the model-free reinforcement learning (RL)framework. PRL is a fully data-driven and learning-enabled approach that does not depend on any prediction and predefined rules. Finally, real vehicle testing is implemented and relevant experiment data is collected and calibrated. Experimental results validate that the proposed EMS can achieve considerable energy efficiency improvement by comparing with the conventional RL approach and deep RL.

I. Introduction

CYBER physical systems (CPSs) are defined as a system wherein the physical components are deeply intertwined with the software components to exhibit various and distinct behavioral patterns [1]. Recent increased demands of performance and complex usage pattern accelerate the development and research of CPSs [2]–[4]. Being a typical application of CPS in green transportation, hybrid electric vehicles (HEVs)show great potential to reduce energy consumption and air pollution [5], [6]. In such a system, hybrid electric powertrain and driving environments constitute the physical resources,communication and control data compose the cyber part of this system [7], [8]. Strong nonlinearities and uncertainties of the interactions between the cyber and physical resources increase difficulties in control, management, and optimization of HEVs [9], [10]. Especially, energy management of HEV is critical, and several challenges remain to be resolved, such as optimization, calculation time, and adaptability [11], [12].

Energy management strategies (EMSs) has been active research for decades because it can achieve remarkable energy efficiency. Existing EMSs are generally classified into three different categories, rule-based, optimization-based, and learning-based ones. Rule-based strategies depend on a set of predefined criterions without knowledge of real-world driving conditions [13], [14]. Binary control as a typical example is used to adjust power split between battery and engine as the state of charge (SOC) exceeds the threshold values. When the trip information is prior known, many approaches have been applied to search the optimal control strategies, such as dynamic programming (DP) [15], stochastic dynamic programming (SDP) [16], Pontryagin’s minimum principle(PMP) [17], model predictive control (MPC) [18], and equivalent consumption minimization strategy (ECMS) [19].However, these strategies are usually inappropriate for various driving environments [20]. Due to the ultrafast development of computing capability, learning-based methods emerge great potential in learning control strategies from the recorded historical driving data [21], [22]. This type of methods needs to be further developed.

As a complex CPS, hybrid electric powertrain still faces several issues to handle energy management problems. The first one is data lack [23]. The controller needs to collect new data and learn new model parameters to derive different strategies for new driving conditions. The second one is data inefficiency [24]. Large-dimension actions and states of complex CPS need to be calibrated and scheduled reasonably to guide the controller. The final one is universality. Adaptive and efficient control strategies need to be generated to accommodate the dynamic real-world driving conditions.

Fig. 1. Bidirectional LSTM network-based parallel reinforcement learning framework.

To address these difficulties, we develop a novel bidirectional long short-term memory (LSTM) network based parallel reinforcement learning (PRL) framework to construct EMS for a hybrid tracked vehicle (HTV), see Fig. 1 as an illustration. This framework involves two levels. In the highlevel structure, an artificial vehicle powertrain system is built analogy to the real vehicle to constitute the parallel powertrain system. The large synthesized data from this parallel system is utilized to relieve the data lack problem. A bidirectional LSTM network is proposed to represent dependence between multi-actions and state. This network can capture more details of the interactions between multi-action embeddings to solve the data inefficiency problem. In the lower-level skeleton,model-free reinforcement learning (RL) algorithm is finally used to compute the adaptive control strategy based on the trained data.

This literature involves three perspectives of contributions:1) A parallel system of the HTV is constructed to generate large synthesized data based on the limited real historical data;2) A bidirectional LSTM network is proposed to train the available data to model effectively the action state function;3) Mode-free RL technique is applied to derive the adaptive EMS to accommodate different driving conditions.Experimental results illustrate that the proposed EMS can achieve considerable energy efficiency improvement by comparing with the conventional RL approach and deep RL.

The remainder of this paper is organized as follows. Section II describes the high-level architecture of a deep neural network for data estimation and the bidirectional LSTM network framework. Section III describes the modeling of the hybrid electric powertrain, wherein the optimal control problem is constructed, and the structure of the lower-level model-free RL algorithms are also introduced. In Section IV,the data collection in real vehicle tests and synthesized data processing are elaborated, and experiment results of three control strategies comparison are presented. Key takeaways are summarized in Section V.

II. Bidirectional LSTM Framework

Bidirectional LSTM network framework for action state function estimation is introduced in this section. First,multilayer deep neural network is constructed via considering powertrain state and actions as inputs. The states are the SOC in battery and generator speed, and actions are the engine torque, power demand, and motor speed. Based on this network, bidirectional LSTM theory is formulated to approximate the action value function. The detailed components are illustrated as follows.

A. Multilayer Neural Network

A deep neural network is a logical-mathematical model that seeks to simulate the behavior and function of a biological neuron [25]. Three layers named input layer, hidden layer, and output layer are included in this network, see Fig. 2(a) as an illustration. The input vectorz= [z1,z2, …,zN] are weighted by elementsω1,ω2, …,ωN, then summed with a biasband imposed by an activation functionfto generate the neuron output as follow:

Fig. 2. Multilayer neural network for states and actions training. (a) Deep neural network construction; (b) LSTM memory block.

wherexdepicts the net input,ymeans the neuron output,Nis total inputs index andzjis thejth input.

The log-sigmoid activation function is adopted in this paper,and thus the output of the overall networks is depicted as

wheref2andf1represent the activation function of the hidden layer and output layer, respectively.Sis the number of neurons in the hidden layer,is the weight connecting thejth input andith neuron in the hidden layer,represents the weight connecting theith source of hidden layer to the output layer neuron.depicts the bias of theith neuron in the hidden layer andis the bias of the neuron in the output layer.

B. Forward Pass of LSTM

A memory block is the key constituent part of an LSTM network. For each block, three adaptive and multiplicative gating units are shared by multiple cells, as shown in Fig.2(b). Furthermore, a recurrently self-connected linear unit called constant error carousel (CEC) is the core of each block.The CEC can provide short-term memory storage for extended time periods by recirculating activation and error signals indefinitely. The three gating units are able to be trained to recognize, store and read information from the memory block.All the cells are combined into the block to share the same gates and reduce the number of adaptive parameters [26].

In this paper, the LSTM network is operated in bidirectional courses and the time steps are discreted ast= 0, 1, 2, … . The two courses are named forward pass and backward pass,which mean the updating of the units’ activation and the calculation of the error signals for weights. The notations in the following manuscript are defined as:jis the index of the memory block,vdepicts the sequence of cells in the blockj,and thusmeans thevth cell in thejth memory block.ycis the output of the memory cell, which is calculated by the cell statesc, cell inputzc, input gatezin, output gatezout, and forget gatezθ.ωlmis the weight connecting the celllandm. The components of one cell are described as follow.

1) Input:In the forward pass, the cell input is first computed as

This variable is affected by the input squashing functiongto generate the new cell state next.

The input gate activationyinis derived by applying a logistic sigmoid squashing functionfinwith range [0, 1] to the gate’s net inputzin

2) Cell State:The memory cell statescis initialized to zero whent= 0, and then it accumulates based on the input and discounted factor of the forget gate. First, the forget gate activation is defined as

wherefθrepresents a logistic sigmoid function and ranges from 0 to 1. Then, the new cell state is derived as follow:

What information to store in the memory block is decided by the input gate and when to erase the outdated information is determined by the forget gate. By doing this, the memory block can retain fresh data and the cell state cannot grow to infinity.

3) Output:The read access to the information is controlled by the output gate via multiplying the output from the CEC.The relevant activation is calculated by applying the squashing function ([0, 1]) into the net input

Then, the cell outputycis described by the cell state and output gate activation as follow:

Finally, the activation of the output unitskis depicted as

wheremrange over all units, andfkis the output squashing function.

C. Backward Pass of LSTM

LSTM’s backward pass is a truncated version of real-time recurrent learning (RTRL) for weights to cell input, input gates, and forget gates. Also, it fuses the error of backpropagation (BP) in the output units and output gates efficiently.

1) Output Units and Gates:Based on the targettk, the squared error objective function is depicted as

whereekis the externally injected error. Gradient descent algorithm is used to minimize the objective function. The weightωlmis decided by the variation Δωlm, which is calculated via the negative gradient ofEtimes the learning rateα.Hence, the standard BP weight changes of output units are

The standard BP is also utilized to compute the weight changes for connections to the output gate from source unitsm

2) Truncated RTRL Partials:The forward propagation is necessary in time for the partial derivatives in RTRL. These partials for weights at the cellinput gate (in), and forget gate (θ) are updated as follow:

where whent= 0, these partials equal to zero.

3) RTRL Weight Changes:In backward pass, the RTRL partials are employed to compute weight changes Δωlmfor connections to the forget gate, cell and input gate as

At each memory cell, the internal state erroris determined as

D. Bidirectional LSTM Outline

In bidirectional recurrent nets, the forward and backward sequences of each training are regarded as two independent recurrent nets and are connected to the same output layer.Taking the time sequence fromt– 1 totas an example, the outline that combines the bidirectional algorithm and LSTM is illustrated as follow.

1) Forward Pass:Feed all input data of the sequence into the LSTM and decide all the output units.

a) For the forward states (from timet– 1 tot) and backward states (from timettot– 1), realize the forward pass process in Section II-B;

b) For the output layer, realize the forward pass process in Section II-B.

2) Backward Pass:Compute the relevant partial derivatives of error for the sequence used in the forward pass.

a) For the output neurons, achieve the backward pass process introduced in Section II-C;

b) For the forward states (from timettot– 1) and backward states (from timet– 1 tot), achieve the backward pass process discussed in Section II-C.

3) Update Weight Changes:Finally, (16) to (19) are used to update RTRL weight changes.

III. Powertrain Model and Parallel Reinforcement Learning

In this section, the energy management of a hybrid tracked vehicle (HTV) is constructed as an optimization control problem. Modeling of the battery pack and engine-generator set (EGS) combine with the optimization objective are first introduced. To resolve the data lack problem of a complex CPS, a parallel system of the hybrid electric powertrain is then proposed to generate the artificial data. Real and artificial driving data constitute the synthesized data, which is trained to approximate the action state function. Finally, Q-learning algorithm is applied to compute the optimal control action according to the trained data from the bidirectional LSTM network.

The studied complex CPS is a self-built HTV and Fig. 3 depicts the sketch of the powertrain architecture. The main energy sources to propel the powertrain are the EGS and battery [10]. Table I lists the key characteristics of the HTV powertrain.

A. Powertrain Modeling

Fig. 3. Sketch of the self-built HTV architecture.

TABLE I Specification of the HTV

For EGS, the rated engine power is 52 kW at the speed 6200 rpm. The rated generator output power is 40 kW within the speed range from 3000 rpm to 3500 rpm. The generator speed is the first state variable and is computed based on the torque equilibrium restraint

wherengandneare the rotational speeds,TgandTeare the torques of generator and engine, respectively.Teis one of the control variables in this work.JeandJgare the rotational moment of inertias for the engine and generator, severally.iegis the gear ratio connects the generator and engine, and 0.1047 is the transformation parameter which means 1 r/min =0.1047 rad/s.

The output voltage and torque of the generator are derived as follow:

whereKeis the electromotive force coefficient,UgandIgare the generator current and voltage, respectively.Kxngis the electromotive force, andKx= 3PLg/π, in whichLgis the armature synchronous inductance,Pis the poles number.

In the hybrid electric powertrain, SOC of battery is selected as another state variable. The output voltage and derivative of SOC in the battery are depicted via the equivalent first-order model

whereIbatandCbatare the battery current and capacity, respectively.Pbatis the battery power,rinis the battery internal resistance andVocis the open circuit voltage.Ubatis the output voltage of battery,rdis(SOC) andrch(SOC) describe the internal resistance during discharging and charging, respectively.

The optimization control goal to be minimized is a trade-off between the charge sustaining constraint and fuel consumption over a finite horizon as

Furthermore, the instantaneous physical limits need to be observed to guarantee the reliability and safety of the powertrain:

wherene,min,ne,max,Te,min, andTe,maxare the permitted lower and upper bounds of the engine speed and torque, respectively.nmis the motor speed,nm,minandnm,maxare its boundary values.Pdem,min, andPdem,maxare the threshold of power demand admissible sets, same as theng,minandng,max.

Note that the core of this article focuses on discussing the PRL technique for a complex CPS, the traction motors are assumed as the power conversion devices with the identical efficiency and the battery aging is not considered in this study[9], [10].

B. Parallel Powertrain System

Fei-Yue Wang first initialized the parallel system theory in 2004 [28], [29], in which ACP method was proposed to deal the

complex CPS problem. ACP approach represents artificial societies (A) for modeling, computational experiments (C) for analysis, and parallel execution (P) for control. An artificial system is usually built by modeling, to explore the data and knowledge as the real system does. Through executing independently and complementally in these two systems, the learning model can be more efficient and less data-hungry.ACP approach has been employed in several fields to discuss the different problems in complex CPSs [30]–[32].

Fig. 4. Parallel powertrain system for the self-built HTV.

For a self-built HTV, there are not sufficient environments provided for it to operate to obtain enough actual data. Hence,we build an artificial powertrain system in MATLAB/Simulink to address the data lack problem in action state function training. This artificial system combines with the real powertrain system constitute the parallel system, see Fig. 4(a)as an illustration. To neglect the steering power (too small),the speeds of two tracks are equivalent to the average speed of them. By taking a few field test data as guidance and regulating the parameters of powertrain model and environments, large artificial data is acquired, including SOC,generator speed, engine torque, engine speed, power demand,battery current, battery voltage, and two motors speed. The synthesized data from the parallel system is collected and calibrated to derive the optimal EMS using the bidirectional LSTM network and reinforcement learning.

C. Reinforcement Learning

A learning agent interacts with a stochastic environment in reinforcement learning (RL) framework, and this interaction is modeled as a discrete discounted Markov decision process(MDP). The MDP is expressed as a quintuplewhereandare the set of control actions and state variables,is the transition probability matrix (TPM),is the reward function, andis a discount factor.Especially, the state variables in this paper involves SOC and generator speed, control actions consist of the engine torque,power demand, and motor speed, reward functionrepresents the fuel consumption rate, anddenotes the probability related with the transfer from statesto next states´ taking actiona.

The value function is defined as the expected future reward

Then, the finite expected discounted and accumulated rewards is summarized as the optimal value function whereπis the control policy, which depicts the control action distribution with the time sequence. To deduce the optimal control action at each time instant, (26) is reformulated recursively as

The optimal control policy is determined based on the optimal value function in (27)

Furthermore, the action value function and its corresponding optimal measure are described as follow:

Fig. 5 shows the bidirectional LSTM-based deep reinforcement network is utilized to estimate the action value function in RL. This structure includes two deep neural networks, one for state variablesembeddings and another for control sub-actionsembeddings. The bidirectional LSTM network is supposed to capture more information on how the individual embeddings combined into an integrated embedding due to its nonlinear structure.

The inner product is used to compute newthrough combining the states and sub-actions neuron output as

whereK1andK2are the number of the states and sub-actions,respectively.denotes the expected accumulated future rewards relevant with the specific state variablest.

Finally, the action value function corresponding to an optimal control policy can be computed using the Q-learning algorithm as [33]

Fig. 5. Bidirectional LSTM-based deep neural network.

Algorithm 1 describes the pseudo-code of the Q-learning algorithm. The discount factoris settled as 0.96, the decaying factoris related with the time instantand given asto accelerate the convergence rate, the iterative timesis 10 000, and the sample time is

Algorithm 1: Q-learning Algorithm 1. Extract Q(s, a) from training and initialize iteration number Nit 2. Repeat time instant k = 1, 2, 3, …3. Based on Q(s, .), choose action a (ε-greedy policy)4. Executing action a and observe r, s'5. Define a* = arg mina Q(s', a)6. Q(s, a)←Q(s, a) + μ(r(s, a) + γmin a' Q(s', a') – Q(s, a))7. s←s'8. until s is terminal

The TPMs of power demand and vehicle modeling are inputs of RL technique for optimal EMS calculation. The RL algorithm is realized in Matlab via the Markov decision process (MDP) toolbox presented in [34] and a microprocessor with an Intel quad-core CPU ofand RAMThe proposed EMS is compared with the conventional RL approach and deep RL to demonstrate its optimality and availability in the next section.

IV. Experiment Results and Discussions

The proposed bidirectional LSTM enabled PRL-based energy management strategy (EMS) is assessed on the selfbuilt HTV powertrain in this section. First, data collection and processing are introduced in detail. We operate the HTV in the real scenarios to collect real vehicle driving data. Based on this data, we generate the synthesized data from the parallel system to use for action value function estimation, including all the states and control variables. Then, the presented PRL-based energy management strategy is compared with the conventional RL and deep RL approaches to evaluate its availability and optimality. Simulation results indicate that the proposed energy management strategy is superior to the two benchmarking techniques in control performance.

A. Data Collection and Processing

The real vehicle experiment is implemented on the self-built HTV in the suburb to represent the cross-country scenarios,and the real and target driving cycles are depicted in Fig. 6.The vehicle data and powertrain states are collected with a sampling frequency offrom the CAN bus. The collected driving data is applied to create large artificial data in the parallel powertrain system. Observing the physical constraints of powertrain, the inputs of the parallel system are engine throttle and rotational speed randomly, and outputs are the state variables and control actions. Fig. 7 illustrates a period of the generated driving data from the parallel powertrain system.

Fig. 6. Real vehicle testing scenario and the corresponding driving cycles.

Fig. 7. One period of the generated driving data from the parallel system.

Furthermore, to eliminate the influence of different variable units on training, the input state variables and control actions of the network are scaled to the range from 0 to 1.

B. Comparison of Different EMSs

Based on the trained action-value function, the proposed bidirectional LSTM enabled PRL-based EMS is compared with the conventional RL and deep RL controls to certify its availability and optimality in this section. In the energy management problem, the simulation cycle is a real vehicle driving cycle, and the initial values of the state variable SOC and generator speed are 0.7 and 1200 rpm, respectively.

The SOC trajectories of a certain driving cycle and the corresponding generator speed are illustrated in Fig. 8. It can be discerned that the SOC trajectory imposed on the proposed model free EMS is close to that in deep RL control, and they are different from that in conventional RL control. This can be explained by the different power split between the EGS and battery, which is decided by the action value functions. It demonstrates that the training process in the deep neural network can improve the accuracy and optimality of control policy derived by the Q-learning algorithm. An analogous result in the generator speed trajectory is also given in Fig. 8.

Fig. 8. State variables SOC and generator speed trajectories for the different control strategies.

Taking engine torque as an example, the above observation can be explained by the different distribution of engine torque with the state variables. Being a control variable, different values of engine torque decide multiple operative modes of the powertrain, as shown in Fig. 9.

The convergence processes of the action value function in the proposed EMS, conventional RL and deep RL are illustrated in Fig. 10 . The mean discrepancy depicts the deviation of two action value functions per 100 iterations.Note that, the increase of iterative number accompanies with a decreased mean discrepancy, which implies the convergence characteristic of the Q-learning algorithm.

Fig. 9. Control action engine torque with the state variables.

Fig. 10. Convergence rates of the action value function in three controls.

Fig. 10 also describes that the proposed control is superior to the conventional RL and deep RL controls in control performance, and the convergence rate is a little slower than them. This can be illuminated by the additional training process of the action value function in the bidirectional LSTM network. With an accepted calculation speed, the proposed EMS adapts to the real-time driving conditions more suitably than the conventional RL and deep RL controls, which demonstrates its availability.

Table II describes the fuel consumption after SOC correction and computation time for the three control strategies. It is apparent that the fuel consumption under the PRL-enabled EMS is lower than those in conventional RL-based and deep RL controls, which demonstrates its optimality. Also, the consumed time of PRL is lower than that of deep RL and conventional RL, which implies that it is potential to be applied in real-time.

TABLE II The Fuel Consumption in Three Control Strategies

V. Conclusion

We propose a novel bidirectional LSTM network based PRL framework to construct EMS for an HTV in this paper.First, the up-level builds an artificial vehicle powertrain system analogy to the real vehicle to constitute the parallel powertrain system. Second, a bidirectional LSTM network is proposed to train the large synthesized data from this parallel system to represent dependence between multi-actions and states. Third, in the lower-level skeleton, model-free RL algorithm is finally used to compute the adaptive control strategy based on the trained data.

Tests prove the optimality and availability of the proposed energy management strategy. In addition, the advantages in control performance and energy efficiency imply that the proposed adaptive control can be applied in real situations.

The proposed combination of bidirectional LSTM network and RL is indeed a simplified specification of the so-called parallel learning [35] which aims to build a more general framework for data-driven intelligent control. Future work focuses on applying the parallel learning and PRL framework into different research fields of automated vehicles, such as driving style recognition [36], braking intensity estimation[37], [38], and lane changing intention prediction [39], [40].The parallel system could generate abundant driving data and evaluate the performance of different controllers easily.