Application of the asynchronous advantage actor-critic machine learning algorithm to real-time accelerator tuning

2019-10-18YunZouQingZiXingBaiChuanWangShuXinZhengChengChengZhongMingWangXueWuWang

Nuclear Science and Techniques 2019年10期

Yun Zou · Qing-Zi Xing · Bai-Chuan Wang · Shu-Xin Zheng ·Cheng Cheng · Zhong-Ming Wang · Xue-Wu Wang

Abstract This paper describes a real-time beam tuning method with an improved asynchronous advantage actorcritic (A3C) algorithm for accelerator systems. The operating parameters of devices are usually inconsistent with the predictions of physical designs because of errors in mechanical matching and installation. Therefore, parameter optimization methods such as pointwise scanning,evolutionary algorithms (EAs), and robust conjugate direction search are widely used in beam tuning to compensate for this inconsistency. However, it is difficult for them to deal with a large number of discrete local optima.The A3C algorithm, which has been applied in the automated control field, provides an approach for improving multi-dimensional optimization. The A3C algorithm is introduced and improved for the real-time beam tuning code for accelerators.Experiments in which optimization is achieved by using pointwise scanning, the genetic algorithm (one kind of EAs), and the A3C-algorithm are conducted and compared to optimize the currents of four steering magnets and two solenoids in the low-energy beam transport section (LEBT) of the Xi’an Proton Application Facility. Optimal currents are determined when the highest transmission of a radio frequency quadrupole (RFQ) accelerator downstream of the LEBT is achieved.The optimal work points of the tuned accelerator were obtained with currents of 0 A,0 A,0 A,and 0.1 A,for the four steering magnets,and 107 A and 96 A for the two solenoids. Furthermore, the highest transmission of the RFQ was 91.2%. Meanwhile, the lower time required for the optimization with the A3C algorithm was successfully verified. Optimization with the A3C algorithm consumed 42% and 78% less time than pointwise scanning with random initialization and pre-trained initialization of weights, respectively.

Keywords Real-time beam tuning · Parameter optimization · Asynchronous advantage actor-critic algorithm · Low-energy beam transport

1 Introduction

The low-energy beam transport (LEBT) section is commonly applied to proton and heavy ion accelerators to transport and match a beam to the entrance of a radio frequency quadrupole (RFQ) accelerator [1, 2]. The magnetostatic focusing LEBT, which has been widely explored, has the following advantages: the space charge effect can be neutralized by the residual gas and emittance growth can be decreased [3]. The magnetostatic focusing LEBT usually consists of several steering magnets and solenoids. The beam envelope and trajectory can be adjusted by changing the currents of their power supplies[4,5].However,the operational currents are different from the predictions of physical designs because of errors in mechanical matching and installation. Therefore, parameters should be optimized in beam tuning to compensate for the inconsistency.

Methods of parameter optimization mainly comprise pointwise scanning and algorithm optimization. In pointwise scanning,after each point is scanned,a global optimal work point can be detected.However,the scanning process may take a long time if multi-dimensional parameters are involved and an even longer time in the case of re-tuning and re-scanning. Thus, algorithm optimization is proposed to reduce the cost. This optimization is achieved by evolutionary algorithms (EAs) [6], robust conjugate direction search (RCDS) algorithm [7], machine learning algorithm[8], etc.

EAs are a set of population-based, fitness-oriented, and variation-driven algorithms that are inspired by Darwin’s natural selection [9]. In particular, EAs start with a group of solutions and evolve with variation and other policies until the fittest solution is obtained. Two typical EAs include the genetic algorithm (GA) and particle swarm optimization (PSO). The GA encodes parameters into binary codes that will crossover and mutate like a gene does to generate the next generation. The fittest codes in the new generation will be reserved and will evolve again.PSO starts with multiple solutions;then,the fittest solution is searched in the optimization direction of local historically best and global best solutions [6]. RCDS is an improved conjugate direction search algorithm that combines Powell’s method with a robust line optimizer and can realize optimization for noisy data [10]. In reinforcement learning (RL), a machine learning algorithm, a neuron network,is adopted as an agent that reacts to the real world,and a series of actions are chosen to obtain the highest reward [8].

The above algorithms have been widely applied in the parameter optimization of accelerators. Bartolini has optimized the beam dynamics in free electron lasers through the GA [11], and the set point of the LANSCE linac has been found by Pang through PSO [12]. RCDS have been adopted by Huang for the online optimization of accelerators in SLAC [10]. The machine learning algorithm has been applied in the low-level radio frequency tuning for FNAL by Steimel[13]and radio frequency design by Shin[14].

However, owing to time constraints and instabilities in the above algorithms, the global optimum is difficult to obtain within a short period of real-time beam tuning. To realize fast real-time tuning for accelerators, the asynchronous advantage actor-critic (A3C) algorithm is introduced and improved in this paper.The A3C algorithm is an RL algorithm that uses the asynchronous parallel actorcritic approach to speed up the training process and obtain a convergence solution [15]. It has been adopted in the automated control field [16].

To verify the improved A3C algorithm, we conducted three beam tuning experiments at the Xi’an Proton Application Facility (XiPAF) with parameter optimization by pointwise scanning, the GA, and the improved A3C algorithm.

This paper is organized as follows: Section 2 presents RL and the A3C algorithm. Section 3 describes the improvements made to the A3C algorithm and its application in LEBT beam tuning. Section 4 discusses the experiments and results. Finally, the conclusions are summarized in Sect. 5.

2 Reinforcement Learning and A3C algorithm

2.1 Reinforcement Learning

RL aims to create an optimal action policy that can be used to choose actions based on the current environment and to maximize the long-term sum of rewards, which are the evaluations of the environment [17]. Figure 1 represents the algorithm structure of RL.RL acquires the current state from the environment, which can be the beam position, beam current, etc. The algorithm also needs rewards,which are calculated as feedback from the environment.Then, the action policy is updated to maximize the reward and choose an action to change the environment.

2.2 A3C algorithm

Fig. 1 Structure of reinforcement learning

The actor-critic method contains two networks, as shown in Fig. 2. One is the action policy network (actor),which chooses an action to change the environment, and the other is the value function network (critic), which predicts a reward when the environment is changed by the action chosen by the actor network.

Fig. 2 Structure of the actor-critic algorithm

An improvement on the AC algorithm, the A3C algorithm is an asynchronous AC algorithm with an advantage function.Asynchronous means that there are multiple ACs in the algorithm as shown in Fig. 3.Furthermore,each AC,which downloads network parameters and runs an episode without interacting with other ACs, is independent. A new episode will not start for an AC until all current episodes of the AC are completed. Then, the headquarters (a database that stores the parameters of the AC network) merge all updates from each AC to calculate the total update, which it then applies to its network parameters. Multiple independent ACs that run independent episodes will reduce the correlation of the selection for actions so that A3C can avoid being trapped in local optima. The advantage function is used as a normalized value function, which reduces the bias from the prediction after each step and speeds up the convergence process. The pseudocode of the A3C algorithm is shown below:

Fig. 3 Structure of the A3C algorithm

The update functions of the weights for the actor and critic can be described as follows [15]:

where s is the environment state, a is the action, ω represents the parameters of the critic network, θ represents the parameters of the actor network, t is the current step,qω（st，at） is the current step’s reward from the critic, and πθ（st，at） is the action probability from the actor when the action is atand state is st,α and β are learning rates,γ is the discount factor, R（st，at） is the real reward of the current step calculated from the environment when the state is s and action is a, and qω（st+1，at+1） is the reward of the next step predicted from the critic when the action is at+1and state is st+1.

The update function for the actor increases the probability of a ‘‘good’’ action, which is expected to acquire a high reward and decrease the probability of a‘‘bad’’action.The degree of the update depends on the reward and the probability of the action. If a low-probability action achieves a high reward, the actor drastically increases its probability.The critic decreases the difference between the real reward from the environment and the predicted reward from the critic [18].

3 Application of A3C to the real-time beam tuning

3.1 XiPAF LEBT

The LEBT of XiPAF contains two pairs of steering magnets and two solenoids [19], as shown in Fig. 4. Figure 5 is an illustration of the LEBT of XiPAF.

In this experiment, the current of the steering magnets ranges from-0.4 to 0.4 A with an increment of 0.1 A.The current of the solenoids ranges from 90 to 110 A with an increment of 1 A. Beam tuning achieves the highest RFQ transmission by adjusting the currents of steering magnets and solenoids.

Fig. 4 Layout of XiPAF LEBT

Fig. 5 LEBT of the linac injector of XiPAF

3.2 Improvements made to the A3C algorithm

Real-time beam tuning with the A3C algorithm faces several challenges. First, existing local optima hinder the pursuit of the global optimum. Second, although the A3C algorithm can converge more rapidly than the AC algorithm [18], it is too time-consuming for real-time beam tuning. Therefore, we improved the A3C algorithm in the following three ways:

(A) To prevent the optimization from being trapped in local optima, an exploration item was added to Eq. 1:

where ∈is the discount factor, and p（st，at） is the probability of the current action atwhen the current state is st.

The exploration item is the entropy [20] of action atwhen the state is st, which increases the probability of the unchosen actions. Furthermore, random start points are set in the algorithm,thereby allowing the algorithm to explore more points in the scanning range.

Figure 6 shows that, without the exploration term, the A3C algorithm will converge when an RFQ transmission of around 60% is achieved when the currents of the four steering magnets are 0.2 A, 0.1 A, 0.1 A, and 0 A. This proves that, without the exploration term, the A3C will be stuck in a local optimum, and with the introduction of the exploration term, this situation will be avoided.

In Fig. 6, after the convergence, the RFQ transmission fluctuates by approximately 20%owing to two reasons:(1)the instability of the accelerator system,including the beam current fluctuation of 10%from the ion source,and reading errors from the ACCT and Faraday cup, and (2) the A3C algorithm, which may explore work points with low RFQ transmissions to avoid being trapped in local optima.

(B) To speed up the convergence process, a factor was introduced into the learning rate β of the update rule in Eq. 1 as Eq. 3 shows.

where Ntis the current number of steps in one episode.

After the modification, the learning rate is low at the beginning of an episode and increases as the optimization proceeds. Then, the convergence process is shorter than that specified in the original update rule, which means that updates should be performed equally according to the original update policy for each scan step.However,the first scan steps are searched in random directions, which will cause fluctuation in the convergence if updates are made equally. Therefore, with the modified learning rate, the convergence process is shorter.

Fig. 6 RFQ transmission with episodes of A3C with/without the exploration term

Figure 7 shows the A3C algorithm without modification of the learning rate converges after 600 episodes when the currents of the steering magnets are 0 A,0.1 A,0 A,and 0 A, and the RFQ transmission is 87%. After modifying the learning rate, A3C can obtain the same optimal steering currents of 0 A, 0.1 A, 0 A, and 0 A, and an RFQ transmission of 87% in 200 episodes.

Fig. 7 RFQ transmission with episodes with/without the A3C algorithm with a modified learning rate

(C) To further speed up the convergence of the process,the network structure is altered. In pointwise scanning, as scanning multi-dimensional parameters is impractical, an alternating-scan method that divides the process into several parts is used.In each part,only several parameters are scanned.The scanning process can alternate between parts because the parameters in different parts are less dependent on each other. For example, in the LEBT, the steering magnets and solenoids exhibit different effects on the beam,and their fields are not coupled.Thus,the currents of the steering magnets can be scanned first to determine the optimal currents. Then, using the same method, the currents of the solenoids and currents of the steering magnets can be scanned alternatively until the highest RFQ transmission is acquired.

Learning from the alternating scanning, the A3C assumes the alternated network structure shown in Fig. 8.

Fig.8 Alternating network structure of the A3C algorithm (red lines depict the weights that are frozen in the present scanning process).(Color figure online)

In the A3C network, the alternating structure freezes weights (red lines in Fig. 8) associated with solenoid current parameters during the scanning of the steering magnets and locates the optima. Then, the weights associated with the currents of the steering magnets are frozen to obtain the optimal solenoid currents. In this way, the number of parameters trained in each scanning process is reduced to expedite the convergence. Figure 9 shows the A3C algorithm that uses the frozen structure converges quicker than that which does not use the frozen structure.

Fig. 9 RFQ transmission with episodes with/without the A3C algorithm with frozen neurons

3.3 Design of the input state and output action

The input state, output action, and reward are specified in order to apply the A3C algorithm to real-time beam tuning of the accelerator.The A3C consists of four network layers, one input layer, one output layer, and two hidden layers.The hidden layers contain 50 neurons.In the LEBT,the input layer consists of steering magnet currents, solenoid currents,currents at the entrance and exit of the RFQ,and the total number of steps Nsin one episode.The output layer of the actor increases and decreases the currents of the power supplies. The output layer of the critic only includes one neuron for the reward.

3.4 Reward design

Zero RFQ transmission has been observed for around 90% of the points in the current scanning of steering magnets and solenoids. The variation in RFQ transmission after an action is not continuous. The intent of our scan is to find an optimal point when an episode is over,indicating that the final step is the most important in the scanning process. For these reasons, the reward is expressed in the following form, as in Eq. 4:

where R is the reward, and E is the RFQ transmission.

The first item of Eq. 4 includes two parts. ΔE/|ΔE|represents the trend of the derivative. If the transmission increases, the reward is 1. Otherwise, the reward is-1.（E-0.5） is the penalty of low RFQ transmission point, which means that if the RFQ transmission is lower than 50%,the item is negative.The reward in the final step is 10 times larger than those in the other steps, which was determined from the experiment conducted to obtain a fast convergence in our case.

4 Experiment and results

4.1 Experimental method

In pointwise scanning, the alternating scanning process mainly includes three steps: (1) scan the currents of the four steering magnets; (2) scan the currents of the two solenoids; and (3) re-scan the currents of the four steering magnets.

In GA optimizing, we use a population of 100 sets of parameters and run the algorithm for 50 generations in order to acquire the optimized steering magnets currents.In A3C-algorithm optimizing,the improved A3C algorithm is used to search the optimal currents of the steering magnets and solenoids. Each AC algorithm used in the A3C algorithm for real-time beam tuning of the XiPAF LEBT is shown in Fig. 10. The scanning process is presented in Fig. 11.

4.2 Experimental results and discussions

4.2.1 Pointwise scanning

Fig.10 Diagram of the AC algorithm in real-time beam tuning of the LEBT

Figure 12 presents the RFQ transmission after pointwise scanning of the currents of the four steering magnets is performed for the first time. (Currents of the first and second solenoids are fixed as 110 A and 90 A, respectively.) In each cube, the 3D scanning results of the first three steering magnets are projected on the three coordinate planes. On each coordinate plane, the point values are the maximum RFQ transmissions in the corresponding projection direction.The highest RFQ transmission is 87%when the currents of the steering magnets are 0 A,0.1 A,0 A, and 0 A. The first and third steering magnets are positively correlated, which means increases in the currents of two steering magnets have the same effect on the RFQ transmission. The currents of the second and fourth steering magnets are negatively correlated.

Fig. 11 Flowchart of the real-time beam tuning of the LEBT with A3C

Fig. 12 RFQ transmission after the first pointwise scanning of the currents of the four steering magnets (the currents of the first and second solenoids are fixed as 110 A and 90 A, respectively)

The scanning result of the solenoids in Fig. 13 reveals that 89% is the highest RFQ transmission that can be achieved, and the solenoid currents are 107 A and 96 A.

Figure 14 shows the results of the re-scanning of the currents of the four steering magnets.The solenoid currents are fixed as 107 A and 96 A. The highest transmission is 90.6%when the steering magnet currents are 0 A,0 A,0 A,and 0.1 A. The currents of the second and fourth steering magnets are still negatively correlated but have an offset that was not observed after the first scan.The first and third steering magnets are no longer positively correlated.These results confirm that the solenoid currents have an influence on the scanning result of the optimal points of steering magnets. Therefore, alternating scanning is necessary in pointwise scanning.

Fig. 13 RFQ transmission by the pointwise scanning of solenoid currents

Fig.14 RFQ transmission after the second pointwise scanning of the currents of the four steering magnets (the currents of the first and second solenoids are fixed as 107 A and 96 A, respectively)

From these results, we can see there are many local optima in the beam tuning of the LEBT. Furthermore, the relationship among four steering magnets can be altered when the current in the solenoids change.

4.2.2 Optimization with the genetic algorithm

Figure 15 shows the results when optimization was conducted with the GA six times. Each time, the GA converged at different RFQ transmissions with different sets of steering magnets currents.The result reveals that the final solutions are trapped in local optima. Early maturing and instability are the major weaknesses of the GA [21],which means its exploratory ability is limited, and its results are unstable. Therefore, the GA cannot converge at the global optimum owing to the presence of too many discrete local optima as shown in the results of pointwise scanning. This means the GA is not suitable for the beam tuning of the LEBT.

Fig. 15 RFQ transmissions for six experiments obtained by the genetic algorithm

4.2.3 Optimization with the A3C algorithm

To verify the A3C algorithm in real-time beam tuning,two sets of experiments were performed. The first experiment involved optimization with random weights as the initialization, which verified that the A3C algorithm has a higher efficiency than pointwise scanning. The second one comprised optimization with the initialization of pretrained weights,that is,weights from the previous scanning process. The learnability was verified for the second experiment.

Figure 16 presents the RFQ transmission with episodes with random weights as the initialization. The scanning result reveals that the real-time beam tuning code converges at 260, 150, and 240 episodes for the steering magnet scanning, solenoid scanning, and steering magnet re-scanning, respectively. The corresponding numbers of scans are approximately 3800, 2000, and 3000.

A maximum of 20 actions can be performed in each episode. (One action refers to one current scan and each scan takes approximately 2 s considering the RF repetition rate of 1 Hz and an additional 1 s for the stabilization of the power supplies.)Meanwhile,the average number of actions is approximately 10 in an episode because the A3C algorithm can stop early when the global optimum is obtained.

The optimal currents are 0 A,0 A,0 A,and 0.1 A for the steering magnets, and 107 A and 96 A for solenoids. The optimal currents are the same as those with pointwise scanning. Furthermore, the performance of the tuned accelerator will be the same, with the highest transmission of the RFQ as 91.2%. The RFQ transmission fluctuates because of the exploration factor and the current instability(±5%) of the ion source. Optimization with the A3C algorithm in the first experiment consumes 42% less time than pointwise scanning.

Fig.16 RFQ transmission with episodes with random weights as the initialization. a Current scanning of the steering magnets for the first time; b current scanning of solenoids; and c current scanning of the steering magnets for the second time

Fig. 17 RFQ transmission with episodes with the pre-trained initialization of the weights (red line). For comparison, Fig. 16a is drawn as a blue line. (Color figure online)

In the second experiment, the current scanning of the steering magnets and solenoid, and current re-scanning of the steering magnets are performed successively. The current re-scanning of the steering magnets is conducted with weights from the previous scanning of the steering magnets. Figure 17 presents the RFQ transmission with episodes presented as a red line. The RFQ transmission converges at 120 episodes (corresponding to 1500 current scans)for the red line.With the pre-trained initialization of the weights,optimization with the A3C algorithm results in faster convergence than that with the random initialization and consumes 78%less time than pointwise scanning.The optimal currents are the same as those in the first experiment. The RFQ transmission achieved in the two experiments differs slightly, but actually the steering magnets currents are the same when convergence occurs. The difference is caused by the fact that the accelerator status is not very stable; it includes a 10% fluctuation of the ion source and instability in reading the ACCT.

5 Conclusion

Pointwise multi-dimensional parameter scanning can be time-consuming in the real-time beam tuning of an accelerator. For the current optimization of four steering magnets and two solenoids, the total scanning process lasts approximately 3.6 h. To speed up the multi-dimensional optimization, the A3C algorithm is introduced and improved for a real-time beam tuning code.To prevent the A3C from being trapped in a local optimum, we add an exploration item in the update functions of the weights.To speed up the convergence process, variations in the learning rate and the network structure with the alternating-scan method are proposed.

The improved A3C algorithm was verified through two experiments in which it was applied to beam tuning in the LEBT of XiPAF.The currents of the four steering magnets and two solenoids in the LEBT were optimized to achieve the highest transmission of an RFQ accelerator downstream of the LEBT. By pointwise scanning with optimal currents of 0 A, 0 A, 0 A, and 0.1 A for the steering magnets, and 107 A and 96 A for solenoids, the highest transmission of the RFQ obtained was 91.2%. The results of RFQ transmission optimized by the GA vary from 66.1% to 82.6%and the results mean that the GA cannot find the global optimum. Optimization with the A3C algorithm results in faster convergence than that with the random initialization and consumes 78%less time than pointwise scanning.The optimized currents of the steering magnets and solenoids are the same as those of pointwise scanning.

The results show that the improved A3C algorithm is time effective even when the scanning is performed with random weights. The time can be further optimized if weights from the previous scan are adopted for initialization.The improved A3C algorithm can provide an efficient approach for the real-time beam tuning of accelerators.Furthermore, it is expected to be applicable in other automated control systems with similar requirements.

Nuclear Science and Techniques

2019年10期