APP下载

Maneuvering target tracking of UAV based on MN-DDPG and transfer learning

2021-03-23BoLiZhipengYngqingChenShiyngLingHo

Defence Technology 2021年2期

Bo Li ,Zhi-peng Yng ,D-qing Chen ,Shi-yng Ling ,Ho M

a School of Electronics and Information,Northwestern Polytechnical University,Xi’an,710072,China

b School of Engineering,London South Bank University,London,SE1 0AA,UK

c AVIC Xi’an Aeronautics Computing Technique Research Institute,Xi’an,710068,China

Keywords:UAVs Maneuvering target tracking Deep reinforcement learning MN-DDPG Mixed noises Transfer learning

ABSTRACT Tracking maneuvering target in real time autonomously and accurately in an uncertain environment is one of the challenging missions for unmanned aerial vehicles(UAVs).In this paper,aiming to address the control problem of maneuvering target tracking and obstacle avoidance,an online path planning approach for UAV is developed based on deep reinforcement learning.Through end-to-end learning powered by neural networks,the proposed approach can achieve the perception of the environment and continuous motion output control.This proposed approach includes:(1)A deep deterministic policy gradient(DDPG)-based control framework to provide learning and autonomous decision-making capability for UAVs;(2)An improved method named MN-DDPG for introducing a type of mixed noises to assist UAV with exploring stochastic strategies for online optimal planning;and(3)An algorithm of taskdecomposition and pre-training for efficient transfer learning to improve the generalization capability of UAV’s control model built based on MN-DDPG.The experimental simulation results have verified that the proposed approach can achieve good self-adaptive adjustment of UAV’s flight attitude in the tasks of maneuvering target tracking with a significant improvement in generalization capability and training efficiency of UAV tracking controller in uncertain environments.

1.Introduction

Maneuvering target tracking technologies are widely employed in various fields,such as surveillance early warning,probing and rescuing,and high-altitude shooting,etc.[1-3].There are two major approaches to improve the accuracy of maneuvering target tracking:(1)Implement a filtering algorithm to estimate and predict the status of the maneuvering target by using data fusion technique[4];and(2)Design an online path planning method to realize continuous and stable tracking of moving targets by controlling the movement of perception platforms such as unmanned aerial vehicles(UAVs)[5-7].In this paper,we focus on developing an effective path planning algorithm to deal with UAV tracking control of maneuvering target with sustainable real-time decisionmaking.

Traditional path planning algorithms,such as A*[8],ant colony algorithm[9],and particle swarm optimization algorithm[10],have been implemented to obtain an optimal path based on environment modeling and performance evaluation.However,if a target moves or there is a change of external environments,these methods need to re-model the environment and plan a new effective path accordingly.This process consumes a lot of computational resources and can significantly reduce the effectiveness of real-time target tracking[11].Therefore,it is meaningful to provide autonomous learning and adaptive capabilities for UAV systems so UAVs can detect any changes in the environments dynamically and make tracking decisions autonomously in real time.

By combining the perceptual abilities of Deep Learning[12]with the decision-making capabilities of Reinforcement Learning[13],Deep Reinforcement Learning(DRL)[14]can be used to generate adaptive strategies online through an interactive process between an agent and an environment and has performed well in the field of UAV motion planning[15-19].Liu et al.[15].transformed the UAVs route programming problem into a partially observable Markov Decision Process(MDP)and utilized the Deep Q Network(DQN)[20]algorithm combined with greedy strategy to support UAV trajectory planning and optimization in a large-scale complex environment.By using the value-based Dueling Double DQN(DDQN)[21]with multi-step learning,Zeng[19]introduced a highprecision navigation technique for UAV,which has improved the effectiveness of target tracking.Although these methods have been successfully applied in UAV target tracking tasks,they have oversimplified UAV flight environment and divided the continuous action space into limited action intervals before constructing the decision-making model,which can significantly affect the UAV attitude stability and the tracking accuracy.

To address these problems and to achieve continuous action control of UAVs,researchers have explored other DRL algorithms based on policy gradient[22].One such an efficient and feasible algorithm is Deep Deterministic Policy Gradient(DDPG),proposed by Lillicrap[23]through incorporating DQN in Actor-Critic[24]framework,which can map continuous observations directly to continuous actions.DDPG algorithm has gained increasing popularity in intelligent continuous control of UAVs[25-29].For instance,Yang[26]utilized DDPG algorithm to assist UAV in solving air combat situational awareness and dynamic planning decisionmaking problems.Ramos[27]used the DDPG algorithm to train the UAV to complete an automatic landing task in a real dynamic environment and has achieved good results.However,although some progress,the existing DDPG-based UAV intelligent action control are still far from smart,their solutions tend to fall into local optima because of the generation of deterministic strategies.In addition,they are limited to single and pre-defined tasks,and can hardly be generalized to a new task where a target moves with random trajectories,or a more complex situation that may contain unknown obstacle threats.

To address these limitations,in this paper,an improved DRL algorithm is proposed.This new approach is used to provide UAVs with accurate tracking control strategy for maneuvering target in an uncertain environment.The main contributions of this research are as follows:

(1)A dedicated UAV decision-making framework of target tracking is established based on DDPG algorithm,in which a tracking controller of UAV can be optimized according to the flight direction and velocity of UAV to achieve the adjustment of the UAV flight attitude.This framework can be used for UAV to learn autonomous real-time maneuvering target tracking and obstacle avoidance.

(2)An improved MN-DDPG algorithm is proposed,in which a type of mixed noises composed of Gaussian noise and Ornstein-Uhlenbeck(OU)noise is introduced into the process of UAV training to guide the UAV for strategy exploration and help the network escape from local optimum.

(3)An optimized approach based on transfer learning[30]is introduced to improve the generalization ability of the UAV control model,which can help the UAV quickly learn to track randomly moving targets after being proficient in tracking a fixed moving target.In particular,we apply transfer learning twice to do double pre-training,which assists UAV in making more adequate preparations.Experimental results have shown that these optimizations can improve the efficiency and stability of the training process of deep reinforcement learning,and it can be effectively adapted to the training of UAVs for precise flight actions and target tracking with obstacles avoidance.

The remainder of this paper is organized as follows:Section 2 elaborates the background of the maneuvering target tracking task for UAV and related theoretical method.Section 3 introduces the core approach i.e.an improved MN-DDPG algorithm combined with parameter-based transfer learning approach theoretically,where crucial improved trick are mixed strategy exploration and parameters transfer.The performance,efficiency,adaptability and etc.of the proposed technique are presented through the presentation of experiments in Section 4.Finally,Section 5 conducts a further conclusion about the paper and envisages some future works about UAVs intelligent autonomous decision system.

2.Background

In this Section,an introduction is provided to the essential concepts of maneuvering target tracking,the motion model and observation model of UAV.Furthermore,the relevant theoretical background of DRL-based DDPG algorithm and transfer learning are given.

2.1.Mathematical model of maneuvering target tracking for UAV

Maneuvering target tracking refers to observing a moving target through sensors,and then using airborne computer to process the signals from the sensors,perceiving the environment in which the target is located,and performing tracking decisions accordingly[31].In this paper,the maneuvering target tracking problem under consideration involves a UAV,a group of sensors,and a ground maneuvering target.Therefore,it is essential to formulate the relevant UAV kinematic model and observation model.

2.1.1.Kinematic model of UAV

Many of the autopilots are equipped with advanced facilities such as high-precision dynamic three-dimensional information processing,AHRS and GPS-INS inertial navigation aid,etc.[32],and there is no need for the user to specify wing control input for the UAV.For the sake of brevity and without loss of generality,in this research,we suppose that the UAV flies at a fixed altitude with the help of autopilots(i.e.,pitch angle and flight height z(t)=H,where H is a positive constant).Hence,the continuous motion equation of the UAV with four degrees of freedom can be expressed as,

where x and y represent the two-dimensional coordinates of the UAV,φsignifies the yaw angle of the UAV,and v is the flight velocity of UAV.Furthermore,the state update of the UAV at time interval time t can be described as,

Note that it is reasonable to assume the speed v of the UAV is within a certain range,i.e.,v(t)∈[vminvmax].

2.1.2.Observation model of UAV

During the target tracking process,the UAV obtains the motion information of the target continuously in real-time through radar sensors or other auxiliary systems,as shown in Fig.1(a).Consider that the drone is flying at a fixed altitude,in this paper,the tracking process can be simplified in two-dimensional coordinates as shown in Fig.1(b).

Fig.1.Illustration of a UAV’s observations of ground target in tracking task.

In addition,a set of range sensors are employed to help UAVs detect possible obstacle threats ahead in the range of L.As shown in Fig.2,the blue area within the sensor range is considered the UAV threat detection area.At each moment,the UAV observation about obstacle is defined as

Fig.2.UAV obstacle threats detection based on range sensors.

where d1~7denotes the corresponding sensor indication,and dn=L when the obstacle is not detected,otherwise dn∈[0,L]represents the relative distance between the UAV and the obstacle threats.

2.2.Deep deterministic policy gradient based on deep reinforcement learning

2.2.1.Deep reinforcement learning

Deep reinforcement learning is a technique to train an agent to interact with its environment,and to generate reasonable coping behaviors in accordance with dynamic perception based on the powerful fitting capability of neural network.DRL uses Markov Decision Process(MDP)to model the training process.At each time step of this process,the agent interacts with the environment and obtain the state s∈S at the current moment,and then,selects a corresponding action a∈A.After performing the policy,the agent can receive reward r based on reward function R and get a new state s′∈S in next stage.By continuing trial-and-error interactions with the environment,the agent can leverage previous experience to learn the knowledges of making rational action in the direction of high returns and optimize its behavior to adapt well to varying circumstances,and ultimately,excels in making strategy to meet the mission delivery requirements.

2.2.2.DDPG algorithm

The DDPG algorithm is a mature DRL algorithm and can be used to tackle continuous action control problems.Mainly composed of the Actor-Critic framework,the algorithm uses actor online networkμto output action at=μ(st|θμ)according to the current agent’s state and critic online network Q to evaluate this action value,whereθμandθQdenote the parameters of actor online network and critic online networks.In addition,an actor target networkμ′and a critic target network Q′are constructed for following update process.

He gave, however, as good as he got, and they became so enraged33 that they tore up trees and beat each other with them, till they both fell dead at once on the ground

When updating the actor and critic networks,a mini-batch of N transitions[st,at,rt,st+1]is sampled from the experience replay buffer M to calculate the loss function L for critic networks,which is given by,

where Yi=ri+γQ′(si+1,μ′(si+1|θμ′)|θQ′)is the target value,γis the discounted coefficient,i is the sequence number of the sample extracted,θμ′andθQ′represent parameters of actor target network and critic target network,respectively.Meanwhile,training the actor networks according to the policy gradient,which is expressed as,

At last,the two target network parametersθμ′andθQ′are updated by soft update strategies:

whereτis a configurable constant coefficient,used to the regulate the soft update factor.

2.3.Transfer learning

Transfer learning is a new machine learning method that transfers the learned model parameters to the new model to help the new model training[33].Two basic concepts are referred to TL:(1)source domain,represents the object to be transferred;(2)target domain,represents the target to be endowed with knowledge.As shown in Fig.3,through transfer learning,the learned model parameters or knowledges from source domain task can be shared to the new model in target domain task,which can speed up the training process of new tasks and optimize the learning efficiency[34].

Fig.3.Schematic of transfer learning.

There are some common implementation methods of TL,such as instance-based approach,feature-based approach and parameterbased approach[35].The instance-based TL method is to weight different data samples according to similarity and importance.However,this method needs to collect a large number of instance samples and calculate similarities between these instance samples and the new learning samples respectively,which consumes large amounts of memory resources and computational resources[36].The feature-based TL approach needs to project the features of the source domain and the target domain into the same feature space,and utilize some machine learning methods to process the feature matrix[37],and this approach is mainly used for solving classification and recognition problems.The parameter-based TL approach applies the model trained in the source domain to the target domain,and completes the new similar task through a short retraining[38].To build up UAVs tracking decision-making model in this research,parameter transfer is a simple and effective way,and it can help UAVs learn similar strategies from a more reasonable initial network based on model parameters previously trained[39].As a result,the tracking task is simplified into a set of simple sub-tasks.We can train the model to fulfill sub-tasks,and migrate the sub-tasks model to the final task through parameter-based transfer learning,which will be explained in detail in Section 3.

3.Proposed method

This Section details the proposed approach for realizing the autonomous tracking control of UAVs for maneuvering target in an uncertain environment including a DDPG-based UAV control framework,an improved algorithm named MN-DDPG and the optimization based on transfer learning introduced.

3.1.DDPG based framework for UAV control

Fig.4 describes the UAV maneuvering target tracking framework based on DDPG algorithm.The DDPG networks output an action of UAV based on the current status of the UAV,and then the UAV receives the output of actor networks and generates a corresponding flight strategy to track the target accordingly.Throughout an interactive process with its environment,the networks can be updated,and the UAV can keep learning and making effective sequential control policies from the previous experiences recorded in experience buffer.The UAV state space,UAV action space and the rewards function under this framework are defined and explained below.

Fig.4.UAV maneuvering target tracking framework.

3.1.1.State space

The state space of UAV represents valuable information UAV can attain before decision-making,which is used to help UAV assess the situation.To train the tracking decision-making model,the UAV state space is defined as the input Stof the neural network in our DDPG framework,which is defined as,

where xtand ytare the coordinates of UAV on X-axis and Y-axis,vtdenote the velocity of UAV.φt,Dtrepresent the relative azimuth and relative distance between UAV and target,as formulated in Equation(3).Moreover,d1,…,d7are used to represent the states of surrounding obstacles based on UAV observation about obstacle Oo,as mentioned in Equation(4).

3.1.2.Action space

Considering that the UAV needs to maintain a reasonable speed and stable heading when performing tracking tasks,we employ a UAV strategy controller based on the change rate of UAV flight speed and heading.Therefore,the action output of the UAV is defined as,

with the acceleration of UAV motion˙vtand yaw angle˙φtin time t,manipulate the UAV attitude according to Equation(2).Taking into account the airborne capabilities of the UAV,we constrain the maneuvering characteristics by which UAV need to oblige,which were explicated in the experiment verification of Section 4.

3.1.3.Reward function

The reward in DRL,which is the feedback signal available for the agent’s learning,is used to evaluate the action taken by the agent.A simple approach is to set sparse reward,in which the agent can get a positive return only if the task is accomplished.However,this method is inefficient to collect useful experience data so as to help UAV learning.Accordingly,the convergent speed of network updating is slow and the UAV may not learn optimal strategies.

In this paper,a non-sparse reward is designed to guide UAV maneuvering target tracking with obstacle avoiding.Four types of excitations are considered in the reward:(1)track rewardabout distance between UAV and target;(2)UAV course reward;(3)UAV steady flight reward;(4)safe rewardabout obstacle avoidance.Specifically,the track reward is formulated as,

where Dt-1and Dtdenote the distance between UAV and target at the previous moment and at the current moment t.ξ1andξ2are the positive weight ratio of the two rewards in tracking distance reward.UAV will be penalized moderately if the target gets rid of sensor detection range L.On the contrary,it is considered that the UAV can keep up with the mobile device and was given a positive reward when meet the first condition in Equation(10)where Deis the maximum eligible tracking distance for UAV.In addition,the other three rewards are defined as

whereφis the relative azimuth between UAV and target,˙vtis the accelerated velocity of UAV.Autonomous obstacle avoidance ability is regarded as one of the most material competences of UAV.In Equation(13),d1~7represent the indications of seven sensors.The UAV is expected to stay away from obstacles to maintain its own safety,or it will be punished when obstacles are detected.To summarize,the final reward function in learning process is defined as,

Here,we introduce four relative gain factorsλ1~4,which represent the respective weights of the four rewards or punishments.

3.2.MN-DDPG with mixture noise for strategy exploration

A major challenge of learning based on DDPG in continuous action spaces is exploration.In the process of UAV training,as the target trajectory and environment change,the UAV needs to explore new strategies to complete tracking tasks.Therefore,we propose an improved MN-DDPG algorithm,in which the mixed noises composed of Gaussian noise and Ornstein-Uhlenbeck(OU)noise are added to optimize the deterministic strategy generated from DDPG and guide UAV in strategy exploration.

Considering that DRL use the MDP to model the tracking task as a sequential decision-making problem,according to the characteristics of the UAV’s continuous action output,the sequential related Ornstein-Uhlenbeck(OU)random process is adopted to provide action exploration for UAV to dealing with the changing environment.The noise based on OU process is given as

where atis the action state at moment t,a is the average of action sampling data,βis the learning rate for random process,σ1is OU random weight,and Wtrepresents Wiener process.

In addition,considering deterministic optimal policies for the previous tasks by the transferred model,which might be unbefitting in new task scenarios,another Gaussian noise is introduced to help UAV learn adaptive stochastic behaviors.This exploratory process is particularly important at the beginning of transfer learning.The optimized behavior based on policy network output μtthrough mixed noises are updated as

whereσ2(e)2denotes Gaussian variance in episode e,which can ensure that the UAV has the uniform and stable exploration capability in each turn,maintaining the effectiveness of exploration and correcting deviation.Meanwhile,as the learning goes on,the transferred model gradually adapts to the new task scenarios,which requires an exponential decay with Gaussian variance by:

whereδis attenuation coefficient.During the training process,we introduce the MN-DDPG algorithm to update the DDPG control framework,in which the mixed noises are added for UAVs stochastic strategies exploration,and then,associate the training process with following transfer learning.

3.3.Transfer learning integrated into MN-DDPG

In this Section,the process of combing transfer learning approach with MN-DDPG is discussed.Essentially,this process continuously decomposes a tracking task into two progressive subtasks and carries on double pre-training of tracking tasks.Specifically,to help UAV learn track target with obstacle avoidance step by step,we disassemble the overall task of UAV tracking complex maneuvering target while avoiding barriers into the two simple sub-tasks:1)UAV tracking task for uniform linear motion target;and 2)UAV tracking task for target with random motion.The learning process of UAV maneuvering target tracking based on MNDDPG combined with transfer learning based on task decomposition and double pre-training is summarized in Fig.5.

During the interaction between a UAV and the environment in sub-task 1 i.e.first pre-training,the preliminary tracking model of UAV can be trained based on MN-DDPG algorithm and the weights and bias parameters of current networks can be saved asℕ1.Then,we transfer the trainedℕ1into sub-task 2 as the initial network for sub-task 2.As the target move with various trajectories,UAV can be trained based on MN-DDPG algorithm to continuously learn new strategy.After finish sub-task 2 learning,we save the networks asℕ2.At this point,it is deemed that we have completed the second pre-training on the UAV.Ultimately,we add obstacles in task scenarios and transferℕ2into the final taskℕtracking.Through the training in multiple new environments with random obstacles,the model cannot only keep accurate tracking of the target but also learn new knowledges about obstacles avoidance.With the successive migration of each trained subtask model,ultimately the UAV can be trained to be proficient in generating comprehensive strategy.A detailed description of the complete algorithm is given in Algorithm 1gorithm 1.

Fig.5.The learning process of UAV maneuvering target tracking based on MN-DDPG combined with transfer learning,where decompose task into two sub-tasks to double pretraining.

Algorithm 1UAV maneuvering target tracking training workflow of the proposed approach.

4.Experiment and result analysis

In this Section,we discuss the simulation experiment settings and further analyze the effectiveness and performance boost of our proposed approach based on the experimental results.

Table 1The detailed environment settings in maneuvering target tracking tasks.

4.1.Simulation environment settings

In the experiments,settings and constraints on the parameters of the task environments have been made.According to the UAV aircraft performance,the acceleration and angular velocity range of the UVA can be determined,and the velocity of UAV is set to(m/s).Both of the initial velocity of UAV and target are set as 20 m/s.In addition,the Constant Velocity(CV)[31]model has been introduced for target in sub-task 1,where target does a fixed uniform linear motion.Then,the target with stochastic CV and Constant Turn(CT)[31]model has been considered in sub-task 2,where target maintains a constant speed but its trajectory is random.To train and verify the decision-making ability of UAVs in unknown environments,three random obstacles are added in final task based on sub-task 2.In our experiment,the simulation step sizeΔt=0.1s.Taking into account the fuel consumption of the UAV,the maximum working time of UAV was set as 40s.When the UAV completes the tracking task within the specified time(i.e.UAV keeps tracking in 400 steps)or the target is out of detection range of radar sensors,the task in current training episode is considered to be over,and the simulation environment will be reset.Table 1 gives the detailed environment settings in the tasks.

In the UAV maneuvering target tracking framework,the actor network and its target network are constructed by two 12×64×64×2 fully connected neural networks based on the UAVs’observation,and the critic network and its target network are constructed by two 14×64×64×1 fully connected neural networks.When the experience replay buffer is full of data,the Adam-Optimizer is employed to update the neural network parameters.The detailed hyper-parameters in different stages of training are set as shown in Table 2.In addition,the parameters relating to mixed noises are set with the learning rate for OU random processβ=0.1,OU random weightσ1=0.2,initial Gaussian varianceσ2(0)2=0.5,and the attenuation coefficient of Gaussian varianceδ=0.001.

Table 2Hyper-parameters in UAV training process based on MN-DDPG.

4.2.Simulation results analysis

Primarily,we did the first pre-training i.e.sub-task 1 as shown in Fig.6.In Fig.6(a)we presented the real-time position changes of the UAV and the target with green and blue trajectories,respectively.The UAV continuously learned through MN-DDPG algorithm,and selected the appropriate acceleration and angular velocity to adjust its speed and attitude,thereby maintaining the distance to the target within a large fluctuation range as demonstrated in Fig.6(b).Although we can see that the UAV did not stabilize its own speed as evident in Fig.6(c),the MN-DDPG algorithm proposed has completed the preliminary plan of training the UAV for simple tracking target.

Fig.6.Simulation 1 about UAV tracking linear moving target.

After setting the environment variables,the second pre-training was conducted to fine-tune the model in new scenario on the basis of sub-task 1,where the weights of the trained network were transferred to sub-task 2 as the initial weights of the current policy network.Furthermore,CT model was added for target on the basis of the primordial CV model.Fig.7(a),Fig.7(b)and(c)show the UAV tracking trajectory,tracking distance deviation and flight speed respectively in sub-task 2.It can be clearly seen that the UAV has been following the target and maintaining a similar flight attitude with the target.With continuous iteration,the distance between UAV and target gradually decreased,and the velocity of UAV tended to be stable.Compared to the previous training,the UAV has optimized its strategies and effectively tracked complex maneuvering targets.

Fig.7.Simulation 2 about UAV tracking complex maneuvering target.

The final task was to train UAV to reach the navigation goal of tracking complex maneuvers while avoiding obstacles based on sub-task 2.After accomplishing the double pre-training,by transmitting the trained network weight parameters into the final task,the entire training was eventually completed by fine-tuning of neural networks and the training result is shown in Fig.8.In Fig.8(a),the red areas represent obstacles.Although the situation was difficult for the UAV to make decisions on maneuvering of its own in a complex dynamic stochastic environment involving obstacles,the UAV still navigated securely and tracked the maneuvering target accurately after a quick transfer learning.As shown in Fig.8(b),the distance between the UAV and the target was still dynamically stable within a small range.Fig.8(c)displays the statistical results with minor fluctuations of the UAV speed from 0 to 400 steps in one whole tracking process.This indicates that MNDDPG algorithm with transfer learning by double pre-training implemented can not only help UAV to cope with unexpected changes in complex environments,but also optimize the flight stability of UAV.

Fig.8.Simulation 3 about UAV tracking complex maneuvering target in an environment with random obstacles.

4.2.1.Algorithm effectiveness

In order to verify the advantages of MN-DDPG algorithm for strategy exploration,a comparative experiment has been designed with conventional DDPG.Fig.9 shows the reward curve of the UAV implementing the two algorithms in sub-task 1.The red curve and blue curve represent the reward in MN-DDPG algorithm and DDPG algorithm,respectively.As it can be seen from the figure,the blue curve was in a downtrend,which means that UAV cannot obtain substantial and stable rewards.After the two algorithms have converged steadily,the reward value of MN-DDPG algorithm converged to 271.37.The comparative reward of the value of DDPG was stabilized to-212.68,which was significantly less than the convergence value of MN-DDPG algorithm.

Fig.9.Reward in each episode during the training phase.

During the testing,we randomly selected 50 episodes as samples as shown in Fig.10,from which we can see that the reward value of MN-DDPG algorithm test was relatively stable,with an average value of 310.7828,and the reward value of DDPG fluctuated significantly,with an average value of-110.2586.This means that the disciplined UAV using the DRL-based MN-DDPG algorithm can explore effective policies and complete maneuvering target tracking tasks in a simple environment.

Fig.10.Rewards in each episode during the testing phase.

4.2.2.Algorithm improvement

Fig.11 depicts the UAV trajectory after we directly applied the trained network from the sub-task 1 to the sub-task 2,from which we can conclude that direct network parameter application without supplementary learning and fine-tuning has bad performance in our tasks.At the condition of target moving in a straight line,the UAV could track the target within a controllable range.But when the target moves in a curve,the UAV failed to make an effective response decision and was gradually being pulled away from the target.That means the decision-making model we trained based on DDPG algorithm without transfer learning had a weak generalization capability.

Fig.11.Tracking simulation based on directly applying parameters without transfer learning.

To verify the effectiveness of MN-DDPG combined with transfer learning algorithm,we collected the UAV reward data under 800 rounds using traditional DDPG,MN-DDPG and MN-DDPG with transfer learning techniques.In Fig.12,the green curve is the relevant changing curve of rewards during the training in final complicated scenario from initial networks by using traditional DDPG algorithm.In addition,we have also plotted the blue curve represented the cumulative rewards each round in our task based on MN-DDPG algorithm,and the red curve represented the rewards aggregate in each round optimized by MN-DDPG combined with transfer learning.Distinctly,the learning effect of UAV based on untreated DDPG algorithm is comparatively unideal.It only gradually gains unstable revenue-boosting after about 480 rounds.The UAV optimized by MN-DDPG via mixed exploratory noises can start to learn valid knowledge shortly after the beginning of training,which means that MN-DDPG algorithm improved the exploratory efficiency of UAV and accelerated the learning process.In this case,the UAV has obtained a stable high reward after about 600 episodes.Compared with the two situations discussed above,the trained UAV,utilizing MN-DDPG with transfer learning via double pre-training in complex task scenario,could obtain a high reward around 210th episode.The reward curve has been stabilized quickly after a slight fluctuation,which means that UAV can make rational tracking strategy after only 270 training episodes.

Fig.12.Based on the sum of rewards agent get in each training episode,compare DDPG,MN-DDPG with MN-DDPG combined with transfer learning.

These results indicate that the performance and efficiency of DRL-based UAV tracking maneuvering target MN-DDPG algorithm combined with transfer learning is greatly improved.It enables the UAV to accomplish the tracking task under the environment of complex maneuvering targets and random disturbances,and also accelerates the training via original DRL approach.

5.Conclusions

The traditional algorithm solves the UAV maneuvering target tracking problem by establishing a tracking model matching with the target motion based on the known environment.When the environment changes or an obstacle threat generates,the environment modeling should be updated in real time,which consumes a lot of computing resources and reduce the real-time and effectiveness of tracking.

The intent of this paper was to illustrate that DRL-based MNDDPG combined with transfer Learning is an efficient approach to develop an autonomous maneuvering target tracking control system for UAV applications in dynamic environments.Using MNDDPG algorithm,this paper constructs an online decision-making method and realizes maneuvering target tracking and obstacle avoidance autonomously of UAV.The simulation results show that the mixed exploratory noises and the parameter-based transfer learning we introduced can improve the convergence speed of the neural network in original DDPG algorithm,and improve the generalization capability of the UAV control model.

We intend to expand the UAV missions in the future to evaluate the performance,efficiency and robustness of our algorithms to a realistic 3D space,so as to support UAV flight with six degrees of freedom.Furthermore,we are also going to build up the diversity and authenticity of the simulated scenario,e.g.add interferometer of winds and other uncertainties,and consider the introduction of UAV positioning error and other errors,which accelerate the achievement conversion from the virtual digital simulation to real UAVs and other spacecraft instrumentations.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

The authors would like to acknowledge National Natural Science Foundation of China(Grant No.61573285,No.62003267),Aeronautical Science Foundation of China(Grant No.2017ZC53021),Open Fund of Key Laboratory of Data Link Technology of China Electronics Technology Group Corporation(Grant No.CLDL-20182101)and Natural Science Foundation of Shaanxi Province(Grant No.2020JQ-220)to provide fund for conducting experiments.