Autonomous Vehicle Platoons In Urban Road Networks: A Joint Distributed Reinforcement Learning and Model Predictive Control Approach

2024-01-27LuigiAlfonsoFrancescoGianniniStudentGiuseppeFranzSeniorGiuseppeFedeleFrancescoPupoandGiancarloFortino

IEEE/CAA Journal of Automatica Sinica 2024年1期

Luigi D’Alfonso ,,, Francesco Giannini , Student,, Giuseppe Franzè ,Senior,, Giuseppe Fedele ,,, Francesco Pupo ,,, and Giancarlo Fortino ,,

Abstract—In this paper, platoons of autonomous vehicles operating in urban road networks are considered.From a methodological point of view, the problem of interest consists of formally characterizing vehicle state trajectory tubes by means of routing decisions complying with traffic congestion criteria.To this end, a novel distributed control architecture is conceived by taking advantage of two methodologies: deep reinforcement learning and model predictive control.On one hand, the routing decisions are obtained by using a distributed reinforcement learning algorithm that exploits available traffic data at each road junction.On the other hand, a bank of model predictive controllers is in charge of computing the more adequate control action for each involved vehicle.Such tasks are here combined into a single framework:the deep reinforcement learning output (action) is translated into a set-point to be tracked by the model predictive controller; conversely, the current vehicle position, resulting from the application of the control move, is exploited by the deep reinforcement learning unit for improving its reliability.The main novelty of the proposed solution lies in its hybrid nature: on one hand it fully exploits deep reinforcement learning capabilities for decisionmaking purposes; on the other hand, time-varying hard constraints are always satisfied during the dynamical platoon evolution imposed by the computed routing decisions.To efficiently evaluate the performance of the proposed control architecture, a co-design procedure, involving the SUMO and MATLAB platforms, is implemented so that complex operating environments can be used, and the information coming from road maps (links,junctions, obstacles, semaphores, etc.) and vehicle state trajectories can be shared and exchanged.Finally by considering as operating scenario a real entire city block and a platoon of eleven vehicles described by double-integrator models, several simulations have been performed with the aim to put in light the main features of the proposed approach.Moreover, it is important to underline that in different operating scenarios the proposed reinforcement learning scheme is capable of significantly reducing traffic congestion phenomena when compared with well-reputed competitors.

I.INTRODUCTION

IN the last decade, advances in vehicular networking, communication and computing technologies have facilitated the practical deployment of autonomous vehicles (AVs) [1]-[3].On the other hand, the increasing number of vehicles has led to an urgent need for the innovative solutions to deal with road traffic issues.By joining these novelties with modern control technologies, the vehicular system is promising to shift from the individual driving automation to the platoonbased vehicular system paradigm [4].It is envisioned that the use of platoon configurations can improve road traffic safety and efficiency.Therefore, the control of AV platoons has attracted increasing attention from different perspectives starting from the ’00s [5]-[14].Indeed, vehicle routing decisions assume a key role because the optimization of the solution has a positively impact on the traffic congestion, see [15]-[17]and references therein.Moreover, these aspects require particular attention because the private mobility within urban road networks is almost always unsustainable [18].As a consequence future smart cities should refer to autonomous mobility systems that may offer a new way to provide equivalent service capabilities at possibly low congestion levels.In the sequel, these objectives will be addressed by formally combining two different approaches: the model predictive control(MPC) for constrained multi-agent evolutions and deep reinforcement learning (DRL) techniques for routing decision purposes.

A. Literature Review

Coordination and control of multi-vehicle systems have gained increasing interest in the control community as testified by several important contributions, see e.g., [19]-[23].Essentially two main lines are pursued: the characterization of admissible vehicle state trajectories tubes and the design ofad-hoccontrol architectures to keep the relative positions of the vehicles within safe and acceptable limits.According to this premise, the MPC appears to be the more adequate approach to comply with hard constraints during the system dynamical evolution.Hereafter, a brief analysis on some pertinent works is provided.In [24] a distributed receding horizon control (DRHC) scheme for nonlinear platoons is developed,while [25] proposes a synchronous DRHC for nonholonomic multi-vehicle systems.In [26], the constrained DRHC problem of a vehicle platoon with constrained nonlinear dynamics is investigated by usingγ-gain stability arguments, whereas in[27], the distributed MPC (DMPC) scheme is based on state trajectories parameterized as polynomial splines.Conversely in [28] a DMPC strategy for solving the formation problem of leader-follower systems is derived by exploiting explicit robust model-based control and set-theoretic ideas.In [29] a DMPC algorithm is developed for heterogeneous vehicle platoons with unidirectional topologies anda-prioriunknown desired set point.Finally, it is worth to mention that the possibility to integrate MPC techniques within AV platoon scenarios has gained an increasing interest because the resulting scheme is flexible with respect to topology changes and feasible when switching constraints sets arise, see e.g., [30].

In the last decade, the MPC philosophy has been coupled with reinforcement learning (RL) schemes for addressing complex problems in different real contexts: building energy management [31], control of multi-energy systems [32],quadrotor micro air vehicle navigation [33], power system control [34], to cite a few.

More recently, RL algorithms have applied in several areas of intelligent transportation systems in virtue of achievable high performance [35], [36].Few DRL algorithms have been exploited for vehicle platoon control purposes.Since accelerations (or speeds) of all vehicles should be properly selected,the simple exploitation of centralized RL unit could lead to an exponentially increase of the plant size by making intractable the underlying computations [37], [38].Hence, distributed descriptions of single-unit RL algorithms are mandatory, see[39] and references therein.

Although some studies have been published on the integration between machine learning and distributed model predictive control approaches, see [40]–[42] and references therein,this topic is worth looking into the capability of the distributed MPC to take fully advantage of the potentialities of the class of RL algorithms in terms of achievable control performance.In particular, MPC-based strategies, being robust and based on worst-case analysis, may be too conservative under dynamic and changing working conditions, conversely reinforcement learning algorithms, although capable of dynamically adapting to time-varying operating scenarios variations, are not able to guarantee hard constraints satisfaction.

B. Distributed Reinforcement Learning Based Model Predictive Control Scheme

The analysis of the above discussed contributions put in light two key aspects often left out in the proposed strategies:1) The capability to efficiently take or modify the routing decisions during the on-line operations; 2) The definition of anad-hoccontrol architecture capable to couple thenominalpath planning (defined by the sequence of routing decisions)with real dynamics of the autonomous vehicles where constraints and/or model impairments must be taken into account at each time instant.

Starting from this reasoning, in this paper the main aim is to propose new insights in the field of path planning for constrained autonomous vehicles moving in the cluttered environments of urban road networks.To this end, the guideline is to make the resulting control architecture scalable, flexible and computational low demanding in order to be capable of addressing time-varying operating scenarios along the path.From a methodological perspective two aspects deserve particular attention:

1) The computation of routing decisions for mitigating traffic congestion phenomena;

2) A constrained control strategy in charge of adequately exploiting the routing decisions.

These considerations naturally lead to consider a distributed framework, necessary to enjoy scalability and flexibility specs, where the first aspect is addressed by resorting to an innovative DRL scheme, while the second point is attacked by means of an MPC approach adequately designed to reduce as much as possible computational load pertaining to the underlying optimizations.Preliminary studies along these lines have been presented in [43], [44].

To comply with this reasoning, the two approaches are combined into a single distributed control architecture where the capability of the DRL scheme to derive routing decisions by mitigating the undesired traffic effects allows to drive safely the platoon towards the pre-assigned target under the requirement that no collisions amongst the involved vehicles occur.Then the main contributions of this paper can be summarized as follow:

1) Formalization of local MPC controllers that take advantage of receding horizon philosophy and the platoon topology:in particular most of computations are performed by the leader while followers exploit the information spread along the chain and set-membership conditions for reducing their computational complexity;

2) The periodic RL activation is the job of the leader while all the rest of platoon only accesses and evaluates the reward function for routing decision purposes;

3) Time-varying topology scenarios are allowed, i.e., the initial platoon can split, whenever recommended (different routing decisions between the leader and its followers), in two or more leader-follower configurations andvice versatwo platoons can safely join each other.

As it will be shown in the simulation section, these properties allow to improve the overall performance when looking at the minimization of traffic congestion occurrences.

A second important outcome of this paper concerns with the development of a co-design procedure between the SUMO platform [45] and MATLAB packages that allows to perform real-like simulations by using real city maps and data series of the traffic evolution.This prescribes that data are recast in the XML format, a client-server paradigm is defined for making admissible the communication between the two platforms and application programming interfaces (APIs) functions are implemented to get, modify or add information about the vehicles.

Finally, it is important to analyze the possible implications of using the proposed control architecture in real contexts.From one side, by looking at the computational burdens, the fact that the proposed strategy is computationally “light” represents a crucial property for operating in unknown environments populated by obstacles and/or external agents where routing decisions and command inputs must be quickly adopted.On the other hand, an effective application imposes to take care of the following issues: develop accurate mathematical models describing vehicle dynamics and operating environments; introduce anti-collisions capabilities to the overall strategy; define adequate communication features in order to correctly acquire and evaluate the information available on the sensor units located along the urban road network.

C. Paper Structure

The rest of the paper is organized as follows.In Section II,the path planning problem for urban road networks is formulated.In Section III, the proposed deep reinforcement learning algorithm is introduced and its main properties are discussed.Sections IV and V describe the proposed DMPC architecture and the resulting algorithm, respectively.Section VI details the co-design procedure.Finally simulations and some remarks end the paper.

D. Notation and Definitions

II.PROBLEM FORMULATION

Moreover, we assume that the plant description (3) has real(e.g., double integrator models) or artificially added (i.e.,thanks to a pre-compensation action) integral effects so that under zero velocity andui(t)=0, any pointpi∈R2in the planar space is an equilibrium.In the context of the intelligent vehicles field, this is by now a standard assumption as outlined in [46].

Throughout the paper it is assumed that the team is topologically organized as a platoon, see Fig.1.

Fig.1.Leader-follower topology.

-Information Exchange: At each time instantt, eachi-th vehicle sends to (i+ 1)-th its predicted future state trajectory,namelyxˆi(t).

Operating scenario: The platoon (3) and (4) operates within an urban road network (URN), consisting ofMlinks and withFjunctions, where sequence of routing decisions have to be taken into account during the travel.

Hereafter, the following definition will be exploited.

Definition 4: A link is defined as a portion of a road comprised between two junctions.

Then given a road map, the problem we want to solve can be stated as follows.

Platoons in Urban Environments(PL-UE): Given a platoon of constrained AVs (3) and (4) and a targetdesign a distributed state-feedback control policy

satisfying constraints (4) and such that, starting from an admissible initial conditionthe team is driven towards the targetxfregardless of any junction occurrence.

According to the PL-UE statement, the following definition is exploited for the next developments.

Definition 5: A statex∈X1×···×XLis said admissible if there exists at least a path fromxtoxf∈X1×···×XLcompatible with (4).

The problem will be addressed by means of a DMPC scheme properly customized to exploit the capabilities of a recent DRL algorithm.In particular, the idea we would develop consists of exploiting routing decisions, provided by the DRL unit, at each junction occurrence as set-point references for the underlying MPC controllers.

A. Proposed Solution: An Overview

In the sequel, the key ingredients of a DMPC-based architecture devoted to address the PL-UE problem is discussed.First, without loss of generality, it is assumed that an admissible path complying with the PL-UE problem prescriptions exists.Then, the overall control architecture is reported in Fig.2.Essentially, it consists of three units: a Path Planner based on the DRL algorithm, a Reference Generator and a Controller each one tuned w.r.t.the involved vehiclesAVi,i=1,...,L.

Fig.2.Distributed control architecture exploiting the DRL algorithm.

where the direction of travel and center line are taken into account.By taking advantage of this reasoning, the set-pointzi(t)is then formally achieved by solving the following optimization:

which provides the admissible referencezi(t), at the maximum distance from the current conditionxi(t), compatible with the road constraint setselected by the actionai(t).

Moreover, a decision zone for routing decision purposes has to be formally defined for each link.This translates into the following half-space description:

see the explanatory example of Fig.3.

Fig.3.Geometric constraints (dashed green zone) due to direction travel and center line (dashed red line).

Hence, the Controller computes an admissible commandui(t)in a distributed receding horizon fashion.Specifically:

•MPC controllers- Eachi-th vehicle is equipped with a DMPC:

1) The leader (namelyi=1) implements a local receding horizon controller (RHC) with the prediction horizon lengthN1=0;

2) Each follower (namelyi=2,...,L,) uses an MPC controller withNi=i-1;

Remark 1: Notice that the above choices are dictated by the following reasoning.In real contexts, the platoon is subject to a certain level of uncertainty within the URN, e.g., other agents moving in the same road links, obstacles, and so on.Since the leader should compute control actions in guaranteed fashion, this leads to consider a receding horizon controller for the leader with a control horizon lengthN1=0 within the positively invariant region that can be simply checked by using on-board perception modules.As followers are concerned, the simplest idea is to increment the length by one at each successive level so that the overall feasibility can be kept according to the Bellman optimization principle [47].

•Time-varying LF topology- According to a given switching rule, the proposed strategy allows to split the platoon in two or more LF configurations whenever different routing decisions occur.Conversely, two platoons can be combined into a single configuration only if the leader of the first platoon reconnects to the leaf node of the second one.As a consequence the feasibility is preserved if the following actions are performed:

– Lett¯ be the time instant when thei-th,i>1, vehicle leaves the initial LF configuration.Att¯+1 it becomes the leader of a new platoon and the prediction horizon is set toNi=0.If 0

Remark 2: In order to ensure that the leader-follower (LF)formation is preserved, each vehicle should be aware of the position of its predecessor from the current time instant onward.Within an MPC framework, this prescribes that each element of the LF configuration informs followers about its future state trajectory (predicted sequence) in order to properly define the LF formation constraints within the time interval defined by the prediction horizon length.

III.REINFORCEMENT LEARNING SCHEME

This section is devoted to implement a DRL scheme for routing decision purposes.To this end, the routing problem can be easily recast as a path search problem on a fully connected graph by exploiting as a threshold the decision zone,see Fig.3.For the sake of computational complexity savings,the resulting graph can be appropriately pruned to reduce as much as possible the research space.

In the sequel, the following definitions are of interest.

Definition 6: Given a road linkj, at each time instanttthe vehicle density ρj(t) is defined as follows:

where µj(t)∈Z+is the number of vehicles onjatt,d¯j∈R+the average road width andljthe road length.

By referring to a distributed framework, aQ-learning approach [48] is adopted and the multi-agent RL model is as follows:

where

•V={AV1,...,AVh} accounts for the set of vehicles(agents) contributing to the RL task;

whereEDiis a one-hot encoding string accounting for the usable edges, while ρj(t),j=1,...,Mi, are the distribution densities of the road linkiand its neighbors;

• Λ collects the admissible vehicle actions: e.g., “turn right”,“turn left”, “go straight on”;

• Φ : Σ×Λ×Σ →R is the global reward function.

Fig.4.Neural Network description.

Unfortunately, during the training procedure, a trade-off between the exploration and exploitation phases comes out.Hence, such a hitch can be overcome by using an epsilongreedy policy [49], i.e., the probability ϵ(·) of choosing a random action is an inversely proportional function of time.

Here, the following local reward function has been selected:

Notice that one obtains good rewards (performance) as long as the vehicle density decreases.

In order to adequately train the considered NN, the following squared loss function is used:

where

represents the so-calledexpectedreward that characterizes the quality of information acquired by thei-th agent on its surrounding.Moreover, in the sequel the moving time-window average reward is exploited

withTbeing the moving window length.

Finally, whenever a new reward measurement is available a gradient descent update is performed

with α ∈(0,1) the learning rate.

All the above developments allow to write down the following computable algorithm.

DQL-Algorithm Initialization σi(0),i=1,...,h,1: Extrapolate from the URN;Q(σi,a;θ),i=1,...,h;2: Initialize At each do 1: Select an admissible action:tdec ai(tdec)=■■■■■■■rand(Λ),with probability ϵ(tdec)argmax a∈Λ Qi(σi(tdec),a;θi),otherwise.k σk(tdec+∆tdec(t))) σi(tdec+∆Tdec(t)),∀i=1,...,h Φ(∪2: Observe the global reward and;ϕi(tdec+∆Tdec(t)),,∀i=1,...,h;3: Compute the local reward θi,∀i=1,...,h.4: Update

Remark 3: Notice that, in the present context, DQL algorithms are more suitable than standardQ-table based schemes[50], [51].In fact, deepQ-learning uses a neural network to calculate theQ-values necessary to take an action in the environment.Essentially, theQ-values are not directly updated but the neural network weights are corrected, according to the reward for each action, and used to compute them.This means that they are versatile and require very less memory than anyQ-table counterpart.As a matter of fact most appropriate actions come out.

IV.DMPC STRATEGY: SOLUTION AND FEASIBILITY

The above control framework poses the following methodological questions:

1) Given an initial LF configuration, define local MPC problems for theLinvolved vehicles;

2) Characterize the conditions under which the platoon split is feasible;

3) Determine the conditions under which two platoons can be combined into a single LF configuration.

A. DMPC Control Units: Single Platoon

Notice that the sequence xˆiis transmitted to the followeri+1 which in turn hypothesizes that thei-th vehicle will implement during the update prediction time interval[t+k+1,t+k+Ni+1].

In order to formally define the optimal control problem underlying the MPC strategy, the following ingredients are required:

• Input sequence parametrization:

withKi∈Rmi×nia stabilizing and admissible state feedback law;

• Cost-to-go:

In the proposed DMPC scheme, the MPC controller of each vehicle has to be obtained by also taking care of the main PLUE goal, i.e., converging to the targetxf.Then, it is necessary to deal with the fact that the LF formation moves towardsxf:the latter prescribes that the terminal condition is timevarying and the related constraint must be accordingly modified.Moreover, the vehicle must take into account routing decisions at each junction, so that the resulting referencezi(t)comes to play when the terminal region is derived.First, the following conditions have to be satisfied for all involved vehicles:

whereΞi(t) is computed such that

As the followers are concerned, an admissible computation of the sets Ξi(t),i=2,...,L, prescribes that each vehicle transmits to its direct followeri+1 the PI computed at the previous time instant, i.e.,Ξi(t-1).

Then, one has that

•Ifi=1 then Ξ1(t) is such that

– Ifz1(t)∈Ξ1(t-2)∩Ξ1(t-1) then

– else

• else

– I fzi(t)=zi-1(t) then

– elseifzi(t)∈Ξi(t-2)∩Ξi(t-1)

– otherwise

Finally, it is mandatory to preserve the LF configuration at each time instant.Here, it is required that each follower vehicle must maintain a safe distance from its predecessor and neighbors.This translates into LF formation constraints in terms of desired separation and the desired relative bearing between thei-th vehicle and its predecessor [52].Hence,given the current state measurementxi(t|t)∈Xiand the predecessor assumed state sequence(t) computed according to the following strategy:

the optimization problem for thei-th follower, hereafter denoted as(t), is

Conversely, the DMPC optimization pertaining to the leader vehicle, named DMPC-PL(t), reduces to the off-line computation of an admissible pair (Ξ1,K1) via the following semidefinite programming (SDP) problem (see e.g., [53]):

DMPC-PL(t):

Notice that DMPC-PL(t) does not involve the satisfaction of (36) because this vehicle leads the team, while the LF configuration is exclusively preserved by its follower nodes.

Remark 4: Although l.h.s.of (36) gives rise to non-convex constraints, they are addressed by straightforwardly adapting the convexification results of [54] (Proposition 1, p.292).

B. Platoon Splitting and Queuing

Along the road links, the initial platoon can split into two or more sub-LF configurations and two any platoons (even singletons) can regroup.

Notice that the splitting phase occurs during the on-line operations when different routing decisions are provided to the platoon vehicles.More precisely, since the reward is periodically updated until the decision zone of any road junction is reached, a vehicle and its predecessor (father) could assume different directions of travel so that a disconnection in two or more sub-platoons takes place.

As a consequence of such an event, it is assumed that an adequatere-numberingprocedure is implemented so that the new platoonsLF1andLF2are renamed as followsand

On the other hand, two platoons could regroup according to the following reasoning.The idea is thatLF1is added toLF2orvice versaat the leaf node of the first LF chain.As a consequence, the associated DMPC controllers will be implemented over the horizon of lengthsNh=NL1+h,h=1,...,L2.

The main difficulty with such an aim relies on the capability of the external platoon to detect leaf node of the platoon to regroup.In fact, noa-prioriinformation about roles and positions of the vehicles within the LF chain is made available on the external vehicles side.Then, one can use the following technicality to overcome this hitch.

Statement 1: At each time instantt, eachi-th vehicle of theLF1formation makes available a data packet whose structure is the formCounti={Counti.child(i),Counti.Ni}, where the integerchild(i) accounts for its follower, i.e.,child(i)=0 implies no follower, and the horizon lengthNi.Such a packet can be acquired by anotherh-th vehicle if there exists a time instantt¯ such that

It is worth noticing thatStatement 1is simply implementable because the actual production robots are usually equipped with embedded and purpose-oriented communication devices with a low energy consumption, see e.g., [55].

Then, theLF2platoon is able to regroup withLF1if condition (42) becomes valid for somei-th vehicle for which theCounti.child(i)=0.This means that theAV12vehicle will be a leaf node of theLF1formation and, as a consequence, communication facilities will be enabled and a new LF configuration will come out.

The key question to investigate is if splitting and queuing operations allow to preserve the overall feasibility property of the DMPC strategy.To this end the following arguments will be considered:

imposed equal to the number of followers of the initial platoon

and this always allows to provide an admissible sequence when the platoonsLF1andLF2regroup.According to this reasoning, first the positively invariant regions are derived by resorting to the so-called Delay Dependent time-delay scenario [57] and the prescribed constraints (4).To this end, by considering the state-feedback control law

and the regulated plant

the delayed system technicalities of [57] are hereafter used.The feedback control law (45) stabilizes the plant and satisfies the prescribed constraints if the following matrix inequalities in the objective variablesKi,Qi, Ψi,Riand τ, are satisfied

Hence, the ellipsoidal set

Conversely, if the control horizon becomesNi+ka solution there always exists becauseKixi∗(t+Ni|t) can be consecutively applied as prescribed by (47)-(49), i.e.,

V.DISTRIBUTED DRL-MPC ALGORITHM

The developments of the previous sections are here collected to define a computable algorithm based on the availability of an urban map network characterized in terms of admissible directions of travel for each road link.

Recursive feasibility and asymptotic stability of the DRLMPC Algorithm are stated in the following proposition.

Proposition 3:Letxi(0) andxif,i=1,...,L, initial and final conditions be given.Then, the DRL-MPC Algorithm always satisfies the prescribed constraints, complies with the requirements of the PL-UE problem and ensures that the regulated state trajectories are asymptotically stable.

Proof: Under the hypothesis that feasible input sequences are off-line available both for PLand PiFoptimization problems, the following arguments are exploited for feasibility purposes.The following scenarios can arise:

1)Single platoon phase: The leader feasibility arises from the fact that if an optimal solution there exists at the time instantt, namelyu1∗(t|t)=K1(t)x1(t), then at the next time instantt+1 the positively invariance property of Ξ1(t) ensures that the state trajectory at least will remain confined within Ξ1(t)and, in virtue of (22), (23), the transition to the new controller is asymptotically guaranteed.Similar arguments can be exploited for the generic follower because afterNi-steps state ahead predictions the state trajectoryxi(·) keeps confined within the positively invariant set Ξi, in turn updated by (27);

2)Splitting mode: Without loss of generality, let us assume that two platoons arise, namelyL1andL2.Att+1 the local optimizationPL1remains feasiblebecausethedomainof attractionΞ1(t)isinvariantw.r.t.the statefeedbackK1(t),while the feasibility of the leader of the new platoonL2is not affected because admissible, though not optimal, solutions there always exist in virtue of the results ofProposition1.Again, similar ideas apply to the follower vehicles;

3)Queuing mode: Att+1 the feasibility of the platoonL2is guaranteed in virtue ofProposition2 that allows to define an admissible, though not optimal, control moves sequence by exploiting the delay property of the PI region Ξi(t) computed by complying with (47)-(49).

Finally, the closed-loop asymptotic stability derives from the same analysis because, according to each routing decisionzi(t), the vehicle position converges toxifast→∞.■

VI.SUMO VERSUS MATLAB CO-DESIGN

In this section, a simulation framework is developed with the aim of stressing the capabilities of the proposed DRLMPC architecture in complex environments.Specifically, two software platforms are considered: Simulation of Urban MObilityⓇ (SUMO) software [45] andMATLAB 2022aⓇ.

Recall that SUMO is an open source, globally accepted by scientific community, highly portable, microscopic and continuous multi-modal traffic simulation package designed to handle large road networks [58].Although from a macroscopic point of view, traffic is usually described by means of departure times and routes with certain durations, in a real scenario it is highly conditioned by various other factors like drivers mobility, driving styles, weather, infrastructure within the region or other incidents affecting the environment.As it is well-known, the environment complexity prevents any traffic modeling via explicit mathematical expressions [59], therefore the only chance consists in performing reliable simulations instrumental to show critical waypoints or to predict traffic behaviors [45].This software is particularly suitable to simulate URNs where different classes of vehicles (such as cars, trains, buses) lie.The simulation engine is based on a hybrid description: discrete-time and continuous-space.It enjoys collision avoidance capabilities, multi-lane roads under lane changing, junction-based right-of-way rules, lane-to-lane connections.Moreover, SUMO provides all the URN data,traffic demanding and the resulting simulation in XML formats.

In the sequel, a co-design procedure is outlined in order to allow information sharing between SUMO and MATLAB.Essentially this prescribes that, starting from the set of functions provided by TraCI4MatlabⓇ [60] that are customized to comply with the DQL-Algorithm, the co-design in terms of software communication and data exchange is performed.More in detail, the TraCI4Matlab toolbox allows the communication between any MATLAB application and the SUMO traffic simulator.The TraCI application level protocol is based on the client-server paradigm and it is built on top of the TCP/IP stack: the application developed in MATLAB plays the role of a client and it can access and modify the simulation environment provided by SUMO that acts as a server.Hence, the proposed co-design is in charge of automatically mapping all the URN information into properly encoded MATLAB objects.Then, a MATLAB graph is obtained by exploiting the SUMO XML files, where the URN junctions(i.e., crossroads, traffic circles, road entrances and exits) are the graph nodes while the roads are the graph edges, see the explicative Fig.5.A set of APIs has been then developed to get, modify or add information about vehicles within the simulation environment.For instance these routines can be used to query SUMO where a vehicle is currently moving or to add a new vehicle in terms of its position, speed, attitude and driving features in the URN reference frame.In addition, the proposed toolbox straightforwardly allows to obtain data regarding road vehicles density, spatial constraints (i.e., road width,length and topology), obstacles, and so on.

As the DQL-Algorithm is concerned, the proposed codesign adapts to new operating scenarios by simply selecting a new SUMO XML describing file that in turn increases its flexibility features for routing decision purposes.In particular,the designed APIs allow to compute the local reward (13),while the optimization of NN weights is achieved by means of the MATLAB Reinforcement Learning toolboxⓇ and the MATLAB Deep learning toolboxⓇ [61], [62].

VII.SIMULATIONS

In this section, a set of simulations is proposed aiming at highlighting the performance of the proposed DRL-MPC algorithm by using real traffic data.The simulation setup has been implemented within the Matlab R2022a environment by using the SUMO 1.11.0 toolbox and the computations have been carried out on a Lenovo IdeaPad L340-15RH laptop equipped with an Intel Core i9 processor under a 64-bit operating system.

TABLE I PLATOON INITIAL POSITIONS

A. Vehicle Platoon, Urban Road Network and Constraints

In the sequel, the district around the “Andrea Costa” road in Bologna (IT) has been chosen for testing the features of the proposed scheme, see Fig.6.To this end, the real traffic data set reported in [63] includes the area in the proximity of the city football stadium that well adapts to simulate the vehicles mobility during big and popular events such as football matches or concerts.In particular, these traffic data are related to the Bologna’s peak hour (8:00 am–9:00 am) and have been generated by considering the highest traffic demand during a football match on March 3rd 2010 and the daily flow one week before.Moreover, additional road information have been also considered: traffic lights locations and plans, inductive loop positions, public bus transportation system, and so on.

The URN has been built within the public project iTETRIS[64] which is specifically oriented to define large-scale traffic management solutions for vehicular communications.

Operating scenario - The reference initial position is located at the latitude 44◦29′23.6′′Nand longitude11◦18′30.2′′Ecoordinates.Within a local tangent plane East-North-Up reference frame [65], the AVs initial positions have been set as reported in Table I withvix=0 andviy=0,∀i.The aim consists in driving all the vehicles towards the parking area located close to the target position in the south-east side, see Fig.7.

Fig.7.The URN description via SUMO: the leader initial (yellow) and target (red) locations are denoted with pins.

The following knobs are used for the DRL-MPC algorithm implementation:i=1,...,L,α=0.01 and γ=0.99.Moreover, the decision zones(8) are straightforwardly derived according to the assumption that the decision is made 3.5m before each junction.Finally,the NN architecture of Fig.4 is customized within the Matlab Deep Network Designer as depicted in Fig.8, where the state input layer dimension is twice the number of edges of the considered URN, i.e.,2M=358.

Fig.8.Neural Network from MATLAB Deep Network Designer.The activation functions are labeled as “elus” and “elua” because they are two elu functions, one for the state path and another one for the action path.

B. Results

All the achieved results are collected in Figs.9-16.As the Q-function is concerned, each AV benefits of pre-trained initialization performed by using the MATLAB Reinforcement Learning Toolbox built-in routines and the SUMO data set related to the “Andrea Costa” district.Such a phase has been carried out by generating 2000 traffic episodes [49].Each episode is here defined as a finite sequence of time instants where the first one corresponds to the initial position vehicles configuration of Table I, while the last to the targetxf(Parking area).Notice that during the training, the NN weights are updated according to (18).Specifically, the training process evolution is depicted in Fig.9, where the episode reward (the sum of all the rewards received from the agent during each episode), the moving average reward and the expected episode reward are reported.There, it can be observed that the capability of the vehicle to select the more adequate decision improves as the training process evolves: the average reward(red line) increases, while the expected one (orange line)shows an asymptotic convergent behavior to the effective reward.

By taking a look to the regulated behavior, all the prescribed constraints are always fulfilled, see Figs.10 and 14.Hence for the sake of comprehension, the attention will be focused on the two critical modes: platoon splitting and platoon queuing.

Fig.9.Training from MATLAB Reinforcement Learning toolbox.

Fig.10.8:00 am–8:30 am traffic data set: constraints fulfillment.Left bottom graph refers to a single platoon, while the right column to the two platoons scenario ( L F1 (top) and L F2 (bottom)).

Fig.11.8:00 am–8:30 am traffic data set: (a) single platoon before split (b)splitting (c) two platoons moving in the URN (d) platoons reaching the target position.Yellow cars denote the traffic along the path.

The first experiment is performed by considering the traffic data in the time interval 8:00am-8:30am.In this case, the initial platoon configuration is kept the same until around

Fig.12.8:00 am–8:30 am traffic data set: single leader (blue line) and two leaders ( A V1 red line, A V2 green line).

Fig.13.8:00 am–8:30 am traffic data set: routing decisions.Leader a1 and first splitting vehicle a9.

Fig.14.8:30 am–9:00 am traffic data set: constraints fulfillment.Left bottom graph refers to a single platoon, right top and middle graphs to the two platoons configuration and the right bottom one to the single platoon after queuing.

Fig.15.8:30 am–9:00 am traffic data set: (a) single platoon before split (b)splitting (c) two platoons moving in the URN (d) platoons reaching the joining condition.In all the pictures, green vehicles are the leaders and red vehicles are the followers.Yellow cars denote the traffic along the path.

Fig.16.8:30 am–9:00 am traffic data set: leaders trajectory in the cases of a single platoon, before the split and after the join (blue line), and of two platoons (red line for L F1 and green line for L F2).

Fig.17.8 :30 am-9:00 am traffic data set: routing decisions.Leadera1 and first splitting vehicle a6.

platoon reaches the target att≈715s.For the sake of completeness constraint evolutions and platoon behavior snapshots within SUMO are summarized in Figs.14 and 15,respectively.Finally, Fig.16 depicts the achieved paths during the prescribed mission.

For the interested readers, simulation demos are available at the following web link: https://youtu.be/YcxVJctk75k.

Finally, some numerical comparisons have been carried out by contrasting the DQL algorithm with a well-reputed competitor, namely the Dijkstra algorithm [66].To this end, the same DMPC controller has been exploited for both the routing schemes.In addition to the simulations above described, a further scenario is addressed: within the time window 8:00 am-8:30 am, the initial platoon configuration is initialized by imposing that all the vehicles have zero velocities,the leader is located at (x11(0),x12(0))=(874.54,979.20)m and the target isxf=[1611.76,883.38,0,0]Tm.Notice that the ENU reference frame is the same exploited for the other simulations.The results are collected in Table II in terms of the maximum travel time required, i.e., the time interval required by the leaf vehicle.Such a computation has been parceled out:20%, 50% and 100% of the overall route length.As a common denominator, the three simulations show that during the first segment (20%) the two schemes have quite similar dynamical behaviors, while as the platoon moves towards the targetxfthe proposed DRL-MPC more and more outperforms the Dijkstra-based solution.The main reasoning of this behavior relies on the capability of the DQL algorithm to simply and quickly adapt its routing decisions at the time-varying traffic conditions, while this is difficult, or even unviable, for the Dijkstra counterpart.

VIII.CONCLUSIONS

In this paper, the path planning problem for vehicle platoons subject to routing decisions and operating in urban road networks has been addressed by means of a new control architecture based on the joint exploitation of DRL and MPC properties.As one of the main merits of the proposed approach,the DMPC algorithm has been formalized by fully taking advantage of the routing decisions obtained by the DRL unit.As a consequence, time-varying topologies of the multi-vehicle system, arising according to the minimization of traffic congestion criteria, have been also considered without compromising the overall feasibility.In the simulation section, aplatoon of eleven vehicles, operating within the area around the football stadium of Bologna in Italy, has been used to put in light the main capabilities of the proposed DRL-MPCAlgorithm.Future studies will follow two directions.First, it is important to extend the proposed approach to comply with the satisfaction of obstacle avoidance requirements and when the roadway has more than one lane allowing the use of grid vehicle topologies.Then, the performance of the DRL algorithm can be improved by accelerating the distributed learning phase to get faster vehicle planning decision responses.