Decentralised grid scheduling approach based on multi-agent reinforcement learning and gossip mechanism

2018-04-12JunWuXinXu

CAAI Transactions on Intelligence Technology 2018年1期

Jun Wu,Xin Xu✉

1College of Aerospace Science,National University of Defense Technology,Changsha,People’s Republic of China

2College of Mechatronics and Automation,National University of Defense Technology,Changsha,People’s Republic of China

1　Introduction

With the rapid development of Big Data and Artificial Intelligence,the demand for computing resources has risen sharply recently.However,as a result of widespread use of computers,the numerous computers in local networks or Internet provide a possible solution for this problem.To utilise such potential,grid computing[1,2]has emerged as a solution paradigm for supportingcomplicatedcomputingandinformationsharing problems by enabling the sharing,selection,and aggregation of geographically distributed heterogeneous resources(see Fig.1).

For most grid systems,the urgent problems unresolved are the organisation,coordinationandutilisationof all computing resources efficiently,andaproper schedulingalgorithmis extremely important[3–5].It is well known that only through flexible resource scheduling methods,can the grids improve computing efficiency and service quality[6,7].

For grids,the scalability and adaptability of job scheduling are two central problems.For the centralised resource scheduling problem,the limitation of scalability and computational performance is unavoidable[8].Moreover,due to the heterogeneity of resources,the variations of resource,the diversity of applications and the grid environments are dynamic;and an adaptive and robust scheduling method is desired[9].In this paper,a novel distributed job scheduling method,named gossip-based reinforcement learning(GRL),is proposed.A decentralised scheduling architecture with coordination is designed to achieve better scalability and efficiency.Moreover,a multi-agent reinforcement learning(MARL)scheme is applied to improve the adaptability.Compared with existing MARL-based scheduling methods[10–12],the proposed method possesses two innovations.First,a distributed bandit-like reinforcement learning algorithm is adopted by each scheduler.A utility table merely in scale with resources number is learned to estimate the current efficiency and load of resources;and a decentralised scheduling policy can be obtained from the utility table directly.So the GRL is efficient in computation.Second,good coordination is achieved with a gossip mechanism[13–15].The gossip mechanism permits each scheduler to communicate with its neighbours randomly.So,unlike the schedulers in ordinal sharing learning(OSL)[9]and Condor Flock P2P[16],GRL-based schedulers can work without dependency on special schedulers.In this way,the GRL is relaxed and robust in coordination.Finally,a grid system is simulated to verify the proposed method,and the results show that the GRL is effective and flexible.

This paper is organised as follows.Section 2 gives an overview of related works.Section 3 de fines the system model and discusses the performance measures.Section 4 presents the GRL method for grid job scheduling.Section 5 describes the performance evaluation of the GRL method in computational grids.Finally,conclusions are drawn in Section 6.

2　Related works

In the past decade,the scale and complexity of grid systems increase constantly.For example,one of the largest grid infrastructures,the Large Hadron Collider Computing Grid[17]consists of a large number of resource centres distributed around the world offering large-scale CPUs.Moreover,the grid environments are inherently dynamic,and the topology,scale and members of grids are all changeable at any moment.Despite of the significant advances in grid scheduling techniques,how to improve the scalability and adaptability is still an open problem to be solved.

Generally speaking,we can classify the grid system schedulers into three categories:the centralised,the hierarchical and the decentralised[18].A centralised grid system needs a single authority to schedule all the resources and lacks of scalability and fault-tolerance ability due to a single point of failure[7,19].Hierarchical schedulers,such as KOALA[20]and Gridway[21]only solve the problem partially.In order to achieve better fault-tolerance and scalability,decentralised grid schedulers are popularlystudiedin recent years.However,most existing decentralised schedulers,e.g.Condor-G[22]and AppleS[23],perform scheduling tasks without regard to the effects of other schedulersfullyandcannotmeettherequirementsof synchronisation.Moreover,a Herd behaviour will arise since schedulers run without communication and cooperation[24,25].If the job scheduling task is carried out with full coordination,such as Condor Flock P2P[16],the strong dependency on negotiation can result in heavy communication load.Therefore,it is important to design a scalable scheduling algorithm for the decentralised schedulers with appropriate coordination cost.

Fig.1　Illustration of resource management in grid computing

In order to realise prominent adaptability,reinforcement learning(RL)has beenappliedtojobschedulingingridsystems recently[26].As an important class of machine learning methods,RL can approximate sequentially optimal policies in uncertain dynamic environments[27].It provides a model-free methodology for adaptive optimal control and is very promising to solve the difficulties in grid job scheduling problems.

Some independent learner(IL)method is introduced in grid resources allocation,such as simple learning selection(SLS)method[10],R-learning method[28]and hybrid RL[26]and so on.However,such IL methods achieve good performance only for some special cases due to lack of coordination,for instance,a small-scale grid system.

To achieve better policy,some centralised RL methods were applied for resource allocation[29–31].Especially,in[30],a learner agent performs the learning process by collecting all rewardinformationfromother scheduler agentsandthen distributesthelearnedvalue-functiontootherschedulers.However,the communication cost is high.

To achieve better coordination and policy,some decentralised RL methods were introduced for resource allocation[9,32].A parallel Q-learning approach was introduced towards automating resource allocation[32].Agents coordinate and accelerate the learning process by sharing Q-value estimates,but the scalability problem exists yet.An OSL method is proposed for job scheduling problems in grids[9].All the scheduler learned with an ordinal distributed learning strategy,and realised coordination based on an information sharing mechanismwith limited communication.Because learning and sharing strategy are both ordinal,a directed cyclic topology among all scheduler must predefine,which may result in lack of robustness and flexibility.

Besides the abovevalue-function-based learning algorithms,some policy gradient learning algorithms[12,33]were adopted for job scheduling.A fair action learner(FAL)algorithm was applied to share the resources in a decentralised manner[12].FAL adopts a policy gradient ascent method to learn policies and converges slowly.A RL algorithm named weighted policy learner was introduced for distributed task allocation problem[33],however,its convergence rate is slow,too.

Therefore,despite the existing achievement,the efficiency and scalability of RL-based scheduling methods still need be improved in large-scale grids.

3　Problem statement

In general,a grid job scheduling environment consists of a large number of users,schedulers and resources.All nodes are connected through existing routed networks,such as local network or Internet.As mentioned above,for achieving better fault-tolerance and scalability,a decentralised grid scheduling architecture can be adopted and as depicted in Fig.2[9].

In Fig.2,N schedulers are applied in the task of job scheduling.L users merely produce and submit jobs to schedulers,and each scheduler serves some users by delivering the jobs to M resources for computing.It is assumed that a job only requires CPU resource for completion and can be uniquely characterised by its duration J.Resourcesexecutejobsanddiffer in their computing capabilities,i.e.the definition of C is the inverse value of the time,which spent for finishing one job with unit length.Furthermore,at a given moment,only one job is executed in a resource while the others are queued according to their arrival time.Each scheduler is able to distribute the job to any computing resources.

Fig.2　Decentralised architecture for grid job scheduling

The above model holds the most important characteristics of the grid environment,i.e.heterogeneity,randomness,large number of resources and users[10].

Generally speaking,a proper scheduling policy helps to achieve load-balancing in distributed computations and then helps to complete all the jobs at hand as soon as possible.To measure the scheduling performance,a competent metric is used,namely the average load of resources(ALoR)[9].The jth is resource’s load,LoRj,is de fined as the total length of the queued jobsdivided by its capacity Cj.The averaged load,i.e.ALoR,can be expressed as

where Ojrepresents how many jobs in the jth resource’s queue,and M is the number of resources.So the standard deviation of LoR,i.e.DLoR,is expressed as

The above two metrics reflect the efficiency and fairness of a scheduling policy comprehensively.Therefore,the objective of grid job scheduling algorithms is to minimise both metrics simultaneously.

4　GRL algorithm with gossip-based cooperative learning

Many job scheduling algorithms use global knowledge,namely,the schedulers obtain the load information from the Grid Information System(GIS)directlyandmakecorrespondingdecisions.However,due to the large scale,the decentralised architecture and the dynamic of grid systems,the global information obtained from GIS is always delayed seriously.Furthermore,as the scale of grid systems is increasing,the cost of building and maintaining a usable GIS becomes more and more expensive.Therefore,a more robust and adaptive scheduling algorithm without GIS,or even without any prior knowledge,is required.Fortunately,such requirement can be fulfilled by the usage of reinforcement learning techniques[27].

4.1　RL for multi-agent systems

RL helps agents to learn an optimal policy with the experiences data by interacting with the environments.Single-agent RL algorithms depend on the model of Markov decision processes[34].As an extension of RL for the distributed decision-making environments,MARL methods adopt the model of Stochastic Game(SG)[34],e.g.JAL[35],Team-Q[36],OAL[37]and so on.In an SG model,the joint state space and joint action space are considered to ensure the Markovian and stationary property.Nevertheless,such MARL algorithms may result in huge computational burdens since the joint learning space expands with the number of the agents exponentially.Besides a few small-scale applications,SG-based MARL methods usually perform poorly for real-world applications,not to mention being adopted in large-scale job scheduling problems.

Different from the SG-based MARL,another technique called independent learner(IL)was developed[10,35]and applied in MARL[38].IL-based MARL algorithms adopt the standard single-agent RL technique in each agent directly.So each agent learns independently according to local states and local rewards.Although it is not necessary for the IL-based agents to explore the joint state-action space which is exponentially growing,their environments will no longer be stationary.As a result,the problems of convergences and oscillatory behaviours would be caused.For example,when applying simple IL-based MARL in job scheduling for grids without coordination,a Herd behaviour was observed[8].

To deal with the above difficulties,a promising solution for adopting MARL in grid job scheduling is to combine the IL technique with autonomous coordination techniques so as to meet the need of the efficiency and optimality.According to this idea,a new MARL method named gossip-based reinforcement learning(GRL)will be presented in the following subsection.

4.2　GRL method for job scheduling

The GRL method mainly has two features.First,it adopts a decentralised scheduling structure and an IL strategy based on a bandit-like model in each agent,so the learning complexity can be reduced remarkably.Second,it makes use of a gossip mechanism to achieve coordination,so the coordination relationship is relaxed,and the communication cost is limited.

4.2.1 ILs based on bandit-like models:(i)The scheduler agent:The schematic diagram for each scheduler agent in grid system is depicted in Fig.3.

In Fig.3,each scheduler agent consists of two parts,i.e.Learner and Actor.The learner takes charge of the action decision,reward converting,policy learning and coordination.The Actor takes charge of receiving,queuing and submission of jobs.The scheduler agent receives jobs from users and queues them in the Jobs Submission module,and then the Action Decision module selects certain resources to serve the jobs according to the Value Function.The Jobs Submission module executes the decisions and submits the jobs to the selected resources.The States Watch module watches the job-submission and job-completion process,and the Reward Converter module can create proper rewards according to the job states.Finally,the learner makes use of the reward information to update Value Function.The Sharing module uses the received value function to update internal Value Function,too.The Value Function module takes a simple utility table to estimate the state of resources.In the following subsections,we will present the GRL approach in more details.

Generally speaking,there are two key problems for a scheduler agent to adopt the IL technique.First,it is hard to obtain proper rewards for learning directly;so,how to calculate the reward from the known information must be solved.Second,although the global LoR information(i.e.{LoRi},i=1,...,M)lays the foundation for scheduling decisions,it is hard to be obtained without GIS,sohowtobuilda local value functionto approximate the global LoR information is another problem to be solved.

(ii)The mechanismfor creatingrewardsignals:Inthe decentralised scheduling model,nothing global signals can be provided by the environments except the local information of job finishing,therefore,a reward signal must be extracted from the obtainableinformation,suchasthecachedjob-submission information and the job-completion information.In GRL,a reward creation mechanism is designed with two kinds of signals,i.e.the instantaneous reward signal and the periodical reward signal.

Fig.3　Schematic diagram of a GRL-based scheduler agent

The instantaneous reward signal rinst(j)is created after an action j was executed,i.e.a job was submitted to the jth resource immediately,i.e.

The periodical reward signal rperi(j)is designed based on two rules.First,the reward mechanism works at intervals of fixed time length even if there are no jobs during the interval.Second,at the end of each interval,agent giobtains the reinforcement signal for action j according to the cached jobs’states,i.e.

where M is the number of resources.When some jobs are executed in the same resource,every job has its individual reinforcement signal.In the end,we can obtain the reward of an action of every period by summing up all the signals

where Kjrepresents how many jobs submitted by this scheduler and being served by the jth resource at present.For instance,when an agent sent three jobs to the first resource and when a time step passed,only one job is finished while other two jobs still uncompleted,the periodical reward for the action which choosing the first resource is rperi(1)=1+2∗(−1)= −1.

(iii)MARL based on the bandit-like model:After generating the reward signal,a proper value-function representation will be the key problem.In GRL,agent giadopts a bandit-like model and keeps a utility table Uito estimate the LoR of resources,where Ui(j)represents theevaluationof thejthresource’s load,j∈{1,2,...,|S|}={1,2,...,M},or in other words,Ui(j)gives a score for action choosing the jth resource from the resource set S={s1,s2,...,sM}.The utility table for scheduler giis depicted in Fig.4.

During every time interval,the agent giexecutes the resource selection operation,and updates the utility table with two steps as follows:

Step 1:The agent gichecks and judges whether there is a new job coming.If not,goes to Step 2 directly.If so,then repeats step 1 until all jobs are submitted.The agent giselects the jth resource to deal with a job with greedy in limit with in finite exploration(GLIE)strategy,i.e.

Fig.4　Mapping between resources and utility table Ui

where ɛ1is the probability for exploring and decays with a factor γ every step,i.e.ε1? γ×ε1.After the jth action has been executed,the current job was submitted to the jth resource and recorded as an un finished job in the scheduler.The instantaneous reward is obtained as rinst(j)=−1,and the utility value of this action Ui(j)is updated immediately

where α denotes the learning rate.

Step 2:Agent gicalculates the whole periodical reward vector(rperi(1),rperi(2),...,rperi(M))and updates the utility table

At this point,the agent gican use the utility table Uito estimate the load of all the resources.For example,if a resource queue is very long or the capacity of resources is very poor after a scheduler submitted a job to it,the scheduler must wait for a long time to receive the completed response from the resource.Therefore,the scheduler will receive the reward signal−1,much more times than the reward+1.Finally,the practical utility value for this resource will be very small.Obviously,a bigger utility value denotes a better state of the resource.

In this decentralised scheduling architecture,all scheduler agents make decisions independently and simultaneously.Obviously,the decision of the scheduler will change the load of certain resources,but other schedulers cannot detect the change before submitting jobs to the same resource.To improve the estimation accuracy,some effective coordination mechanisms need to be considered,such as information sharing among individuals.

4.2.2Multi-agent coordinationbasedonthe gossip mechanism:Since each agent has a local utility table to estimate the LoR of a resource,it is reasonable to improve the accuracy of estimation by sharing such knowledge.However,the large-scale characteristic in the grid becomes an obstacle for schedulers to share each utility table directly and frequently.So,in GRL,a gossip-based sharing mechanism is proposed to solve the above problem.

(i)The gossip mechanism:The advent of decentralised networks,such as sensor,wireless ad hoc and peer-to-peer networks has motivated the design of gossip algorithms.This algorithm is becoming a way to maintain scalability and simplicity while achieving fault-tolerant merit[15].Gossip mechanism is similar to the spread of epidemics but obtains a high degree of fault tolerance and self-stability.In addition,it has advantages in simplicity and usually produces only moderate or negligible communication cost,even compared to the best deterministic protocols[8].

The large-scale decentralised grid is strongly similar to the above distributed networks,so the gossip mechanism can be adopted to achieve the desired performance as quickly and efficiently as possible,as well as to be robust against changes in networks topology.In GRL,the gossip mechanism is adopted as follows(depicted in Fig.5).After a scheduler agent submits a job to the jth resource,it starts a gossip process and shares the corresponding utilityvalueUi(j)tosomeimmediateneighbourschosen randomly.Each neighbour scheduler retransmits the received Ui(j)in the same manner except to send back to its father node.If a scheduler receives the same utility value again,it will not utilise it,but retransfer it to other nodes randomly.This process typically terminates after a number of rounds of information exchange.As depicted in Fig.5 in a grid with 15 schedulers,the utility value can be shared by all schedulers only after a gossip with depth 3.

(ii)Multi-agent coordinationbasedondifferent sharing strategies:With the help of the above gossip mechanism,a utility value Ui(j)can be shared by agents gkwho received the gossip information.We de fine such agents as a set Gs,then the sharing process is de fined as follows:

Fig.5　Gossip mechanism for information sharing

where k denotes the kth agent sharing the utility value with gossip mechanism,Ukis the utility table of agent gk.There are two strategies for sharing the received utility value,i.e.linear sharing strategy and minimal sharing strategy.

where β is a fixed sharing factor.(ii)Minimal sharing strategy:The received utility value is shared with a nonlinear mean as follows:

where min()is a function to acquire the minimal value between the Ui(j)and Uk(j).Therefore,the adopted gossip-based coordination algorithm is quite simple and can be depicted in Fig.6,where the parameter εgcontrols the gossip probability after receiving a utility value.The number of neighbouring schedulers sharing the utility value is defined as the gossip width W.To assure the validity of the information dissemination[39],the gossip width W should be set not less than ln(N).

The number of gossip rounds is defined as the gossip depth D.To share the information efficiently,the selection of the gossip depth D is very important.Obviously,given W and D,the total number of gossiped schedulers T can be calculated by

Since the information can be shared efficiently only after satisfying the inequality:T≥N,a lower bound of D,i.e.DLB,can be calculated approximately as

The function ceil(X)returns the smallest integral value that is not less than the float-point value X.After presenting the IL-based learning method and the gossip-based coordination mechanism,respectively,the GRL algorithm for grid job scheduling is given in Fig.7.

Fig.6　Gossip-based coordination algorithm for agent gi

Compared with other MARL algorithms,the GRL algorithm is more effective andeasytobeimplementedinlarge-scale applications.As a method with the bandit-like model,there is no explicit state variable in a utility table function.Therefore,it can achieve better robustness and adaptability for grid scenarios where both resources and applications are highly diverse and dynamic.Moreover,themoderatecommunicationcostandhigh fault-toleranceperformanceingossip-basedcoordinationare another two prominent merits.

4.3　Time consumption(TC)of GRL algorithm

The time consumption of GRL algorithm is defined as TC and consists of computation of all steps in Algorithm 2.It takes M+1 effort in Step 1,C×(W×(M+3)+5)effort in Step 2,K+1 effort in Step 3,K×M(C+1)/N effort in Step 4,K×C/N effort in Step 5,K×M(C×+1)/N effort in Step 6 and K effort in Step 7.Hence,the total TC can be calculated approximately as

where K is the maximal iteration steps in the learning process,C is the total number of jobs submitted by all users in one time interval for learning,N is the number of schedulers,M is the number of resources,W is the gossip width.

Fig.8　Results with linear sharing strategy under different gossip depths and system loads

5　Performance evaluations and comparisons

In this section,the proposed GRL method will be evaluated and analysed in different experiments.First of all,a grid system for testing was con figured and simulated.

The scale of a grid system can be mainly described as(L,N,M),i.e.the number of users L,the number of schedulers N and the number of resources M.In the following experiments,the system scale is con figured as[400,50,1000]and is large enough to denote a common decentralised grid’s scale.The system load lsystemis jointly determined by the total disposal capacity of resources and the total jobs arrived.Apparently,the system load should not be more than 100%.The dispersion range of the jobs’length is[10,990].The distribution of resources’capacity is[50,350].

5.1　Performance evaluations for linear sharing strategy

First,the scheduling performance for GRL method adopting a linear sharing strategy was evaluated.To compromise between new and old experiences as well as own and others’experiences,both the learning rate α and the sharing factor β are set as 0.5.The coefficients for GLIE strategy,i.e.exploring coefficient ɛ1and decaying factor γ,areinitialisedas0.9and0.996,respectively.Thecontrol parameter εgis used to control the percentage of agents sharing the gossiped information and set as 0,10,42.5 and 100%for D=0,1,2,3,respectively.The time interval for simulation and learning is set as 0.1 s.To assure the validity of sharing,the gossip width W is set as 4.To evaluate the new method’s scheduling performance,experiments under different gossip depth D,i.e.0,1,2 and 3,were tested,respectively.By selecting proper job arrival rates,three kinds of system loads,namely,50,75 and 90%,are simulated.The performance comparisons in terms of ALoR/DLoR curves for GRL method are shown in Fig.8.

The convergent ALoR/DLoR curves show that the GRL method can schedule the jobs and achieve load balancing effectively under different system loads.Moreover,with the increasing gossip depths,the utility values were shared by more agents,and the GRL method has better performance generally.However,a fact was found that schedulers with D=0 can achieve better scheduling performance under some given conditions.The schedulers with D=0 work better than the ones with D=3 under low load,i.e.50%,approximately under medium load,i.e.75%,and worse under high load,i.e.90%.It is because the reward mechanism of GRL method introduces a hidden coordination by adopting periodical rewards to estimate all schedulers’states.However,such coordination mechanism works well only under a low system load.Withthehigher systemloadbecomes,amoreeffective coordination mechanism,such the gossip-based linear sharing strategy,is needed.

Moreover,it is shown that the ALoR/DLoR values increase quickly at the beginning and then decrease,and level off after achieving a climax.It is because that the GRL-based schedulers have no prior knowledge of the grids at first,so its performance is unsatisfactory.However,good scheduling performance can be achieved at last after accumulating enough experiences.Fig.9 gives a snapshot of LoR in all resources when the scheduling performance levels off.The corresponding system load and the gossip depth are 50%and 3,respectively.

Additional experiments show that the scheduling performance was not improved with a gossip width greater than the lower bound DLB(i.e.3).It is obvious that the sharing with gossip depth DLBalready visits all the schedulers.The gossip depth exceeding DLBresults in nothing more than higher communication and computation cost.

5.2　Performance evaluations for minimal sharing strategy

Fig.9　Snapshot for LoR in resources

Besides the linear sharing strategy,a minimal sharing strategy can be used to coordinate the scheduling process.In the following,some experiments for GRL method adopting the minimal sharing strategy are performed in a similar way as the above section.The ALoR/DLoR results for the new method are shown in Fig.10.

The convergent ALoR/DLoR curves show that the GRL-based Selection(GRLS)with minimal sharing strategy can deal with the jobs scheduling effectively,too.Compared with the results in Fig.8,the GRL method with the minimal sharing strategy performances worse almost under all conditions except D=3.In fact,while D=3,the minimal sharing strategy shows a little bit better than the linear sharing strategy with a magnitude around 10%.Obviously,the linear sharing strategy utilises the shared information more sufficiently,however,the minimal sharing strategymerelymakes useof thepessimisticinformation.Therefore,in the following context,we adopt the GRL method with linear sharing strategy for experiments,comparison and analysis.

5.3　Robustness evaluation under coordination communication failure

Some communication failures may exist during the gossip process and result in coordination failure.The number of success sharing the gossiped information should be decreased once such failure occurred.To simulate and test such cases,the parameter ε1in Algorithm 1 is selected to control the number/percentage of success for sharing.In the following experiments,two scheduling tasks with D=2 and D=3 under different system load conditions,i.e.50 and 75%,were executed.Because of the communication failure,the percentage of success ε1for D=2 decreased from 42.5 to 25%,and the one for D=3 decreased from 100 to 75%.The contrastive results are shown in Fig.11,where the ALoR curves for the task without communication failure were marked as ‘D=2’or ‘D=3’and the abnormal ones were marked as ‘D=2,F’or‘D=3,F’,respectively.

The resulting curves show that the scheduling performance of GRL method degenerated to a certain extent due to the sharing failure.However,the GRL method is still robust enough to deal with the scheduling task and converges finally.

5.4　Performance comparisons among different scheduling rules

Moreover,the GRLS rule is compared with other job-scheduling rules,namely random selection(RS),least load selection(LLS),SLS[10]and Ordinal sharing learning Selection(OSLS)[9].An agent with the RS rule selects resource randomly.An agent with the LLS rule selects the least-loaded resource.The LLS rule assumes that all agents can get global resource information in real-time,e.g.from an ideal GIS system.For the SLS rule,an agent learns with a simple RL method.Unlike the GRLS rule,an SLS-based agent does not update its value function until it has received the job-completion signals,and each agent learns without coordination[10].The OSLS rule learns in a similar way with GRLS rule but differs in sharing strategy.The OSLS rule shares the utility value in a directed cyclic topology among the schedulers,however,the GRLSrule does ina relaxedgossipmechanism.For comprehensive evaluation and comparison,all methods were tested under the same conditions.Three kinds of system loads are simulated,which are 50,75 and 90%,respectively.The grid system and other parameters are the same as the ones in former experiments.The comparisons of ALoR curves under different system loads are shown in Fig.12.

Fig.10　Results with minimal sharing strategy under different gossip depths and system loads

The results show that both the RS rule and SLS rule fail in the job-scheduling task and the performance of the RS rule is even worse than the SLS rule.The RS-based agents do not take into account the efficiency at all and choose resources for serving randomly,so the LoR of resources with low capacities will increase in finitely.The SLS-based agents learn only with the job-completioninformation,however,therewardofa job-completion is sparse and delayed seriously,so the scheduler cannot catchthedynamicof thegridsystem.Moreover,SLS-based schedulers learn and work independently regardless of the coordination problem.Obviously,this behaviour leads to the over-utilisation of some resources;while under-utilisation of other resources leads toherdingbehaviour[24].Thecentralised LLS-based scheduler can achieve the best scheduling performance with expensive costs for computing and communication.The OSL-based agents achieve good job-scheduling performance under all conditions for adopting proper learning model and sharing mechanism.However,the proposed GRL rule can achieve better scheduling performance than the OSL rule does in a decentralised,robust and low-cost way.

Fig.11　ALoR with linear sharing strategy under different gossip depths and system loads

Fig.12　ALoR comparison for different scheduling rules

Table 1　Comparison of different rules

Table 1 shows characteristics summarisation and comparison of all rules.

According to the above evaluation and analysis,the proposed GRL method provides an effective,efficient and robust solution for scheduling scenarios without an expensive GIS or high network overhead.

6　Conclusions

Grid computing uses distributed heterogeneous resources to process large-scale or complex computing tasks.Appropriate job scheduling algorithms are crucial to the success of grid applications.In this paper,a novel MARL method(the GRL method)is proposed for job scheduling in grids.This GRL method avoids the scalability limitation by adopting the decentralised structure and achieves optimised performance by adopting gossip-based learning strategy.Simulation results show that the GRL method can achieve load balancing effectively with medium communication cost.Future works may include re fining and applying this method to actual grid problems,as well as testifying the algorithm’s robustness in changeable network environments.

7　Acknowledgment

This work was supported by the National Natural Science Foundation of China under grant U1564214.

[1]Joseph,J.,Fellenstein,C.:‘Grid computing’(Prentice-Hall,Upper Saddle River,NJ,USA,2003)

[2]Chang,J.S.,Chang,R.S.:‘A probabilistic cost estimation model for R fid real-time applications in data grids’,Int.J.Innov.Comput.,Inf.Control,2010,6,(12),pp.5421–5438

[3]Dong,F.P.,Akl,S.G.:‘Scheduling algorithms for grid computing:state of the art and open problems’.Technical Report No.2006-504,School of Computing,Queen’s University,Kingston,Ontario,Canada,January 2006

[4]Xhafa,F.,Abraham,A.:‘Computational models and heuristic methods for grid scheduling problems’,Future Gener.Comput.Syst.,2010,26,pp.608–621

[5]Izakian,H.,Ladani,B.,Abraham,A.,et al.:‘A discrete particle swarm optimization approach for grid job scheduling’,Int.J.Innov.Comput.,Inf.Control,2010,6,(9),pp.4219–4233

[6]Chung,W.C.,Chang,R.S.:‘A new mechanism for resource monitoring in grid computing’,Future Gener.Comput.Syst.,2009,25,pp.1–7

[7]Lin,J.F.:‘Performance analysis and discussion on a heuristic approach for schedulingmultiprocessortasksinagridcomputingenvironment’,Int.J.Innov.Comput.,Inf.Control,2010,6,(12),pp.5451–5462

[8]Brugnoli,M.,Heymann,E.,Senar,M.A.,et al.:‘Grid scheduling based on collaborative random early detection strategies’.18th Euromicro Conf.Parallel,Distributed and Network-based Processing,Pisa,Italy,February 2010,pp.35–42

[9]Wu,J.,Xu,X.,Zhang,P.C.,et al.:‘A novel multi-agent reinforcement learning approach for job scheduling in grid computing’,Future Gener.Comput.Syst.,2011,27,(5),pp.430–439

[10]Galstyan,A.,Czajkowski,K.,Lerman,K.,et al.:‘Resource allocation in the grid with learning agents’,J.Grid Comput.,2005,3,(1–2),pp.91–100

[11]Vengerov,D.:‘A reinforcement learning approach to dynamic resource allocation’,Eng.Appl.Artif.Intell.,2007,20,(3),pp.383–390

[12]Zhang,C.J.,Lesser,V.P.,Shenoy,P.,et al.:‘A multi-agent learning approach to online distributed resource allocation’.Proc.of the IJCAI2009,Pasadena,CA,USA,July 2009,pp.361–366

[13]Ganesh,A.J.,Kermarrec,A.,Massoulié,L.,et al.: ‘Peer-to-peer membership management for gossip-based protocols’,IEEE Trans.Comput.,2003,52,(2),pp.139–149

[14]Kermarrec,A.,Massoulié,L.,Ganesh,A.J.,et al.: ‘Probabilistic reliable dissemination in large-scale systems’,IEEE Trans.Parallel Distrib.Syst.,2003,14,(3),pp.248–258

[15]Boyd,S.,Ghosh,A.,Prabhakar,B.,et al.:‘Randomized gossip algorithms’,IEEE Trans.Inf.Theory,2006,52,(6),pp.2508–2530

[16]Butt,A.R.,Zhang,R.,Hu,Y.C.,et al.: ‘A self-organizing flock of condors’,J.Parallel Distrib.Comput.,2006,66,(1),pp.145–161

[17]Lamanna,M.: ‘The LHC computing grid project at CERN’,Nucl.Instrum.Methods Phys.Res.A.,2004,534,(1–2),pp.1–6

[18]Vasilakos,X.,Sacha,J.,Pierre,G.,et al.:‘Decentralized as-soon-as-possible grid scheduling:a feasibility study’.Proc.19th Int.Conf.Computer Communications and Networks(ICCCN),Zurich,Switzerland,August 2010,pp.1–6

[19]Krauter,K.,Buyya,R.,Maheswaran,M.,et al.:‘A taxonomy and survey of grid resource management systems for distributed computing’,Softw.-Pract.Exp.,2010,32,(2),pp.135–164

[20]Mohamed,H.,Epema,D.: ‘KOALA:aco-allocatinggridscheduler’,Concurrency Comput.Pract.Exp.,2008,20,(16),pp.1851–1876

[21]Huedo,E.,Montero,R.S.,Llorente,I.M.,et al.:‘The gridway framework for adaptive scheduling and execution on grids’,Scalable Comput.Pract.Exp.,2005,6,(3),pp.1–8

[22]Frey,J.,Tannenbaum,T.,Livny,M.,et al.:‘Condor-G:a computation management agent for multi-institutional grids’.Cluster Computing,2002,5,(3),pp.237–246

[23]Berman,F.,Wolski,R.,Shao,G.,et al.:‘Adaptive computing on the grid using AppLeS’,IEEE Trans.Parallel Distrib.Syst.,2003,14,(4),pp.369–382

[24]Cirne,W.,Berman,F.:‘When the herd is smart:aggregate behavior in the selection of job request’,IEEE Trans.Parallel Distrib.Syst.,2003,14,(2),pp.181–192

[25]Zheng,Q.,Yang,H.,Sun,Y.,et al.:‘How to avoid herd:a novel stochastic algorithm in grid scheduling’.Proc.15th IEEE Symp.on High Performance Distributed Computing,Paris,France,June 2006,pp.267–278

[26]Tesauro,G.,Jong,N.K.,Das,R.,et al.:‘A hybrid reinforcement learning approach to autonomic resource allocation’.IEEE Int.Conf.on Autonomic Computing,2006(ICAC ’06),Dublin,Ireland,June 2006,pp.65–73

[27]Sutton,R.S.,Barto,A.G.:‘Reinforcement learning:an introduction’(MIT Press,Cambridge,Massachusetts,1998)

[28]Zhang,Z.,Zheng,L.,Li,N.,et al.:‘Minimizing mean weighted tardiness in unrelated parallel machine scheduling with reinforcement learning’,Comput.Oper.Res.,2012,39,(7),pp.1315–1324

[29]Zhao,T.,Zheng,X.,Li,K.,et al.:‘Proactive scheduling in distributed computing-a reinforcement learning approach’,J.Parallel Distrib.Comput.,2014,74,(7),pp.2662–2672

[30]Moradi,M.:‘A centralized reinforcement learn method for multi-agent job scheduling in grid’.Proc.of the TCCKE 2016,Ferdowsi University of Mashhad,20–21 October 2016,pp.171–176

[31]Arabnejad,H.,Pahl,C.,Jamshidi,P.,et al.:‘A comparison of reinforcement learning techniques for fuzzy cloud auto-scaling’.IEEE ACM Int.Symp.on Cluster,Cloud and Grid Computing,Madrid,Spain,May 2017,pp.64–73

[32]James,D.,Enda,H.,Duggan,J.,et al.:‘Applying reinforcement learning towards automating resource allocation and application scalability in the cloud’,Concurrency Comput.Pract.Exp.,2013,25,(12),pp.1656–1674

[33]Abdallahy,S.,Lesser,V.:‘Learning the task allocation game’.Proc.of the fifth AAMAS,Japan,8–12 May 2006,pp.850–857

[34]Busoniu,L.,Babuska,R.,De Schutter,B.,et al.:‘A comprehensive survey of multi-agent reinforcement learning’,IEEE Trans.Syst.Man Cybern.,2008,38,(2),pp.156–172

[35]Claus,C.,Boutilier,C.:‘The dynamics of reinforcement learning in cooperative multiagent systems’.Proc.of the 15th AAAI/IAAI-98,Madison,USA,26–30 July 1998,pp.746–752

[36]Littman,M.L.: ‘Value-function reinforcement learning in Markov games’,J.Cogn.Syst.Res.,2001,2,(1),pp.55–66

[37]Wang,X.,Sandholm,T.:‘Reinforcement learning to play an optimal Nash equilibrium in team Markov games’.Advances in Neural Information Proc.Systems 15(NIPS’02),Vancouver,2003,pp.1571–1578

[38]Wang,Y.,Silva,C.W.:‘Multi-robot box-pushing:single-agent Q-learning vs.team Q-learning’.Proc.of the IROS2006,Beijing,China,2006,pp.3694–3699

[39]Eugster,P.T.,Guerraoui,R.,Kermarrec,A.,et al.:‘From epidemics to distributed computing’,IEEE Comput.,2004,37,(5),pp.60–67

CAAI Transactions on Intelligence Technology

2018年1期