APP下载

N-SVRG:Stochastic Variance Reduction Gradient with Noise Reduction Ability for Small Batch Samples

2022-04-12HaijiePanandLirongZheng

Haijie Pan and Lirong Zheng

School of Information Science and Engineering,Fudan University,Shanghai,200433,China

ABSTRACT The machine learning model converges slowly and has unstable training since large variance by random using a sample estimate gradient in SGD.To this end, we propose a noise reduction method for Stochastic Variance Reduction gradient (SVRG), called N-SVRG, which uses small batches samples instead of all samples for the average gradient calculation, while performing an incremental update of the average gradient.In each round of iteration, a small batch of samples is randomly selected for the average gradient calculation, while the average gradient is updated by rounding of the past model gradients during internal iterations.By suitably reducing the batch size B,the memory storage as well as the number of iterations can be reduced.The experiments are compared with the state-of-the-art Mini-Batch SGD, AdaGrad, RMSProp, SVRG and SCSG, and it is demonstrated that N-SVRG outperforms SVRG and SASG,and is on par with SCSG.Finally,by exploring the relationship between the small values of different parameters n.B and k and the effectiveness of the algorithm,we prove that our N-SVRG algorithm has some stability and can achieve sufficient accuracy even in the case of small batch size.The advantages and disadvantages of various methods are experimentally compared, and the stability of N-SVRG is explored by parameter settings.

KEYWORDS Machine learning; SGD; SVRG; memory storage

1 Introduction

The variance problem introduced by the stochastic nature of the SGD algorithm becomes the main problem of optimization algorithms nowadays.The introduction of variance makes SGD reach only sublinear convergence speed with a fixed step size [1], while the stochastic algorithm accuracy is positively related to the sampling variance, and when the variance tends to 0, the deviation of the algorithm will also be 0.In this case, the SGD can still be fast even with a large step size convergence.Therefore, how to reduce the variance of the stochastic gradient in the stochastic algorithm has become an important issue studied by scholars [2].

For SGD variance problem, there are three mainstream methods to reduce the variance of sampling at present that include importance sampling, hierarchical sampling method and control variable method.The objective function in machine learning is usually solved using the Batch Gradient Descent (BGD)or SGD [3].BGD algorithm computes the gradients of all samples for each iteration to perform the weight update, and the latter randomly selects one training sample at a time to update the parameters by computing the sample gradients.Then came the improved Mini-Batch SGD (MBGD)algorithm, where MBGD computes the gradient and performs weight update by randomly selecting m data samples in the original data for each iteration.SGD has the advantage that each step relies only on a simple random sample gradient,so the computational consumption is only a fraction of that of the standard GD [4].However, it has the disadvantage that a constant step size leads to slow convergence in the case of variance introduced by randomness.

In recent years, many scholars have carried out research based on SGD algorithm, for example, momentum algorithm [5], which is based on the idea of gradient momentum accumulation and uses exponential weighted average to reduce the swing amplitude of the gradient.Two concepts of velocity and friction are introduced into momentum algorithm.The function of the gradient is to change velocity, and friction is to gradually reduce the velocity.In the whole training process, the increase and attenuation of momentum are simulated to achieve the purpose of convergence.AdaGrad (Adaptive Gradient)is proposed to learn by gradually decreasing the step size [6].By accumulating gradients, the learning rate of each weight is related to the value of their previous gradient.However, the consequence of accumulating gradients is that as the training increases, the step size decreases and eventually the training stalls.To address the training stagnation problem that occurs with AdaGrad, Hinton, G.proposed RMSProp (Root Mean Square Prop)to calculate the cumulative gradient using a moving average method that only accumulates the gradient of one window, making the change in step size adapt to the current gradient thus achieving better optimization [7,8].And for SGD stochasticity introduces the variance problem,where SVRG is used to correct the gradient used for each model update using the global average gradient information [9].Theoretical analysis and experiments demonstrated that SVRG produced linear convergence with reducing variance.Subsequently, Zhao et al.[10] proposed the SASG(Stochastic Average Gradient Average)algorithm, which has the same thematic idea as SVRG,with the difference that a piece of memory is used to store the original gradients of all samples,and the global average gradient is updated by constantly updating the original gradients during training, which requires a large amount of memory consumption throughout the training [11].The SCSG (Stochastically Controlled Sto-chastic Gradient)algorithm was proposed, SCSG is to calculate the average gradient by randomly selecting a part of the sample gradient as the global gradient, but when performing the weight update, randomly selecting the number of updates will make the calculation more variable and tedious, and the computation is large [12].Subsequently,a series of algorithms [13] such as the novel Mini-Batch SCSG [14,15], b-NICE, SAGA [16,17]were generated based on the idea of variance reduction.

However, there is another structural risk minimization problem in machine learning, which is composed of “loss function + regularization term”, and different forms of regularization terms lead to different complex problems, such as Overlapping group lasso, Graph-guided fused lasso etc.[18], which are very complex for SGD-based theoretical approaches, while the ADMM algorithm is applied to a wider range of models and its excellent performance proves itself to be an effective optimization tool.Several variance reduction algorithms have been proposed in combination with ADMM, including SAG-ADMM [19], SDCA-ADMM [20], and SVRGADMM [21].All three algorithms are improved algorithms generated based on the update strategy of ADMM.Then [22] proposed the APA-SVRG, which is trained on the basis of SVRG using proximity averaging, and the experimental results show that the APA-SVRG algorithm is comparable to SVRG-ADMM.The neighborhood stochastic L-BFGS [23] methods, on the other hand,an improved algorithm based on Newton’s method, which is trained mainly on loss functions that are smooth functions.Meanwhile, the specular stochastic sub-gradient descent [24] method is used to train on the loss function that is a non-smooth function.However, these schemes compute the gradient unbiased estimation using the average gradient of a small batch of samples, which cannot reduce the total complexity linearly, and the computational cost and memory consumption increase, and the computation and update of the gradient of a small batch of samples increase the computational consumption of the whole algorithm [25].

In order to address the above challenges, we propose a noise reduction method of stochastic gradient method, and then use the idea of small sample average gradient instead of global average gradient to design the algorithm N-SVRG that selects small samples for training while updating the average gradient to achieve variance reduction, and introduce the algorithm flow and convergence analysis of N-SVRG algorithm in detail, and compare it with the mainstream Mini-Batch SGD The N-SVRG algorithm is compared with the mainstream Mini-Batch SGD,AdaGrad, RMSProp, SVRG and SCSG algorithms, and it is proved that the N-SVRG algorithm outperforms SVRG, SASG and other algorithms, and is equal to SCSG.Finally, by exploring the relationship between the small values of different parameters n.B and k and the effectiveness of the algorithm, we prove that the N-SVRG algorithm has some stability and can achieve sufficient accuracy even in the case of low back size.Experimentally comparing the advantages and disadvantages of various methods and exploring the stability of the N-SVRG algorithm through parameter settings.

The contributions of this paper are as follows:

* We propose N-SVRG that uses small batches samples instead of all samples for the average gradient calculation, while performing an incremental update of the average gradient.By suitably reducing the batch size B, the memory storage as well as the number of iterations can be reduced.

* The convergence analysis of the proposed algorithm shows that the algorithm can stably converge to a certain lower limit, and find the saddle point with smooth gradient descent in the whole training process of neural network.It also enables the algorithm in this paper to reduce the computational cost and memory burden.

* The experiments are compared with the state-of-the-art demonstrate that N-SVRG outperforms baseline.Exploring the relationship between the small values of different parameters n.B and k and the effectiveness of the algorithm, we prove that our N-SVRG algorithm has some stability and can achieve sufficient accuracy even in the case of low batch size.

2 Background

For the supervised learning problem in machine learning [2,26]:assume that there is a functional modellin the spaceLfor each model inputx, i.e., there is a predictionl(x).That is,givenntraining data(x1,y1),(x2,y2),···,(xn,yn)wherexi∈Rddenotes the input sample data andyi∈Rdenotes the corresponding training labels.Our goal is to find the prediction function that minimizes the loss due to inaccurate predictions.However, since the predictionsl(x)andyiwill differ each time the model predicts, one uses the loss function to assess the difference between the two.Since each sample corresponds to a loss function, the empirical risk is the average of thesensample loss functions.

Experience risk:

Expected risk:

wherewrepresents the parameters in the function model, andfi(l(w;xi,yi))represents the loss function corresponding to sample(xi,yi).Formulas (1)and (2)clearly show how the expected risk and empirical risk depend on the error function, sample space, sample set, etc.

Empirical risk minimization is solving forwsuch that the empirical risk is minimized:

However, during the training process, the model may have a strong performance capability because it has a large number of parameters itself, but the data provided for training are insufficient.Insufficient data for training, then the model may overfit during a large number of repeated training sessions, i.e., “the learned model is so well suited to a specific set of data that it does not reliably fit other data or future observations”.In order to reduce the impact of overfitting [4,27]to some extent, a regularization term is added to the empirical risk to limit the complexity of the model, which constitutes a structural risk minimization [28] in the form of “loss function +regularization term” problem:

whereλ>0 is a hyperparameter that balances the weight of the regularization term by its own numerical size, and the largerλis set, the heavier the penalty on the weight.r(w)picks different forms depending on the effect, which include L1 parametrization, L2 parametrization [29], and L~∞parametrization.The most common form is the L2 parametrization, i.e.,r(w)=‖w‖ 2,which can be calculated using

The loss function in machine learning is a non-negative real functionf (l(x),y), where the common loss functions are log Logi Loss, Squared Loss, 0-1 Loss, and Loss and Cross Entropy Error Loss (CEL)[30].As shown below:

Logarithmic loss function:

Mean square error loss function:

0-1 Loss function:

Cross-entropy error loss function:

The use of these loss functions should be chosen according to the specific problem, for example, the cross-entropy error loss function can be used with the Softmax function to form a Softmax-with-Loss layer for training and learning the classification problem [31].

3 Related Concepts

To facilitate the subsequent analysis and study, we give the relevant basic concepts that will be covered in the paper as well as the definitions.There is an iterative sequencext, with two adjacent iteration pointsxk+1,xk, satisfying the following conditions:

1)Ifμ= 1, the algorithm achieves sublinear convergence;

2)If 0<μ<1, the algorithm achieves linear convergence [32];

3)Ifμ= 0, the algorithm achieves superlinear convergence.

Let setC⊂Rd, anyx1∈C,x2∈Candθ∈[0,1] has(1 −θ)x1+C,x2∈C, the setCis called a convex set.

Let setC⊂Rdbe a convex set andfbe a function defined on setC.For any set,x1∈C,x2∈Cexists:

f(x)is a convex function defined on a convex setC.

For a functionf(y)with stepη >0, its proximal operator at the point x [33] (Proximal Operator)is defined as:

Iff=ICis a schematic function on the convex setC

Then:

That is, the proximity operator of the functionf(y)atxcan be viewed as the Euclidean projection ofxon the setC.

Assume (Lipschitz continuous target gradient)that the objective functionF:Rd→Ris continuously differentiable, the gradient function ∇F:Rd→RofFis Lipschitz continuous and the Lipschitz constantL>0, i.e.,

Ensures that the gradient ofFdoes not change arbitrarily fast with respect to the parameter vector.This assumption is essential for the convergence analysis of most gradient-based methods;without it, the gradient will not be a good indication of the distance to reduceF.

Assumption 1:If the functionfisatisfiesLsmooth, then for anyx,y∈Rd, andfiwith gradient∇fi(x)at pointx, there existsL>0 such that

The constantLis called the Lipschitz constant and is generally used to measure the smoothness of a function.The function satisfiesLsmoothness in case it itself satisfies the Lipschitz continuity condition.For a fixedfi,Lis a fixed value.This condition places a restriction on the variation of the value of the function.

If the functionfiis strongly convex, then for anyx,y∈Rd, andfiwith gradient ∇fi(x)at pointx, there existsγ >0 such that

4 The Proposed Algorithm

4.1 Controlled Variable Method

This section describes the specific implications of the control variables approach [34].It is assumed that we need to use Monte Carlo [35] sampling method to estimate the expectationμof the random variableX, i.e.,E[X] =μ.Also assume that we have been able to be able to estimate relatively easily the expectationτof another random variable Y, i.e.,E[Y]=τ.We construct a new random variable:

whereC∈Rdis the corresponding coefficient, it can be seen from Eq.(17)that Z remains an unbiased estimator ofμ[16] and that the variance of Z:

It can be easily obtained by calculation that Var(Z)is minimum at, when

whereρX,Y=Corr(X,Y).It can be seen from (18)and (19).The new random variables Z and X are unbiased estimates ofμas well, but the variance of the random variable Z is smaller than that of X.Therefore, it is better to use the random variable Z as an unbiased estimate ofμthan X.From the formula, it can be seen that the variance of Z is sufficiently small as long as the random variable X is guaranteed to show a certain correlation with Y [3], so Y is also called the control variable of X.This is the control variable method.

4.2 SVRG

SVRG is a new variance reduction method formed based on the control variable method,which itself does not construct control variables from similar samples, but from the perspective of optimizing variables.Its iterative approach is:

wherew∈Rddenotes the model parameters,ηdenotes the update step,is the global average gradient which is calculated using the model parametersobtained at the end of the previous cycle,denotes the gradient of the functionfover the sampleijwith respect to parameteris the deviation of the control variables from the unbiased estimate, andis the corrected unbiased estimate, usingto updatewj.

SVRG itself is a periodic algorithm, with the increase of iterations,andwj−1 gradually converge tow∗, that is:whent→∞, soandare getting closer and closer, and becauseis the control variable of, according to the theory, when and the closer, the higher the correlation between the two, then thevariance will converge to 0, so that the variance can be reduced to 0 by the control variable method, so the SVRG algorithm achieves linear convergence.

The SVRG algorithm steps are as follows:

Step 1:Given the initialization weight, the number of SGD stepsm, the number of outer loop iterations T and the learning stepη, initializet= 0;

Step 2:Calculate the global average gradientand initialize

Step 3:j= 0, select a random sampleij∈{1,···,n}, calculatecycle Step 3 untilj=m,,go to the next step;

Step 4:t = t + 1, go to Step 2 until t = T, output:Choice 1:; Choice 2:w∗=.From the pseudo-code, we can see that the main calculations of the algorithm are in Steps 2 and 3, where Step 2 is to calculate the average gradient and Step 3 is used to calculate the corrected unbiased estimate of the gradient fromtorounds iteratively at a computational cost ofn+ 2m.

Assuming that allfiare convex functions and satisfy Eqs.(19)and (20)andγ >0, and assuming that m is large enough, then

Expectation of geometric convergence [21] for SVRG:

Proof:For anyi, consider that there is

science ∇gi(w∗)=0, sogi(w∗)=minw gi(w),

Because ∇gi(w∗)=0, thereforegi(w∗)=minw gi(w), therefore

Wheni= 1,2,...,n, the cumulative inequality (21)is added.And using ∇P(w∗)=0, we obtain

The first inequality uses the previously obtained inequalityand the second inequality uses theP(w)-convex property.By accumulating thej= 1,...m, stages we obtain

The second inequality uses the strong convexity property, thus obtaining

5 N-SVRG Algorithm

The SASG algorithm is a memory-consuming algorithm in the SVRG family of algorithms.Although the computational cost of the SVRG algorithm is huge when computing the average of all sample gradients, it can be computed in parallel during the computation because it can use matrix operations itself.the difference between the SASG and SVRG algorithms is that the SASG algorithm requires one memory block to store all sample gradients, which not only increases the computational cost (because matrix parallelism cannot be used), but also increases the memory burden of the computer.In order to reduce the computational cost and memory burden, we designed the back decimation update gradient [2,33].

5.1 Algorithm Introduction

In the SASG algorithm, a piece of memory is used to store the original gradients of all samples, while a new gradient ∇fij(wt−1)modeled with the current new weights is computed after passingthereby achieving the purpose of updating ∇Pn(wt−1).However,its memory consumption is too large, and the computational cost increases, which makes it impractical to use the SASG algorithm in the face of the complexity of the model due to the large size of the data and the huge parameters, etc.The SCSG algorithm is mainly designed to reduce the computational cost.In the SCSG algorithm, it does not need to calculate the gradient of all samples, but only needs to consistently sample a small batch of samples and calculate the average gradient instead of the global sample gradient, so that it can obtain the same training accuracy and convergence speed as SVRG [15,34].

The small batch average gradient in SCSG and the SGD update approach in the SASG algorithm inspired our algorithm.n-SVRG algorithm discards the calculation of the global average gradient for all samples and consistently samples a small batch of samples from the sample,and replaces the global average sample gradient for all by calculating the batch average gradient,i.e.,so that we can reduce the computational cost.At the same time, we store the gradients of these samples, as in the SASG algorithm, the batch sample gradients are modified after each update of the weight, and then the average gradient is recalculated.But we change the update form of SASG average gradient fromtothat is, the sample gradient calculated by the past model is directly rounded off, and the gradient variance is reduced by directly rounding of the sample gradient.From the perspective of the control variables method, it can be interpreted as follows:the mean of the original weight gradient is used as the estimator, and the variance is reduced by gradually decreasing the number of samples.So inevitably, the number of iterations to compute the unbiased estimate of the gradient is correspondingly reduced to less than B, which indirectly reduces the computational cost.

The N-SVRG algorithm steps are as follows:

Step 1:Given the initialization model parameters, the number of outer cycles T, the step sizeη, the number of SGD stepskand the size of each batch B, initializet= 0;

Step 2:Randomly and consistently sample a batch ofHt⊂{1,···,n}, where size |It|=B;

Step 3:Calculate the gradientfor all samples inHtone by one and deposit it ininitializing

Step 6:t = t + 1, go to Step 2 until t = T, output:w∗=wT.

As can be seen from Step 5 tois finite and no longer embraced after deletion, to avoid repeating deletions by mistake, the random sorting of the samples inBwill be used in the program programming to draw them sequentially.The algorithm steps show that the main computational cost of the N-SVRG algorithm is in Step 3 and Step 5, where the number of computations in Step 3 is B and Step 5 contains an inner loop withk, becausek

5.2 Algorithm Convergence Analysis

For simplicity, we only consider the case where eachfi(w)is convex and smooth andP(w)is a strongly convex problem.There are the following assumptions:

fiis a convex function with L-Lipschitz gradient [35]

Assume thatP(w)is a strongly convex function

The expectation in Step 3 of the SVRG algorithm isbut the variance problem makesnon-zero.The variance of ∇fijwj−1is reduced by usingin the update rule of the algorithm.

In the algorithm of this paper, the same SVRG is wanted to be achieved by updatingConsider thejth loop in thejth.

From the literature it is clear that

We can take the expectation forand get

The first three inequalities are used ‖a+b‖22≤2‖a‖22+2‖b‖22and the fourth inequality is used‖E[x]‖22≤E‖x‖22[26].

The second inequality uses the strongly convex property (38), and we thus obtain

We have the convergence of N-SVRG

The convergence of the N-SVRG algorithm shows that the convergence rate of the algorithm is related to the selection of parametersBandk.

Regarding the selection of parametersBandkon the stability of the algorithm and the convergence speed exploration we will explain in detail in the experimental section.

6 Experimental Design and Experimental Results

6.1 Experimental Data

MNIST [35] is a handwritten digit image dataset that was collected with the aim of logarithmically enabling the recognition of handwritten digits, as shown in Fig.1.This dataset has been widely used in machine learning and deep learning since 1998 by Yan LeCun in the LeNet-5 network to test the effectiveness of algorithms such as SVM, Linear Classifiers, Neural Nets,KNN [31,35], etc.

Figure 1:Example of MNIST dataset

The dataset consists of a training set and a test set, where the training set contains 60,000 handwritten digital images and the digital labels corresponding to the digital images.Similarly, the test set contains a total of 10,000 handwritten digital images and digital labels corresponding to the digital images.The data set is composed of digital images from 0 to 9, each of which is a single-channel gray scale image of 28 pixels by 28 pixels, and each pixel is a uint 8 data type, i.e.,an 8-bit unsigned integer with a value between 0 and 255.Each image is labeled with “1”, “3”,“5”, etc.

6.2 Experimental Setup

6.2.1 Data Pre-Processing

The experimental dataset is a MNIST dataset.In the experimental preprocessing stage, we need to regularize and expand the images in one dimension so that when we use batch processing,we can transform a batch of digital image samples into a two-dimensional array for computation.At the same time, we encode the data labels as one-hot.One-hot encoding is an array where the correct solution label is 1 and all others are 0.For example, like label ‘7’in one-hot encoding array is {0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0}.

6.2.2 Network Structure

All program networks are structured as 728 × 40 × 10, the input layer is 728 neurons with input function only, the implicit layer is 40 neurons containing the Relu function, and the output layer is a Softmax-with-Loss layer with 10 neurons.In the output layer we use a Softmax-with-Loss layer for learning.

6.2.3 Experimental Basis

The algorithm is done by Python3.6, for the implementation of the algorithm for the structure in [15,17] is consulted.The device used in this paper is CPU:Inter(R)Core(TM)i5-7500 CPU@3.40 GHz with 16.00 GB of RAM.

The experiment consists of two parts in total.The first part is mainly the evaluation comparison between the algorithm of this paper and the current mainstream algorithm; the second part is mainly the exploration of the relationship between the parameters of the algorithm of this paper.

6.3 Algorithm Comparison

The N-SVRG algorithm is programmed in PyCharm using Python3.6, and is compared with the mainstream Mini-Batch SGD, AdaGrad, RMSProp, SVRG and SCSG algorithms.

Parameter setting of each algorithm:Mini-Batch SGD:

Maximum number of iterations 4800, per batch size 250;

AdaGrad:4800 iterations maximum, 250 per batch size;

RMSProp:4800 iterations maximum, 250 per batch size;

SVRG:Maximum epoch count 20 times, SGD steps 5000;

SCSG:Maximum number of iterations 4800, batch size 250, number of SGD steps 250;

N-SVRG:Maximum number of iterations 4800, batch size 250, number of SGD steps 249.

The following points can be seen in Fig.2 and Table 1:

Table 1:Algorithm comparison results

(1)N-SVRG, RMSProp and SCSG all reach around 65% accuracy by the first epoch of iteration, while SVRG, Mini-Batch SGD and AdaGrad only reach below 40%.As the iteration proceeds, N-SVRG, RMSProp and SCSG reach 80% or more accuracy after only 7 epochs, SVRG reaches 65%, and Mini-Batch SGD and AdaGrad reach only 50%.Batch SGD and AdaGrad.

(2)N-SVRG, RMSProp and SCSG all achieve high accuracy smoothly and quickly.In terms of accuracy, the difference between N-SVRG and SCSG is 0.26%, and the difference between RMSProp and SCSG is 1.13%.The detection accuracy of SVRG is slightly lower than the previous three algorithms, but it is still better than Mini-Batch SGD and AdaGrad.

Figure 2:Comparison of algorithm accuracy

6.4 Stability Exploration

The stability of the N-SVRG algorithm is mainly affected by two factors:the size of each batch B and the number of SGD steps k.The size of B directly affects the computational cost and storage cost of the algorithm, while an appropriate B can enable the algorithm to achieve better accuracy while converging quickly, and the number of SGD steps k has a similar effect with B.Also, the size relationship between k and B is what we need to discuss.Inspired by [24], this paper focuses on two aspects of the experimental investigation, namely, the relationship between the number of training samples B and n ratio and the relationship between B and k the ratio.

B and n ratio exploration:

In the experiment to explore the effect of B and n scaling relationship on the algorithm, in order to make the experimental sample size diversity, we need to take 10,000, 8,000, 4,000 samples from MNIST data one by one for training andB∈{n/2,n/4,n/8,n/16,n/32}, setk=B−1 for the experiment.

The following can be seen from Fig.3 and Table 2:

Table 2:Results of the B to n ratio investigation

(1)The proportional relationship between B and n has a certain relationship with the effect of the algorithm, the larger B is the smaller the training accuracy, and the longer the training time will be because of the number of substitutions due to the setting of k;

(2)Relatively best accuracy at

(3)Although it can be seen from the detection accuracy that the size of B has a relationship with the convergence speed and the accuracy of the algorithm, especially, it has less impact on the stability of the algorithm and still obtains relatively good results with a simple network structure and strong robustness.

Figure 3:Results of the B vs.n ratio investigation

In the experiments exploring the effect of the relationship between B and k scaling on the algorithm, we directly use the complete MNIST dataset and consistently sample small batches of samples of sizes 250, 500, and 1000, respectively.while we set the number of SGD steps k ∈{B-1,0.8 × B 0.6 × B 0.4 × B 0.2 × B} for the experiments.

The following can be seen in Fig.4 and Table 3:

Table 3:Results of the investigation of the ratio of B to k

(1)The relationship between the size of k and B is positively related to the accuracy of the algorithm, the closer k is to B, the higher the testing accuracy of the algorithm, and vice versa,the accuracy decreases, and this phenomenon does not change with the size of B.

(2)From the cross-observation of the data, it is clear that when k is a constant (k = 200),the accuracy decreases with the increase of B, which confirms the above point.

(3)Comparing the results of two experiments, Experiment B with n-proportional probe and Experiment B with k-proportional probe, we can see that the N-SVRG algorithm is robust and maintains high experimental results for general optimization problems when the batch size is too small.

Figure 4:Results of the B vs.k ratio exploration

6.5 Discussion

Through the experiments we can find that the N-SVRG algorithm is outstanding in convergence speed and accuracy compared with Mini-Batch SGD, AdaGrad and SVRG, and is comparable to RMSProp and SCSG.We can know that N-SVRG algorithm is a relatively stable algorithm, although there are two parameters B and k affect the accuracy of the algorithm, from the experiment we can see that the change of B shows a trend of the smaller the value, the higher the accuracy of the algorithm, this nature is the algorithm is suitable for general optimization problems.

7 Conclusion

In this paper, we propose a noise reduction method for SVRG, which uses a small batch of samples instead of all samples for average gradient computation and incremental update of the average gradient.The experiments are compared with the mainstream Mini-Batch SGD, AdaGrad,RMSProp, SVRG and SCSG algorithms, and it is proved that the N-SVRG algorithm outperforms SVRG and SASG, and is equal to SCSG.Finally, by exploring the relationship between the small values of different parameters n, B and k and the algorithm effect, we prove that the N-SVRG algorithm has some stability.

Funding Statement:This work was supported by the National Natural Science Foundation of China under Grant 62076066.

Conflicts of Interest:The authors declare that they have no conflicts of interest to report regarding the present study.