Pseudo-label based semi-supervised learning in the distributed machine learning framework①

2022-07-06WANGXiaoxi王晓曦WUWenjunYANGFengSIPengboZHANGXuanyiZHANGYanhua

High Technology Letters 2022年2期

WANG Xiaoxi (王晓曦), WU Wenjun②, YANG Feng, SI Pengbo, ZHANG Xuanyi, ZHANG Yanhua

(∗Faculty of Information Technology, Beijing University of Technology, Beijing 100124, P.R.China)

(∗∗Beijing Capital International Airport Co., Ltd., Beijing 101317, P.R.China)

Abstract With the emergence of various intelligent applications, machine learning technologies face lots of challenges including large-scale models, application oriented real-time dataset and limited capabilities of nodes in practice. Therefore, distributed machine learning (DML) and semi-supervised learning methods which help solve these problems have been addressed in both academia and industry. In this paper, the semi-supervised learning method and the data parallelism DML framework are combined. The pseudo-label based local loss function for each distributed node is studied, and the stochastic gradient descent(SGD) based distributed parameter update principle is derived. A demo that implements the pseudo-label based semi-supervised learning in the DML framework is conducted, and the CIFAR-10 dataset for target classification is used to evaluate the performance. Experimental results confirm the convergence and the accuracy of the model using the pseudo-label based semi-supervised learning in the DML framework. Given the proportion of the pseudo-label dataset is 20%, the accuracy of the model is over 90% when the value of local parameter update steps between two global aggregations is less than 5. Besides,fixing the global aggregations interval to 3,the model converges with acceptable performance degradation when the proportion of the pseudo-label dataset varies from 20% to 80%.

Key words:distributed machine learning (DML),semi-supervised,deep neural network (DNN)

0 Introduction

Recently, the rapid growth of emergent applications including unmanned driving, face recognition and automatic navigation has greatly promoted the development of artificial intelligent (AI) technologies. However, in some of the scenarios, such as unmanned aerial vehicle (UAV) networks[1]and Internet of vehicles(IoV)[2], the implementation of AI technologies faces challenges aroused by the limited batteries, low computing capability as well as data privacy. Moreover, in most of the practical cases,the performance of AI technologies is limited by the size of training sample set and the accuracy of the labels. To address the above problems, distributed machine learning (DML)[3]and semi-supervised learning[4]has attracted enormous attentions.

DML is a distributed collaboration architecture for multiple worker nodes to train a machine learning(ML) model together[3]. Generally,there are two basic parallelism modes for DML, which are model parallelism and data parallelism. In the model parallelism mode, the ML model is partitioned among workers and each worker updates part of the parameters using the entire dataset[5]. In the data parallelism mode, each worker has a local copy of the complete ML model and updates model parameters based on their local data[3,6]. Nowadays, the data parallelism has been more widely adopted than the model parallelism, given that most ML models can be entirely stored in the memory of modern GPUs. Since workers do not need to send the raw data to a central node,the data privacy issue is well solved.

As for the aggregation process of DML, both the synchronous mode and the asynchronous mode can be used in the communication between workers and the aggregator[7]. In the synchronous communication, all the workers should stop at the overall ‘barrier synchronous’ and wait for other workers to finish the local training before the barrier. While using the asynchronous communication, such as HogWild![8]and Cyclades[9], all the workers can send their parameters or models to the aggregator when they accomplish several local training. Obviously, the synchronous mode wastes the waiting time, but the aggregation algorithm is simple. Meanwhile, the asynchronous mode can fully utilize the time. However, due to the different computation capabilities of different workers, the poor workers may slow down the convergence rate and become a drag of the whole model.

Semi-supervised learning is a class of methods which can train deep neural networks (DNN) by using both labeled and unlabeled data. With these methods,the problem of lacking labeled data in some of the realtime application scenarios can be overcome. One of the early methods of training the DNN based on labeled and unlabeled data was studied[10]. Ref.[11] proposed a semi-supervised deep learning method for hyperspectral image classification which uses limited labeled data and unlabeled data to train a DNN. In the research area of modulation classification, combining handcrafted feature with deep learning, Ref.[12] proposed a few-shot modulation classification method based on feature dimension reduction and pseudo-label training.

Although DML and semi-supervised learning are widely used in the areas such as image classification[13], face recognition[14], natural language processing[15]etc., the implementation combining these two technologies together has not been well studied. For some of the scenarios, especially some AI applications with real-time collected unlabeled data using devices with limited capability such as the UAV based emergency rescue and real time high definition mapping in IoV, the combination of DML with semi-supervised learning is of great necessary.

In this paper, a data parallelism architecture enabling the pseudo-label based semi-supervised learning in DML is proposed. Then the cross entropy based local loss function with pseudo-label at each worker is given and the learning problem is formulated. The stochastic gradient descent (SGD) is adopted in the training process and the corresponding local parameter updating equation is derived. A demo that implements the pseudo-label based semi-supervised learning in the DML framework is conducted and the CIFAR-10 data set for target classification is used to evaluate the performance. Given the proportion of the pseudo-label dataset is 20%, results show that the model converges when the local update steps between every two global aggregation is less than 5. Results also confirm that the proportion of the pseudo-label dataset affects the convergence rate and the accuracy.

1 Distributed semi-supervised method

1.1 Architecture

A typical data parallelism architecture enabling the pseudo-label based semi-supervised learning in DML is considered and shown in Fig.1. The raw data is locally stored atNworker nodes, each of which trains a complete machine learning model (i.e. DNN)by using the local data. The local dataset of each worker consists of two parts, i.e., the labeled dataset and unlabeled dataset. Moreover, it is necessary to assume that all the workers collaborate in a synchronous way and a parameter server implements the parameter aggregation process.

Fig.1 Architecture

1.2 Loss function of the distributed semi-supervised learning

2 SGD-based training process

Obviously, how to use the local data to calculate Eq.(14) and to update the local parameter vector in practice is crucial.

In this work, the stochastic gradient descent(SGD)[16]method is adopted for the local training in each worker. In each local update step of the SGD method, the gradient of the loss function is computed based on a randomly selected subset of the samples,which is referred to as a mini-batch, rather than the whole sample dataset. So the mini-batch chosen at local training steptat thei-th worker can be defined asSit⊂Di, which includes labeled and unlabeled samples. ThenSi1t⊂Di1andSi2t⊂Di2are the sets of labeled and unlabeled samples inSitrespectively. What’s more, the proportion of the pseudo-label samples in the whole mini-batch can be defined asμ=|Si2t|/|Sit|.

Based on the above definitions and the local loss function given in Eq.(5), the local loss function of thei-th worker at time steptin the SDG based training process can be written as

Algorithm 1 Training procedure at the aggregator 1 Initialize t ←0, global step ←0 2 Initialize w(0) as a random vector and send it to all workers 3 Repeat:4 If t%τ = 0 5 Receive wi(t) from each worker i 6 Compute w(t) = ∑N i=1 | Di | wi(t)| D|7 Broadcast w(t) to all the workers 8 global step +1 9 Until the model converges 10 Set STOP flag and send it o all the workers step ←global

Algorithm 2 Training procedure at the i-th worker 1 Initialize t →0 2 Repeat:3 Select a mini-batch (including labeled data and unlabeled data) from the local dataset of the i-th worker 4 Obtain the pseudo label following y′jk = 1 k = argmaxk′ fk′(w, xj)0 otherwise{5 Receive w(t) from the aggregator, set w′i(t) ←w(t)6 for μ = 1,2,…,τ do 7 t ←t +1 8 Compute:wi(t) = w′i(t -1)-η 1| Si1t |j∈Si1t∑C∑- yjk[( ) ∂f jk k=1 f j k+1 -yjk 1 -f j k ∂w′i(t -1)]-η 1| Si2t |∑j∈Si2t∑C ( ) ∂f ′j[- y′j k k+1 -y′j k k=1 f ′j k 1 -f ′j k ∂w′i(t -1)]9 if μ ＜τ then 10 w′i(t) ←wi(t)11 else 12 Send wi(t) to the aggregator 13 end if end for 14 Until STOP flag is received.

3 Experimentation

3.1 Hardware environment

A DML framework that consists of three workers(worker0, worker1 and worker2) and a parameter server(ps0) is adopted in the experiment. The experiment demo is conducted on three laptops, which are in a local area network and connected by a router, and ps0 and worker0 are deployed on the same laptop. The configuration parameters of the three PCs are as follows.

(1) worker0 and ps0 PC. Processor: AMD Ryzen 74 800 H with Radeon Graphics 2.90 GHz. Memory:RAM 32. 00 GB. System type: 64-bit operating system, based on the X64 processor. Operating system version: Windows 10.

(2) worker1 PC. Processor: Intel (R) Core(TM) i7-9750H CPU @2.60 GHz 2.59 GHz. Memory: RAM 16. 00 GB. System type: 64-bit operating system, based on the X64 processor. Operating system version: Windows 10.

(3) worker2 PC. Processor: Intel (R) Core(TM) i7-9750H CPU @2.60 GHz 2.59 GHz. Memory: RAM 32. 00 GB. System type: 64-bit operating system, based on the X64 processor. Operating system version: Windows 10.

3.2 Software environment

The deep neural network of the target classification model is GoogLeNet[17], which is a convolution neural network considering the local sparsity of the model. It consists of 27 layers including convolution layers, max pool layers, inception structure, dropout layers, linear layers and softmax layers. At the end of the neural network, the fully connection layer is replaced by an average pooling layer, but the use of dropout remained essential.

The DML model is conducted based on the framework of distributed TensorFlow[18], and Python 3.7 and TensorFlow-gpu 1.13.1 with CUDA 10.0 and cuDNN 7.5 are used. Worker0, which is in charge of initializing and restoring the model,is the chief worker in the TensorFlow cluster,and others should wait for the chief worker finishing his initialization and then start their training. Since the value of local update stepsτbetween every two global aggregations affects the convergence of the training process, the results of different values ofτare evaluated under the condition ofμ=20%.

3.3 Dataset

The dataset of CIFAR-10 is used, which includes 60 000 color images (50 000 for training and 10 000 for testing) of 10 different types of classifications[19]. In this experiment, the whole dataset is randomly divided into three parts of 10 000 samples,20 000 samples and 20 000 samples which are the local datasets of worker0, worker1 and worker2, respectively. Thus each worker has the uniform (but not full) information. In the training process, the size of mini-batch is set as 20 for all the workers.

The proportion of the pseudo-label dataset is another factor that affects the convergence performance,thus different value ofμare also evaluated when the value ofτis 3. Since all the samples in CIFAR-10 are labeled originally, some samples are regarded as the unlabeled data and their labels are ignored, and pseudo labels are sought for them following Eq.(4) in the training process.

4 Results and analysis

4.1 Performance evaluation of different values of τ

In the first part of the experiments, the value ofμis fixed as 20% and the value ofτis varied from 3 to 7.

The training results in Fig.2 and Fig.3 show that when the value ofτis not greater than 5, the target classification model is converged. But for different value ofτ,the convergence performance is different. Whenτ= 3,the value of loss is close to 0.25 and the accuracy is 97.7% after 12 000globalstep(36 000 local update steps). Whenτ= 5,the model needs much more time to get an acceptable performance. It converges after about 220 000globalstep(1 100 000 local update steps), and the final values of loss and accuracy are about 0.48 and 96%, respectively. However, when the value ofτis greater than 5, takingτ= 7 as the example, the target classification model cannot converge.The value of loss drops down at the first,and then rises again after severalglobalstep. Meanwhile, the accuracy rises to about 56% and then falls. This is because that whenτis too large, the local gradient may deviate too much from the global gradient,resulting in the poor convergence.

Fig.3 Training results of loss (μ=20%)

The test results in Fig.4 and Fig.5 are consistent with the training results in most cases, and the performance of accuracy and loss are slightly worse than that of the training performance due to the difference in training and test samples. Whenτ= 3,the loss of the test result is about 0.6 and accuracy is about 92%. It is worth noting that when the value ofτis equal to 5,the loss performance cannot converge on the test dataset. That is to say, the acceptable value ofτis less than 5 in practice.

Further, the performance of the well trained model is also tested. The models with the best training performance in Fig.2 for different value ofτare selected,and the test results are given in Fig.6. The performance ofτ= 3 andτ= 4 is as good as that given in Fig.2 and Fig.5. However,whenτ=5 andτ=6,the performance is not good, which is consistent with the phenomenon occuring in Fig.5. Sinceτ= 5 is the critical value and the performance is unstable, the distributions of the loss for the test samples using models obtained from differentglobalstepare counted in Fig.7.When the average loss is 5, which is acceptable, the loss values of most of the samples are less than 5. But when the average loss is around 70, only 18.8% of the samples can gain the low loss which is less than 5, and 25.9% of the samples have extremely high loss which is larger than 100.

Fig.4 Test results of accuracy (μ=20%)

Fig.5 Test results of loss (μ=20%)

4.2 Performance evaluation of different values of μ

The impact of different proportion of pseudo-label dataset on the performance is evaluated. Since the critical value forμ= 20%isτ= 5,the value ofτis set as 3 in the following experiments to ensure a certain margin, andμis set to 20%,50% and 80%.

Fig.6 The best test performance of different value of τ

Fig.7 Proportion of samples with the different test loss results

Fig.8 Training results of accuracy (τ=3)

Fig.9 Training results of loss (τ=3)

As shown in Fig.8 and Fig.9, the target classification model converges. Whenμ= 20%,the accuracy can increase to 98%, and the loss value can descent to 0.17. Whenμ= 50%,the model can also achieve an acceptable accuracy of about 95%, and the value of loss is about 0.3. However,whenμ= 80%,the curve is different from those whenμ= 20%andμ= 50%.The convergence rate decreases obviously, and three steps can be observed in the ascending stage. At the beginning, the accuracy is kept in an extremely low level, which is because the credible pseudo label for the unlabeled data cannot be obtained based on the initial model. But after about 80 000globalstep, it raises up gradually, and the final accuracy can reach 94%.

As for the test results given in Fig.10 and Fig.11,the performance is slightly degraded, but still acceptable. Whenμ=20%andμ=50%,the accuracy of the model can reach 92%. But whenμ=80%,which is very large,the accuracy on the test dataset is only 87%.

Fig.10 Test results of accuracy (τ=3)

4.3 Calculation and communication cost analysis

The cost of the pseudo-label based semi-supervised learning algorithm is composed of two parts,i.e., the computational cost of the local training process and the communication cost of the parameter transmission.

Fig.11 Test results of loss (τ=3)

The measurement of computational cost mainly focuses on the number of floating-point operations (FLOPs).As mentioned in Section 3, the convolution neural network GoogLeNet is adopted, and the time complexity of all convolution layers can be expressed as

This is the sum quantity of parameters and feature maps of all layers,Kis the size of convolution kernel,Clis the number of output channels of thel-th layer,andMis the length of the output feature map. Since the worker needs to upload the local parameters to the aggregator and get the updated global parameters from the aggregator, the communication cost of one local worker can be measured as 2Ωpara.As for GoogLeNet,the number of the parameters of the whole model isΩpara≈6.8 M according to the calculation in Ref.[17].Thus, the communication cost of one local worker is about 13.6 M.

Since all the workers train their local models in parallel and communicate with the aggregator at the same time, the cost of one global update can be estimated asΩTimeτ|Sit| (1+μ)+2Ωparafrom the aspect of time efficiency. Therefore, the cost of the whole distributed training process isTM[ΩTimeτ|Sit| (1+μ)+2Ωpara],whereTMis the number of global updates.

5 Conclusions

In this paper, the main work is the implementation of the semi-supervised DML based on the pseudolabel method in a distributed framework. The local loss function of each distributed learning worker is studied,and the SGD-based parameter update equation is derived. The GoogLeNet based target classification model is evaluated by using the dataset of CIFAR-10. Results show that the model converges when the local update steps between every two global aggregation is less than 5 and the proportion of the pseudo-label dataset is 20%. Further evaluation results under the condition ofτ= 3 show that the increasing of the proportion of the pseudo-label dataset slows down the convergence rateand reduces the accuracy. But even whenμ= 80%,the model still converges.