Semantic segmentation method of road scene based on Deeplabv3+ and attention mechanism

2021-12-21BAIYanqiongZHENGYufuTIANHong

Journal of Measurement Science and Instrumentation 2021年4期

BAI Yanqiong， ZHENG Yufu， TIAN Hong

(School of Electronic and Information Engineering, Lanzhou Jiaotong University, Lanzhou 730070, China)

Abstract： In the study of automatic driving, understanding the road scene is a key to improve driving safety. The semantic segmentation method could divide the image into different areas associated with semantic categories in accordance with the pixel level, so as to help vehicles to perceive and obtain the surrounding road environment information, which would improve driving safety. Deeplabv3+ is the current popular semantic segmentation model. There are phenomena that small targets are missed and similar objects are easily misjudged during its semantic segmentation tasks, which leads to rough segmentation boundary and reduces semantic accuracy. This study focuses on the issue, based on the Deeplabv3+ network structure and combined with the attention mechanism, to increase the weight of the segmentation area, and then proposes an improved Deeplabv3+ fusion attention mechanism for road scene semantic segmentation method. First, a group of parallel position attention module and channel attention module are introduced on the Deeplabv3+ encoding end to capture more spatial context information and high-level semantic information. Then, an attention mechanism is introduced to restore the spatial detail information, and the data shall be normalized in order to accelerate the convergence speed of the model at the decoding end. The effects of model segmentation with different attention-introducing mechanisms are compared and tested on CamVid and Cityscapes datasets. The experimental results show that the mean Intersection over Unons of the improved model segmentation accuracies on the two datasets are boosted by 6.88% and 2.58%, respectively, which is better than using Deeplabv3+. This method does not significantly increase the amount of network calculation and complexity, and has a good balance of speed and accuracy.

Key words： autonomous driving; road scene; semantic segmentation; Deeplabv3+; attention mechanism

0 Introduction

The technology of automatic driving requires vehicles to simulate drivers understanding the relationship of the traffic participants and respond to complex traffic environments, such as pedestrians, vehicles, obstacles, etc. Semantic segmentation is one of the most commonly used methods in the study of automatic driving visual perception task, which can divide various interest target areas of the image at the pixel level, label different objects of the image according to semantic categories, and then obtain an image with pixel semantic annotation[1-2].

In recent years, segmentation accuracy and speed of segmentation methods based on deep learning have been significantly improved. In 2015, a Fully Convolutional Network (FCN)[3]was proposed, which implemented an end-to-end segmentation method that supported to input image in any size. Since that, the method of applying convolutional neural network to semantic segmentation has become the mainstream and made good effects[4]. In order to reduce the loss of spatial information caused by down-sampling and pooling operations to affect the segmentation effect of the model, the encoding and decoding network model was proposed in Ref.[5] that used a decoder to recover the detailed information of image features and perform feature extraction. For example, the SegNet network proposed in Ref.[6] and the U-Net network proposed in Ref.[7], both of them used the encoder and decoder structure to capture rich spatial information. DeepLabv3+[8]was proposed in Ref.[6] that added a simple and effective decoding module to the DeepLabv3 network[9], which allowed it to capture sufficient spatial information in the shallow layer, helped the model to recover the target details, and had obtained a good segmentation effect.

In computer vision tasks, the attention mechanism (AM)[10], i.e. a problem-solving method, is proposed by imitating human attention, which has been applied in many tasks. For example, Ref.[11] introduced the AM into FCN, and proposed a Dual Attention Network (DANet), which made ideal segmentation effects in scene segmentation tasks. In addition, the AM can be contained behind one or more layers as an operator, and recognize important features in the image.

The currently popular Deeplabv3+ semantic segmentation model captures high-level semantic information through an encoder and recovers spatial detail information by a decoder, which has relatively a good performance about segmenting complex and diverse urban road scene images. However, there are some problems in local details, such as rough boundary segmentation, small objects being ignored and misjudgment of objects with similar shapes. In order to solve the above problems, this study uses Deeplabv3+ as the basic network structure, introduces the AM, and proposes a road scene image semantic segmentation method that integrates the attention mechanism. The attention weight is calculated and assigned through the AM to guide feature learning. Among them, the position attention module (PAM) can capture the spatial dependence between any two positions in the feature map, so that any two similar position features can be improved mutually. Moreover, the similarity affects the allocation of attention weights, and then the characteristics of specific positions in the feature map are determined by weighting and summing all the position characteristics. The channel attention module (CAM) can capture the correlation of features on any two channels, and improve the characteristics of each channel feature by weighted sum. Then, the outputs of the two attention modules are integrated to enhance the feature representation of the feature map. Finally, the Batch normalization (BN)[9]operation is used to normalize the data in order to achieve refined output results.

1 Network achitecture

1.1 Deeplabv3+ network junction

Deeplabv3+ is a typical network framework for semantic segmentation, which is developed on the basis of Deeplabv1-3. First of all, Deeplabv1[12]uses atrous convolution to clarify and control the resolution of feature responses in deep convolutional neural networks. It uses the method of reducing down-sampling operations and increasing the network’s receptive field to obtain dense feature maps, but its performance for multi-scale segmentation is poor. In order to cover the shortcomings of Deeplabv1, Deeplabv2[13]uses Atrous Spatial Pyramid Pooling (ASPP) structure with multiple sampling rates and effective field filters to segment targets of multiple scales. To improve segmentation accuracy, Deeplabv3 uses image-level features to enhance the ASPP module, which captures longer distance information, and merges global context information. In addition, it also introduces the BN operation to facilitate training. Furthermore, Deeplabv3+ adds a simple and efficient decoder module on Deeplabv3 to optimize target boundary segmentation results with end-to-end training. Compared with Deeplabv3, Deeplabv3+ in the encoder and decoder structure can control arbitrarily the resolution of extracted features through atrous convolution to improve image segmentation effect and achieve a balance between speed and accuracy. Besides, the DeepLabv3+ network has achieved ideal results on many datasets, such as PASCAL VOC2012, Cityscapes and other public datasets. The network structure is shown in Fig.1.

Fig.1 Deeplabv3+ network structure

The encoder part mainly uses Xception and ASPP modules for feature extraction. Xception is a DCNN network containing input and output. In order to increase the receptive field of the network, the ASPP module firstly performs 1×1 convolutional compression of the feature map, and uses 3×3 atrous convolution with expansion rates of 6, 12, and 18 to realize learning multi-scale features at the same time. It can not only reduce the down-sampling operation, but also obtain more context information by using the global average pooling layer to capture the global information. So that the target boundary information of the segmentation feature map can be captured, and finally the segmentation of multi-scale targets is achieved and the segmentation effect is improved. Then the feature maps output by the ASPP module would be integrated, and the features are compressed by a 1×1 convolution operation. In the end, the high-level feature maps would be output.

In the decoder part, in order to avoid the low-level features containing more channels than the output encoding features, 1×1 convolution operation is adopted for the features output by the Xception module to reduce the number of low-level feature channels. Performing bilinear interpolation up-sampling operation on the high-level features output by the encoding end, and it is integrated with the low-level features after the 1×1 convolution operation to strengthen the recovery of the boundary information of the target part. Then, 3×3 convolution operation is performed on it as input to restore the details and spatial information of the feature map. After bilinear interpolation up-sampling operation, the final segmentation image would be obtained.

1.2 Attention mechanism

AM[11]can be understood as a resource allocation mechanism that redistributes the resources according to the importance of the attention object. In computer vision, the resource to be allocated by the AM refers to the weight, which is obtained by the high-level feature map containing rich semantic information and the low-level feature map containing global context information. This study mainly introduces the PAM and the CAM to better capture the context information of the channel and space dimensions, and then improve the segmentation effect of the model.

1.2.1 Position attention module

PAM[11]selectively aggregates the features of each position according to the weighted sum of the features of all positions. The feature between two points has the characteristics that similar features are related to each other no matter how far apart, so linking the correlation between any two features can enhance the expression of their respective feature. It is very important in scene understanding tasks to enhance and distinguish feature representation. The PAM can establish rich context relationships on local features, and encode more context information as local features, thereby enhance their representation capabilities. The work flow of the PAM is shown in Fig.2.

Fig.2 Position attention module

As shown in Fig.2, the local feature matrixMis obtained through the backbone network,M∈RC×H×W. First, after convolution operation on matrixM, two new feature matricesXandYare generated, respectively, {X,Y}∈RC×H×W. Then transforming the dimensions of the matricesXandY, {X,Y}∈RC×N, whereN=H×Wis the number of pixels. Finally, the transposed matrixXTof the matrixXand the matrixYare matrix multiplied, and the softmax layer is used to calculate the space attention graphS,S∈RN×N. The calculation process is expressed as[11]

(1)

wheresjiis the impact factor of thei-th position on thej-th position. The more similar the feature representation between the two positions, the correlation and the effect have also increased.Xiis thei-th positional element of the matrixX,X∈RC×N, andYjis thej-th positional element of the matrixY,Y∈RC×N.

At the same time, the matrixM∈RC×H×Wis subjected to convolution operation to obtain a new feature matrixZ,Z∈RC×H×W, and the dimension is converted toZ∈RC×N. Then the transformed matrixZand the transposed matrixSTof matrixSare performed matrix multiplication, and the result dimension is converted toRC×H×W. The converted result is multiplied by the scale parameterαand summed with the matrixMto obtain theαis final matrixP,P∈RC×H×W.The initial value ofαis 0, and the learning parameters of the distribution weights is gradually learned. The final output is expressed as[11]

(2)

whereZiis thei-th positional element of the matrixZ. It can be seen from Eq.(2) that the output featurePat each position, besides its original feature. It also aggregates the features of all positions, so that the network will not lose the original features information even if it has not learned new features. According to the spatial attention mapS, the PAM can selectively aggregate the context information, capture the global information, and use the similarity between the semantic features to improve the intra-class compact and semantic consistency.

1.2.2 Channel attention module

CAM[11]is a reallocation of resources among convolutional channels. Each convolution kernel of each layer of convolutional network corresponds to a feature channel. The channel map of each high-level feature can be regarded as a class-specific response, and different semantic responses are related to each other. The interdependence between each channel mapping is used to learn the correlation of each channel feature map, and improve the specific semantic feature representation. Therefore, through introducing the CAM, the interdependence among channel models can be clarified, and the feature maps with less correlation can be adjusted according to the degree of dependence, and then more useful information will be obtained. The work flow of the CAM is shown in Fig.3.

Fig.3 Channel attention module

The CAM first converts the feature matrixM,M∈RC×H×W, the dimension toM∈RC×N, whereN=H×Wis the number of pixels. Then the converted matrixMand its transposed matrixMTare subjected to matrix multiplication, and the channel attention mapFis obtained through the softmax layerF∈RC×C. The calculation process is expressed as[11]

(3)

fjiis the impact factor of thei-th channel on thej-th channel. The transposed matrixFTof the matrixFis subjected to matrix multiplication with the transformed matrixM, and the dimension of the result is converted toRC×H×W. Then the transformed result is multiplied by the proportional parameterβ, and finally summed with the matrixMto obtain the final output resultP,P∈RC×H×W, where β is the proportional parameter for gradually learning weights from 0. The final output is expressed as[11]

(4)

It can be seen from Eq.(4) that the final result featurePof the corresponding channel of each convolution kernel is the integration of the features of all channels with the local features obtained from the original backbone network. The spatial attention module can use the correlation and dependence between the spatial information of all channels to readjust the feature map to enhance the discriminability of features.

1.3 Improved Deeplabv3+ network structure

In order to increase the receptive field of the network, the Deeplabv3+ network uses atrous convolution with different expansion rates to replace the down-sampling operation. However, the atrous convolution is discrete sampling on the feature map, and the lack of correlation between the sampling points makes it produce a grid effect[14-15]. Discrete sampling will achieve better results for obtaining the semantic information of large targets, but some local information will still be lost. For small targets, when the expansion rate is set to be large, the learned features lack relevance due to the large interval, and then the wrong feature information may be learned, which makes segmentation boundaries unclear and misjudge.

To reduce the impact of grid effect on the segmentation accuracy, a group of parallel PAM and CAM are added after the Xception module of the Deeplabv3+ encoding end. These modules can capture global context information and high-level semantic information to enhance the presentation of the output feature map. After the output results of the two attention modules are converted by convolution, the element summation is performed to complete fusion of the output features of the two attention modules. To further refine the output results, PAM and CAM are added to the convolutional layer after the 3×3 convolution operation on the decoding end. These attention modules are used to simulate the interdependence of spatial and channel dimension semantic information, to capture global context information and enhance detailed features. Then, the original size feature map is restored through a 4-fold linear up-sampling operation. Finally, the refined segmentation result will be output. The improved Deeplabv3+ network model structure is shown in Fig.4.

Fig.4 Deeplabv3+ network structure with attention mechanism

The feature maps in Fig.4 are processed by PAM and CAM, respectively, and their features are integrated together after convolution conversion, and then the result is used as the input for subsequent operations to obtain a more refined segmentation result. In order to further refine the output results, BN operations on the output results of each attention module can speed up training, improve network generalization capabilities, accelerate model convergence and prevent gradients from disappearing or exploding.

2 Results and discussion

2.1 Experimental parameters

2.1.1 Datasets and experimental environment configuration

The CamVid dataset[16]is a public dataset with rich objects and fine semantic annotation. Most of the data pictures are taken from the perspective of driving cars with cameras, and it contains a large number of high-quality and high-resolution color video images of various urban driving scenes. The CamVid dataset is divided into 11 semantic categories, including road, building, sky, tree, sidewalk, car, pole, symbol and fence. The dataset contains 701 pieces of high-precision labeling image, 367 of them are used for training, 233 pieces are used for testing and 101 pieces are used for verification.

The Cityscapes dataset[17]is a publicly available large-scale urban road scene dataset. It consisted of video pictures taken on the streets of 50 different cities under different time and weather conditions. The dataset contains 5 000 high-quality pixel set annotated pictures, and 2 975 of them are used for training, 500 of them are used for testing, and 1 525 of them are used for verification. In addition, it contains 20 000 pictures with rough labels. The experiment in this study only uses 5 000 pictures with fine labels to verify the effectiveness of the improved model. Cityscapes dataset defines a total of 30 target categories from the perspective of frequency and practicality, which is mainly divided into 8 semantic categories, including flat, construction, nature, vehicle, sky, object, human and void. The machine environment configuration of the experiment in this study is shown in Table 1.

Table 1 Laboratory environment configuration

2.1.2 Experimental procedure

In order to verify the segmentation performance of the improved network model, under the same experimental environment conditions, this study has performed a test on six network models, respectively, including Deeplabv3+, Deeplabv3+ with PAM, Deeplabv3+ with CAM, Deeplabv3+ with PAM and CAM (series structure), Deeplabv3+ with PAM and CAM (parallel structure), Deeplabv3+ with PAM and CAM (parallel structure) and BN operation. The segmentation results of Deeplabv3+ on CamVid dataset and Cityscapes dataset were used as the criterion to compare the effectiveness of the segmentation performance of each network model.

2.1.3 Evaluation Index

In semantic segmentation, intersection over union (IoU)[18-20]describes the overlap ratio of the image prediction area and the real area. The more overlapped areas or the less misjudged areas will make the IoU value increase. Pixel accuracy (PA) represents the ratio between correctly classified pixels and total pixels. This study uses the mean intersection over union (mIoU) and mean pixel accuracy (mPA) as the standard to measure the segmentation accuracy of the network model before and after the improvement. The larger the values of mIoU and mPA are, the better the model segmentation effect will be.

2.2 Results on CamVid dataset

2.2.1 Comparison of model segmentation effect before and after improvement

In the experiment, the training times of each network model were set as 30 000 times, the batch_size value was set as 2, and the initial learning rate was set as 1e-4. In order to verify the superiority of the improved model segmentation effect and segmentation speed, the mIoU value, mPA value of each model in the experiment and the average time required to generate a predicted picture are statistically compared. The statistical results are shown in Table 2.

Table 2 Segmentation effect of different models on CamVid

It can be seen from the values of mIoU and mPA in Table 2 that the network structure model after the introduction of PAM and PCM are increased. It can improve the segmentation effect to varying degrees, but compared to other models, the parallel structure model has a better segmentation effect. From the results of the average time for each model to generate a predicted picture, it can be seen that the addition of the attention module hardly increases prediction time of the picture. After BN operation, the processing time of the model will be increased a little, and the average duration of a predicted image will be increased by 59 ms. However, at the same time, the corresponding segmentation mIoU value and mPA value will be increased by 0.72% and 1.08%, respectively. Compared with the improvement in accuracy, the increase in prediction time can be ignored. Fig.5 shows the segmentation results of the model on CamVid dataset when PAM and CAM are added separately.

Fig.5 Comparison of visualization results of models with different attention modules on CamVid. (a) Original image;(b) Without attention module;(c) With CAM;(d) With PAM

2.2.2 Impact of batch_size on segmentation accuracy

Taking into account that the size of batch_size in deep learning will have a certain impact on the segmentation results. In this study, the value of batch_size was set to 1, 2, 4, and 8 for experiments under the improved network structure to count the corresponding mIoU values. The results are shown in Table 3.

Table 3 Segmentation effect of model under different batch_size

It can be seen from Table 3 that as the batch_size value increases, the segmentation effect of the model will be improved to a certain extent. However, it is found that when the batch_size is larger in the experiment process, it will occupy a higher video memory and require more hardware equipment. Moreover, to achieve the same segmentation accuracy, the training time of the model will be greatly increased. When the batch_size is small, it will significantly reduce the accuracy of the BN layer statistics, which makes the model difficult to converge. In this study, after a large number of experiments comprehensively compared the experimental results under different batch_size values, choosing to use the batch_size value of 2 to count the segmentation effects of each model. The segmentation speed and effectiveness of the improved network model were verified, and good results were obtained.

2.2.3 Comparison with classic networks

There are common semantic segmentation models including SegNet, FCN and Deeplabv3+. The segmentation accuracy mIoU values of these network models on CamVid dataset are compared, and the results are shown in Table 4.

Table 4 Segmentation accuracy of several network models

It can be seen from Table 4 that the Deeplabv3+ network model integrated with the AM proposed in this study has better segmentation accuracy. The mIoU value of the improved network model on CamVid dataset is 6.88% higher than the original Deeplabv3+, and 12.39% higher than the mIoU value of the SegNet network structure. The experimental results show that the AM can capture more contextual and semantic information, enhance the feature representation ability, and then improve the segmentation effect of the model.

Fig.6 is part of the segmentation results of the network model on CamVid dataset before and after the improvement.

Fig.6 Different segmentation effects of model before and after improvement on CamVid

It can be seen from the Fig.6 that compared with the segmentation results of the original Deeplabv3+ model, the segmentation effect of the network model after integrating the AM is significantly improved. For example, the model integrated with the AM has a clearer segmentation of the outline of the cyclist on the road, and the problem of missed judgments such as telephone poles at traffic lights on the side of the road and bicycles parked on the sidewalk has also been improved. The colors represented by different semantic categories in this experiment are different from the original tags, but they do not affect the judgment of the model segmentation effect, as it is shown in the color description of each semantic category in Fig.6.

2.3 Results on Cityscapes dataset

2.3.1 Comparison of model segmentation effect before and after improvement

For the Cityscapes dataset, the initial learning rate was set to 1e-4, and the batch_size value was set to 4. The segmentation results under each network model are counted to verify the effectiveness of the improved model. The statistical results are shown in Table 5.

Table 5 Segmentation effect of different models on Cityscapes

It can be seen from Table 5 that on the Cityscapes dataset, compared with the Deeplabv3+ model, the introduction of the AM can still improve the segmentation accuracy. Adding the BN operation will increase the time for the model to generate a predicted picture, but its mIoU value has increased by 0.51% compared without using the BN operation, and its mPA value has also increased by 0.81%. Therefore, compared to the improvement in accuracy, the time-consuming cost can be ignored. Fig.7 is the change curves of loss value on CamVid dataset and Cityscapes dataset before and after BN operation. It can be seen that the model is easier to converge after BN operation.

(a) On CamVid

Fig.8 shows the segmentation effect of the model when PAM and CAM are integrated. It can be seen that the segmentation effect of the model integrated with PAM is slightly better than with CAM, and the segmentation effect of the model integrating PAM and CAM is better than Deeplabv3+.

Fig.8 Comparison of visualization results of models with different attention modules on Cityscapes. (a) Original image; (b) Without attention module; (c) With CAM; (d) With PAM

2.3.2 Comparison with classic networks

By comparing the mIoU values of SegNet, FCN, Deeplabv3+ and the improved network on Cityscapes dataset, the results are shown in Table 6. The experimental results show that the improved network model has a better segmentation effect.

Table 6 Segmentation accuracy of several network models

Fig.9 shows the partial segmentation results of the model before and after the improvement on Cityscapes. It can be seen that the improved model has a certain improvement in the problems of small objects being missed and misjudgment of objects with similar shapes. However, there are also some problems. For example, in the first picture of Fig.9, the segmentation result of the building misjudged as a telegraph pole was slightly improved compared with the model before the improvement, but the classification is still not correct, and the pedestrian pushing a stroller was judged as a cyclist in the last picture. Therefore, these problems need to be further improved in the future research.

Fig.9 Different segmentation effects of model before and after improvement on Cityscapes

3 Conclusions

Aiming at the image semantic segmentation technology in the field of automatic driving, a road scene semantic segmentation method combining the attention mechanism is proposed in this study. With Deeplabv3+ as the basic network model, PAM and CAM are introduced in both encoding and decoding. A parallel structure is adopted for the two attention modules to capture more context and semantic information in spatial dimension and channel dimension, and finally the refined results are output.

The experimental results show that the improved network structure on CamVid dataset and Cityscapes dataset improves mIoU by 6.88% and 2.58%, respectively, compared with Deeplabv3+. Although BN operation will increase the time for the model to generate predicted pictures, compared with the improvement of segmentation accuracy, the time cost can be ignored. It can also accelerate the convergence speed of the model, and better solve the problems of fuzzy segmentation boundary, misjudgment and missed judgment under Deeplabv3+ network structure.

The disadvantage of this study is that the model segmentation effect has been improved to a certain extent, but the segmentation time has not reduced, and the segmentation result will be affected by batch_size. In future studies, it is necessary to reduce the network complexity and shorten the training time while ensuring the model segmentation effect, so as to achieve a balance between the segmentation speed and accuracy in the actual driving scene, thereby ensuring the driving safety of vehicles.

Journal of Measurement Science and Instrumentation

2021年4期