APP下载

Multidimensional attention and multiscale upsampling for semantic segmentation

2022-04-18LUZhongdaZHANGChundaWANGLijingXUFengxia

LU Zhongda, ZHANG Chunda, WANG Lijing, XU Fengxia

(1. School of Mechanical and Electrical Engineering, Qiqihar University, Qiqihar 161000, China;2. Heilongjiang Province Collaborative Innovation Center for Intelligent Manufacturing Equipment Industrialization, Qiqihar 161000, China)

Abstract: Semantic segmentation is for pixel-level classification tasks, and contextual information has an important impact on the performance of segmentation. In order to capture richer contextual information, we adopt ResNet as the backbone network and designs an encoder-decoder architecture based on multidimensional attention (MDA) module and multiscale upsampling (MSU) module. The MDA module calculates the attention matrices of the three dimensions to capture the dependency of each position, and adaptively captures the image features. The MSU module adopts parallel branches to capture the multiscale features of the images, and multiscale feature aggregation can enhance contextual information. A series of experiments demonstrate the validity of the model on Cityscapes and Camvid datasets.

Key words: semantic segmentation; attention mechanism; multiscale feature; convolutional neural network (CNN); residual network (ResNet)

0 Introduction

With the rapid development of deep learning, convolutional neural networks (CNNs) have made remarkable achievements in computer vision[1]. As one of the research hotspots in computer vision, the semantic segmentation task segments different objects in an image and then labels the things using different annotations. Each pixel in an image is classified[2], so semantic segmentation is also referred to as a pixel-level classification task. It has been applied to many fields, such as medical image segmentation, autonomous driving, remote sensing, etc.

Recently, state-of-the-art semantic segmentation methods based on the fully convolutional network (FCN)[3]achieve remarkable results. However, due to the limitation of local receptive fields of FCN, only short-range contextual information can be obtained. The lack of semantic information confuses the segmentation of similar objects, which affects the segmentation performance of the network. In order to solve the above problem and obtain rich semantic information, some work captures effective contextual information to improve the segmentation accuracy. For example, pymarid scene parsing network (PSPNet)[4]uses the pyramid pooling module to capture contextual information. Some work[4-5]is based on pooling fuse contextual features in a non-adaptive manner. Although the feature fusion captures information about objects of different sizes, it cannot exploit the dependencies between objects. Some work[6-7]uses atrous spatial pyramid pooling (ASPP) module with multiscale dilation convolutions to capture global semantic information. However, the methods based on atrous convolutions obtain information from surrounding pixels but not dense contextual information. Peng et al.[8-9]captured rich contextual information through large kernel matters or fine-grained network structures. Ronneberger et al.[10-11]used an encoder-decoder network structure to fuse different levels of semantic information. However, using the same context dependency for all pixels does not satisfy the requirements. Different pixels require different contextual dependencies, which is important for semantic segmentation.

In order to capture pixel-wise contextual information, Hu et al.[12-14]have focused on the attention mechanisms. The current attention mechanisms are divided into two main types, channel attention and spatial attention, as shown in Figs.1(a) and (b).

Channel attention[12]obtains the dependency relationship among channels by assigning different weights to the channels but only works on the channel dimension. Spatial attention[13]captures the spatial dependence of any two positions of the feature map in the width-height dimension, whose weight is determined by the feature similarity among the corresponding positions. Both approaches have achieved good results. Convolutional block attention module (CBAM)[14]combines channel attention and spatial attention to capture contextual information. The combined approach uses two branches: one for channel attention and the other for spatial attention. However, the description of three-dimensional contextual information by two-dimensional or one-dimensional similarity matrix causes the loss of some feature information. To solve this problem, we propose a multidimensional attention (MDA) to get the weight of each location accounting for the overall three-dimensional feature map, as shown in Fig.1(c). The multidimensional attention is obtained by calculating the attention matrix of the three dimensions of channel, width and height.

In our work, residual network (ResNet) is used as the backbone network. To capture rich contextual information, the multidimensional attention and multiscale upsampling network is designed on the basis of the ResNet. The network is an encoder-decoder architecture, where the encoder is designed with an MDA module, and the decoder is designed with a multiscale upsampling (MSU) module. The MDA module introduces a self-attention mechanism to capture feature dependencies in the channel dimension and spatial dimension. It calculates the attention matrices for the channel, width, and height dimensions in the three-dimensional feature map. The attention matrix captures the dependencies between each position according to the ratio of different feature map dimensions. The MSU module introduces parallel branches to upsample low-resolution features to high-resolution features of different sizes, which are then fine-tuned by nonlinear mapping of convolutional layers. Meanwhile, an auxiliary training strategy is introduced to add auxiliary loss functions to supervise the features upsampling on different stages to improve the segmentation performance, which is discarded at the inference stage.

1 Related work

1.1 Semantic segmentation

There has been a great interest in semantic segmentation in recent years. FCN[3]is a milestone in the development of semantic segmentation. The convolutional layers replace the fully connected layers in deep convolutional neural networks (DCNN) to perform pixel-level classification tasks. The network structure of most semantic segmentation models is an encoder-decoder. The encoder downsamples to extract semantic features, and the decoder upsamples to recover the feature maps to the original image resolution. SegNet[11]applies pooling indices at the decoder to recover the image resolution, and Unet[10]uses skipping connections to fuse different layers of features. RefineNet[9]adopts a multi-path refinement structure to refine the prediction results, and DeepLab[6]adopts dilated convolutions to increase the receptive field. PSPNet[4]utilizes spatial pooling pyramids to fuse different features. The ASPP network[7]uses parallel atrous convolutions with different dilation rates and upsamples to the same size, capturing contextual information at multiple scales.

1.2 Self-attention mechanism

Self-attention mechanism first appears in natural language processing and now is widely used in computer vision. Squeeaze-and-excitation newtork (SENet)[12]assigns different weights to channels by learning, and suppresses the features that are less useful for the task, depending on the importance of the channel. Non-local neural network[13]captures long-range dependencies of feature maps, breaking the limitation of the convolutional kernel field of perception. CCNet[15]proposes a criss-cross attention module to obtain the contextual information of the pixels. Point-wise spatial attention network (PSANet)[16]moderates the local neighborhood constraint by adaptively learning attention masks that connect different positions of feature maps. Squeeze-and-attention network (SANet)[17]introduces an attention convolutional channel that combines pixel-group attention with convolution to increase spatial-channel dependence. CBAM[14]calculates channel-wise attention and spatial-wise attention to adjust the feature map. Dual attention network (DANet)[18]combines the position attention and the channel attention to integrate local features and global dependencies adaptively. Similarly, object context network (OCNet)[19]aggregates contextual features by computing the similarity of each pixel to other pixels.

All the above one-dimensional and two-dimensional attention methods achieve remarkable results but compress some information. In this work, the multidimensional attention module calculates the dependencies for each position in the three-dimensional feature map.

1.3 Multiscale context

Due to the different sizes of objects in the image and the limitation of the fixed receptive field of the convolution kernel, the contextual features cannot be accurately described. Multiscale feature fusion enhances feature representation in semantic segmentation networks. FCN[3]adopts the skipping connection to fuse different levels of feature maps, which results in higher segmentation accuracy. Spatial pyramid pooling (SPP)[4]aggregates information from various regions and enhances contextual information. ASPP[7]shares a similar idea and adopts atrous convolution with different dilation rates to achieve multiscale feature aggregation.

In this study, we combine the multiscale upsampling module with the auxiliary training strategy. The multiscale upsampling module captures features at different scales, and the decoder uses an auxiliary training strategy to supervise the upsampled features at different stages.

2 Approach

This section provides a detailed description of the multidimensional attention and multiscale upsampling network (MMNet). Firstly, the general framework of MMNet is introduced. Then, the MDA module is designed to obtain the dependency of each position in the feature map. Finally, the MSU module and the auxiliary training strategy are introduced.

2.1 Network architecture

ResNet has been widely applied to other vision tasks due to its strong performance in image classification. In this study, the ResNet is used as the backbone of the segmentation network. MMNet adopts an encoder-decoder structure, as shown in Fig.2.

As shown in Fig.2, the encoder extracts the semantic features by downsampling and makes the feature maps 1/32 of the input image. We design the MDA module in the encoder. The MDA calculates the attention matrices of the three dimensions to obtain the dependenciesof each position in the feature map. The encoder explores the contextual information by employing the attention mechanism to establish the connection between features. The MDA enhances the feature representation for semantic segmentation by adaptively aggregating contextual information. The decoder recovers the image resolution by upsampling. In the decoder, the MSU is designed. The MSU adopts parallel branches to upsample low-resolution features and combines feature information in the encoder. These features are adjusted by convolution and fused. To further improve the segmentation accuracy, an auxiliary training strategy is proposed by adding auxiliary segmentation heads at different decoder stages. The auxiliary training strategy can assist multiscale upsampling and enhance the feature representation. It can be discarded at the inference stage.

Fig.2 MMNet structure: MDA denotes the multidimensional attention module, MSU denotes the multiscale upsampling module, and Seg head denotes an auxiliary training strategy that is discarded when inferring.

2.2 MDA

Semantic segmentation requires capturing long-range contextual information to classify each pixel in an image. Traditional fully convolution networks can only capture local features due to the limitation of the receptive field. Most region pixels lack contextual information, which can lead to mis-classification. The multidimensional attention is proposed to obtain rich contextual relations in feature maps. Next, we describe the idea of multidimensional attention and the implementation process.

Spatial attention enhances the connection between all the positions in both width and height dimensions, and channel attention adjusts the feature map only in the channel dimension. In contrast to the above attention approach, multidimensional attention captures the dependency of each position in the three-dimensional space, as shown in Fig.3. For a given local featureM∈RC×H×W, the attention matrix is computed using softmax function in each of the three dimensions ofC,HandW, whose expression is

(1)

whereO[i,j] is the impact of positionjon positioni, which helps to understand the correlation between different positions. Since the three-dimensional sizes are changing, the proportion of each dimension to the feature mapMneeds to be calculated. Finally, three matrices are summed to obtain the multidimensional attention matrixN∈RC×H×W.

Fig.3 Multidimensional attention: M denotes input matrix; N denotes multidimensional attention matrix; and C, H and W denote different dimensions.

The encoder designs the MDA based on multidimensional attention mechanism. As shown in Fig.4, a given local featureM∈RC×H×Wis used as input. The attention in width dimension computes using 1×kconvolution to generate a new feature mapP′∈RC×H×Was

P′=F1×k(M),

(2)

whereF(·) denotes the convolution operation;P′ applies softmax operations in the width dimension only to compute the width attention mapP∈RC×H×Was

(3)

whereP[i,j] is the impact of positionjon positioniin the same width dimension, andPmultiplies by the percentage of width in the three dimensions. Then we can get

(4)

whereApis the width attention matrix; andC,HandWdenote channel, width, and height, respectively.

Height dimension attention and channel dimensional attention are calculated. Thek×1 convolution is applied in the height dimension, and 1×1 convolution is applied in the channel dimension to obtain the new feature mapsQ′,V′∈RC×H×Was

Q′=Fk×1(M),

(5)

V′=F1×1(M),

(6)

whereQ′ performs a softmax operation in the height dimension only andV′ performs a softmax operation in the channel dimension only to obtain the attention matrixcesQ,V∈RC×H×Was

(7)

(8)

whereQ[i,j] is the impact of positionjon positioniin the same height dimension, andV[i,j] is the impact of positionjon positioniin the same channel dimension. ThenQandVare multiplied by the weights of their own dimension as

(9)

(10)

Fig.4 MDA: M denotes the input; N denotes the output; A denotes multidimensional attention matrix; C, H and W denote different dimensions; and P, Q and V denote different dimensional attention matrices.

The three attention matrices are summed to obtain the multidimensional attention matrixA∈RC×H×W. Then matrixAis multiplied by the local featuresMto capture the dependencies between the positions. Finally, the attention features and the input features are fused to the outputN∈RC×H×W, which enhances the feature representation, that is,

(11)

whereAidenotes the attention matrix in channel, width and height dimensions; and1is the constant matrix with the same dimension asAi, whose elements are 1.

Since the dimensions of channel, width, and height are different and change as the network deepens, in the shallower layers, the feature maps have larger width and height dimensions and fewer channels. Therefore, the dimensions of width and height play the primary role. However, as the neural network layers deepen and the downsampling process proceeds, the number of channels gradually increases, and the height and width decrease. So the channels play a major role in the deep neural network. The multidimensional attention module can automatically change the proportion of three dimensions of attention as the network continues to deepen.

2.3 MSU

Upsampling of images is considered to recover low-resolution feature maps to high-resolution feature maps. The methods of upsampling are unpooling, bilinear interpolation and transposed convolution. Transposed convolution and convolution are similar in that they both require learning parameters. Therefore, we consider transposed convolution as special feature extraction rather than just recovering image resolution. Multiscale is the sampling of the information at different granularities. In the downsampling, the structures of multiple branches and spatial pyramids capture different receptive fields to identify objects of different sizes. In this study, transposed convolution is used for special feature extraction. Therefore, the multiscale extraction of features during the downsampling can also be applied to upsampling. The MSU is proposed based on the idea as follows. MSU uses multiple parallel branches to upsample low-resolution features to different scales. Then it is adjusted by nonlinear mapping of multiple convolutional layers. Finally, the multiscale feature maps are fused with the feature maps of the downsampling process.

As shown in Fig.5, given a feature mapU∈RC×H×W, the 8x, 4x and 2x upsamplings are performed to obtainX∈RC/8×8H×8W,Y∈RC/4×4H×4WandZ∈RC/2×2H×2W. Here,X,YandZare adjusted by multiple convolution layers so that the feature map becomes double of the feature mapU. The adjusted feature maps are then fused with the low-level spatial featuresE∈RC/2×2H×2Win the encoder, and the number of channels is reduced using 1×1 convolution to obtain the outputO∈RC/2×2H×2W. The MSU combines information from different layers and scales to enhance the feature representation in the semantic segmentation network.

2.4 Auxiliary training strategy

The MDA module generates feature maps of different scales from low-resolution feature maps. Large upsampling multipliers introduce invalid information and reduce segmentation performance. In this study, we introduce the auxiliary training strategy to supervise the upsampling feature maps at different stages and reduce the invalid information. The segmentation head is shown in Fig.6. The segmentation heads at different stages of the decoder promote the back propagation and enhance the feature representation as a regular term, whereas they are discarded at the inference stage. During training, the auxiliary loss are added to the total loss of the network as

Ltotal=L+α∑Laux,

(12)

whereLtotaldenotes total loss;αdenotes the weight of auxiliary loss, which is 0.3, andLauxdenotes auxiliary loss.

Fig.6 Segmentation head: N denotes the number of segmentation categories and r denotes upsampling ratio, Ci and Ct denote different numbers of channels.

3 Experiments

We conduct experiments on two challenging semantic segmentation datasets: Cityscapes and Camvid. The datasets and implementation details are given. Ablation experiments are performed on the Cityscapes dataset to verify the effectiveness of each module. Finally, it is compared with other methods on both datasets.

3.1 Experimental settings

3.1.1 Datasets

Cityscapes dataset is a large-scale dataset with 5 000 high-quality annotated images from 50 different cities. The dataset contains 30 labels, and only 19 categories are used for training and evaluation. It contains 2 975 images for training, 1 525 images for testing, and 500 images for validation, with an image resolution of 1 024×2 048 pixels.

Camvid dataset is a segmentation dataset for auto-driving, with 701 images. The dataset contains 32 labels, and only 11 categories are used for training and evaluation. It contains 367 images for training, 233 images for testing, and 101 images for validation, with an image resolution of 720×960 pixels.

3.1.2 Implementation details

The experiments are conducted using Nvidia Tesla V100 GPUs with 32G memory per GPU for training and the deep learning framework is PaddlePaddle. The experiments employ a poly learning rate policy as

(13)

wherelrdenotes the learning rate,base_lrdenotes the base learning rate,num_epochdenotes the number of total epochs, andpowerdenotes decayed index. The experiments setbase_lrto 0.01 andpowerto 0.9. The mini-batch stochastic gradient descent (SGD) is employed with a batch size of 8, a momentum of 0.9, and a weight decay of 1×10-4. In the experiments, we use random scaling and random flipping for data augmentation. The experiments use ResNet as the backbone and set atrous convolutions with rates of 2 and 4 in the last two blocks. Three 3×3 convolutions replace the 7×7 convolution in the first block. The mean intersection over union (mIoU) is the leading evaluation standard as

(14)

wherekdenotes the number of categories;pjidenotes the number of pixels belonging to classjwhereas predicted to be classi.

3.2 Ablation studies

3.2.1 Ablation studies for MMNet

The MMNet captures rich contextual information to enhance segmentation accuracy. We set up ablation experiments to verify the performance of the module.

As shown in Table 1, the experiments utilize ResNet 50 and ResNet 101 as the backbone network to verify the performance of the MDA and MSU. Compared with the baseline (ResNet 50), the mIoU of MDA reaches 76.51%, which is an improvement of 7.33%. Also, using only the MSU, there is a 5.95% improvement over the baseline. When the network contains two modules, the mIoU reaches 77.42% and the pixel accuracy (Acc) improves to 90.62%. When MMNet uses ResNet 101 as the backbone network, mIoU reaches a maximum of 79.93%. MMNet with a backbone of ResNet 101 improves the mIoU by 2.51% over that of the backbone of ResNet 50. The reason is that the deepening of the network layers enhances the feature mapping of the network. The results show that multidimensional attention and multiscale upsampling bring great benefits to scene understanding. The experiments use parameters and floating-point operations (FLOPs) to calculate the complexity of the network. The number of parameters of MMNet is 81.9 M (1 M=220) four MDAs have parameters of 9.67 M, and four MSUs have parameters of 29.64 M. The parameters of the MDA are mainly focused on the two asymmetric convolutions and 1×1 convolution. The MSU has multiple branches, and each branch requires a large number of nonlinear mappings for tuning. Therefore, the number of MSU parameters is large. In Table 1, the calculation amount of the MMNet reaches 979.93 G (1 G=230).

Table 1 Ablation studies for MMNet on Cityscapes dataset

3.2.2 Ablation studies for upsampling multiplier

The MSU module upsamplies the feature maps at different multiples. To obtain the maximum upsampling multiplier at each stage, we perform different upsampling experiments. As shown in Table 2, the five stages of upsampling recover the feature map to the resolution of the input image when the maximum upsampling multiplier of the MSU module is (32,16,8,4,2).

Table 2 Ablation studies for upsampling multipliers on Cityscapes dataset

The mIoU is the lowest at 73.56% due to the excessive upsampling multiplier that introduces a large amount of noises. As the upsampling multiplier decreases, the mIoU keeps improving. When the upsampling multiplier is (8,8,8,4,2), mIoU and Acc reach a maximum of 79.93% and 94.35%. As the upsampling multiplier continues to decrease, the number of branches of the MSU decreases and the network gets fewer features, resulting in a lower mIoU. We also study the upsampling beyond the input image size and find that the results are not satisfactory.

3.2.3 Ablation studies for auxiliary training strategy

MSU captures the multiscale features of the image using upsampling at different scales. However, MSU also adds some noises. In the study, an auxiliary training strategy is used that acts as a regular term to enhance the feature representation and improve the segmentation accuracy. The ablation studies are designed with different weights to obtain the optimal weights of auxiliary branch loss. As shown in Table 3, whenα=0, there is no auxiliary training strategy and mIoU is 77.01%. The highest semantic segmentation accuracy is achieved when the auxiliary branch loss weight α is taken as 0.3.

Table 3 Ablation studies for auxiliary training strategy on Cityscapes dataset

3.2.4 Ablation studies for MDA

The multidimensional attention mechanism calculates the attention matrices in three dimensions: channel, width, and height. MDA is compared with channel attention, spatial attention and mixed attention to verify module performance. As shown in Table 4, the channel attention (SE Module) has the lowest mIoU. The reason is that semantic segmentation is a pixel-level classification task and channel attention only enhances the connection among channels. However, the connections between positions in the space are also extremely important. Compared with channel attention, spatial attention (non-local) is improved by 2.17% in mIoU. The mixed attention (CBAM) combines spatial and channel attention and outperforms the first two. However, there is feature compression when computing channel attention and spatial attention. MDA calculates the attention matrices in three dimensions, and the mIoU reaches 78.04%. The multidimensional attention mechanism enhances the connection between positions in the three-dimensional space and obtains rich semantic information.

Table 4 Ablation studies for MDA on Cityscapes dataset

3.2.5 Ablation studies for MSU

In the decoder, the MSU captures feature information at different scales for upsampling. The MSU is compared with other upsampling methods to verify the performance of the module. The experiments use MMNet as the backbone network and set up different upsampling methods. As shown in Table 5, compared to the other two upsampling methods, mIoU is improved by 2.75% and 1.89%, respectively. Upsampling in semantic segmentation is used to recover the image resolution. The idea of MSU is to use transposed convolution as a special way of feature extraction. By extracting information at different scales and refining the adjustment using convolutional layers, rich semantic information is extracted.

Table 5 Ablation studies for MSU on Cityscapes dataset

3.2.6 Ablation studies for improvement strategies

The experiments employ a number of strategies to improve performance further. Table 6 contains data augmentation (DA) at random scaling, using a hierarchy of different sizes (4,8,16) in the final ResNet block (multi-grid), and averaging the segmentationprobability maps from 8 image scales {0.5,0.75,1,1.25,1.5,1.75,2,2.2} for inference multi-scale. Finally, the mIoU is improved by 0.81%, reaching 80.74%.

Table 6 Ablation studies for improvement strategies on Cityscapes dataset

3.3 Comparison experiments

MMNet is compared with other models on the Cityscapes and Camvid datasets to evaluate the segmentation performance of the network further. The segmentation results on the Cityscapes dataset are shown in Table 7.

Table 7 Segmentation results on Cityscapes dataset

Methods such as FPENet, SoENet, CFNet, and PCNNet have achieved impressive results. However, each position of the feature map lacks the adaptive relationship, which leads to mis-classification of pixels. MMNet uses multidimensional attention to capture the dependencies among positions compared to other methods. Meanwhile, the multiscale upsampling module fuses low-level spatial information with multiscale information. The richer semantic information is obtained, and the mIoU reaches 80.7%.

The visual segmentation results of MMNet on the Cityscapes dataset are shown in Fig.7. The baseline incorrectly segments the part where the truck and the trees overlap, and the details of pedestrians and bikes are poorly segments. MMNet gets richer contextual information by focusing on the dependencies between each position through the attention mechanism. Also, the multiscale upsampling module adds some spatial detail information. The work has a better segmentation effect in the part of connected objects and small shaped objects, such as the details of bikes and pedestrians and the connection of trucks and trees.

Fig.7 Visualization results from top to bottom are original image, ground truth, baseline and MMNet

To further validate the effectiveness of the proposed method, we also conduct comparison experiments on the Camvid dataset. Compared with DeepLab-v2, FPENet, and other methods, the mIoU is 70.6%, as shown in Table 8. The above comparison experiments illustrate that our proposed MMNet model can effectively improve the segmentation accuracy.

Table 8 Segmentation results on Camvid dataset

4 Conclusions

The MMNet is proposed to capture rich contextual information. The proposed MDA obtains the dependencies of each position of the feature map in three dimensions: channel, width, and height. Compared with channel attention, spatial attention and mixed attention, MDA reduces feature compression to obtain tighter dependencies. The MSU combines low-level spatial information with multiscale features. The final comparison with other models on the Cityscapes and Camvid datasets illustrates that the proposed MMNet captures richer contextual information and improves segmentation accuracy. However, the number of parameters of the MMNet is large, and the model will be lightened next.