APP下载

MAAUNet: Exploration of U-shaped encoding and decoding structure for semantic segmentation of medical image

2022-11-28SHAOShuoGEHongwei

SHAO Shuo, GE Hongwei

(1. Jiangsu Provincial Engineering Laboratory of Pattern Recognition and Computational Intelligence, Jiangnan University, Wuxi 214122, China; 2. School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214122, China)

Abstract: In view of the problems of multi-scale changes of segmentation targets, noise interference, rough segmentation results and slow training process faced by medical image semantic segmentation, a multi-scale residual aggregation U-shaped attention network structure of MAAUNet (MultiRes aggregation attention UNet) is proposed based on MultiResUNet. Firstly, aggregate connection is introduced from the original feature aggregation at the same level. Skip connection is redesigned to aggregate features of different semantic scales at the decoder subnet, and the problem of semantic gaps is further solved that may exist between skip connections. Secondly, after the multi-scale convolution module, a convolution block attention module is added to focus and integrate features in the two attention directions of channel and space to adaptively optimize the intermediate feature map. Finally, the original convolution block is improved. The convolution channels are expanded with a series convolution structure to complement each other and extract richer spatial features. Residual connections are retained and the convolution block is turned into a multi-channel convolution block. The model is made to extract multi-scale spatial features. The experimental results show that MAAUNet has strong competitiveness in challenging datasets, and shows good segmentation performance and stability in dealing with multi-scale input and noise interference.

Key words: U-shaped attention network structure of MAAUNet; convolutional neural network; encoding-decoding structure; attention mechanism; medical image; semantic segmentation

0 Introduction

With the development of computer vision, image segmentation has achieved superior performance in the fields of natural image and biomedical image. Under the conditions of medical image, the parts that need to be segmented are often only specific regions, such as tumor regions, organ tissues and diseased regions. Unlike natural image, medical image has inconsistencies in scale, difficult to collect datasets and more noise interference. Manual inspection requires considerable professionalism and subjective dependence. Therefore, the development of semantic segmentation technology for medical image is an important research topic.

Early image segmentation methods commonly include region segmentation, boundary segmentation, thresholding and feature-based clustering. Although traditional segmentation methods have certain improvements in segmentation accuracy, they require prior knowledge, are not applicable to challenging tasks, and cannot maintain robustness. For example, the ISIC-2018[1]dataset contains skin lesion images of different scales. Fig.1 demonstrates that the scale, shape and color of skin lesions can greatly vary in dermoscopy images. Some images with complex shapes or unclear boundaries are unsatisfactory in traditional segmentation methods.

Fig.1 Variation of scale in medical images

Relying on the popularity of deep convolutional neural network (CNN[2]) in computer vision, CNN is quickly used for medical image segmentation tasks[3]. Networks such as fully convolutional networks (FCN[4]), SegNet[5], U-Net[6], V-Net[7], ResNet[8], DDANet[9], PSPNet[10], DenseNet[11], MultiResUNet[12], U-Net++[13], DC-UNet[14]and DoubleUNet[15]are used for image and voxel segmentation in various medical image modes. These methods have also achieved good performance on many complex datasets, proving the effectiveness of CNN networks in learning and identifying features to segment organs or diseased tissues from medical image.

A fully CNN (FCN) structure[4]is proposed to perform end-to-end image segmentation, which is superior to the existing algorithms at the time. FCN is improved and a new architecture of SegNet[5]is developed, which includes a 13-layer deep encoder to extract spatial features, and a corresponding 13-layer deep decoder to give segmentation results. DeepLab[16]is proposed, and deep CNN with a fully connected conditional random field (CRF) is used to refine the segmentation result. Then DeepLabV2[17]is improved by using atrous convolution to reduce the degree of signal down-sampling. Atrous spatial pyramid pooling (ASPP) module is employed to capture long range context for DeepLabV3[18]. DeepLabV3+[19]is proposed, in which encoder-decoder structure is employed for semantic segmentation. An architecture of U-Net[6]is proposed, which includes a contracted path for acquiring context and a symmetrical extended path for precise positioning. A skip connection is added to the encoder-decoder image segmentation network (such as SegNet) to improve the accuracy of the model and solve the problem of gradient disappearance. A similar architecture of V-Net[7]is proposed, which adds residual connections and replaces 2D operations with 3D operations to process 3D voxel images. The optimization of Dice, a widely used segmentation metric, is also proposed. Some studies have developed a segmented version of the densely connected network architecture of DenseNet[11], which uses an encoder-decoder framework like U-Net.

However, these models still face the problems of variable segmentation target scale, noise interference, rough segmentation results, slow training process and insufficient robustness. In response to these problems, a multi-scale residual aggregation U-shaped attention network structure of MAAUNet (MultiRes aggregation attention U-Net) is proposed. Through extensive experiments on different medical image datasets, it is found that MAAUnet is better than the classic U-Net model and the recent MultiResUNet model in most cases. The contributions of this article can be summarized as 1)-3).

1) Aggregate connection is introduced, which is different from the original single feature aggregation of the same level. Skip connections are redesigned to merge the features of different semantic scales at multiple levels and multiple scales, thereby the semantic gap between skip connections is further reduced.

2) After the multi-scale convolution module, a convolution block attention module is added to focus and integrate features in the two attention directions of channel and space, and optimize the intermediate feature map.

3) The residual connection is improved in the original convolution block. The convolution channel is expanded with a series convolution structure. The residual connection is retained, and the multi-scale convolution block is turned into a multi-channel convolution block.

1 Prior knowledge

1.1 U-Net architecture

Fig.2 shows the U-Net network architecture. It consists of encoder and decoder. The encoder follows the typical structure of a convolutional network. It includes repeated application of two 3×3 convolutions, each of which is followed by a rectified linear unit (ReLU) and a 2×2 maximum pooling operation with a step size of 2 for down sampling. In each down sampling step, the number of feature channels is doubled. This operation is repeated four times. Each step in the decoder includes the up sampling of the feature map, followed by 2×2 deconvolution, which reduces the number of feature channels by half, and concatenates the feature map with the corresponding feature map of the skip connection in the encoder. And then two 3×3 convolutions are used, a ReLU after each convolution is obtained. Since boundary pixels are lost in each convolution, cropping is required. In the last layer, 1×1 convolution is used to map each component feature vector to the required number of classes.

Fig.2 U-Net architecture

In addition, a skip connection is introduced to transmit the output of the encoder to the decoder by the U-Net architecture. These feature maps are connected with the output of the upsampling operation. And the spliced feature maps are propagated to subsequent layers. The network is allowed to retrieve spatial features lost due to pooling operations by Skip connections.

1.2 MultiResUNet architecture

The U-Net model is considered and redesigned by MultiResUNet. Aiming at the diversity of medical image in scale, the original convolutional layer is replaced by a block like Inception[20], which can better solve the problem of different scales of images, as shown in the Fig.3.

Fig.3 MultiRes block improvement process

The parallel structure of the left picture in Fig.3 is converted into the serial structure of the middle picture through reconstruction. Based on the serial structure, a residual connection is added to form the structure of the right picture in Fig.3. So far, a series of smaller 3×3 convolutional layers are used to replace the larger 5×5 and 7×7 convolutional layers, and a 1×1 convolutional layer called residual connection[8]is added, which can provide some additional spatial characteristics. This structure is called MultiRes block.

There may be a semantic gap between the corresponding levels of encoding-decoding architecture. The main reason is that the feature map obtained by the encoder cannot be directly connected with the feature map output by the decoder. There is a semantic gap between encoder and decoder, so some convolutional layers are added to the path of the skip connection, which is called ResPath. Some modifications have been made to the skip connection named ResPath between encoder and decoder. Instead of simply connecting the feature maps from the encoder stage to the decoder stage, they are first passed through a chain of 3×3 convolutional layers with residual connections, and then they are connected with the decoder features. The ResPath path is shown in Fig.4.

The MultiRes block and ResPath are added to the U-shaped structure to form the MultiResUNet model, and it is shown as Fig.5.

Fig.4 ResPath structure diagram

Fig.5 MultiResUNet architecture

2 MAAUNet model

The MAAUNet model improves the aggregation connection based on MultiResUNet, reduces the semantic gap, integrates the attention mechanism to optimize the intermediate feature map, and proposes a multi-channel convolution block to deal with interference of different scales. The model can effectively deal with scale transformations and background interference, and provide more effective segmentation based on these improvements .

2.1 Aggregate connection

Although MultiResUNet reduces the semantic gap between encoder and decoder by adding ResPath at the corresponding level, in order to further bridge the semantic gap between encoder start and decoder end, it is recommended to use aggregate join strategy on the basis of the original ResPath retention. The deeper feature maps are up-sampled, and the low-level feature maps of the skip connection are fused in this layer to better deal with images of different scales and help to further reduce the semantic gap.

This is because the deep-level feature map has more accurate semantic information, and is a coarse-grained feature map that is not conducive to recovering details. The low-level feature map has inaccurate semantics and is a fine-grained feature map that helps to restore segmentation details. Therefore, the upward aggregation connection not only makes full use of the semantic information, but also can restore the fine segmentation results. The skip connection is redesigned to aggregate the features of different semantic scales in the decoder sub-network to form a highly flexible feature fusion scheme. In this way, different levels of features can be merged, and they can be integrated through feature superposition.

The specific method is to fill the center of the U-shaped structure with nodes based on the original ResPath retention. Each node is concatenated by the ResPath result of the previous node at the same depth and the upsampling result of the node at the next depth which together constitute the restoration information of this node. The specific structure is shown in Fig.6.

Different sizes of receptive fields have different sensitivities to target objects of different sizes. For example, the characteristics of large receptive fields can easily identify large objects. However, the edge information of large objects and the small objects themselves are easily lost by the down-sampling and up-sampling of the deep network in medical image segmentation. At this time, it may be necessary to help with the characteristics of small receptive fields. Therefore, feature aggregation strategies of different depths are helpful to deal with scale changes.

Fig.6 Diagram of aggregation connection

2.2 Convolution block attention module

The existing U-shaped structure treats image feature recognition equally, but in fact the information of a picture is not evenly distributed but slightly focused.To pay more attention to areas with rich information gathering features, the CNN model is combined with the attention mechanism to improve the segmentation performance of the model. For the channel attention mechanism, the squeeze-and-excitation module before is proposed[21], which can distinguish the importance of different channels. The spatial attention module is introduced for the spatial attention mechanism, such as SA-UNet[22]. The module draws the attention map along the spatial dimension, and multiplies the attention map by the input feature map to optimize the adaptive features.

The convolutional block attention module (CBAM[23]) is introduced, which can refine the feature map along the channel and space dimensions and integrate the two dimensions. Given an intermediate feature mapF∈RC×H×W, CBAM sequentially infers a one-dimensional channel attention feature mapMc∈RC×1×1and a two-dimensional spatial attention feature mapMs∈R1×H×W. The overall attention process can be summarized as

F′=Mc(F)⊗F,

F″=Ms(F′)⊗F′,

where ⊗ represents element-wise multiplication. During the multiplication process, the attention value is propagated accordingly. The channel attention value is propagated along the spatial dimension, and vice versa.F″ is the final refined output. Fig.7 shows the calculation process of each attention map.

Fig.7 CBAM module structure

The feature map is compressed into a one-dimensional vector in the spatial dimension by the channel attention sub-module. The global maximum pooling and global average pooling are used to aggregate the feature information of the spatial mapping. And the result is added element-wise through a shared fully-connected layer. Global pooling and maximum pooling can be used together to extract richer high-level features and provide more accurate information. The results of the channel sub-module is used to apply average pool and maximum pool operations along the channel axis to generate concatenate operations feature descriptors by the spatial attention sub-module. Convolutional layers are used to generate spatial attention maps.

To effectively calculate the channel attention, the spatial dimension of the input feature map is compressed. The most commonly used method to aggregate spatial information is average pooling, but the maximum pooling also collects important clues about the characteristics of different objects, which can be used to infer more refined channel attention. The maximum pooling feature that encodes the main part can compensate for the average pooling feature of the global statistics. Therefore, both average pooling and maximum pooling features are used .

Fig.8 CBAM sub-module structure diagram

The spatial relationship is used to generate a spatial attention map. Spatial attention and channel attention are complementary. To calculate the spatial attention, average pooling and maximum pooling operations are first applied along the channel axis, and they are connected to generate effective feature descriptors. It can effectively highlight the information area by applying pooling operations along the axis of the channel. On the cascaded feature descriptors, the convolutional layer is applied to generate the spatial attention mapMs(F)∈RH×W, and the coding of the spatial attention map represents the enhanced or suppressed position information.

After the convolution module of the network, the convolution attention module is introduced to adaptively refine the generated feature map, focus on the feature-rich channel and space, and then perform the next layer of convolution to obtain a more accurate and finer intermediate feature map.

2.3 Multi-channel block

Note that there is a simple residual connection in MultiRes block. This residual connection only provides some additional spatial features, which may not be enough to complete some challenging tasks. Features of different scales have shown great potential in medical image segmentation. Therefore, to overcome the problem of insufficient spatial features, a sequence of three concatenated 3×3 convolutional layers is used to expand the convolution channel in the MultiRes block, so that the two convolutional channels in series can complement each other and provide a richer space feature. To prevent the neural network from degenerating and improve the convergence speed when training the model, the symmetry of the convolutional structure can be broken, so the original residual connections are kept. This block is called Multi-channel block. Its structure is as shown in the Fig.9.

Fig.9 Multi-channel block structure

Therefore, the basic architecture of MAAUnet is proposed. Aggregate connections are added to reduce the semantic gap based on MultiResUNet. The intermediate feature map and the multi-channel block are optimized to deal with scale changes by the convolutional block attention mechanism module. Its structure is as shown in the Fig.10.

Fig.10 MAAUNet structure diagram

3 Experiments

3.1 Experimental setup

The model in this article is based on Python programming, and the network model has been implemented by using Keras with Tensorflow backends. The operating system of the experimental platform is Linux4.4.0, and the GPU is Ge Force RTX2080Ti.

In order to verify the effectiveness and segmentation performance of MAAUNet model, comparative experiments are carried out on the ISIC-2018, Murphy lab[24], CVC-ClinicDB[25]and ISBI-2012 datasets by using U-Net, MultiResUNet and DC-UNet.

3.1.1 Datasets

Four public datasets are selected to test the performance of four U-Net based models. The nuclei in Murphy lab dataset are irregular in terms of brightness and the images often contain noticeable debris. Some images in the ISBI-2012 electron microscope dataset contain many interferences, such as noise, and other parts of the cell will affect the model identify boundary. The ISIC-2018 dataset contains skin lesion images of different scales, and the shape, size and colour of the lesion area are all different. In the colonoscopy images of CVC-ClinicDB, the boundaries of polyps are very blurred and difficult to distinguish, and the shape, size, structure and location of polyps are also different. These factors make this dataset the most challenging. The Table 1 briefly describes the dataset used in the experiment.

Fluorescence microscope image dataset is collected by Murphy lab. This dataset contains fluorescence microscope images , and cell nuclei are manually segmented by experts. The brightness of the nuclei is irregular, and the image usually contains bright fragments, making it a challenging dataset of microscopy images.

Electron microscopy images are segmtioned using the ISBI-2012: 2D EM challenge dataset. The images face slight alignment errors and are corrupted by noise.

ISIC-2018 is a dermoscopy image dataset. A total of 2 594 images of different types of skin lesions with expert annotations are included. The original and input resolutions are shown in the Table 1. CVC ClinicDB is a colonoscopy image database used in the endoscopy image experiments. The images are extracted from 29 colonoscopy video sequence frames. A total of 612 images are obtained.

3.1.2 Pre-processing/post-processing

The purpose is to study the performance of the proposed MAAUNet architecture compared with the original U-Net and MultiResUNet. Therefore, no specific pre-processing is applied. The only pre-processing applied is to resize the input image to fit the GPU memory, and divide the pixel value by 255 to bring it to the range of [0,1].

3.1.3 Training

For a batch containingnimages, the loss functionJis

The Adam optimizer with parameters is used to train these modelsβ1=0.9 andβ2=0.999. The number of training epochs varies with the size of the dataset.

3.1.4 Evaluation metric

In semantic segmentation, the target points occupy different proportions in the entire image. Therefore, indicators such as accuracy and recall rate are not enough, and may show exaggerated segmentation performance, which changes with the proportion of segmented background. Therefore, the Jaccard index is used to evaluate the image segmentation model. The Jaccard index of two setsAandBcan be defined as the ratio of the intersection and union of the two sets.

3.2 Results and discussion

Four kinds of datasets of Murphy lab, ISBI-2012, ISIC-2018 and CVC-ClinicDB are compared with U-Net, MultiResUNet and DC-UNet. The proposed MAAUNet model is analysed from both quantitative and qualitative perspectives to validate the segmentation performance. Compared experiment results are shown in Table 2. For better readability, the fractional values of Jaccard index have been converted to percentage ratios (%). And bold values in the Table 2 represent the maximum performance for each dataset.

Table 2 Comparison of experimental results of different models

3.2.1 Quantitative analysis

It can be seen that the proposed model achieves improvement of 2.979 7%, 7.958 0%, 5.072 8% and 6.392 9% on the Jaccard index for Murphy lab dataset, ISBI-2012 dataset, ISIC-2018 dataset and CVC-ClinicDB dataset, respectively, compared with the classic UNet model. Among them, results on ISBI-2012 dataset and CVC-ClinicDB dataset have been significantly improved. For Murphy lab dataset and ISIC-2018 dataset, the proposed model still achieves improvements compared to U-Net. Therefore, these improvement effects are obvious.

Compared with the MultiResUNet model, the Jaccard index of the proposed model achieves improvement of 0.628 7%, 7.419 5% and 1.201 7% for Murphy lab dataset, ISBI-2012 dataset and ISIC-2018 dataset. For CVC-ClinicDB dataset only, MAAUNet seems to be equivalent to MultiResUNet. Compared with the DC-UNet model, the proposed model has also been improved on all datasets.

Quantitatively, for the Jaccard index, which is widely used in medical image segmentation, the proposed model MAAUNet has achieved considerable performance improvement on the datasets which have multi-scale input and interference of bright noise. The proposed multi-scale model with aggregation connection and attention mechanism has indeed achieved good improvements.

3.2.2 Qualitative analysis

This paper selects the more typical samples in the dataset for qualitative analysis. As shown in Fig.11, the proposed model is more robust to images of different scales. The segmentation of boundaries, fragments and small areas are more refined, and the interference of high-brightness noise is more effectively avoided.

For example,the first row of Fig.11 is the experiment results of ISBI-2012 dataset. Because of the influence of light and dark changes, the original MultiResUNet segmentation results contain many messy lines inside the cells. While these fragments and lines are filtered out for the results of proposed model segmentation. The interior is cleaner.

Experiments on the ISIC-2018 dataset are shown in the second row. Segmentation results obtained by MultiResUNet are too fine and narrow due to blurred boundaries and noise interference, while MAAUNet integrates multi-scale features to obtain more overlapping lesion segmentation areas, which achieves relative improvement.

The third row of Fig.11 shows experiment results on the CVC-ClinicDB dataset. A small target is incorrectly segmented by MultiResUNet due to the influence of other similar organizations, while the wrong segmentation and the interference of similar organizations are avoided by MAAUNet.

Experiment results on the Murphylab dataset are shown in the fourth row of Fig.11. The lower right corner of the input image contains highlighted fragments. Incomplete cell segmentation is caused by MultiResUNet due to the highlight interference. The segmentation results obtained contain small cell fragments, while the proposed MAAUNet avoid confusion of highlight noise. A clearer and complete cell segmentation is obtained. It can be observed from the qualitative analysis that the MAAUNet model faces multi-scale input, and the noise interference input obtains a more refined and clear segmentation result, and its effectiveness and robustness are verified on multiple datasets.

Fig.11 Qualitative analysis. Segmentation results for different models on four datasets. Dataset (row from top to bottom): ISBI-2012, ISIC-2018, CVC-ClinicDB, Murphy Lab. Image (Column from left to right): Original image, Ground truth, MultiResUNet, MAAUNet.

3.3 Ablation experiment

To further confirm that the aggregation connection marked as ①, the convolutional block attention mechanism module marked as ② and the multi-channel module marked as ③ do play a positive role, the ablation experiments as shown in Table 3 are performed.

3.3.1 Aggregate connection

Comparing original MultiResUNet network and the structure with aggregated connections added, it can be seen that the addition of aggregated connections improves the segmentation performance by 0.237 5%, 4.392 1% and 0.502 9% on Murphy lab dataset, ISBI-2012 dataset and ISIC-2018 dataset. It shows that adding aggregate connections on the basis of the attention mechanism also yields 2.023 7%, 0.131 9% and 1.067 6% improvements on ISBI-2012 dataset, ISIC-2018 dataset and CVC-ClinicDB dataset. It can be seen that aggregate connections based on the multi-channel module still achieve 0.133 6%, 2.051 3% and 0.389 5% performance improvements on Murphy lab dataset, ISBI-2012 dataset and ISIC-2018 dataset.

Table 3 Ablation experiment results

The aggregation connection reduces the semantic gap and plays a good role in the decoder to recover low-level position information. It can be seen from Fig.11 that the original skip connection at the same level is replaced with cross-level flexible aggregation connections. The segmentation result of electron microscope dataset, skin disease image dataset and cell nucleus dataset have all been improved. Similar results are obtained for segmentation on endoscopy dataset. The aggregation connection can fuse different levels of feature information, which is more conducive to the accurate restoration of segmented images. The use of aggregate connection is a beneficial expansion of the U-shaped structure, and is very helpful for medical image segmentation with different scales and sensitive boundary information.

3.3.2 Convolution block attention module

It can be seen that the performance of the model is improved by 0.260 6%, 4.351 1% and 0.383 3% in Murphy lab dataset, ISBI-2012 dataset and ISIC-2018 dataset because of the addition of the attention module, respectively. It shows that the use of the attention module based on aggregation connection also promotes the performance of the model by 1.982 7%, 0.012 3% and 0.271 1% in ISBI-2012 dataset, ISIC-2018 dataset and CVC-ClinicDB, respectively. It shows that the attention module based on the multi-channel module has achieved segmentation performance improvement by 1.770 6% and 0.366 8% in ISBI-2012 dataset and ISIC-2018 dataset.

It can extract richer advanced features, provide more refined information, adaptively optimize the intermediate feature map, and obtain more accurate segmentation results by the insertion of the convolutional block attention model. Improvements have been made in the fluorescence microscope, dermoscopy dataset and electron microscope dataset,while the endoscopy dataset has achieved comparable results.

3.3.3 Multi-channel block

It can be seen that the performance of the model is improved by 0.306 0%, 5.435 7% and 0.174 5% on Murphy lab dataset, ISBI-2012 dataset and ISIC-2018 dataset by using of multi-channel modules, respectively. It shows that the model with multi-channel modules based on aggregation connection obtain comprehensive segmentation performance improvement by 0.202 1%, 3.094 9% , 0.061 1% and 0.606 0% in Murphy lab dataset, ISBI-2012 dataset, ISIC-2018 dataset and CVC-ClinicDB dataset, respectively. It can be seen that the performance of multi-channel module based on the attention module is relatively improved by 2.855 2%, 0.158 0% and 1.557 1% in ISBI-2012 dataset, ISIC-2018 dataset and CVC-ClinicDB dataset, respectively.

The multi-channel convolution block has a positive effect on the training of the network gradient. Improvements have been made to the fluorescence microscope dataset, electron microscope dataset and dermoscopy dataset. It can better extract spatial features of different scales, enrich complementary feature information and produce better segmentation results through the multi-channel convolution module.

Finally, the aggregation connection, attention mechanism module and multi-channel convolution block are merged into the original U-shaped encoding-decoding structure, and the proposed model achieves better segmentation results.

4 Conclusions

By analyzing the architectures of classic U-Net and recent MultiResUNet in view of the different image scales, noise interference and other influencing factors, the aggregate connection structure, the convolution block attention module and the multi-channel convolution are designed to better capture multi-scale features, optimize intermediate feature maps and reduce the semantic gap. A new U-shaped architecture-MAAUNet is proposed.

To verify the segmentation performance of the model, experiments on four public medical datasets are compared with a variety of mainstream models.The efficiency and stability of MAAUNet are verified in medical image segmentation. The qualitative results also show better segmentation fineness, which can detect fuzzy boundaries more effectively and avoid noise interference.

In summary, the proposed model MAAUNet with aggregation connection and attention mechanism has indeed achieved good segmentation results. Of course, it is necessary to continue the research in lightening the model structure and improving the generalization ability of the model in the future.