APP下载

Research on single image super-resolution based on very deep super-resolution convolutional neural network

2022-09-19HUANGZhangyu

HUANG Zhangyu

(Department of Electronic, Electrical and Systems Engineering, University of Birmingham, Birmingham B152TT, United Kingdom)

Abstract: Single image super-resolution (SISR) is a fundamentally challenging problem because a low-resolution (LR) image can correspond to a set of high-resolution (HR) images, while most are not expected. Recently, SISR can be achieved by a deep learning-based method. By constructing a very deep super-resolution convolutional neural network (VDSRCNN), the LR images can be improved to HR images. This study mainly achieves two objectives: image super-resolution (ISR) and deblurring the image from VDSRCNN. Firstly, by analyzing ISR, we modify different training parameters to test the performance of VDSRCNN. Secondly, we add the motion blurred images to the training set to optimize the performance of VDSRCNN. Finally, we use image quality indexes to evaluate the difference between the images from classical methods and VDSRCNN. The results indicate that the VDSRCNN performs better in generating HR images from LR images using the optimized VDSRCNN in a proper method.

Key words: single image super-resolution (SISR); very deep super-resolution convolutional neural network (VDSRCNN); motion blurred image; image quality index

0 Introduction

Image super-resolution (ISR) is a process which can create a high-resolution (HR) image from a low-resolution (LR) image.HR means that pixel density within an image is high, therefore, an HR image can offer details that may be critical in various applications[1]. For ISR, the main method is super-resolution image reconstruction (SRIR). The learning-based method, one of the SRIR methods, is used in our work. Learning-based methods mainly depend on the missing high-frequency information of LR images. This information can be assumed by learning from a training set of LR or HR images[2]. Very deep super-resolution (VDSR) is a learning-based method based on deep learning. In several visual recognition tasks, convolutional neural network(CNN) has demonstrated recognition accuracy better than or comparable to human beings[3]. Single image super-resolution (SISR) is a fundamentally challenging problem because an LR image can correspond to a set of HR images, while most are not expected. Recently, SISR can be achieved by a deep learning-based method. By constructing very deep super-resolution convolutional neural network (VDSRCNN), the LR images can be improved to HR images. Kim et al. considered that training a very deep CNN was difficult because of slow convergence rate. However, the residual-learning and extremely high learning rates can solve this problem[4-5]. Rampal et al. proposed VDSRCNN based on residual learning, which means the network can learn to estimate the residual image.

In recent years,the CNN has been extensively popular owing to its success in image classification, as stated by He et al.[6]. A CNN-based SISR starts from super-resolution convolution neural network (SRCNN) proposed by Dong et al. firstly[7]. Then Lee et al. analyzed SRCNN including three convolution layers and generating super-resolution images with a higher image quality index than past SISR[8]. The three layers are composed of a image input layer, several convolutional and rectified linear unit (ReLU) layers, and a regression layer instead of an ReLU layer. VDSRCNN is based on SRCNN. Kim et al. considered that VDSRCNN with 20 convolutional layers is deeper than SRCNN with three layers[9]. Besides, Lee et al. pointed out that VDSR was extremely hard to implement in hardware with a reasonable cost. Hence, it might be effectively achieved in software-based optimization[8], in which three interpolation algorithms were often employed: bilinear interpolation, bicubic interpolation and nearest neighbour interpolation. In three classical interpolations, although the bicubic interpolation has the lowest operational speed, it can ensure the best performance in the result.

The main objectives of this study include constructing a VDSRCNN using Matlab and modifying the different parameters of VDSRCNN training options to test the performance of VDSRCNN. The former is achieved by observing the image quality index of test images to acquire the validity of VDSRCNN. The latter is achieved by adding motion blurs to the test images and then reconstructing the HR image of the motion blurred image. In the same way, we can judge the performance of the motion blurred image optimized by VDSRCNN.

1 Image super-resolution

1.1 Structure of VDSRCNN

Training a very deep CNN is difficult because of its slow convergence rate. VDSRCNN is based on residual learning, which means the network can learn to estimate the residual image. A residual image is a difference between an HR image and an LR image, which is up-scaled to match the reference image size[4-5]. In this case, a residual image includes the high-frequency information of the test image.

SRCNN has included three convolution layers and can generate super-resolution images with a higher image quality index than past SISR[6-8]. The three layers are a image input layer, several convolutional and ReLU layers, and a regression layer instead of an ReLU layer. The difference between SRCNN and VDSRCNN is that SRCNN has three layers while VDSRCNN has 20 convolutional layers, which means the latter is deeper[9]. The structure of VDSRCNN is shown in Fig.1.

Fig.1 Structure of VDSRCNN

In Fig.1, circles represent image batches,Frepresents the input layer, andLrepresents the output layer. Layer 1 is the image input layer, layers 2-18 are the convolution and ReLU layers, layer 19 is the computation layer, and layer 20 is the regression layer. Except for input/output layers, the size of each layer is 41×41.

The middle layer contains 18 convolutional and ReLU layers. Each convolutional layer includes 64 filters with 3×3 size. In this study, the mini-batch size is specified as 64, thus there are 64 filters. To keep the same size between the feature maps and the input which is passed each convolution, we specify the ‘pad’ of the layer to ‘0’. To Ensure the asymmetry of deep learning, He-initializer is used. The middle 17 layers aim to reconstruct images with a single filter of 3×3 size. The final layer is a regression layer. Its function is to evaluate the mean square error between the residual images and the prediction of the network.

1.2 Image quality index

After obtaining HR images from VDSRCNN, some methods can be used to detect the quality of image reconstruction. Those methods can be divided into objective and subjective methods. The subjective methods depend on human judgement, such as visual sense. The objective methods are based on the accuracy of values[10-11]. The commonly-used objective methods are peak signal to noise ratio (PSNR), structural similarity index (SSIM), and natural image quality evaluations. The research shows that the PSNR value approaches infinity as the mean square error (MSE) approaches zero, which means that a higher PSNR value provides a higher image quality[12]. PSNR is calculated by

(1)

wherePSNRis defined as the logarithm ofMSEbetween the original image and pre-processed image relative to (2n-1)2,nis the number of bits of each sampling value. For two images (x,y), the SSIM can be derived by[13]

(2)

1.3 Motion blur

Motion blur is a phenomenon generated by a camera that moves faster than its exposure time. In real life, motion blur can be observed when someone looks out of the window of a car which moves at a breakneck speed.

It also appears on camera shaking. Bora et al. pointed out that adding motion blur to an image depends on the perception of the person creating the forgery[14]. Hence, the motion blur of this study is not a natural phenomenon generated by camera shaking or some other reasons. To obtain the restoration image from a motion blurred image, the blur direction and the motion-blurred image are the essential key points[15]. To generate a deblurred test image, the blur direction and the blurred image should be in a new training set of the VDSRCNN.

2 SISR in VDSRCNN

2.1 Preparation of training data

VDSRCNN can learn the mapping between LR and HR images.Therefore, it is essential to create a training data set. The training data set comprises up-sampled images and the corresponding residual images[16-18]. Considering the performance of graphics processing unit (GPU), the training data set includes 344 images. The function will do these steps:

1) Convert the training set images to the ones in YCbCr colour space.

2) Decrease the luminance channelYby different scale factors to generate LR images and restore the processed images to their original size using bicubic interpolation.

3) Calculate the difference between the two images. One of the up-sampled images in the training set is shown in Fig.2, and the corresponding residual image is shown in Fig.3.

Fig.2 One of up-sampled images in training set

Fig.3 Corresponding residual image of Fig.2

In this case, the input of VDSRCNN can be obtained. However, the number of input images is not enough. In order to make it easier to increasethe number of input images of VDSRCNN, we specify random rotation by 90 degrees and random reflections on the x-axis. Thus, the random patch extraction datastore meets the requirement that it can extract a large set of small image patches[19]. The result from the datastore can provide a mini-batch size of data of VDSRCNN at every iteration of the epoch.

2.2 Construction of VDSRCNN

After creating the input images of VDSRCNN, the next step is to build the three layers of VDSRCNN. The image input layer is operated on image patches for layer 1. The image patches are based on VDSRCNN receptive fields. Besides, the ideal situation is that the size of receptive fields is the same as that of the image patches. Hence, all high-frequency features can be observed in the receptive field. However, convolutional layernhas a receptive field of (2n+1)×(2n+1). In VDSRCNN, there are 20 convolutional layers, so the receptive field size is 41×41. Besides, the image batch size is also 41×41. Because VDSR training just uses a luminance channel, the image input layer allows the channel. The signal processing of VDSRCNN is shown in Fig.4.

As shown in Fig.4,Xrepresents the image data,Amrepresents the mapping layer + the enhancement layer,Wmis the weight matrix corresponding to the linear transformation, andYrepresents the output image classification result. For example, supposing that we provide the input image dataXand use the functionXWi. The mapping featureziof groupiis generated by functionXWimapping, whereWiis a random weight coefficient with appropriate dimensions. The given markZi≡[z1,…,zi] represents all mapping features of the firstigroup. Similarly, we record the enhancement nodeξ([z1,z2,…,zn]Whj+βhj) in groupjashj, and all the enhancement nodes in groupjare recorded asHj≡[h1,…,hj]. According to the complexity of modeling tasks, differentiandjcan be selected.

Fig.4 Signal processing of VDSRCNN

Firstly, the feature values of moving images are extracted through the constructed CNN. Then, the feature values are sent to the feature layer of VDSRCNN. Finally, the classification results are obtained in the output layer through the hidden layer.

2.3 Generation of HR test images from VDSRCNN

The stage after constructing VDSRCNN is to generate HR test images. Here, we take Matlab database ‘sherlock.jpg’ as an example. The high-frequency feature is lost when the image is resized with a scale factor of 0.25. Fig.5 shows the LR image with a scale factor of 0.25. Fig.6 is the classical HR images by using bicubic interpolation.

Fig.5 LR image (scale factor=0.25, sherlock)

Fig.6 HR image by using bicubic interpolation (sherlock)

The VDSRCNN only uses the luminance channel. In term of human perception, the brightness change is more evident than the change of colour. The residual image from VDSRCNN is shown in Fig.7.

Fig.7 Residual image by using VDSRCNN (sherlock)

By adding the residual image and the luminance element, we obtain the HR image. Then the HR image in YCbCr color space is converted to that in RGB colour space, and the final HR image is obtained, as shown in Fig.8.

Fig.8 HR image from VDSRCNN (sherlock)

2.4 Usage of image quality indexes to test the performance of images

Though the two HR images from bicubic interpolation (BI) and VDSRCNN are obtained, it is also hard to observe which image shows better performance from human eyes. Hence, using an image quality index to distinguish which is better is a suitable method. Table 1 shows the difference between the images using BI and VDSRCNN in one training result, respectively.

Table Difference in image quality index between BI and VDSRCNN

In Table 1, NIQE is naturalness image quality evaluator. It can be seen that for PSNR, the more significant the number, the better the performance; for SSIM, the more significant the number, the better the performance; and for NIQE, the smaller the number, the better the performance.

2.5 Modification of different parameters to test the performance of VDSRCNN

In order to ensure the value of the image quality index is not accidental, five-time trainings in each different parameter of training options are completed. Different training parameters are used to distinguish the image quality. Table 2 shows the image quality indexes in different training parameters. The learning drop factor and the init-learning rate are adjusted to obtain different image performances. Different image quality indexes is listed in Section 4.

Table 2 Three image quality indexes in different training parameters

3 Image restoration from motion blur in VDSRCNN

3.1 Modification of training data by adding motion blur

The first parameter that should be modified is the training set. In the original training set, all images are not added motion blur. Because the number of images in the original training set is 344, the motion blurred images should be more than 1/2. Hence, 200 motion blurred images instead of the same images in the original training set are added. In order to make it simplified, the motion blur will be created in the horizontal direction.

Figs.9 and 10 show one of the motion-blurred images and original images in the new training set, respectively.

Fig.9 One of original training images

Fig.10 One of original training images with motion blur

3.2 Usage of image quality index to test the performance of deblurred image

As the same structure as the VDSRCNN, the CNN structure for motion blurred images is not changed. The only changed element is the training set with 200 motion blurred images. The parameters of training options are set to the same as those to the best performance of VDSRCNN with the init-learning rate of 0.1 and the learning drop factor of 0.01.

After constructing the CNN for motion blurred images, we also use the image quality index to test the performance of the deblurred image. The difference between the motion-blurred image and the deblurred image by using CNN is shown in Fig.11.

Fig.11 Difference between motion blurred image and deblurred image from new VDSRCNN

Though it is hard to distinguish whether the deblurred image would be better or not, the image quality index can also test the performance.

4 Result and discussion

4.1 Result of ISR

From the experiment above, it can be seen that the performance of VDSR is better. The reason is that compared with the classical method of bicubic interpolation, those three image quality indexes show that an image using VDSRCNN has an enormous value except for that of NIQE because for NIQE, the smaller the value, the better the image quality.

In order to know which train parameter makes the performance of VDSRCNN best, the average of PSNR is shown in Table 3, and the variance is shown in Table 4.For PSNR, when the initial-learning rate is equal to 0.1 and the learning drop factor is equal to 0.01, VDSR has the best performance while the variance of PSNR is not the smallest. In other words, the performance of this VDSRCNN might be worse than that of VDSRCNN in other training parameters. Therefore, it can be sure that when the initial-learning rate is equal to 0.1 and the learning drop factor is equal to 0.01, VDSRCNN can show the best performance.

Table 3 Average of PSNR

Table 4 Variance of PSNR

4.2 Result of motion blurred image using VDSRCNN

In the new VDSRCNN, its training parameters are applied based on the above results, with the initial-learning rate of 0.1 and the learning drop factor of 0.01. The results are shown in Table 5.

Table 5 Result of deblurred image in three image quality indexes

It is obvious that the deblurred image using the new VDSRCNN has a better image quality than the classical method. Finally, the results prove the availability of the new VDSRCNN.

5 Conclusions

This study is devoted to implementing HR images from a motion blurred image by constructing VDSRCNN and modifying the training parameters for the best performance of VDSRCNN. We obtain the best training parameters of VDSRCNN and prove the availability of VDSRCNN successfully. Besides, adding motion blur to the training set also achieves a new VDSRCNN, which includes a motion blurred training set and can deblur the image with motion blur. If changing the structure of VDSRCNN in a proper method, the VDSRCNN might show better performance in generating HR images from LR images.