A convolutional neural artistic stylization algorithm for suppressing image distortion

2021-09-15SHENYuYANGQianZHANGHongguoWANGLin

Journal of Measurement Science and Instrumentation 2021年3期

SHEN Yu，YANG Qian，ZHANG Hongguo，WANG Lin

(School of Electronics and Information Engineering,Lanzhou Jiaotong University,Lanzhou 730070,China)

Abstract：Aiming at the problems of image semantic content distortion and blurred foreground and background boundaries during the transfer process of convolutional neural image stylization,we propose a convolutional neural artistic stylization algorithm for suppressing image distortion.Firstly,the VGG-19 network model is used to extract the feature map from the input content image and style image and to reconstruct the content and style.Then the transfer of the input content image and style image to the output image is constrained in the local affine transformation of the color space.And the Laplacian matting matrix is constructed by combining the local affine of the input image RGB channel.For each output blocks,affine transformation maps the RGB value of the input image to the corresponding output and position,which realizes the constraint of semantic content and the control of spatial layout.Finally,the synthesized image is superimposed on the white noise image and updated iteratively with the back propagation algorithm to minimize the loss function to complete the image stylization.Experimental results show that the method can generate images with obvious foreground and background edges,clear texture,restrained semantic content-distortion,realized spatial constraint and color mapping of the transfer images,and made the stylized images visually satisfactory.

Key words：neural network；style transfer；deep learning；affine transformation

0 Introduction

Image artistic style rendering is an important research direction in the field of computer vision.Image artistic stylization means that the content of image A is represented as much as possible with the style of image B,which makes ordinary image change into an artist style image[1].The key point of simulation painting is to imitate the real painting process,including stroke attribute definition,stroke arrangement strategy and stroke direction design[2-3].However,most of them are manually modeled,and thus require professional experience and complex mathematical formulas.The development of deep learning has become the main research method for the artistic stylization of images,which can quickly achieve artistic creation.Deep learning method such as convolutional neural network has outstanding performance in image feature extraction,which is highly consistent with image style extraction and content extraction tasks.

Gatys et al[4-8]used convolution neural network to separate the content feature representation and style feature representation of the image,and independently processed high-level abstract feature representation to realize image style transfer and obtain artistic effect.However,the edge of the transferred image was blurred,the foreground and background were not clear,and the image content was inconsistent.Li et al.[9]integrated the whitening and coloring transformation pair in the feedforward network to match the statistical distribution and correlation of the intermediate features of content and style,and the method showed a good effect in the transformation of arbitrary style.Ulyanov et al.[10-11]introduced a new learning method and replaced batch standardization with instance standardization modules.The new learning method samples as much as possible from the Julesz texture collection without bias,which improves the stylization speed,but the quality of forward texture synthesis and image stylization is similar to the quality of the image generated by optimization.Huang et al.[12]proposed the adaptive instance normalization (AdaIN)layer to align the input content with the mean and variance of style features and transmit feature statistics,which can quickly realize the real-time transfer of arbitrary style,but there are small particles on the generated image covering the semantic content.Mechrez et al.[13]proposed a loss function that does not need to be aligned in space.Compared with areas with similar semantics,it carries out feature similarity matching and ignores the spatial information of content images.This method has a good stylization effect on images with similar features,but a poor effect on images without similar features.Chuan et al.[14]proposed a Markovian generative adversarial networks (MGANs),which can capture the characteristic statistical information of Markovian fragments and directly generate the output of any dimension.Also it can directly decode photos into art paintings and improve the speed of image synthesis.However,the effect of facial transfer on two different people is poor.Liu et al.[15]combined a feed-forward transformation network with an additional depth loss function and an optimization-based method,using depth retention as an additional loss,retaining the semantic content and layout of the content image to a certain extent.However,the color of the generated image is generally dark,which does not conform to the style image.Johnson et al.[16]defined and optimized the perception loss function from the advanced features extracted from the pre-training network to improve the speed of style conversion.

According to the above research,during the image transfer process,because the image layout is destroyed and the boundaries between the foreground,background,and other objects become blurred,resulting in not only limited retention of the semantic content and spatial layout of the image,but also distortion of the generated image and unsatisfactory visual effect.Therefore,a convolutional neural artistic stylization algorithm for suppressing image distortion is proposed to address the above problems.We constrained the image stylization process to local imitations in color space and constructed Laplacian matting matrix constrained stylized image on RGB channel,which is helpful to generate the images that retain their semantic content,thus preserving the coherence and spatial layout of the original content image and inhibiting distortion.

1 Image matting

Image matting matrix divides the image into foreground and background to extract the foreground.The image matting process returns a probability value indicating that the image area belongs to the foreground or background,and a gradual effect will occur in the foreground and background interaction area[17].A good matting algorithm for hair,fine hair and other details processing effect is more accurate,make matting more natural.

The matting is expressed as a formulaIi=αRi+(1-α)Si,whereIiis the currently observableithpixel of the image;αis matte,Ris foreground pixel;andSis background pixel.The original image can be viewed as the foreground and background superimposed with a certain weight.Matte is locally represented as a linear combination of image color channels.For the windows with similar foreground and background colors,the matte has a strong normalized correlation with the image,which can be generated by multiplying one of the color channels by a scaling factor and adding a constant.When matte is constant throughout the window,matte is obtained by multiplying 0 in the three channels.Since the matte on most image windows is constant (0 or 1),the matte in these windows can be represented in a simple way as a linear function of the image.Constructing Laplacian image matting matrix on three channels of the image can preserve the semantic content of the original content image and restrain the distortion of the image.

1.1 Gray matting

We assumes that the local color distribution follows the color linear model,the color in the local window is represented as a linear combination of two colors[18],andRandSare approximate constants on the small window in the neighborhood of each pixel.

αi≈xIi+y,∀i∈ω,

(1)

whereiis pixels of imageI;ωis the small image window with the small window size for each pixel field of 3×3.Small windows will overlap so that the information between adjacent pixels overlaps.We need to findx,yandαto minimize the cost function.The smaller the cost function is,the more accurate the image matting will be.The cost function is

whereεis the coefficient of the regularization term，ωjis the small window of pixeljneighborhood,In order to make the value of the equation more stable,regularization terms are added.Lis anN×Nmatrix,whose (i,j)th element is

L(i,j)=

(3)

(4)

1.2 Color matting

Applying the cost function in the color image is to apply the gray cost function separately in each channel.the linear model of pictureIiis

∀i∈ω,

(5)

wherecrepresents the color channel of the image.

Usingthe linear model of Eq.(5),the image matting cost function of RGB image can be defined as

(6)

Similar to the gray image,the quadratic cost function about the unknownαis generated by

(7)

whereLis anN×Nmatrix,whose (i,j)th element is

L(i,j)=

(8)

whereAkis a covariance matrix of 3×3,μkis a vector of windowωkcolor averages,andI3is a recognition matrix of 3×3.

The matrixLof Eqs.(3)and (8)is called Laplace matting matrix.If ε=0 is used,the null space ofLalso includes each color channel of imageI,since each color channel can be represented as linear function of itself.The matting effect of Laplace matting matrix is shown in Fig.1.The standard linear system represented by the matrixLIof the input imageIcan be minimized by using the least square constraint function,and the semantic content constraint of the stylized image can be realized by using the matte to constrain the process of image style transfer.

Fig.1 Matting effect

2 Image stylization algorithm

Image style transfer mainly starts from three aspects of color control,space constraint and efficiency to make them optimal.Our neural style transfer is based on the public VGG-19 network model,focusing on spatial constraints and color mapping[20].Deep convolutional classification networks have good feature extraction abilities,and the features extracted from different layers have different meanings.Every trained network can be regarded as a good feature extractor.The deep network is composed of layers of nonlinear functions,which can be regarded as complex multivariate nonlinear functions.This function completes the mapping from input image to output result.The trained deep network is used as a loss function calculator.

The high-level features extracted from the VGG-19 network model are information about the object and layout of the input image,while the low-level features express the pixel information of the input image.The style transfer process usually preserves the semantic content and spatial layout of content images.The smaller the Euclidean distance between the extracted high-dimensional features,the more similar the contents of the generated images to the original content images.The feature expression of style representation in different layers has different visual effects.The fusion of multi-layer features will enrich the style representation and achieve the purpose of overall representation of style.

2.1 Loss functions

Gatys et al[4]laid the foundation for the neural artistic image stylization.He mainly separated and reconstructed the content and style of natural images.The error between the stylized image and the style image and the content image is the minimum to generate a new image of high perceptual quality.The main process of error calculation is as follows:In the feature map,each value comes from the convolution of the filter in a specific position and represents the strength of the feature.If layerlof the convolution layer hasNlfilters,Nlfeature map should be obtained.The total loss function of style transfer is given a certain weight and then added.

Ltotal=αLc+βLs,

(9)

whereαrepresents the weight of content loss functionLcandβrepresents the weight of style loss functionLs.The total content loss function of content imageyand generated imagezis

(10)

The total style loss function of style imagex and generated imagezis

(11)

whereNwrepresents the width of feature map,Nhrepresents the height of feature map,Ncrepresents the number of feature maps incchannel,Z[O] represents the feature matrix of the output image,Y[I] represents the feature matrix of the input content image,X[I] represents the feature matrix of the input style image,iis theith filter of layerl,andjis thejth position of layerl.

2.2 Semantic content constraint items

We add the Laplace matting matrix constraint to the loss function of style transfer to retain the semantic content of the image and to balance the deviation and variance,the fitting ability and the generalization ability,the average loss function and the structural error,thus solve the problem of color distortion and space distortion in the process of neural artistic stylization.We combine the Laplace matting matrix on the local affine transformation of the image’s color space,and then respectively process the R,G and B color channels of pixels in the feature neighborhood.For each output block,the input image is mapped to the corresponding output position,so as to form the final constraint term of the Laplace matting matrix.

The standard linear system represented by the matrixLIof the input imageIcan be minimized by using the least-squares constraint function.LIis the Laplace matting matrix and theN×Nsymmetric matrix.Rh[O] is theN×1 vector of the output image in channelh.The output of local affine transformation is constrained by the regular term

(12)

Then the total error function generated in the process of neural artistic style is the sum of these three parts.

Ltotal=αLc+βLs+γLRe,

(13)

whereγrepresents the regularization ofLRe.

The synthesized image is superimposed on the white noise image for back propagation to generate an image that matches the characteristic response of the original image.The total error function is iteratively optimized so that the smaller the value of the error function,the more similar the semantic content of the generated image to the original content image,the closer the style of the generated image to the original style image.

As can be seen from Fig.2,different values given toγcan suppress the distortion of semantic content to different degrees.Whenγ=0，the image is severely distorted.The road surface,windows and so on cannot be identified,and the color distribution is not mapped from the input to the output.Whenγ=1，the constraint term is added,which playes a certain role,but the value is too small,and there is a certain degree of distortion.Whenγ=10,it has a good effect of suppressing distortion,and the scene boundary is relatively clear.Whenγ=102,it has an obvious effect of suppressing distortion,and the edges such as road surface and window are gradually clear,and the color mapping effect is better.Whenγ=103,the distortion of the image is smaller,the semantic content is clearly expressed,and the spatial layout conforms to the representation of the content image.Whenγ=104,the color mapping effect of the stylized image is good,and the boundary of the foreground and background is clear.When the value ofγcontinues to increase,some neurons of the neural network will lose their learning ability,resulting in a decline in the generalization ability of the neural network and the quality of image transfer.A large number of experiments show that the value changes within (1,104),which can inhibit the distortion of semantic content to varying degrees.

Fig.2 Effect diagram of constraint weight change

2.3 Image stylization

We use the VGG-19 model to extract the feature maps of the input style images and content images,and perform style reconstruction and content reconstruction.Then we constrain the texture synthesis process in the local affine of color space,and combine the Laplacian matting matrix on the local affine of color space,so that the stylized image semantic content and color mapping can achieve texture synthesis in the matte constraint of the Laplacian matting matrix.For each output block,affine transformation maps the RGB value of the input image to the corresponding output and position.The Laplacian matting matrix is used to constrain the semantic content of the generated image,suppress the distortion of semantic content,make the stylized image content conform to the original content image,and improve the color mapping effect.Our algorithm flow chart is shown in Fig.3.

Fig.3 Algorithm flow chart

3 Experiment and analysis

3.1 Experiment

Our experiment was built under the framework of TensorFlow.The CPU of the computer is Intel I9 9900K 5.0 GHz and the RAM is 64 G.NVIDIA RTX 2080Ti graphics card is used to run the neural network VGG-19 framework to achieve GPU acceleration.The experimental computer software is configured as Python3.6+ anaconda3-5.1.0+CUDA 9.0+cuDNN7.3.1+tensorflow-gpu 1.12+PyCharm2019.2 community version.

We use pre-trained VGG-19 as a feature extractor.High-dimensional features extracted from conv4_2 and conv5_2 of the convolutional layer are used for content presentation.The feature expression of style presentation on different layers has different visual effects,and the fusion of multi-layer features is adopted to enrich the style expression.conv1_1,conv2_1,conv3_1,conv4_1 and conv5_1 are selected.With random noise initialization,the code of Gatys[4]original TensorFlow framework is modified.The weight ratio of content loss function and style loss function is 103,and the weight of constraint item is104.

3.2 Result analysis

We combine the Laplacian matting matrix on the local affine transformation in the color space to constrain the image stylization process.The stylized results are compared with the stylized images of Gatys et al[4],Huang et al[12],Sheng et al[21]and Chen et al[22].The comparison shows that the method achieves better results.The experimental results are shown in Fig.4.The first behavior style image,the first column is the content image,and the rest are the stylized image.

Fig.4 Our results

3.2.1 Subjective analysis and comparison

Huang’s method can quickly achieve arbitrary style transfer with high flexibility,but the stylized images have poor quality compared with the images in column (f)and (g)in Fig.5.Chen’s method matches each content fragment with the most matched style fragment and obtains less distortion of the image.However,the content image is not represented by the style image,and the effect of texture representation and color representation is poor.Therefore,the stylization effect is poor.Gatys’method can achieve any style of transfer,but the stylization speed is slow，the result of stylization is unstable,resulting in artifacts(Fig.5(3)(e)).Sheng’s method seamlessly recombines the style representation according to the semantic spatial distribution of the content image.The global tone and local stroke are better,and the stylized effect is better,but the semantic content of the stylized image is distorted (Fig.5(4)(f)).

Fig.5 Comparison of stylized effect

Our method stylizes the semantic content of the image with little distortion,and the content image is well represented by the texture and color of the style image.The stylized image has high quality,the global tones and local lines indicate good results,and there are no artifacts around the image，The image has good continuity,the global scene is fully displayed,and the visual effect is satisfactory.

3.2.2 Objective analysis and comparison

The image has a very high structure.There is a strong correlation between pixels,which carries important information about the object structure in the visual scene,and the image will be distorted after transformation.The human visual system can extract the structural information in the scene with high adaptability.The distortion of image is considered by comparing the change of image structure information,and the objective quality evaluation is obtained.We use local structural similarity (SSIM)[23-25]and root mean square error (RMSE)[26]to evaluate the degree of image distortion.The value range of SSIM is (0,1).The larger the value is,the more similar the structure of the two images will be.Since the statistical features of the image in space are unevenly distributed and the distortion varies in space,the measurement effect of the local SSIM index is better than that of the global one.RMSE compared the pixel difference between the stylized image and the original content image.The smaller the RMSE value,the smaller the image distortion.It is calculated as

(14)

wherepandqrepresent the original content image and the stylized image,respectively.

SSIM measures image similarity from brightness,contrast and structure.The evaluation model of structural similarity is

ISSIM(p,q)=l(p,q)α×c(p,q)β×s(p,q)γ,

(15)

(16)

(17)

(18)

(19)

wherec1=(k1L)2andc2=(k2L)2are constants used to maintain stability,k1=0.01,k2=0.03,andLis the dynamic range of pixel values.

As can be seen from Table 1,the overall structure similarity of Chen’s method is relatively high,because the content image of his method is not represented by style image,and the effect of texture representation and color representation is poor,almost the same as the original content image.Huang et al.,Gatys et al.and Sheng et al.all realized the image stylization,but the structural similarity of the image is low.The stylized image structure of our method has high similarity and small error,and has improved semantic content and structure retention.The generated image is closer to the original content image and style image.

Table 1 Structural similarity comparison

4 Conclusions

We propose a convolutional neural artistic stylization algorithm that suppresses image distortion.By combining the Laplacian matting matrix on the local affine in the color space,we realize the constraint on the stylization process,and preserve the semantic content of the image.Our method constrains the image layout to obtain clear the boundary between foreground and background and other objects as well as small color affine deviation.In addition,the proposed algorithm realizes the color mapping and space control of image style migration,so as to prevent the semantic content distortion of the generated image.Finally,the result of stylization is beautiful and satisfactory.

Journal of Measurement Science and Instrumentation

2021年3期