A method to generate foggy optical images based on unsupervised depth estimation

2021-04-14WANGXiangjunLIULinghaoNIYuboWANGLin

Journal of Measurement Science and Instrumentation 2021年1期

WANG Xiangjun,LIU Linghao,NI Yubo,WANG Lin

(1. State Key Laboratory of Precision Measuring Technology and Instruments,Tianjin University,Tianjin 300072,China；2. MOEMS Education Ministry Key Laboratory,Tianjin University,Tianjin 300072,China)

Abstract：For traffic object detection in foggy environment based on convolutional neural network (CNN),data sets in fog-free environment are generally used to train the network directly.As a result,the network cannot learn the object characteristics in the foggy environment in the training set,and the detection effect is not good.To improve the traffic object detection in foggy environment,we propose a method of generating foggy images on fog-free images from the perspective of data set construction.First,taking the KITTI objection detection data set as an original fog-free image,we generate the depth image of the original image by using improved Monodepth unsupervised depth estimation method.Then,a geometric prior depth template is constructed to fuse the image entropy taken as weight with the depth image.After that,a foggy image is acquired from the depth image based on the atmospheric scattering model.Finally,we take two typical object-detection frameworks,that is,the two-stage object-detection Fster region-based convolutional neural network (Faster-RCNN) and the one-stage object-detection network YOLOv4,to train the original data set,the foggy data set and the mixed data set,respectively.According to the test results on RESIDE-RTTS data set in the outdoor natural foggy environment,the model under the training on the mixed data set shows the best effect.The mean average precision (mAP) values are increased by 5.6% and by 5.0% under the YOLOv4 model and the Faster-RCNN network,respectively.It is proved that the proposed method can effectively improve object identification ability foggy environment.

Key words：traffic object detection;foggy images generation;unsupervised depth estimation;YOLOv4 model;Faster region-based convolutional neural network (Faster-RCNN)

0 Introduction

More than 80% of the information is acquired by the drivers through human vision in their driving operation.While in foggy environment,rays of light are blocked by mist and micro-particles,which weakens the brightness of objects and background affects the drivers’ visual recognition of objects,and seriously endangers the safe driving.That is why traffic accidents occur frequently in foggy environment[1].Object detection in foggy environment has been increasingly noticed in the academic field.With the development of deep learning,convolutional neural network(CNN) has been widely used for object detection.At present,studies on object detection in foggy environment are mainly divided into two directions:one is that the input image is pretreated with dehazing,and then input into the detection network for detection;the other is that the objects in the fog are directly detected.The first method requires calculation at both stages of dehazing and detection,which has high demand for computing resources[2].Meanwhile,the pixel-level dehazing destroys the original structure of the image,therefore,the detection effect is worse than the expectation[3].Consequently,the second method has become an urgent idea to be explored.

The training of object detection networks in foggy days requires high-quality training sets which are necessary to train the object detection network in foggy weather.At present,the open data sets on the Internet include the DENSE-HAZE outdoor foggy day data set provided by NTIRE competition,and the real-world rask-driven testing set (RESIDE-RTTS) data set.However,the DENSE-HAZE data set includes only 55 images,which cannot be used for training.The RESIDE-RTTS data set is based on Google maps to collect images in Europe and Northern China in foggy environment.The purpose of the establishment is to test the effect of object detection in real scenes.The number of pictures is not supportive for training,testing and verification at the same time.Since it is difficult to capture images in real foggy environment,the method of generating foggy images on original fog-free images has been gradually applied.In Ref.[4],RESIDE converts NYU2_DEPTH V2 indoor depth data set into foggy data set based on the atmospheric scattering model.Sakaridis et al.proposed the method of creating foggy images through data collection[5].In the afore-mentioned method,the RESIDE foggy images are generated based on the depth images that come with the original data set.The images are not annotated,and not applicable to object detection.Sakaridis utilized the depth filling method based on the standard stereo matching algorithm,and generated the depth map from the original left and right image pairs.However,this method is computationally complex,and not suitable for large-scale data set generation.Besides,only 550 synthetic foggy images are generated.To sum up,there is a lack of large-scale outdoor traffic data sets for training in foggy environment.

In order to generate large scale labeled outdoor traffic data sets,in our work,a method to generate foggy images from original fog-free images is studied using the KITTI data set as the original data set.KITTI data set is currently the largest automated driving scenario evaluation data set[6].First,the depth image of the original image is generated based on the improved Monodepth unsupervised depth estimation model.Then,a geometric prior depth template is constructed,and the image entropy is used as the weight to fuse the depth map for the optimization of the depth map.Then,and a new foggy data set containing 7 481 images is generated according to the atmospheric scattering model.At last,in order to verify the function of the generated data set,the target detection experiment is carried out.The generated data set is mixed into the KITTI original data set for the training set,and the RESIDE-RTTS data set is used as the test set.After the training of the object detection network,the detection results show that the performance of the object detection network is improved upon mixing the generated data set into the original data set.

1 Atmospheric scattering model

In the foggy environment,the quality of the images captured by the camera decreases due to two reasons:(1) The suspended particles in the atmosphere have a scattering effect on the incident light,leading to the attenuation of the incident light in the process of propagation from the scene point to the observation point of the imaging equipment;(2) The atmospheric particles deviate from the original propagation direction,integrate into the imaging light path,and then participate in imaging in conjunction with the reflected light of the object[7].The process of imaging in foggy days is shown in Fig.1.

Fig.1 Imaging model for foggy days

The mathematical model of foggy imaging is expressed as

I(x,λ)=e-β(λ)d(x)R(x,λ)+L∞(1-e-β(λ)d(x)),

(1)

whered(x) refers to the depth image;R(x,λ) represents the original image;β(λ) is the atmospheric scattering coefficient;andL∞is the atmospheric light value at infinity.

According to Eq.(1),by means of depth imaged(x),atmospheric scattering coefficientβ(λ) and atmospheric light value at infinityL∞,the foggy image can be generated from the fog-free image.The generation process is shown in Fig.2.

Fig.2 Foggy image generation process

HereL∞determines the color of the fog,andβ(λ) determines the concentration of fog.Both of them can be constantly adjusted as needed.Therefore,the depth image is probably the key to generate foggy image.

2 Depth estimation

The key to generate foggy image is to obtain the effective depth estimation of the original image.While in order to obtain image annotation information,we use the KITTI object detection data set which includes 7 480 training pictures and 7 518 test pictures taken by the vehicle left looking camera.In our work,the improved Monodepth model is utilized to estimate the depth of KITTI object detection data set,and then generate depth map.

2.1 Monodepth model

Fig.3 Binocular depth estimation process

The infrastructure of the Monodepth model is the U-Net encoder-decoder model[9],and the structure of the model is shown in Fig.4.

Fig.4 U-Net encoder-decoder model

This model has been widely used in the semantic segmentation of images.The whole neural network is mainly composed of an encoder and a decoder.Specifically,the main function of the encoder is to capture the context information in the image,while the decoder is used to precisely locate the parts in the image that should be segmented.

The decoder of the Monodepth model makes use of the ResNet[10],of which the core is a residual network.Its difference from general network lies in the introduction of skip connection,which allows the information from the previous residual block to flow unimpededly to the next residual block,thus improving information flow and avoiding the problem of vanishing gradient and degradation caused by too deep network.Besides,the ResNet is divided into multiple networks according to different layers.Those networks are compared in Table 1.

Table 1 Comparison of ResNet networks

The Monodepth model makes use of the ResNet18.In order to better improve the effect of the model,ResNet50 and ResNet101 are used herein to replace ResNet18 and conduct comparative experiments.

2.2 Loss function of monodepth model

The loss function of Monodepth model is

(2)

whereCaprefers to characterize the similarity between the image reconstructed and input training image;Cdsrepresents the smoothness of an image;andClrexpresses the difference between the depth image generated by the left and right views.Therein,Capplays a key structural role in the whole image,and it is defined as

(3)

The structural similarity (SSIM)[11]is used to calculate the three comparative measures (including brightnessl,contrastcand structures) based on the samplesxandy.They are defined respectively as

(4)

(5)

(6)

SSIM(x,y)=l(x,y)ac(x,y)bs(x,y)c.

(7)

(8)

(9)

(10)

The calculation is performed by keeping the window sliding.Finally,the average value is taken as the global SSIM.The change is referred to as SSIMB.

2.3 Estimation of monodepth depth

In order to provide the training model with more application scenarios,we use the KITTI 2015 data set for training;meanwhile,42524 is used as the training set,and 3396 is used as the verification set.Besides,the pictures are furnished with corresponding left and right views.ResNet18 is characterized by a batch size of 12 and an epoch of 20,while ResNet50 has a batch size of 8 and an epoch of 25.ResNet101 features a batch size of 4 and an epoch of 35.The initial learning rate is 10-4.At last,the learning rate of the last five epochs is decreased to 10-5.The GPU is 2080ti,and the memory is 12 GB.Depth images are generated as shown in Fig.5.

(a) ResNet18

According to experimental results,the texture of the depth image becomes clearer by increasing the number of layers in ResNet and adjusting the loss function.When introducing the Gaussian weighting function,some miscalculated depth-free void areas are effectively improved.

3 Generation of foggy images

When the depth images are obtained,the foggy images can be generated from the depth images according to Eq.(1).The foggy images generated under different atmospheric scattering coefficientβ(λ) and atmospheric light valueL∞are shown in Fig.6.

(a) β(λ)=0.3,L∞ pixel value [255,255,255]

3.1 Geometric prior template

In the actual foggy scene,the transition of each part of the image in human eyes is relatively smooth.However,in the foggy images generated from the depth image,some areas in the depth image,such as tree trunks,pedestrians and vehicles show great changes in depth.The unsmooth edges and the inaccurate depth estimation in such areas make the foggy image far from nature scene,thus affecting the quality of the data set.

Fig.7 Diagram of unsmooth edges and inaccurate depth estimation

In order to solve the above problems,a geometric prior template is constructed to combine the depth image with the template for mitigating defects[12].

Most of the outdoor traffic data sets,including KITTI data set,follow certain rules during the change of image depth information.That is,the bottom of the image is close to the camera,therefore the depth value is small;The top of the image is close to the sky,therefore the depth value is large.The depth is increased from the bottom to the top of the image.The rule is referred to as a bottom-to-top rule in our work.A bottom-to-top template is constructed according to this rule.The gray value at each point in the imagea(i,j) is ek×h(-i)+ln(min_depth)wherehis the KITTI image width of 1 225,min-depth is set at 20 andkis set at 5.The template is shown in Fig.8.

Fig.8 Diagram of bottom-to-top template

Since the KITTI data set is taken in a road scene,many images contain lane lines and road vanishing points,as shown in Fig.9.

Fig.9 Detection of road vanishing point

Fig.10 Diagram of middle-to-around template

In the images where no vanishing point is detected,as shown in Fig.11,the middle-to-around template will not be used.

Fig.11 Image without vanishing point

3.2 Fusion of depth images and geometric prior templates

To better fuse the depth image with the geometric prior template,the fusion weight of parts is considered.The depth image contains a lot of details,and the richer the details,the more likely the unsmooth edge and inaccurate depth estimation occur,as shown in Fig.12.Therefore,the higher richness the details of the image,the more we need to compensate with the prior template.

Fig.12 Impacts of depth images with different richness of details on foggy images

In our work,two-dimensional (2D) entropy of the image is used to measure the richness of details in the depth image.The 2D entropy of the image is a characteristic statistical form that reflects the average amount of information in the image,and indicates the clustering feature and the spatial feature of image gray distribution.The calculation equation for entropy is presented as

Pij=f(i,j)/N2,

(11)

(12)

whereirepresents the gray value of the pixel,jmeans the gray average of the domain,f(i,j) refers to the frequency of occurrence of characteristic binary group (x,y) respectively,andNis the scale of the image.The entropy of white or black images is 0.

Let the 2D entropy of the depth image,bottom-to-top template and middle-to-around template beH1,H2andH3,respectively,the corresponding weights are calculated by

(13)

(14)

(15)

The final depth image is

ω1Id(i,j)+ω2Ib(i,j)+ω3Im(i,j),

(16)

whereId(i,j),Ib(i,j) andIm(i,j) represent the depth image,bottom-to-top template.and middle-to-around template,respectively.

Comparison of foggy images generated before and after fusion is shown in Fig.13.

(a) Foggy images generated before fusion

4 Evaluation of foggy images

When the foggy images is generated,it is necessary to evaluate the foggy images.In order to verify the role of data set in practical application,the object detection experiment in foggy environment is carried out.

The best one-stage object detection network,YOLOv4,is used as the detection network[13].And the training sets consist of:① the original KITTI data set;② the fog data set whenβ(λ)=0.3;③ the data set generated whenβ(λ)=0.5;④ the original KITTI data set+the fog data set whenβ(λ)=0.3;⑤ the original KITTI data set+the fog data set whenβ(λ)=0.5;⑥ the original data set +the data set after Gaussian blur.

The RESIDE-RTTS set is adopted as the test set.During the training,the initial learning rate is 0.001 and the attenuation coefficient is 0.000 5.Besides,batch size is 64,subdivision number is 8,and max_batch size is 60 000.The mosaic image enhancement method is applied for all.The GPU is 2080ti and the memory is 12 GB.The experimental results are shown in Table 2.

Table 2 Experimental results of YOLOv4 object detection

The experimental results show that when the generated data set is added to the original data set as data enhancement,the object detection effect can be greatly improved.The mAP is improved by 5.6%.Meanwhile,it can be seen from the results that highβ(λ) may result in the decrease of the detection precision.The possible reason is that the actual scene is dominated by mist.In addition,some objects are completely obscured by the fog,resulting in wrong image annotation.In depth learning,Gaussian blur and other methods are typically used for data enhancement.In order to compare the object detection effects of the foggy image and the image after Gaussian blur,experiment ⑥ was carried out.The results show that the mAP acquired by data enhancement and generated by foggy images is increased by 3.9%,in comparison with that of Gaussian blur.

(a) Original KITTI data set

The CNN architecture Faster-RCNN based on region proposal[14]was used to test the object detection effects of data sets under the two-stage detection network.Faster-RCNN with ResNet50 is selected in the model;and the epochs was 20.The experimental results are shown Table 3.

Table 3 Experimental results of Faster R-CNN object detection

The experiment results show that the data sets also improve the two-stage object detection networks.The mAP is higher than that obtained by directly using the original data set by 5.0%,indicating that the data set constructing and generating methods proposed herein have great generalization ability and adaptability to the current classical object detection algorithm.

5 Conclusions

In order to solve the problem of the lack of complete annotation of large outdoor traffic data sets in foggy days,a method to generate foggy images from original images is studied.The data set constructed by the proposed method is trained and tested in two typical object detection CNN networks.The results show that the detection effects of the proposed method in the foggy environment are significantly improved,and the model has strong universality and generalization ability.Besides,this method is suitable for any data set with left and right camera views,and has certain application value in practice.In the future,the scale of data set will be expanded,and the data set will be further cleaned to achieve better network training effect.At the same time,the application of data set in image dehazing and other fields will be further studied.

Journal of Measurement Science and Instrumentation

2021年1期