Anomaly detection and segmentation based on multi-student teacher network

2022-05-05RENChaoqiangLIUDengfeng

Journal of Measurement Science and Instrumentation 2022年2期

REN Chaoqiang， LIU Dengfeng

(1. School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214122, China；2. Jiangsu Provincial Engineering Laboratory of Pattern Recognition and Computational Intelligence,Jiangnan University, Wuxi 214122, China)

Abstract： In automated industrial inspection, it is often necessary to train models on anomaly-free images and perform anomaly detection on products, which is also an important and challenging task in computer vision. The student-teacher network trains students to regress the output of the teacher, and uses the difference between the output of the student network and the pre-trained teacher network to locate anomalies, which has achieved advanced results in the field of abnormal segmentation. However, it is slow to predict a picture, and no anomaly detection is performed. A multi-student teacher network is proposed, which uses multiple student networks to jointly regress the output of the teacher network, and the minimum square difference between the output of students and teachers in each dimension is selected as the difference value. The information in the middle layer of the network is used to represent each area of the image and calculate the anomaly distance for anomaly segmentation, and the maximum abnormal score is used to represent the abnormal degree of the image for abnormal detection. Experiments results on MVTec anomaly detection show that the algorithm predicts a picture in 0.17 s and can output anomaly detection results at the same time, with image AUROC reaching 91.1% and Pixel AUROC reaching 94.5%. On the wall tile dataset produced by taking pictures of real scenes, image AUROC reached 89.7%, and Pixel AUROC reached 89.1%. Compared with the original student-teacher network, the proposed method can quickly complete anomaly segmentation and anomaly detection tasks at the same time with better accuracy, and it also has better results in real applications.

Key words： student-teacher network； anomaly detection； anomaly segmentation； unsupervised learning

0 Introduction

In industrial production, it is necessary to find abnormal products in time, so as to investigate the cause in time and eliminate abnormal products. In the past, manual methods were used to find abnormal products[1]. Currently, deep learning has developed rapidly[2-3]and has been widely used to solve various problems and achieved good results[4-5]. An anomaly detection method based on deep learning can be used as an auxiliary tool for manual detection or directly replace manual labor. The technique of detecting whether an image is abnormal is called anomaly detection, and the technique of segmenting out those abnormal areas is called anomaly segmentation.

At present, anomaly detection can be divided into two types: supervised and unsupervised. Supervised algorithms require a large number of marked abnormal sample images for training, such as U-net[6-7], fast RCNN[8-9], Yolov4[10-11], etc. However, the unsupervised anomaly detection algorithm only needs pictures of normal samples for training. In the anomaly detection tasks of some expensive products such as chips, the supervised algorithm requires a large number of abnormal chip samples, which is costly, while the unsupervised algorithm only requires normal samples and is low in cost, so it has better development potential.

Existing work mainly focuses on generative algorithms, such as generative adversarial networks (GANs)[12-13]or variable auto encoders[14-15]. These methods adopt the reconstruction error per pixel or the density of the probability distribution of the evaluation model to detect anomalies, which has been proven to be problematic due to the possibility of inaccurate reconstruction or poor calibration[16-17].

In recent years, methods for anomaly detection based on the extracted features of the pre-training network have emerged, and generally have achieved better results than the generative model. This type of method uses the trained classification neural network published on the Internet to extract the features, and judges whether the test picture orpatch is abnormal by comparing the characteristics of the normal pictures and the abnormal pictures. The student-teacher network has achieved advanced effects in the field of anomaly segmentation, but it takes 1 hour to locate a picture of anomaly, which is not suitable for industrial practical applications and does not have anomaly detection function. In the training phase, multiple students is used to learn the features of the non-anomalous images identified by the teacher network. In the inference phase, the students and the teacher jointly predict the results as shown in Fig.1.

Fig.1 Inference graph

The intermediate layer information of the neural network is used to represent each area of the image to calculate the difference between the test image and the normal image to perform abnormal segmentation. The maximum abnormal score of each pixel represents the abnormal score of the image for abnormal detection. This method is fast and can complete anomaly segmentation and detection at the same time. Multiple student networks is used to jointly regress the output of the teacher network, and select the minimum squared difference between the output of the student and the teacher in each dimension as the difference value. The joint prediction is more stable, and it can avoid the poor accuracy of the model due to training errors. The proposed multiple student networks reasonably combine different parts to better forecast the effect of each student so that better forecasting effect can be obtained. Pictures in reality are taken to make a dataset of wall tiles, which can detect the effect of the model in practical applications.

1 Student-teacher network

This section briefly introduces the principles and shortcomings of the student-teacher network.

1.1 Fundamentals

Paul Bergmann et al. proposed a student-teacher network[18]and achieved the best anomaly segmentation effect on MNIST, CIFAR10 and MVTec anomaly detection (MVTec AD) dataset. Since the official method was not disclosed, Paul Bergmann reproduced the method.

For an anomaly-free images training data setD={I1,I2,…,IN}, training a student network so that it can detect the abnormal area of the test image T.aandcrepresent the length and depth of the cropped graph, respectively, andpanddrepresent the length and depth of the feature graph, respectively. The student-teacher network crops an image into multiple regionsP∈Ra×a×c, and then calculates the abnormal value of each region, which represents the abnormal value of the pixel at the center (r1,r2). For a regionP, the student network outputs a feature mapS(P)∈Rp×p×d, and the teacher network outputs a feature mapT(P)∈Rp×p×d, using MSELoss to calculate the degree of abnormality. MSELoss is shown as Eq.(1), wherexandyrepresent the output of the student network and the teacher network, andnrepresents the number of points in the feature map.

(1)

1.2 Shortcomings

This method calculates the abnormal value of the areaPto represent the abnormal value of its center point (r1,r2). When the image size is large, a large number of areas need to be calculated to obtain the abnormal value of each pixel, which consumes a lot of time. For example, for a 900×900 image, by randomly cropping the image into a 128×128 area and calculating its abnormal value, a total of (900-128)×(900-128)=595 984 calculations are required, resulting in the inference of each image per hour. In addition, in the author’s later method reproduction, only one student network is used, which makes it easier to cause greater fluctuations in results due to training deviations, and only anomaly segmentation is performed without image-level anomaly detection.

Therefore, a new structure of multi-student teacher network is proposed. It directly inputs the entire image, and then estimates the abnormal value of each point based on the extracted feature map. This method combines the output characteristics of multiple students for prediction, and selects the minimum value of the distance between each student and teacher in each point and each dimension as the abnormal value. The maximum abnormal value at each point of the test image is the abnormal value of the image, which is used to determine whether the image is abnormal. Therefore, the method can complete the tasks of anomaly detection and segmentation at the same time, and meet the industrial requirements in terms of speed.

2 Design of method

This method uses a pre-training network as a teacher network to extract features from the input image, and trains three student networks to regress the output of the teacher network. When the test image is abnormal, the output of the teacher network and the student network will show a big difference. And it is judged whether each area of the image is abnormal by calculating the square of the difference in each dimension of each point of the feature maps.

2.1 Patch distribution model

The features extracted by the neural network are used to represent the information of each region of the image to perform abnormal segmentation, so that only one calculation is needed to predict an image, which is very fast, as shown in Fig.2.w,handcare used to represent the width, height and depth of the input image, andpanddare used to represent the side length and depth of the feature map. For the input imageI∈Rw×h×c, the feature mapT∈Rp×p×dis extracted by the teacher network, and the feature mapS∈Rp×p×dis extracted by the student network, which is equivalent to the mapping result of the input image. After extracting the features, each point hasd-dimensional information. MSELoss is used to calculate the abnormal value between the corresponding points between the feature maps of the student network and the teacher network, and obtain the abnormal mapM∈Rp×p. The calculation formula is shown in Eq.(1). Quadratic linear interpolation is used to change the size of the abnormal mapMtow×hto obtain the abnormal segmentation map. The point with the largest abnormal value is used to represent the abnormal value of the image for abnormal detection.

The experimental results show that the feature map can be used to represent the input image to calculate abnormal values and obtain good results. This method is not as detailed as the original student-teacher network block-by-block prediction, but the receptive field is larger, and it has more advantages for the overall prediction. What’s more, it has a very fast prediction speed, which can meet industrial requirements, and can perform abnormality detection at the same time.

Fig.2 Relational graph

2.2 Multi-student joint prediction

Three student networks are used to regress the output of the teacher network. For a test image, the student network outputs three feature mapsS1,S2,S3∈Rp×p×dand the teacher network outputs a feature mapT∈Rp×p×d. Specifically, the mean value of the minimum squared difference amongS1,S2,S3, andTin each dimension of the point is selected as the abnormal value for a feature point. The abnormal value calculation method for a feature point is shown in Eq.(2), wherexijrepresents the value of theith dimension at the point ofSj, andyirepresents the value of theith dimension at the point ofT.

(2)

By this method, the abnormal value of each point of the feature map is calculated to obtain the abnormal mapM∈Rp×p, and then the size of the map is changed tow×hto obtain the abnormal segmentation map. This method selects the minimum value when calculating the difference between the characteristics of the students and the teacher in each dimension, which can effectively filter those values with abnormal prediction results and make the results more stable. In actual experiments on MVTec AD, it is also confirmed that this method achieves better results than taking the average or maximum value.

3 Experiments

The effects of the method is compared with other methods on the MVTec AD data set, and then its practical application effect is tested on the self-made wall tile data set.

3.1 Dataset

3.1.1 MVTec AD

The public data set MVTec AD is used in this experiment[19], which imitates the actual industrial production scene and is mainly used for unsupervised defect detection and location. MVTec AD has 3 629 training images and 1 725 test images, including 15 types of images, 5 types of texture images including carpets, wood, grids, tiles and leather, and 10 types of object images including bottles, metal nuts, cables, capsules, hazelnuts, pills, screws, toothbrushes, triodes, zippers, and each image is square, with sides ranging from 700 to 1 024. The data set can be a comprehensive and true representation of the needs of defect detection in industrial practice. Therefore, many methods adopt the data set to calculate evaluation indicators to test the performance of the model.

3.1.2 Walltiles dataset

Pictures are took in reality and a dataset of wall tiles is created. In order to reflect the actual effect of the model in daily scenes, the data set pictures are collected under various daily conditions. For an object, different images will appear under different light, different angles and different distances, which will bring great difficulties to anomaly detection. The objects in MVTec AD are located in the center of the image, and the conditions of illumination and shooting distance are fixed. The images of our data set are obtained under a variety of conditions, which can reflect the effect of the model in real applications.

Referring to the number of pictures of a class of objects in MVTec AD, this data set contains 249 training pictures and 86 test pictures. In this data set, the training images are all normal wall tile images, and the test images have normal and abnormal wall tile images. This data set can reflect the effect of the model in practical applications.

3.2 Implementation details

This model is built with pytorch, and vgg19 is used as the backbone network, and the 13th layer output of vgg19 is selected as the feature. The teacher network and the student network adopt the same structure, but the parameters of the teacher network are pre-trained on imagenet, and the parameters of the student network are initialized randomly. In the training phase, the student networks are used to fit the output of the teacher network. The model is trained for 100 epochs, the batchsize is 1, and MSELoss is used. The input picture is changed to 224×224×3 in the preprocessing stage and randomly scrambled. When experimenting on the wall tile dataset, the size of the picture are first changed to 256×256×3 in the preprocessing stage, and then to 224×224×3 through center cropping, which achieved better results.

Two threshold-independent indicators, image AUROC score and pixel AUROC score are used to evaluate the effects of anomaly detection and anomaly location, respectively. After predicting the abnormal score of each point of the image, the area under the receiver operating characteristic curve of each point (pixel AUROC) is calculated to evaluate the effect of defect location. The maximum value of the abnormal score of each image is used as the abnormal score of each image, and the area under the operating characteristic curve of the receiver of each image (image AUROC) is calculated to evaluate the effect of defect detection.

3.3 Effects of different designs

In this section, a number of comparative experiments are designed to prove the effectiveness of the structural design proposed in this paper.

3.3.1 Influence of different training strategies on single student network

A single-student network are used to study the impact of different training strategies on the results. First, the overall MSELoss is calculated by using the computer according to the original method to get the image AUROC. The effect of using the 18th layer of vgg19 is compared to extract the features, using the resnet layer 3 to extract the features, and using the wide resnet 50 layer 3 to extract the features. Then, the effect of using the 18th and 32nd layers of vgg19 is calculated to extract features according to the method by using the computer in this article. That is, the average distance of the feature maps between the student network and the teacher network in each dimension of a point is taken as the point abnormal value, and then the maximum value is used to represent the abnormal value of the image. It can be seen from Table 1 that the method in this paper is better than the original method. The effect of vgg19 is better than resnet18 and wide resnet50. The 18th layer of vgg19 has excellent feature extraction effect.

The result shows that the choice of the method in this paper is more reasonable. Vgg19 has achieved better results in many experiments as a pre-training network, and it has better generalization. As shown in 2.1, the two-step calculation of image abnormal value in this method is more reasonable and finer than the original method by using MSELoss in one-step calculation, and it can filter out some abnormal values that may interfere with the inference results.

Table 1 Prediction results of single-student networks under different strategies

3.3.2 Impact of different joint forecasting methods

Three student networks are trained to jointly regress the output of the teacher network. When calculating the distance of each dimension of the feature points between each student network and the teacher network, the minimum value is selected as the joint prediction result of this dimension. Table 2 additionally shows the effect of selecting the maximum value or the mean value, indicating that the method in this paper obtains better results.

Table 2 Results of different joint forecasting strategies

For a picture, the output of the student network and the teacher network have the same size, and the student network strives to obtain the same value as the teacher network in all dimensions during the training phase. In the testing phase, when a student network has a poor predictive effect for a test picture in a certain dimension, it will reduce the accuracy of anomaly detection. If the output of the appropriate student network can be selected as the predicted value in each dimension, the accuracy of the model will be greatly improved. When the abnormal area is easily confused with the original image, selecting the minimum distance between each student and teacher in each dimension of each feature point for prediction can better distinguish the abnormality. When the abnormal area is significantly different from the original image, the maximum distance can be used for prediction to prevent the normal area from being misjudged as abnormal. Selecting the mean distance will make the result more stable and avoid the problem of large fluctuations in neural network training results.

3.4 Results and discussion

Compared with the original teacher-student network, this method can quickly perform anomaly segmentation and anomaly detection. Table 3 shows the time taken to predict an image. M-ST listed is used as the abbreviation for our method.

Table 3 Time consuming of inferring a picture

In addition, this method is compared with several methods with better results. As shown in Table 4, the proposed method has good anomaly detection and segmentation effects. Since the original teacher-student network consumes too much time and the evaluation indicators are different from this article, our improved single-student network is used as an alternative for comparison.

Table 4 Effect of anomaly detection and anomaly segmentation on MVTec AD

This method uses fewer feature points to calculate outliers and then corresponds to the entire image, and each image represents a region. This is not as detailed as the original method of predicting block by block, which will lead to a decrease in pixel auroc, but the actual prediction area is close. When more attention is paid to the effect of anomaly detection, this method is better. Fig.3 shows the test image, ground truth, heat map, mask and abnormal segmentation image from left to right. It can be seen that the prediction area of the mask is basically correct, but the detail texture is worse. In the prediction of the capsule in the second line, this method additionally marks the abnormality in the background image. How to remove the disturbance of the prediction results from the background area is a problem worthy of study.

Fig.3 Abnormal location effect on MVTec AD

In addition, pictures in reality are tried to be tested the effect of the model. On the wall tile data set, the image AUROC of this method reaches 89.7%, and the pixel AUROC reaches 89.1%, which has good results. The images of the wall tile dataset come from a variety of normal conditions, which is more challenging, as shown in Fig.4. For example, tiles will show different colors under natural light and lighting. The size of the tiles in the picture is different when shot at a distance or a short distance, and the shape of the tiles when shot at different angles is also different. These changing factors in real situations will bring difficulties to anomaly recognition, but they can also reflect that the model has good results in practical applications.

Fig.4 Abnormal location effect on Wall Tiles Dataset

Finally, ablation experiments are performed. Table 5 lists the effects of the original teacher-student network, the teacher-student network based on the patch distribution model, and the multi-student teacher network, which gradually becomes better. This shows that the structure proposed in this paper can effectively improve the effect of the model.

Table 5 Ablation experiment on MVTec AD

4 Conclusions

A new structure of teacher-student network is proposed for anomaly detection and segmentation by using multiple students to jointly return to the output of the teacher network, and using the middle layer features of the network to calculate the anomalies of each area of the image. It has the advantages of fast speed and good effect, and it can meet the actual needs of industry.

Pictures in reality are used to create a dataset of wall tiles, and the model tested has good results in practical applications. Reasonable combination of the better knowledge learned by each student’s network will achieve better results in anomaly detection. This method uses the minimum distance of each dimension between the student network and the teacher network for prediction, and there is still much room for improvement. In the next step, it should be considered how to better combine multi-student networks and apply the model to the anomaly detection of more types of objects, such as roads, cabinets, tables and chairs, etc. This can bring great convenience for people to detect object anomalies.

Journal of Measurement Science and Instrumentation

2022年2期