Crop Leaf Disease Recognition Network Based on Brain Parallel Interaction Mechanism

2022-05-09YUANHuiHAOKuangrong郝矿荣WEIBing

Journal of Donghua University(English Edition) 2022年2期

YUAN Hui(袁惠), HAO Kuangrong(郝矿荣), WEI Bing(隗兵)

Engineering Research Center of Digitized Textile & Apparel Technology, Ministry of Education, College of Information Science and Technology, Donghua University, Shanghai 201620, China

Abstract: In the actual complex environment, the recognition accuracy of crop leaf disease is often not high. Inspired by the brain parallel interaction mechanism, a two-stream parallel interactive convolutional neural network (TSPI-CNN) is proposed to improve the recognition accuracy. TSPI-CNN includes a two-stream parallel network (TSP-Net) and a parallel interactive network (PI-Net). TSP-Net simulates the ventral and dorsal stream. PI-Net simulates the interaction between two pathways in the process of human brain visual information transmission. A large number of experiments shows that the proposed TSPI-CNN performs well on MK-D2, PlantVillage, Apple-3 leaf, and Cassava leaf datasets. Furthermore, the effect of numbers of interactions on the recognition performance of TSPI-CNN is discussed. The experimental results show that as the number of interactions increases, the recognition accuracy of the network also increases. Finally, the network is visualized to show the working mechanism of the network and provide enlightenment for future research.

Key words: brain parallel interaction mechanism; recognition accuracy; convolutional neural network; crop leaf disease recognition

Introduction

Crop leaf diseases are one of the main challenges in agriculture, and catastrophic crop leaf diseases aggravate the shortage of food supply[1]. How to identify crop leaf diseases timely and effectively is the first step to take relevant control measures to stop losses in time.

With the continuous development of machine vision and artificial intelligence, the level of crop leaf disease diagnosis based on visible light images and infrared spectra images has also been developed rapidly. Although infrared spectra and hyperspectral images[2]have more abundant information than visible light images, they are limited by the heavy and expensive acquisition of image equipment, and the cost of image acquisition is often relatively high. These shortcomings limit the wide development and application of the technology[3]. However, visible light images can be easily obtained through cameras, mobile phones, and other smart electronic devices. As a result, visible light images are increasingly popular with research scholars[4].

Good progress has also been made in the image recognition of visible crop leaf diseases. Support vector machine (SVM) model[5], Bayesian classification[6], random forest random classification[7], artificial neural network[8]and so on are widely used in crop leaf disease recognition research. The above methods are used to identify diseases by extracting plant specific image features. Although these methods have achieved good recognition results, their specific features cannot completely characterize crop disease information. At the same time, under different backgrounds, different weathers, and different harm degrees, the efficiency of these methods is low. The number of experimental samples selected by the above methods is limited, and the selected leaves are limited to the same plant.

In recent years, convolutional neural network(CNN)[9]has been used to solve the problem of crop disease image recognition[10]. CNN is a multi-level and non-full-connected neural network that simulates the structure of human brain. Using a supervised deep learning method, visual patterns can be recognized directly from the original image. Compared with the traditional classification and recognition methods, it can obtain the global and context features of the image and achieve high robustness and a high recognition rate[11].

In the complex background, crop leaf disease images often have the problems of huge background information, changeable disease characteristics, high disease correlations， and high disease complexity. We intend to extract and transmit disease image information by simulating the human brain from the perspective of biological vision inspiration, so as to propose a network based on human brain visual mechanism, which can realize the accurate recognition of disease in complex background.

In this paper, we propose a two-stream parallel interactive convolutional neural network (TSPI-CNN) based on human brain vision mechanism for crop leaf disease recognition. This work has the following three main contributions.

(1) Inspired by the two-stream mechanism of human brain vision, a two-stream parallel network (TSP-Net) is proposed. It can enhance the feature extraction ability of color, shape, and position in the spot area of crop leaves.

(2) A new deep learning model, called parallel interaction network (PI-Net), is proposed based on the brain visual mechanism. PI-Net is introduced to TSPI-CNN to enhance the distinguishability of visual features.

(3) The proposed framework is validated on different datasets by comparing with other methods.

The rest of this paper is organized as follows. Section 1 describes the relevant structures and mechanisms of biological vision system. Section 2 presents TSPI-CNN based on the brain vision mechanism, and expounds the network architecture and implementation details. Section 3 conducts experiments on different datasets and networks, and subsequently visualizes and analyzes the results. Conclusions and interesting future work are shown in section 4.

1 Biological Evidence for New Model Learning

1.1 Parallel mechanism

The mechanism of human brain visual information processing is shown in Fig. 1. Biological studies have shown that there are cone cells and rod cells in the human eye[12]. The nerve impulses produced by them need to be analyzed by afferent nerve center and transmitted through P pathway and M pathway, respectively. The end point is visual cortex. The receptive fields of P pathway are generally smaller and sensitive to color, while the receptive fields of M pathway are larger, and some cells are sensitive to motion. This can be considered as the transmission of these two types of information into the cerebral cortex.

Fig. 1 Visual information processing in human brain

The central part of the visual cortex[13]is the extra striate cortex and the primary visual cortex. The extra striate cortex mainly includes the second, the third, the fourth， and the fifth regions of vision, which are V2, V3, V4, and V5, respectively. The primary visual cortex is also called the first region of visual V1. The accepted cortical information processing process is that the primary visual cortex receives information from the lateral geniculate bodies, and then the whole cortical information processing process is completed by two parallel dorsal and ventral circulation pathways at the same time[14]. Among them, V1, V2, and V5 constitute the dorsal flow to realize the dynamic information extraction. It mainly stores the spatial information and motion control information of the target object in order to complete the motion detection and positioning of the target object. V1, V2, and V4 constitute the ventral flow to realize static information extraction. It mainly stores color information and shape information of the target object to realize the recognition of the target object. When the information processing terminal is reached, the output information of the two streams will interact in the brain region of the shared target for data fusion.

1.2 Interaction mechanisms

The mechanism of human brain visual interaction refers to the interaction of two information flows in the process of human brain visual information transmission. At present, it has been proved that the two streams (dorsal and ventral) in the visual mechanism of human brain interact at different stages when processing characteristic information. Biological studies also show that this interaction phenomenon controls the specificity of different regions to some extent when the human brain processes visual information[15]. In addition, researchers have done a lot of mechanistic studies on how the two main visual pathways interact[16]. Reference [17] showed that there were many anatomical intersections between the ventral flow and the dorsal flow. These two pieces of information flow through horizontal interaction to fuse different information, thus improving the brain's ability to recognize objects. Reference [18] found that if the brain only used ventral flow information, its recognition effect would decline. This indicates that the dorsal flow information can assist the ventral flow and improve the recognition ability of the brain.

2 Proposed Approach

The aim of this study is to construct a bio-inspired visual computational model with good robustness to achieve fast and accurate recognition of crop disease images. In this section, TSPI-CNN is introduced. First, the architecture of TSPI-CNN is introduced in detail. Second, the concrete structure of TSP-Net and PI-Net is described, and their implementation details are introduced.

2.1 Overview of methodology

Aiming at the robustness extraction of crop leaf spot features and the visual processing mechanism of human brain[17, 19], TSPI-CNN is proposed to identify crop leaf disease. It is intended to allow the network to extract more abundant disease image features and further enhance the nonlinear fitting ability so as to realize high precision and strong robustness of crop leaf disease recognition.

The overall architecture of the proposed TSPI-CNN is shown in Fig. 2, which includes TSP-Net and PI-Net. In Fig. 2, the solid arrow connections in the network represent the defined TSP-Net, and the dotted arrow connecting the orange highlight module represents the defined PI-Net. Given a crop disease image, our network outputs the classification probability of the input image through TSP-Net and PI-Net, so as to obtain the classification results of the input image. PI-Net combines the different information extracted by the two branches to provide richer features for the latter layer network, thus improving the robustness and recognition rate of the whole network. Finally, there is a bilinear pool layer and a softmax layer at the end of the parallel branch.

Fig. 2 Overall architecture of TSPI-CNN for crop leaf disease recognition

2.2 TSP-Net

TSP-Net is designed based on the visual parallel processing mechanism. TSP-Net has two branches. One branch is called P-CNN, and another branch is called M-CNN. In this paper, the network structure of the two branches is similar, and each branch in TSP-Net contains one convolutional layer and four residual modules. The difference between the two branches is that the first convolution layer has different receptive fields. A convolution kernel size is 3×3 in P-CNN, and another convolution kernel size is 5×5 in M-CNN. This corresponds to P pathway perception field and M pathway perception field in human brain visual information processing. Some parameters of TSP-Net are shown in Table 1.

Table 1 Some parameters of TSP-Net

The residual module is used to speed up training convergence and suppress network overfitting. If the number of channels for the input feature map ism, then the number of channels for the output feature map is 2m. The specific parameters of the residual module are shown in Table 2. The internal structure of the residual module is shown in Fig. 3. The residual module contains three convolution layers, which are C1, C2， and C3.T1,T2， andT3are the output when the input passes through these three convolution layers. We can see thatT1andT3have the same number of channels, which can be directly cascaded together.T1andT3will be added as the total output of the entire residual module. The residual module uses the idea of residual learning, and the convolution layer learns residual functions.

Fig. 3 Internal structure of residual module with three convolution layers

Table 2 Specific parameters of residual module

It is assumed that:

F(x)=R(x)+T1=T3+T1,

(1)

wherexrepresents the input image;F(x) is the total output of the whole residual subsidiary network;R(x) is the residual function.

Then the residual functionR(x) is defined as

R(x)=F(x)-T1=T3.

(2)

By learningR(x), we get more prominent tiny information in images. The low-level features of C1convolution and the advanced detail features of the three convolution layers of C1, C2and C3are extracted together in the whole module. Then the extracted features are passed to the following network structure to continue the finer feature extraction.

At the end of two branches, the characteristic information of two branches is fused by bilinear pooling[20]. We define the output characteristic of the end of two branches asEp∈RC×AandEm∈RC×DifEpandEmextract features of sizeC×AandC×Drespectively. When we multiply these two features, we get matrixB, and the dimension ofBisA×D. This process can be expressed as

(3)

To facilitate the matrix operation,Bof all positions is sum pooling to obtain matrixη. At this point, the dimension ofηis stillA×D. The process can be expressed as

η(x)=∑B(x,Ep,Em)∈RA×D.

(4)

Subsequently, we reshape the matrixηinto a vector, which is denoted as a bilinear vectorv.

v=vec(η(x))∈RAD×1.

(5)

Finally, the fused featurezis obtained by performing matrix normalization and L2 normalization operations on the vectorv. The process can be expressed as

(6)

After the fusion featurezis obtained, a Softmax layer is added to output the classification probability of the input image. Assume a value vectors=(s1,s2,···,sk). The process can be expressed as

(7)

where yiis the label of the samplei,i.e., yiis the ordinal number of the sample’s real category in the category set;δ(s) represents Softmax output value.

2.3 PI-Net

Considering the interaction between two branches, we propose PI-Net for this paper. PI-Net consists of input, feature interaction module， and output. PI-Net is defined as

PI=(LP,LM,F,O),

(8)

whereLPrepresents the P-CNN branch;LMrepresents the M-CNN branch;Frepresents feature interaction units;Orepresents the output. The input in parallel interactive network comes from the outputLPandLMof the residual modules of two parallel branches. The features extracted byLPandLMinteract in theFto helpLPextract more abundant features.

(9)

(10)

(11)

Fig. 4 PI-Net structure diagram: (a) parameter relationship and interaction location of PI-Net; (b) feature interaction module

(12)

Parallel interaction can provide more abundant feature information, but some information with high redundancy and strong correlation will be produced in these feature information. At this time, we need to filter the feature information, delete the redundant information, and extract the effective information of the disease. The process can be expressed as

(13)

3 Experiments and Results

3.1 Datasets

We conduct experiments on four crop leaf disease image datasets, including MalayaKew leaf dataset (MK-D2)[21], PlantVillage dataset[22], Apple leaf dataset[23], and Cassava leaf dataset[24]. The detailed statistics with category numbers and data splits are summarized in Table 3.

Table 3 Statistics of crop leaf disease datasets used in this paper

The MK-D2 is a subset of the MalayaKew leaf dataset, which consists of 44 classes. It is collected at the Royal Botanic Gardens, Kew, England. The MK-D2 consists of 43 472 images, of which the training set is 788 per class, the test set is 200 per class, and each image is a color image of 256×256 pixels in size. Three categories of examples in MK-D2 are listed in Fig. 5, and this dataset is very challenging as leaves from different classes have very similar appearance.

Fig. 5 Image examples of three categories in MK-D2

PlantVillage engineering solves the problem of plant disease diagnosis, and opens database for all users. The database contains image data of diseased and healthy leaves of a variety of plants. A total of 61 525 leaf images collected by PlantVillage engineering were used as experimental data, including 39 kinds of diseased leaves and healthy leaves of 14 plants. In this dataset, each image is a color image of 256×256 pixels in size. At the same time, we used six different augmentation techniques to increase the dataset size. The techniques are image flipping, Gamma correction, noise injection, principal components analysis (PCA) color augmentation, rotation, and scaling. Three categories of examples in PlantVillage dataset are listed in Fig. 6.

Fig. 6 Image examples of three categories in PlantVillage dataset

The Apple leaf dataset image data are from a Kaggle competition called Plant Pathology 2020-FCVC7. The competition is part of the Fine-Grained Visual Categorization FGVC7 workshop at the Computer Vision and Pattern Recognition Conference (CVPR 2020). This dataset consists of 3 462 images, including 1 821 training samples and 1 641 test samples. The dataset contains 4 categories, which are health, infected with apple rust, apple scab, and those with more than one disease. Each image is a color image of 2 048×1 325 pixels in size. Four categories of examples in Apple leaf dataset are listed in Fig. 7.

The Cassava leaf dataset image data are from a competition organized by Makere University AI laboratory. They introduce a dataset of 21 367 labeled images collected during a regular survey in Uganda. Most images were crowdsourced from farmers taking photos of their gardens, and annotated by experts at the National Crops Resources Research Institute (NaCRRI) in collaboration with the AI lab at Makerere University, Kampala. It is in a format that most realistically represents what farmers would need to diagnose in real life. The dataset consists of 9 430 images, including 5 656 training samples and 3 774 test samples. The dataset contains 5 categories, and each image is a color image. Five categories of examples in Cassava leaf dataset are listed in Fig. 8.

Fig. 8 Image examples of five categories in Cassava leaf dataset

3.2 Experiment setting and results

3.2.1 Experimentsetting

The experiments in this study are carried out on a machine with an NVIDIA GPU 2080Ti and Tensorflow platform. During the training process, our experiment uses Adam optimizer with an initial learning rate of 0.001. The batch size is set to 64. At the same time, with the increase of epoch, the learning rate is automatically attenuated. The attenuation rate is set to be 0.1. Subsequently, we use the test dataset to verify the performance of our methods. We test 10 times for each network and take an average of those results.

3.2.2 Classificationresults

We experimented with the above four datasets to verify the recognition accuracy of TSPI-CNN. The input image size of the network is 256×256. For MK-D2, we use the provided training set and the test set. The 44 leaf types are used in this experiment. Each type of the training set has 788 images, and each type of the test set has 200 images. For PlantVillage datasets, we randomly extracted 80% of the images in the dataset as the training set for the training model, and used the remaining images as the test set to test the model performance. For Apple leaf datasets, we use the single-label training dataset provided in the competition dataset as our experimental dataset. That is to remove the combined disease training dataset. Because the competition platform does not give the label of the test image, we only use the training set and name it the Apple-3 leaf dataset. At this point, the experimental dataset has 1 729 images. There are three categories. Table 4 shows each class and the number of corresponding images. For Cassava leaf dataset, we also use the labeled training dataset provided in this competition dataset as our experimental dataset. At this point, the experimental dataset has 5 656 images. There are five categories. Table 5 shows each class and the number of corresponding images. For Apple-3 leaf datasets and Cassava leaf dataset, we selected 80% images as our experimental training sets and 20% images as our experimental test sets to test our network recognition performance. In order to avoid overfitting during training and improve the generalization ability of the model, we enhance the training set. The methods include rotation, up and down translation, horizontal flip, vertical flip and so on.

Table 4 Disease name and number of corresponding images in Apple-3 leaf dataset

After training, the test set is used to test the classification accuracy of our network, including the TSP-Net and the three improved networks. We test 10 times for each network and take an average of those results. The recognition accuracy on the MK-D2 is shown in Table 6. At the same time, we compare other methods on this datasets, including the deep CNN (MLP) method[21], the deep CNN SVM (linear) method[21], the self-attention convolutional neural network (SACNN) method[25], the combine SVM (linear) method[26], the hand crafted features (HCF) method[27], the scale invariant feature transform (SIFT) SVM (linear) method[28], and the LeafSnap SVM (RBF) method[29].

As can be seen from Table 6, the addition of PI-Net is helpful to improve the recognition accuracy. With the increase of the number of PI-Nets added, the recognition accuracy also improves. Adding one PI-Net, the recognition accuracy increases 0.08%. Adding two PI-Nets, the recognition accuracy increases 0.17%. Adding three PI-Nets, the recognition accuracy increases 0.21%. Compared with other methods, our proposed method performs the best. The recognition accuracy of the network without PI-Net has reached 99.71%, and the recognition accuracy with the three PI-Nets is as high as 99.92%.

Table 6 Recognition accuracy on MK-D2

The experimental results of the other three datasets are shown in Table 7. We also take out a branch (One-Net) to train and test its recognition effect. Clearly, the addition of PI-Net has improved recognition accuracy, and CNN using two branches gets better results than using only one CNN branch. For PlantVillage datasets, the recognition accuracy is 98.82% when the dataset is identified by TSP-Net without PI-Net. For the recognition accuracy of TSP-Net with one PI-Net, it is improved by 0.11% compared with the TSP-Net without PI-Net. Adding two PI-Nets, the recognition accuracy is improved by 0.46%. Adding three PI-Nets, the recognition accuracy is improved by 0.60%. For Apple-3 leaf datasets, after using two branches, the recognition accuracy increases by about 1.40%. After adding one PI-Net to TSPI-CNN, the recognition accuracy is 0.52% higher than that without PI-Net. After Adding two PI-Nets to TSPI-CNN, the recognition accuracy is 0.83% higher than that without PI-Net. After adding three PI-Nets to TSPI-CNN, the recognition accuracy is 1.77% higher than that without PI-Net. For Cassava leaf datasets, CNN using two branches gets better results than CNN with only one branch. The recognition accuracy increases about 2.07%. After adding one PI-Net to TSPI-CNN, the recognition accuracy is 0.75% higher than that without PI-Net. After adding two PI-Nets to TSPI-CNN, the recognition accuracy is 1.44% higher than that without PI-Net. After adding three PI-Nets to TSPI-CNN, the recognition accuracy is 2.80% higher than that without PI-Net.

Table 7 Recognition accuracy on other three datasets

3.3 Visualization analysis

We visualize the Apple-3 leaf dataset for example, adding different numbers of PI-Nets. The parameters of the model before bilinear pooling are shown in Table 8. It is seen from Table 8 that adding PI-Net brings changes in the number of parameters. As the number of PI-Net increases, the parameters do not linearly increase. It proves that if the information interaction occurs, the deeper the network, the higher the complexity of the model.

Table 8 Parameters before bilinear pooling for models with different numbers of PI-Nets in Apple-3 leaf dataset

We visualize our TSPI-CNN with three PI-Nets with a raw image of apple leaf disease shown in Fig. 9. Figure 10 shows the feature maps before and after the first interaction. Figure 11 shows the feature maps before and after the second interaction. Figure 12 shows the feature maps before and after the third interaction. It can be seen that the feature maps learned by P-CNN and M-CNN are different. It is because the feature maps extracted from two branches are different, and the richness of the features can increase in the process of feature fusion. We can see the location and shape of the spot more clearly after model interaction.

Fig. 9 A raw image of apple leaf disease (rust)

Fig. 10 Visualization of feature maps before and after the first interaction: (a) feature maps extracted by P-CNN before the first interaction；(b) feature maps extracted by M-CNN before the first interaction；(c) feature maps after the first interaction

Fig. 11 Visualization of feature maps before and after the second interaction: (a) feature maps extracted by P-CNN before the second interaction; (b) feature maps extracted by M-CNN before the second interaction; (c) feature maps after the second interaction

Fig. 12 Visualization of feature maps before and after the third interaction: (a) feature maps extracted by P-CNN before the third interaction；(b) feature maps extracted by M-CNN before the third interaction；(c) feature maps after the third interaction

By comparing the feature information extracted from different depths of the network, we can see that with the deepening of network layers, the extracted features are more and more representative. Shallow networks learn low-level features of disease spots, such as edges and textures. Deep network learning is the overall concept and the strongest characteristics.

Therefore, we speculate that TSP-Net with two-stream parallel structure can extract richer disease feature information. PI-Net can fuse different feature information and remove redundant feature information. The two networks work together to improve the classification accuracy of diseases.

3.4 Experimental result analysis

From the biological research on the human brain, the human brain will adjust the interaction position and number of the two information flows adaptively, thus maximizing the recognition ability of the image. In this experiment, we simulate the process of brain information two-stream transmission and interaction, and adjust the number of interactions. Subsequently, we performed comparative analysis in our network by increasing different numbers of PI-Nets. The experimental results of the above datasets show that the network architecture with two branches is better than that with one branch. At the same time, adding PI-Net can improve the disease recognition accuracy of the network. With the increase of the number of PI-Nets, the recognition accuracy has also been improved. As we infer, the two-branch CNN is similar to the human brain’s two information streams, which can extract different characteristics of crop leaf diseases. The data abstraction ability of convolution kernel in feature interaction module in PI-Net is similar to human brain information interaction ability, which can bring more abundant disease feature information. The two networks work together to achieve a higher recognition accuracy.

4 Conclusions

In this paper, we introduce the visual mechanism of human brain into CNN. Inspired by the interaction between the dorsal information flow and ventral information flow of human brain, we propose TSPI-CNN to recognize crop leaf diseases with high precision. Through the experiment in the MK-D2, PlantVillage, Apple-3 leaf， and Cassava leaf datasets, it is proved that TSPI-CNN is superior to the existing methods of crop leaf disease identification. Furthermore, we discuss the effect of the number of PI-Nets on disease recognition performance to demonstrate its working mechanism. In the future, adaptively selecting location and number of interactive modules to enrich the feature information and extracting finer disease features of the image are the next step in the study.