APP下载

Rotating machinery fault detection and diagnosis based on deep domain adaptation: A survey

2023-02-09SiyuZHANGLeiSUJiefeiGUKeLILngZHOUMihelPECHT

CHINESE JOURNAL OF AERONAUTICS 2023年1期

Siyu ZHANG, Lei SU,*, Jiefei GU, Ke LI, Lng ZHOU, Mihel PECHT

a Jiangsu Key Laboratory of Advanced Food Manufacturing Equipment and Technology, School of Mechanical Engineering,Jiangnan University, Wuxi 214122, China

b HUST-Wuxi Research Institute, Wuxi 214071, China

c Center for Advanced Life Cycle Engineering, University of Maryland, College Park, MD 20742, USA

KEYWORDS Deep learning;Domain adaptation;Fault detection and diagnosis;Transfer learning

Abstract In practical mechanical fault detection and diagnosis,it is difficult and expensive to collect enough large-scale supervised data to train deep networks. Transfer learning can reuse the knowledge obtained from the source task to improve the performance of the target task,which performs well on small data and reduces the demand for high computation power.However,the detection performance is significantly reduced by the direct transfer due to the domain difference.Domain adaptation (DA) can transfer the distribution information from the source domain to the target domain and solve a series of problems caused by the distribution difference of data. In this survey,we review various current DA strategies combined with deep learning(DL)and analyze the principles, advantages, and disadvantages of each method. We also summarize the application of DA combined with DL in the field of fault diagnosis. This paper provides a summary of the research results and proposes future work based on analysis of the key technologies.

1. Introduction

With the rapid development of modern industry, industrial system integration and information technology have also been growing,highly reliable components and systems are the guarantee for the safe operation of aerospace.Undetected aviation faults may lead to catastrophic events of aviation machinery.As the core component of air-craft engine, space shuttle and other rotating machinery, mechanical transmission system is prone to various faults under high speed,heavy load and harsh operating conditions for a long time,which directly affects the safe operation of mechanical system. The rotating machinery represented by aviation equipment has attracted the attention of researchers. Early fault detection and diagnosis technology can predict the fault development trend,which plays a key role in the prevention of mechanical engineering transmission system faults.Therefore,in order to avoid the occurrence of subsequent major accidents caused by minor faults, all industries attach great importance to the intelligent fault diagnosis system on rotating machinery.1

DL used in intelligent fault diagnosis of rotating machinery has achieved great success and has benefited industrial applications.Compared with shallow machine learning,DL can automatically extract features at a high level and fuse feature extraction and classification into one structure to avoid a large number of trials and errors,which has achieved reliable performance.2,3However, DL demands long training time and large supervised samples. In many practical applications, it is often difficult and expensive to obtain enough large-scale labeled data to train the deep network to achieve adequate performance. Further, most of the current machine learning algorithms rely on the basic assumption that the training and testing data are independent and identically distributed. This assumption is rarely true in actual industry, because the data will change with time and space. It is difficult for the generalization error of the diagnosis model to meet the fault diagnosis requirements of the actual production.4

Transfer learning can transfer a model trained on a separate and labeled source domain to the desired unlabeled or sparsely labeled target domain, which works well on small data and reduces the demand for high computational power. However,the direct cross-domain transfer will result in poor performance stemming from a phenomenon known as domain shift,in which the probability distributions of data and labels are different between domains. Domain shift appears in many situations,such as from data to data,from RGB images to depth images, and from simulation to reality.5

DA relaxes the constraint that the training and testing data must be independent and identically distributed in traditional machine learning.6It can mine invariant features and essential structures between the different but interrelated fields, which effectively solves the problems resulting from domain shift,small target size, and unbalanced data.7In recent years, DA approach has become the focus in transfer learning. In the DL era, three current DA methods are usually employed to achieve the diagnosis model for the source and target domains,including the discrepancy-based method, adversarial-based method, and reconstruction-based method. The discrepancybased method reduces the discrepancy between the source and the target domains on the corresponding feature layer in the network. The adversarial-based method introduces a domain discriminator to encourage domain confusion to learn invariant features between domains. The reconstruction-based method reconstructs the samples of the source and target domains to ensure intra-domain differences and inter-domain commonality. In this paper, these deep DA approaches are introduced according to categories, in which the discrepancybased method can be divided into the statistic transformation method, structure optimization method, and geometric transformation method, the adversarial-based method is analyzed based on two common algorithms, and the reconstructionbased method based on the coding decoding structure and its variants is introduced. On the basis of the above methods,multi-source do-main adaptation (MDA) method is further analyzed for supervised data from multi-source domains.Then,the in-novation of deep DA method and the actual testing results for typical rotating machinery fault parts are discussed. Finally, based on these current various deep DA strategies, this paper proposes future work for deep DA in fault diagnosis.

2. Deep DA and its application in rotating machinery

DA can reduce the distribution differences between domains, so that the model trained on the source data performs well on the target data. The transfer process of deep DA reduces the differences between domains via continually adjusting the network parameters, the goal of DA is to use the labeled source data XSto learn a classifier to predict the label YTof the target data XT, as shown in Fig. 2. When the difference is small enough,the model parameters and detection results on the target domain will be both optimal. Once the environment changes, the collected data in the new environment can be seen as the target data, which can achieve the excellent accuracy by transfer, and there is no need to retrain a new model from random initialization parameters. Therefore,there are several major advantages of deep DA,as shown in Fig.3.Because the distribution difference can be reduced via DA,using the idea of DA,labeled data close to the target data can be used to build the model and increase the annotation of the target data.During the training process,the model parameters trained on big data by big companies can be transferred,fine-tuned,and updated adaptively for the target task,so as to achieve better results.In practical application,in order to solve the challenge of personalized needs, the idea of DA is used to carry out adaptive learning.Considering the similarity and difference between different users, the pervasive model is flexibly adjusted to complete the target task.

However, there is a negative phenomenon in DA, that is,negative transfer, as shown in Fig. 3. The core problem of DA is to find the similarity between the two domains. Therefore,if there is no similarity between the two domains,or they are basically not similar, then the effect of DA will be greatly damaged. The main reasons for negative transfer are: (1) data problems: the source domain and the target domain are not similar at all;(2)method problems:the source domain and the target domain are similar, but the DA method is not good enough to find the components that can be transferred. Negative transfer has a negative impact on the research and application of DA. In practical application, finding reasonable similarity, and choosing or developing reasonable transfer learning methods can avoid negative transfer phenomenon.

Fig. 1 Marginal distribution of domains.

Fig. 2 Process of DA.

In recent years, there have been three major areas of research on DA. The first group is the discrepancy-based method which measures the distance between the source domain and the target domain on the feature layers of the model and utilize statistical approaches to reduce domain difference.Several main aspects,such as statistic transformation,structure optimization, and geometric transformation are the main targets of research. The second group is the adversarial-based method which contains two competing structures. This group could be categorized into two subgroups: generative adversarial DA and non-generative adversarial DA. In this case, a domain discriminator is used to encourage domain confusion via an adversarial function to minimize the distance between the two do-mains, which is one of the new research topics in DL approaches. The third group is the reconstruction-based meth-od in which the domain difference could be reduced via mapping the source data and the target data, or both do-main data into a shared domain. Encoder-decoder models and generative adversarial nets (GAN) are typical examples. Fig. 4 shows the three research areas of DA.

Fig. 4 Three research groups of DA.

Fig. 3 Major advantages and disadvantage of DA.

Table 1 Comparison of four main detection methods for bearing.

Deep DA can diagnose many components of rotating machinery, and has achieved satisfactory results. The most important type is complex mechanical transmission parts,for example bearing, gear, and so forth. These parts work in high-speed, variable working conditions, the operating environment is often very complex, it is difficult to obtain diagnostic samples. However, the requirements for detection accuracy and efficiency are often high. Therefore, deep DA is an effective method to diagnose these components. For example, bearing is one of the most important components in rotating machinery. There are several diagnosis methods to detect bearing, such as signal processing and analysis,shallow machine learning, deep learning, and deep DA. Signal processing and analysis identifies the bearing fault according to the energy change of each frequency band of vibration signal when the bearing fault occurs. Although the existing feature extraction methods have achieved good application results, its process is complex and some links rely on expert experience. Shallow machine learning has the advantages of fast learning speed and less training samples, which can realize fast fault diagnosis of bearing, but it also needs to rely on experience to select strong correlation fault features. Deep learning learns the original fault data between layers, establishes the mapping relationship between fault samples and fault categories, does not need feature extraction, but it is difficult to effectively apply in the scene with less fault sample data. Because of the advantages of deep DA, it is the preferred method to diagnosis rotating machinery under the small sample in complex environment.Table 1 shows the comparison of four main detection methods for bearing. With the development of rotating machinery, in which the quality requirements of transmission parts are higher and higher, deep DA will have increasing potential as an effective method for the detection of mechanical transmission components. A specific overview of the research advances of deep DA is given in this survey,followed by a description of applications for fault detection of rotating machinery components.

3. Research advances of deep DA

The research advances of deep DA are given in this section,which provides the basis for analyzing the application of DA in fault detection of rotating machinery components. Several main research aspects of deep DA, as shown in Fig. 4, including the discrepancy-based method, adversarial-based method,and reconstruction-based method, are studied.

3.1. Discrepancy-based method

The basic idea of the discrepancy-based method is that the feature distributions of different domains are aligned by minimizing the distribution difference. According to different measurement and transformation methods, the discrepancybased method can be further divided into statistic transformation, structure optimization, and geometric transformation sub-methods,as shown in Table 2.8-19In the following,the difference between subgroups are elaborated.Statistic transformation starts from network parameters and minimizes the domain differences by adjusting network parameters, the most commonly used methods for comparing and reducing distribution shift are maximum mean discrepancy(MMD),coral measure,wasserstein distance,and among others.MMD compares the mean value difference between the source domain and the target domain,8which can be explained as

Table 2 Summary for discrepancy-based method.

where φ(·)is the mapping.On this basis,MMD was combined with neural network for the first time(DaNN)in 2014,20which reduces the distribution difference in the latent space. However,the representation ability is limited due to its shallow feature extraction layer, so the DA problem cannot be solved effectively.By combining MMD and convolutional neural network (CNN), deep domain confusion (DDC) was proposed,which extends deep CNN to DA.21The corresponding loss function is consistent with that of DaNN.However,DDC only adapts to one layer of network, which leads to low portability of multi-layer. Based on DDC, deep adaptation networks(DAN) were proposed and has made two improvements. The first point is to apply MMD to multiple layers.The later layers were selected to transfer, which can extract more private features. The second point is to change MMD into multiple kernel MMD (MK-MMD), which is a convex combination of multiple kernel functions.9The corresponding multilayer loss can be calculated as

where A is the second order feature transformation, and T is the corresponding transpose operation,CSis the source covariance matrix,and CTis the target covariance matrix.Coral used in deep DA is simple and efficient for solution, and is suitable for the mismatch in low-level image statistics such as textures,edge contrast, color, etc. Therefore, coral is usually applied to the shallow layer of DL, as shown in Fig. 6. Sometimes it is necessary to transform one distribution into another continuously and ensure the geometric characteristics of the distribution itself to be used as a reliable feature difference measure between task-specific classifiers. For this purpose, a sliced Wasserstein discrepancy method was proposed via defining a sliced barycenter of discrete measures.12However,it is difficult to calculate Wasserstein distance,which has become a difficult problem to limit its application. All the above methods perform well in their respective domain scenarios, but it is not clear which layer or layers of deep network need to implement DA.

The above studies assume that the conditional distribution of the source domain and the target domain is the same, and the marginal distribution is different,but sometimes it is better to measure the difference of the joint distribution (the conditional and marginal distribution), the source classifier cannot be used in the target data directly. To make it more generalized, a joint adaptation network combined with MMD(JMMD)13was proposed in 2016, which minimized the joint distribution difference via simultaneously learning from class and feature invariants between domains.Fig.7 shows the main research methods of joint distribution adaptation, different data distribution differences will lead to different combination methods. In the above methods, the contribution of marginal distribution and conditional distribution to the network is the same.However,marginal distribution and conditional distribution are not equally important for some tasks.Therefore,dynamic distribution adaptation (DDA) was further introduced to dynamically adjust the importance of each distribution, which achieved a better distribution adaptation effect.14DDA is the first accurate quantitative estimation method that utilizes the global and local domain properties to estimate the value of balance factor μ, it can be explained as

Fig. 6 Application of coral.

Fig. 7 Research of joint distribution adaptation.

Fig. 5 Research of MMD.

Structure optimization starts from the architecture of the network and to minimize the domain discrepancy, which can be achieved in most deep DA networks. The related domain knowledge, which can be represented by statistical data of the batch normalization (BN)layer, is stored in the weight matrix of each layer. The statistics in the BN layer can be adjusted to achieve well-trained cross-domain models, which has no additional parameters to be adjusted. Therefore, the statistics of the source data can be replaced in each BN layer with those in target data to align the distribution,as shown in Fig.8,BN normalizes the mean and standard deviation of each personal feature channel,so that each layer gets features with similar distribution,no matter it comes from the target or the source.15Fig.9 shows the research advances for BN,it is not only easy to expand to most deep DA networks,but also easy to expand to to multiple source domains. In contrary to statistic transformation method,such as MMD and coral measure,to update the weights for deep DA,this method only adjusts the statistics in BN layer.Each layer gets features with similar distribution via BN normalizes, no matter it comes from the target or the source.15However, sometimes neurons are not effective for all domains due to the domain biases. It is useless for neurons to capture other features and clutter.In view of this situation,weight coefficient optimization method has been widely studied because it pays more attention to shared features and does not completely discard feature information.The weighting scheme used in deep DA can quantitate transferable source examples and control the importance of the source data to learn the target task.16Weight coefficient optimization method aim at adjusting the architectures of DL to improving the ability of learning more share features,which solves the ambiguity between shared features and private features,but does not completely eliminate the influence of the private part.

Geometric transformation starts from the geometrical properties of the source and target domains to reduce the domain shift. The Grassmann manifold and geodesic in geometric transformation method are widely used in DA,inspired by incremental learning. The basic idea is to take the data of the source domain and the target domain as two points on the Grassmann manifold,and then take the points on the geodesic to form a path in which the source domain can be transformed to the target domain via the path. The geodesic flow kernel method is the most typical, and it introduces kernel learning (KL) to determine the number of intermediate points by integrating infinite points on the path, as shown in Fig. 10.17Inspired by the intermediate representations, Luo et al. introduced an end-to-end progressive graph learning framework whereby they adopted a graph neural network with episodic training and adversarial learning to solve the domain shift problem in both the sample and manifold level.18Kundu et al. aligned the high source density regions in the potential space via learning a potential space of the target,which accommodated new target classes in potential space while preserving semantic information.19These approaches take into account the specific geometric structure of the source domain and the target domain to describe the nature of data,which effectively reduce the domain differences,though it faces a big bottleneck about the high computational complexity.

Fig. 8 Batch normalization method.15

Fig. 9 Research of BN.

3.2. A-based method

Adversarial training is inspired by the idea of a two-player game in self game theory. The generator and discriminator play each other to complete the adversarial training. The corresponding training process for generative adversarial nets(GAN) can be expressed as:

where Ex-Pm(x)is the distribution for true data, Ez-PZ(z)is the distribution for generated data,D(x)is mapping for true data,and G(D(z))is mapping for generated data.In this survey,the adversarial-based method based on GAN is further divided into two sub-methods, including generative adversarial DA(GADA) method with an additional generator and nongenerative adversarial DA (non-GADA) method without the additional generator.

GADA can learn a transformation with an unsupervised way based on GAN,more research focuses on generating data that are similar to the target data while keeping annotations.Coupled generative adversarial net was developed to adjust the joint distribution of the source domain and the target domain, which was composed of two GANs. By imposing the constraint for sharing parameters on two GANs, the abilities of two GANs were limited to generate data with the same distribution from noise.22Bousmalis et al.transformed the source data into the target data via a generator and used a discriminator to judge the true and false. When the data could not be judged by a discriminator, the generated data produced the same distribution as the target data.23Volpi et al. proposed a new DA method with feature augmentation and domain invariance.First,a feature extractor(ES)with softmax layer(C)was trained on the source data and a generator was trained via adversarial training to conduct the data augmentation in the source feature space. A domain-invariant feature extractor(E1) combined with the generated features was trained via adversarial training.Finally,the proposed module can be combined with the softmax layer previously trained to perform testing on the source and target data, as shown in Fig. 11.24Johnson et al.presented a novel theory for the GAN approach,which did not depend on the traditional minimax formulation.A strong discriminator and a good generator were designed via composite functional gradient learning so that the distance measures between domains are improved after each functional gradient step converging to zero.25Generated target data with ground-truth annotations are one of the effective means to solve the problem of the lack of training data.

Fig. 10 Geodesic flow kernel method.17

The aim of deep DA is to learn domain-invariant feature representations from the source data and the target data.Therefore, it is crucial to confuse these representations.Inspired by GAN, the confusion loss of domain produced via the discriminator is developed without generator to improve deep DA, which is called non-GADA method. In non-GADA,the function of the generator is changed such that it no longer generates new samples but rather plays the feature extraction function.In this way,the generator is replaced by a feature extractor in many adversarial-based deep DA methods.In 2017, adversarial discriminative domain adaptation(ADDA)was proposed,which is seen as a general framework,and many existing methods can be regarded as special cases of ADDA. In this method, a classifier was pre-trained on the source data.Then,the discriminator structure was used to project the target data to the source domain. Finally, the source classifier was used to classify the target domain.26Shen et al.estimated the Wasserstein distance between the source and the target domains and optimized the feature parameters via adversarial training to minimize the Wasserstein distance.27Long et al. introduced a discriminator for features and a discriminator for class information to align the joint distribution between domains.28Cao et al. designed an adversarial mechanism, the weight was used in each discriminator, and the corresponding weight on the network was affected by the predicted labels of the target data; that is, the discriminator was weighted at the category level to select the source data suitable for the target data.29Pinheiro et al. combined the similarity-based classifier and adversarial training to jointly learn the domain-invariant features and the categorical prototype representations.30Luo et al. added a discriminator after the classifier to adaptively weight the adversarial loss of different features, which emphasized the importance of class-level feature alignment to reduce the domain shift.31Zhang et al.proposed domain-symmetric networks, which designed an additional classifier based on shared neurons to complete the learning of domain discrimination and confusion.32Lee et al.used adversarial dropout to learn strong discriminative features and designed the loss function to realize the robust DA.33Shu et al. proposed a novel DA method based on curriculum learning and adversarial training to select noiseless and transferable source data, which enhanced positive transfer.34Yu et al. further extended the concept of DDA to GAN and proposed a dynamic adaptive network to solve the problem of dynamic JDA via two discriminators.35Xue et al. proposed a novel deep adversarial mutual learning method, which designed two groups of domain discriminators to learn from each other to obtain domain-invariant properties.36Shin et al.presented a new two-phase pseudo label densification method based on a combination of intrinsic spatial correlations in the first phase and confidence-based easy-hard classification in the second phase. On this basis, the feature alignment was achieved via adversarial learning.37Hu et al.proposed a discriminative partial domain adversarial framework (DPDAN), which used hard binary weight to maximize the distribution divergence between the source-negative data and all the other data and minimized domain shift between the target data and the source-positive data,38as shown in Fig. 12. Wu et al. proposed a dual mixup regularized learning network,including category mixup regularization and domain mixup regularization to enforce prediction consistency and explore more intrinsic features for better DA.39Bao et al.presented a two-level approach, in which the first level utilizes MMD to reduce the distribution discrepancy of deep features between domains and the second level makes the deep features closer to their category centers via the domain adversarial network.40

Compared with other generative models, GADA uses a random distribution to directly generate samples,so as to truly approach the target data in theory,which is the biggest advantage of GAN. However, collapse problems may occur in the learning process of GAN, which always generates the same samples and cannot continue learning. When the generator is corrupted,the discriminator will also point to the similar samples in the same direction, and the training cannot continue.Non-GADA does not rely on the generator to generate identically distributed samples, but realizes domain alignment through adversarial training in discriminator,which can maintain high sample diversity.However,adversarial training needs to obtain Nash equilibrium which cannot always be realized,so the training is unstable compared with the traditional training method.

3.3. Reconstruction-based method

In much of the deep DA researches,a data reconstruction task based on the self-encoder structure and GAN has been proposed to ensure the invariant features and keep the individual features of each domain during the transfer. In the basic framework of the self-encoder,the encoding and decoding process is designed, which first encodes the data to feature representations and then decodes these representations back to the reconstructed version. The DA method based on reconstruction usually shares the coder to learn domain-invariant representations and keeps the domain-specific representations through minimizing reconstruction loss for the source and the target domain.

Fig. 11 Detailed framework of deep DA method combined GAN.24

Fig. 12 Detailed framework of DPDAN.38

Reconstruction technology-based self-encoder aims at reconstructing the input information, which can extract good feature description and has strong feature learning ability.The self-encoder was introduced to combine with KLdivergence to reconstruct data, which can achieve better domain alignment.41On this basis, Bousmalis et al. proposed domain separation networks(DSN).The source and the target domain were divided into two parts (including the public part and the private part) via feature reconstruction and the domain distance measure method. The public part learned the domain-invariant features, the private part was used to maintain the independent features of each field, as shown in Fig.13.42Ghifary et al.designed two networks:one is the classification model for the source data,and the other is the reconstruction network for the target data. The two networks were trained at the same time, and they can encode the data on the shared network and learn the common features.43By separating the space in such a way, the shared features will not be influenced by individual features of each domain,so that a better DA ability can be achieved. In order to realize the knowledge transfer between two domains with large distribution differences, a selective learning algorithm was built, which defined an intermediate domain to transfer the information flow from the source domain to the target domain through reconstruction. And a regular term was added to control the data selection for transfer.44Furthermore,a stacked local constraint auto-encoder method was used to learn domaininvariant features via the back propagation and lowdimensional manifold. To measure the importance of each neuron in the process of aligning the domain differences, the proposed method calculated the weighted sum of neighbor data on the defined manifold.45To further guarantee the separation effect and promote the completeness and uniqueness of learned features, a hybrid generative network, including encoder, decoder, classifier, and separation modules, has been widely concerned, in which the separation module can study the irrelevant units of classification.46In summary, reconfiguration technology used in DA method can be seen as an auxiliary task, which focuses on separating shared and private characteristics between domains to transfer the shared characteristic to avoid negative transfer caused by private characteristic. However, the decoder is trained by minimizing the reconstruction error rather than cheating the discriminator like GAN,it may be more difficult to generate a sharp new image.Combining the advantages of other models (GAN, and dictionary model,et al)with automatic encoder can achieve better performance.

Fig. 13 Detailed framework of DSN.42

Due to the unique advantages of reconfiguration technology and GAN,deep DA method combined with the two methods has attracted the attention of researchers, in which duallearning mechanisms—one for primal tasks and the other for dual tasks—can teach each other via reinforcement learning and the reconstruction error.47Inspired by the mechanism of dual-learning,an adversarial loss was learned to make the distribution between domains approximate, while an inverse mapping was defined to reconstruct the domain, which mapped the generated images back to the source domain by the cyclic consistency loss to learn the mapping relationship between the input and the output in the aligned dataset, as shown in Fig. 14.48Dual-learning is further improved to minimize the loss of the two GANs and the reconstruction errors to complete the transformation between domains, which can solve the problem that neither source domain nor target domain has labels.49In order to ensure the one-to-one mapping relationship, a combination of standard GAN and a GAN with reconstruction loss was developed to learn the relation between domains.50Although reconstruction technologybased GAN method combines the advantages of the two algorithms, it also has some problems. The reconstruction error solves the problem of the authenticity of the generated samples, but also greatly reduces the diversity of the generated samples, which is a contradiction. In addition, the parameter updating problem caused by complex structure is a problem.

4. Research and application of deep DA in fault diagnosis

Thus far, previous studies can be divided into two stages according to the diagnosis framework. In the first stage, data collection, feature extraction, and fault diagnosis are mainly included in the diagnosis framework (Fig. 15(a)). Under this framework, a lot of efforts in the extraction and selection of manual features have been made, which finally benefits from a wide range of expertise.51,52In addition,the designed characteristics are always aimed at special tasks, so that the adaptability is limited in new diagnostic problems or the different physical characteristics of the original system. Furthermore,the final diagnosis performance is usually sensitive to the model parameters, which indicates that additional parameter optimization process needs to be performed.53,54In order to solve these problems,a diagnosis framework of automatic feature learning based on DL is developed in the second stage,which provides an end-to-end learning process from input data to output diagnosis tag, as shown in Fig. 15(b).55DL shows strong feature learning and fitting ability, which has significantly promoted the application of artificial intelligence in fault diagnosis. However, domain differences are ignored in most of the developed methods.For actual industrial diagnosis task, the fault data is limited, and the training data is usually from the experimental environment or the historical data of related equipment. Due to the complex application scenarios of equipment,real-time data may get different feature distributions. Therefore, the research on cross-domain fault diagnosis has important practical significance.

In recent years, deep DA has a wide application prospect.We have conducted a comprehensive survey to review its current development in Section 3. Some related algorithms, such as discrepancy-based method, adversarial-based method, and reconstruction-based method have been gradually designed for image detection tasks. However, in the field of intelligent fault diagnosis,the application of deep DA is rarely considered to enhance the applicability and flexibility of the diagnosis framework for tasks in different fields,56-58as shown in Fig. 15(c). In this survey, a specific description of the research advances of deep DA for fault detection of rotating machinery components is given in the following, which is further divided into three categories.

Fig. 14 Detailed framework of reconfiguration technology-based GAN.48

Fig. 15 Intelligent fault diagnosis framework. (a) Stage I,51,52 (b) Stage II55 and (c) New one.56

4.1. Discrepancy-based method for fault diagnosis

In the actual working environment of rotating machinery, it is difficult to obtain the available diagnosis information,resulting in a bottleneck in the diagnosis. Therefore, many researchers have developed exploratory studies in deep fault diagnosis of rotating machinery based on deep DA method.Deep DA method used in DL measures the difference between the source domain and the target domain on corresponding feature layers of two networks. There are two goals of measurement: one is to measure the similarity of the two domains, not only to tell us whether they are similar qualitatively, but also to give the degree of similarity quantitatively.The second is to increase the similarity between the two domains based on measurement via the machine learning methods that we need, so as to complete transfer learning.Therefore, finding similarity of the two domains is the core.With this similarity, the next step is how to measure and use the similarity.

Because a kernel function mapping method is introduced into MMD to improve its efficiency of computation and memory-efficient, MMD has become one of the most commonly used transformation methods to achieve cross-domain fault diagnosis. Single-layer MMD in deep fault diagnosis has achieved a certain degree of cross-domain feature extraction,59-65as shown in Fig. 16, MMD was introduced into the adaptation layer after feature extraction layers to measure the feature distribution difference between the source and target domains. Then the difference was added to the network loss for training, which can be explained as

Fig. 17 shows the diagnosis accuracy of DL and the proposed deep DA trained by different data samples. All trails of deep DA can achieve an accuracy greater than 95%, and the average accuracy is 97.96%, which are higher than traditional DL.62Further, some variant methods based on MMD have been developed and applied to fault diagnosis. Li et al.utilized CMD to build a DA CNN to test fault.66Zhang et al. employed MMD to the sparse filtering via mapping the source and target data into a kernel space to obtain more transferable features. To reduce the domain discrepancy, L1-norm and L2-norm were used to MMD to get the sparse distribution of domain.67Zhang et al. combined the maximum variance discrepancy with the MMD for the feature matching.68Wang et al. evaluated the distribution distance of the corresponding domain via a high-dimensional kernel space.69Deng et al. developed an order spectrum transfer algorithm to transform the target data to the source domain according to the fault characteristic orders.70Coral used in deep neural network is similar to MMD, in which the MMD layer is replaced by coral. It is more powerful than DDC (only aligning sample means), and much simpler to optimize than DAN.11Si et al. focused on matching the second moments and the high order moments at the same time to achieve a new intelligent fault diagnosis approach for bearings, which combined MMD and coral.71Xu et al. modified a second order statistics fusion network based on coral to learn the depth nonlinear domain invariant features.72On this basis,the high-order moments of the domain-specific distributions73were proposed after fully connected layers to achieve bearing fault diagnosis. However, adjusting only one layer of network will lead to low portability of multi-layer. By extending single layer DA to multi-layer DA, An et al.,74Zhu et al.75and Che et al.76proposed a model based on multilayer MK-MMD for defect diagnosis, which realizes domain alignment in the last two feature extraction layers of network. Zhang et al.77and Li et al.78developed an intelligent fault diagnosis approach for bearing to reduce the distribution discrepancy of the learned transferable features based on multilayer MMD, in which Zhang et al. designed a two-stage method to obtain strong fault-discriminative and domain-invariant performance, including pre-training the feature extractor on the labeled source data and fine-tuning the feature extractor on the target data. Li et al. developed different kernel MMDs to construct multiple deep transfer networks.79Xiong et al.adopted CMD to minimize the domain discrepancy at each dense block inside the network.80Most existing DA methods based on MMD are applicable to vectors only. For adapting the source and target tensor representations directly, a range of alignment matrices without vectorization is presented to align the tensor representations of two domains into the invariant tensor subspace.81All the above methods perform well in their respective domain scenarios, but it is not clear which layer or layers of deep network need to implement DA. Hoffman et al.observed that the first several layers are independent of classification,and the last layers are independent of domain,so the first several layers were used to extract the source and the target features.Then these features were fused in latter layers for classification.82All the above studies assume that the conditional distribution of the source domain and the target domain is the same,and the marginal distribution is different. That is, only distance of the hidden layer between the source domain and the target domain is considered,but sometimes it is better to measure the difference of the joint distribution,the source classifier cannot be used in the target data directly. To make it more generalized, a joint adaptation network combined with MMD was proposed,13which minimized the joint distribution difference via simultaneously learning from class and feature invariants between domains. The corresponding loss function is carried out as

Fig. 16 Detailed framework of DL with MMD.62

Fig. 17 Comparison results of DL with MMD.62

where CXS,1:|L|(RS)is the joint distribution of the source domain computed via the tensor product of feature spaces, and CXT,1:|L|(RT) is the joint distribution of the target domain.Pseudo label learning is one of the most commonly used methods to adjust conditional distribution, which is usually combined with statistical transformation to adjust joint distribution between domains. For example, MMD and the pseudo label were used to adjust marginal and conditional distributions to achieve multi-representation adaptation,83,84a pseudo label learning combined a similarity guide constraint method can reduce the distance between similar images and increase dissimilar images,85and a pseudo label learning combined domain discriminative loss based on coral can align the joint distribution in a typical two-stream network framework for DA (J-DCDA), as shown in Fig. 18.86On this basis,Han et al. developed a deep transfer network based on a joint distribution adaptation (JDA) approach, which presented smooth convergence and avoided negative adaptation in comparison with marginal distribution adaptation.Compared with different detection tasks and methods about rotating machinery, deep JDA achieves the highest accuracy, As shown in Fig. 19.56Tong et al. realized JDA via MMD and pseudo test labels obtained from the nearest-neighbor classifier in the feature space, which extracted transferable features for testing under different environments.87,88Yang et al. proposed the regularization terms of multi-layer DA and pseudo label learning on the parameter set of the domain-shared CNN, which can learn transferable and discriminated features.89Qian et al. developed an improved JDA, which conducted dimension reduction on high-dimension inputs and used softmax regression to generate more accurate pseudo labels.90Further,The MK-MMD combined with pseudo test labels has been developed.91Wang et al. aligned the conditional distributions of multiple scale high-level features extracted through a multiple scale feature extractor via MMD.92Wu et al. introduced grey wolf optimization algorithm to adaptively learn key parameters of joint distribution adaptation.93

According to difference data distribution characteristics, a series of joint distribution adaptation approaches were developed,different data distribution differences will lead to different combination methods. In practical application, according to data distribution characteristics between domains, the proportion of two distributions is often different.Marginal distribution and conditional distribution are not equally important in some tasks. In order to quantitatively evaluate the domain differences,a balance factor is employed to measure the weight between the two distributions,which needs to be set according to the data distribution.94On this basis, a multi-kernel dynamic distribution adaptation method (MDDA) was proposed based on DL to achieve cross-machine fault diagnosis,which defined a mix kernel function to map different domains to a unified space and dynamically evaluated the importance of marginal and conditional distributions,95as shown in Fig. 20.Moreover, as for bearing fault diagnosis, MDDA solves the problem that DDA needs to obtain some crosscharacteristics of different domains in advance. Fig. 21 shows diagnosis results with different levels of additional noises under different methods,and it can be seen that MDDA shows the best test performance.The quantitative estimation of marginal distribution and conditional distribution is of great significance to the study of DA, which is a trend to apply different probability distribution adaptation methods to DL.However, due to the existence of multiple feature layers in DL, it is not clear which layer needs domain adaptation. At the same time, a poor transfer result will generate from an inappropriate distance metric and transformation method.The details of statistic transformation method in rotating machinery are summarized in Table 3.56,62,66-81,87-95

Fig. 18 Detailed framework of J-DCDA method.86

Fig. 19 Detailed framework and comparison results of DL with JDA.56

Fig. 20 Detailed framework of MDDA.95

Fig. 21 Comparison results of MDDA.95

BN method has performed well in fault diagnosis. As shown in Fig. 22,96for the fault detection of bearing, BN was employed after the feature layers (WDCNN) to align the mean and variance of the two domains, which has achieved high-performance testing in different conditions.Fig.23 shows the comparison results for different methods under six domain shifts,in which FFT-SVM performs poorly in domain adaptation, MLP and DNN perform better, both achieving roughly 80% accuracy. In contrast, the proposed WDCNN method is much more precise than the algorithms compared, which achieves 90.0% accuracy in average and proves the domain invariant of BN. BN used in a stacked autoencoder and DNN has also achieved high precision fault diagnosis of bearing by fine-tuning and modifying the network.97Further, Hu,et al. modified and utilized adaptive BN based on exponential moving average instead of the common BN to self-adapt the traits of different domains.98Wu, et al. proposed an adaptive logarithm normalization to realize data distribution preprocess.99The structure of BN method is simple, which only adjusts the BN layer statistics and has no additional parameters to be adjusted. However, this kind of global information will weaken the particularity between domains in some cases,and sometimes neurons are not effective for all domains due to the domain biases.It is useless for neurons to capture other features and clutter. In view of this situation, Zhang et al.designed a CNN with kernel dropout for fault diagnosis of bearings under different operation conditions.100Shen et al.selected valid source channels via cross minkowski distance and fused the target channels via two-order selective ensemble.101Further, weight coefficient optimization method has been widely studied because it pays more attention to shared features and does not completely discard feature information.Fine-tuning partial weight can interleave DL layers in the adaptor and those in the base to improve the domain perceptual sensitivity.102The weight coefficient method was combined with statistical transformation to build an attentionaware weighed distance method103and adaptive weighted complement entropy104to learn the discriminant features in different domains, which encouraged incorrect classes to get uniform and low scores in the process of DA. A novel fault diagnosis network based on a modified TrAdaBoost method for weight coefficient optimization was used to test the small target data under different operating conditions and fault types, which can perform higher accuracy.105By combining the swarm optimization method and L2 regularization to optimize weights, the DL network ensured the diagnosis stability under different conditions.106The detection results are more than 90%, much higher than the traditional methods. However, the algorithm also has shortcomings. At the beginning,if the source samples have more noise and the iteration times are not well controlled,it will increase the difficulty of training in classifier. BN method and weight coefficient optimization method have performed well in fault diagnosis. When the source data and the target data have a lot of similarities, the weight coefficient optimization method can achieve good results.Weight coefficient optimization method aims at adjusting the architectures of DL to improve the ability of learning more share features, which solves the ambiguity between shared features and private features, but does not completely eliminate the influence of the private part.The details of structure optimization method in rotating machinery are listed in Table 4.96-101,105,106

The Grassmann manifold and geodesic method are widely used in geometric transformation method. For example, Rui et al. extended the sampling geodesic flow via smooth polynomial functions described by splines on the Grassmann manifold.107Shaw et al. used the Grassmannian manifold to exploit the parameter space structure for DA via subspace estimation.108Li et al. minimized the domain difference by keeping the discriminative information to find new representations in a common subspace via the conjugated gradient method on the Grassmann manifold.109Thopalli et al. presented a multiple subspace alignment method, which used low-dimensional subspaces to represent the datasets and exploited the natural geometry on the subspaces of the Grassmann manifold to aligndomains.110Hua et al. proposed a new progressive data augmentation method for DA, which generated a series of intermediate virtual domains via the interpolation method to achieve multiple subspace alignment.111Inspired by the intermediate representations, the source and target data for rotating machinery are projected into intermediate subspaces along the shortest geodesic path connecting the two ddimensional subspaces on the Grassmann manifold, which can reduce the domain differences via the intermediate subspaces connecting the source domain and the target domain.57,58,112In order to solve the problem that the normal data is much more than the fault data in the actual diagnosis,a novel DA model based on geodesic flow kernel is further improved to realize the fault diagnosis when there is only normal data in the target domain.58To make the model more adaptable to the target domain, a cross-domain stacked denoising autoencoder network was built, which introduced MMD and manifold regularized fine-tuning to develop the cross-domain and the cross-task fault diagnosis,Fig.24 shows the corresponding results, which performs well on both gear detection and rolling bearing detection under different tasks,Fig. 25 shows the framework.113To eliminate feature distortions when conducting distribution alignment,an adaptive factor based on A-distance was proposed to dynamically adjust the influence of the conditional and marginal distributions,then the manifold embedded distribution alignment based the adaptive factor was applied to a new transferable fault diagnosis method to get transformed representations.114These approaches take into account the specific geometric structure of the source domain and the target domain to describe the nature of data, which effectively reduce the domain differences, though it faces a big bottleneck about the high computational complexity. The details of geometric transformation method in rotating machinery are listed in Table 5.58,113,114

Table 3 Application of statistic transformation method for rotating machinery.

Fig. 22 Detailed framework of DL with BN.96

Fig. 23 Comparison results of DL with BN.96

It can be concluded from Table 3,56,62,66-81,87-95,96-101,105,106Table 4, and Table 558,113,114that Discrepancybased method has been widely used in detection of unlabeled fault data in practical application.And the comparison results also show satisfactory performance. However, the marginal distribution obtained by hidden layers may be destroyed by nonlinear mapping, which will weaken the DA performance.Secondly, although most of the existing deep DA methods achieve domain adaptation by adjusting the marginal distribu-tion, they still lack the domain adaptation ability. Matching the marginal distribution and the conditional distribution helps to achieve better domain adaptation. In the case of no labeled data in the target domain, the pseudo label learning method is proposed to achieve conditional distribution. This method needs to iteratively update the network parameters to obtain satisfactory performance, which increases the training time.In summary,it is very important to master the distribution discrepancy characteristics of different domains for achieving high accuracy and efficiency of fault diagnosis.Future research on the discrepancy-based method will continue.For statistical transformation method,how to effectively find appropriate difference measure and transformation function will be the key research area.For structure optimization method,it is necessary to eliminate the influence of private part and maintain the intact feature information. For geometric transformation method,it is a key to further normalize the intermediate step size and transformation path.In addition,in order to improve the performance of DA, joint distribution adaptation is often used in DA processing,which lacks systematic evaluation mechanism for marginal distribution and conditional distribution. How to achieve high performance DA under appropriate discrepancy-based method while ensuring better contributions of both marginal distribution and conditional distribution is also an important research task.

Table 4 Application of structure optimization method for rotating machinery.

Fig. 24 Results of DL with manifold learning.113

4.2. Adversarial-based method for fault diagnosis

On the basis theory of adversarial training, generative adversarial net (GAN) is proposed, which consists of two parts.One is the generator (G), which is responsible for generating as many true samples as possible. The other is the discrimina-tor(D),which judges whether these generated samples are real.The generated samples are encouraged to cheat true samples to make the discriminator unable to distinguish the differences between the two domains.

Table 5 Application of geometric transformation method for rotating machinery.

Fig. 25 Structure of DL with manifold learning.113

Fig. 26 Two-stage model based on GADA.115

Fig. 27 Comparison results of two-stage model based on GADA.115

Fig. 28 Detailed framework of WD-DTL.121

Using GAN to align domain distribution,adversarial-based method is developed. Adversarial-based method has been successfully used in mechanical fault diagnosis because of its strong flexibility and robustness. For GADA method in fault diagnosis,a two-stage model training approach was proposed,which generated data for different classes in the target domain on the first stage via different generators and trained the crossdomain classifier on the second stage (Fig. 26).115Fig. 27 shows the comparison results of different methods under four cross-domain fault diagnosis tasks.It can be seen that the proposed method achieves the best testing diagnosis accuracy under different tasks,and all the testing results are above 92%,which proved the superiority of GADA fault diagnosis framework. To avoid losing information in GADA, a cycleconsistent GAN was further designed to achieve the fault diagnosis DA,which generated new data based on known conditions via twice circular mapping in GAN and pre-trained a classifier for testing raw fault diagnosis data under various conditions.116To solve data imbalance problem for rolling bearing fault diagnosis,Li et al.unified framework incorporating predictive generative denoising autoencoder and deep coral network, which the generative model is used to generate extra fault data, and the diagnosis deep coral network is used for fault recognition via correlation alignment.117To fully utilize more information of labeled source data, Wu et al. designed a BN long-short term memory model to learn the mapping relationship between two domains to generate auxiliary data,then a transfer maximum classifier discrepancy method was applied via adversarial training to align probability distributions of generated auxiliary data and unlabeled target data.118

Fig. 29 Detailed framework of WD-DTL.121

For non-GADA method, the feature extraction part of the network can be equivalent to a generator.For transfer learning,the process of generating samples can be avoided sometimes,because there are a source domain and a target domain. The data in one of the domains(usually the target data)can be seen as the generated samples.The non-GADA method as one of the commonly used unsupervised deep DA algorithms has been integrated into a unified framework for fault diagnosis,119which has been constantly developed that used a domain discriminator to encourage feature extractors to learn domain-independent features.120Based on the idea of adversarial training, a new Wasserstein distance-based deep transfer learning network(WD-DTL) for fault diagnosis tasks was proposed, as shown in Fig.28,which learned domain features based on the feature extractor and minimized distribution distance between domains via adversarial training.Fig.29 shows the comparison results of different methods under different fault diagnosis tasks.121It can be seen that the testing accuracy of WD-DTL is increased from 59.47%to 64%as the sample number increasing,and the accuracies of WD-DTL are all higher than DAN and CNN.However, the results also reveal a limitation of the WD-DTL that the proposed improvement is limited under large differences.Further,Li et al.used an adversarial training scheme to realize marginal domain fusion for different bearing work conditions.122,123Guo et al.built a domain classifier and a distribution discrepancy metrics based on MMD to learn domain-invariant features.124Jiao et al.presented a double-level adversarial DA for fault diagnosis,which achieved domain-level alignment via adversarial training between feature extractor and domain discriminator, and achieved class-level alignment via adversarial training based on Wasserstein discrepancy between feature extractor and two classifiers.125Jiao et al. further designed a one-dimensional residual network for adaptive feature learning,which used JMMD and adversarial discriminator to eliminate the conditional distribution and marginal distribution differences.126To further realize the class-level alignment between domains,Li et al.created an adversarial multi-classifier method for fault diagnosis,which exploited the overfitting phenomena of different classifiers via adversarial training to extract domain-invariant features and built cross-domain classifiers.127To obtain more valid data for conditional distribution alignment,Zhang et al.proposed a statistical distribution recalibration method of soft labels (SDRS), then SDRS and center distance metric is used to reduce the distribution differences of fault data via adversarial learning.128To reduce the negative transfer,Li et al.proposed a class weighted adversarial network via attachment of class-level weights on the source domain to encourage positive transfer of the shared features and ignore the source outliers.129To improve domain generalization, Li et al.ued a domain augmentation method to expand the available data,which used domain adversarial training and distance metric learning to learn generalized features of different domains, and enhanced robustness via scaling the temporal vibration data horizontally.130,131To enhance the adaptability of auxiliary data,Li et al.used the joint distribution of labeled auxiliary data and unlabeled target data via domainadversarial training, which has improved the performance of transfer and classification.132To solve the problems of high training cost and low testing accuracy of traditional deep DA,such as DDC and RevGrad, Qin et al. proposed a parameter sharing adversarial DA model, which built shared classifier and domain classifier to reduce the complexity of model structure, and added coral to enhance the domain confusion via unbalanced adversarial training.133To handle unsupervised cross-domain fault diagnosis tasks, Zhao et al. developed an improved joint distribution adaptation model via adversarial learning,which more accurately calculated the value of joint discrepancy without any approximation.134

The application details of adversarial-based method in fault diagnosis are shown in Table 6.115-118,120-130,132-134The researches and applications of GAN in DA field is of great significance for practical fault diagnosis.Because there is no need to design models that follow any kind of factorization for GAN, any generator and discriminator will be useful. Therefore, GADA is a very flexible design framework, and various types of loss functions can be integrated into GAN model.When the probability density cannot be calculated in DA,some generation models that rely on the natural interpretation of data cannot be learned and applied.GADA can still be used in this case, because it introduces a very smart training mechanism of internal confrontation, which can approximate some objective functions that are not easy to calculate. Although great progress has been made at current, the interpretability of the generation model is poor, in which its distribution is not expressed explicitly, further it is difficult to train that the discriminator and the generator need a good synchronous training. Further, the generator will be easy to degenerate due to collapse problem,while the discriminator will also point to the similar samples in the similar directions, so the training cannot continue. Non-GADA does not rely on the generator to generate identically distributed samples and keeps high sample diversity. In addition, for GADA and non-GADA, discriminant network is equivalent to provide an adaptive loss according to different tasks and data, which is more robust than non-adversarial network with fixed loss.However,adversarial training needs to achieve Nash equilibrium which can be achieved by gradient descent method sometimes, and sometimes it cannot,the training is unstable compared with the traditional training method.

Table 6 Application of Adversarial-based method method for rotating machinery.

4.3. Reconstruction-based method for fault diagnosis

Fig. 30 Detailed framework of sparse stacked denoising autoen-coder with MMD.136

Fig. 31 Comparison results of sparse stacked denoising autoen-coder with MMD.136

Fig. 32 Detailed framework of CatAAE.144

The reconstruction of the source data or the target data can improve the performance of DA, which has been gradually applied and achieved success in fault diagnosis.To realize better performance, the aforementioned methods can be used simultaneously, for example, considering the reconstruction ability of data, MMD was used to combine with the selfencoder structure to minimize the domain discrepancy between the training and testing features,which has improved the accuracy of fault diagnosis.135,136Fig. 30 shows a sparse stacked denoising autoen-coder structure with MMD,136which sent the target to the model trained on the source data directly in the fine-tuning process. Therefore, the pre-training model on the source domain can be directly applied on the target data without retraining. Comparison results of the proposed reconstruction-based method and other methods are shown in Fig. 31136. This comparison shows the superiority of the proposed DA method. Inspired by shared encoder structure,a shared dictionary matrix that combined two regularization terms on a common low-dimensional subspace was proposed to learn the knowledge from the source and target domains,which aims at reducing large differences between domains.137To effectively diagnose gearbox faults with very few training data, He et al. presented a new deep transfer multi-wavelet auto-encoder, which designed new-type multi-wavelet autoencoder based on multi-wavelet activation function, and utilized similarity measure to select high-quality auxiliary data to train a source model containing the shared features with the target domain.138Peng et al.proposed a new DA model with a sparse auto-encoder and a CNN,which drastically reduced the risk of negative transfer through instantaneous rotating speed information of the target domain in the training process.139Li et al.utilized an auto-encoder model to project features of different equipment into the same subspace, and adopted crossmachine adaptation algorithm based on MMD for knowledge generalization, which minimized the distribution discrepancy between domains.140Cao et al.developed a soft JMMD based on class weight bias to reduce the marginal and conditional distribution differences of the extracted features via a reconstruction-based unsupervised learning strategy. The conflict with classification tasks is low,which is more conducive to transfer learning.141The synthetic fault data may not follow the true fault data distribution or exploit excessively over the available small data,which will result bias or overfitting.What is more, the value of the abundant normal data with essential information for fault discrimination has not been well explored. To address these issues, Lu et al. developed a new two stage transferable common feature space mining method for fault diagnosis. A weakly supervised DA convolutional autoencoder with MMD was used to learn the shared features underlying multi-domain data in the first stage, and then the common feature net is combined with a unique feature net to construct a dual-channel feature extraction and comparison model in the second stage, which can mine both the transferable shared features and unique features of different faults.142

Fig. 33 Comparison results of CatAAE.144

Table 7 Application of reconstruction-based method for rotating machinery.

Inspired by GAN,more and more researchers have embedded the idea of GAN in the field of reconstruction-based method. For the reconstruction technology-based GAN method in fault diagnosis, a framework combined GAN and the stacked denoising autoencoder was developed to perform cross-domain fault diagnosis tasks.143To adjust the conditional distribution at the same time, a categorical adversarial autoencoder (CatAAE) was proposed, including an encoder,decoder,and discriminator,to impose a prior class distribution to obtain a satisfactory unsupervised clustering result. As shown in Fig.32,the encoder was seen as the generator to generate latent vectors(fake samples),which can be confused with random samples (true samples) via a discriminator.144Fig. 33 shows the corresponding comparison results, which shows a good intra-class compactness and and inter-class difference compared with other method. In addition, the deep stack autoencoder145and the variational autoencoder146have been successfully used in cross-domain fault identification via combining with domain discriminator. The application details of reconstruction-based method in fault diagnosis are shown in Table 7.135-146Deep DA method based on reconstruction technology has been widely concerned, because it can effectively separate the share and private features between domains without relying on explicit functions and avoid negative transition.However,its powerful flexibility also brings some problems.In particular, it ignores the private features of the target domain via reconstruction and is hard to train under the adversarial training.

Fig. 34 Data distribution in different domains.

5. Multi-source DA method (MDA)

In practice, there may be more than one labeled dataset. Different from the general single DA problem, MDA can collect multiple supervised data from multi-source domains with different distributions, which is of great significance in practice.For example, data can sometimes be obtained in multiple domains. A very natural idea is to combine these data into a data set to train the model. However, as shown in Fig. 34,because the distribution of each domain is different, such a processing method cannot provide sufficient data, and sometimes it may even have a negative impact on the model.Therefore, MDA has attracted more attention from both academia and industry. In essence, the purpose of DA in single domain or multiple domains is to align the features of the target domain with the source domain, so the DA method in single source domain can also be applied to MDA. On this basis,Rebuffi et al.proposed a residual adaptive module to compress multiple domains and share substantial parameters between domains to achieve multiple-domain learning.147Mancini et al. associated the source domain to a latent domain to find multiple latent domains and introduced specific domain alignment layers based on BN to learn variables.148Zhao et al.proposed a novel multi-source distilling DA method, which mapped the target data to the feature space of each source by minimizing the empirical variance and selected the source data via the domain weight computed from the difference between each the source domain and the target domain.149Peng et al. used the moment matching to align the multisource domain with the target domain, in which the error bound was proposed in the framework of cross-moment divergence.150Li et al. presented a new multi-source DA method based on a mutual learning network, including a branch conditional adversarial DA network trained on the target domain and the single-source domain, a guidance conditional adversarial DA network trained on the target domain and the multi-source domain. The two networks were aligned to each other to realize mutual learning, as shown in Fig. 35.151Lin et al. introduced a multi-source sentiment GAN to find a unified sentiment latent share space to handle multi-source data.152It can be seen from the above research that MDA method combining different deep DA methods has been designed, but there are still many modular implementation details, such as how to align the target domain with multiple source domains, whether feature extractors are shared, how to select more relevant sources, and how to combine multiple predictions from different classifiers.

Fig. 35 MDA with GAN.151

Fig. 36 MDA with GAN for fault diagnosis.153

The multi-source domain often exits in actual industry,which result in the wide application of MDA in fault diagnosis.A fault diagnosis approach based on the multi-source domain has been successfully designed—which used local Fisher discriminant analysis to learn discriminant directions from multimodal data via preserving the intra-class local structure—and utilized the Karcher mean to compute the mean source subspace on the Grassmann manifold to assist the target task.58Furthermore, in order to achieve more flexible DA, GAN is introduced into the training process of MDA. As shown in Fig. 36, an MDA framework based on GAN was developed to extract general features with discriminative information about different equipment and machine health conditions,which transferred diagnostic knowledge learned from multisource rotating machines to the target machine via adversarial training.153Fig.37 shows the superiority of MDA by comparing with other method under different training sizes. In general, the adversarial MDA method consists of a shared feature extractor, a multi-classifier, and a domain discriminator, in which the parameters were updated via a crossentropy loss,a domain alignment loss,and a domain classifier alignment loss were used to minimize the distribution differences for all domains.154A multi-source DA framework was presented to transfer the knowledge from multiple labeled source domains into a single unlabeled target domain via aligning feature-level and task-specific distribution based on sliced wasserstein discrepancy,which can be easily turned into a single-source DA problem and readily updated to unsupervised DA of other fields.155Wang et al. developed an MDA method with different weights applied to different source domains for machinery fault classification. Different weights will be assigned under different working conditions based on the distributional similarities of the source domains to the target domain.156

The application details of MDA in fault diagnosis are shown in Table 8.153,155,156MDA is a powerful development to explore more available information,which can utilize supervised data from multiple sources with different distribution.157However,how to choose the most relevant data in each source domain automatically and adaptively is still a key problem.

6. Discussion

As a kind of transfer learning, DA fits the setting where the source data labels are available, and the target data labels are unavailable, which is normally seen in practical detection environment of various areas. The broad application prospect of deep DA has been viewed in different research fields.56-58Some DA algorithms concerned have been designed for detection gradually, which can be further categorized into four branches, including the discrepancy-based DA method,adversarial-based DA method, reconstruction-based method,and multi-source DA method. Table 9 shows the characteristics of different Deep DA methods.

In discrepancy-based method, the distance metric function represented by MMD and coral is easy to implement, which needs no additional parameters, and has efficient calculation performance.8,11After combining it with DL, all network parameters are adjusted by back propagation to minimize the domain differences.9,20,21BN method realizes DA by optimizing the network structure, which only adjusts the parameters of BN layer and does not produce additional parameters.15The geometric transformation method represented by manifold transformation can well describe the characteristics of data, especially its specific geometric structure.17All discrepancy-based methods well describe the distribution characteristics of domains, however, when the distribution characteristics cannot be calculated in DA,some models that rely on the natural interpretation of data cannot be learned and applied. In this case, the adversarial-based method was developed. It can adaptively fit the objective function according to specific tasks and data to learn the distribution characteristics via an internal adversarial training mechanism, which performs more robust detection result.17But the discriminator is usually implemented as a part of network, which adds new learning parameters.To avoid negative transfer caused by private feature, reconstruction-based method is proposed, which focuses on separating shared and private features between domains,and can be seen as an auxiliary task in DL.158However,it is more difficult to generate a sharp new image,because the decoder is trained for minimizing reconstruction error,rather than cheating the discriminator like GAN. By combining the reconstruction method with GAN, the advantages of two algorithms and powerful flexibility can be obtained. The reconstruction error solves the problem of the authenticity of the generated samples, but also greatly reduces the diversity of the generated samples, which is a contradiction. Further,to collect multiple supervised data from multi-source domains,MDA which can be combined with the above methods, has been concerned to explore more available information.

Fig. 37 Comparison results of MDA with GAN.153

Table 8 Application of MDA method for rotating machinery.

Deep DA-based methods performed well in reducing crossdomain discrepancy, which has been gradually used to solve domain shift problem in intelligent fault diagnosis.In this survey,we have made comprehensive survey to review the present development of deep DA in rotating machinery fault detection and diagnosis. In recent years, there are only a few researches considering the application of DA to enhance the applicability and flexibility of rotating machinery fault detection and diagnosis task of different domains.56-58Nevertheless, deep DAis still expected to be more widely used for fault diagnosis of rotating machinery in the future because of its broad capabilities. From the application of DA in different detection fields,it can be seen that the harder problems for DA are far from being solved(Table 9).Based on the introduction and analysis of the literature collection on deep DA-based fault diagnosis model published in different periods, we summarize the main problems and research directions in the future. More studies need to be focused on these difficult problems to achieve higher performance fault diagnosis in rotating machinery:

Table 9 Summary for deep DA method.

(1) As we have seen that unsupervised deep DA has been successfully applied to fault diagnosis, which rely on the classification performance of the source data. However,if the marginal distributions are significantly different, the optimization for the source classification loss and divergence between domains will actually increase the target error.94,95How to better align the label distributions without target label is an important problem.

(2) The transfer results are related to the metric and transformation method (such as MMD,77CMD,78BN,97and Grassmann manifold,113,114et al.) of the distribution discrepancy in the discrepancy-based method,which will result in a poor transfer result on the target domain in diagnosis models if the discrepancy cannot be correctly described via the measured distance.Therefore,how to design a metric and transformation method suitable for specific data driven is one of the necessary research directions in DA. Further, it is not clear that which layer needs to be measured.82

(3) For deep DA based on GAN method, the best results for generalization ability and feature matching in the source domain will produce a good generalization ability in the target domain. However, if XS∩XT= and the feature mapping space is complex enough,the mismatch of the distribution for different categories between the source and the target domains in the feature space will be generated, and the two optimization objectives of DA can be obtained simultaneously: 1) the error rate of the classifier in the source domain is as small as possible; 2) in the feature space, the feature distributions in the source and the target domains are identical,in which the model can be mistaken for being well trained.Therefore,how to add alignment constraints is one of the necessary research directions in GAN. Further, the interpretability of the generation model is poor,in which its distribution is not expressed explicitly, and it is hard to train.

(4) It is necessary for reconstruction-based method to pay more attention to the decoder to generate better images.Further, it ignores the uniqueness of the target domain in the process of classification, how to mine more information in feature space is important. MDA lacks the ability to process the labeled source data originated from various patterns, such as vibration data, acoustic data, and image data. It is necessary to find the method of integrating different data modes in the multi-source DA method.157Further, how to choose the most relevant data automatically and adaptively in each source domain is still a key problem.

(5) For all deep DA methods, there is no hyperparameter tuning method for specific tasks. Most of them depend on empirical value or cross validation method, which is time-consuming and labor-consuming. It is necessary to design a uniform hyperparameter tuning method so that other methods can get similar benefits from this tuning; sometimes, due to the domain transformation in fault detection,it is also necessary to further optimize the parameters to achieve the bidirectional network. In additional, in the above analysis of literature, different detection methods are compared to confirm the superiority of their respective DA methods. However, the DA methods proposed from different literatures are not compared on the same data set, so it is difficult to make a direct comparison between the proposed DA methods. More experiments are needed to compare these methods. Similarly, other promising methods may be superior to DA methods on some data sets,which can be evaluated by additional experiments.

Based on the above analyses, Fig. 38 summarizes several possible directions for future research in deep DA method,including designing the metric and transformation method for specific domain difference measurement to achieve more accurate testing, studying the alignment constraints method in GAN for more effective transfer,mining more feature information for better training, developing the DA method with multimode data fusion function for excavating more information,building general adaptive hyperparameter tuning module for the application to a wider range of tasks, and conducting further experimental comparison between DA methods for more clearly knowing which aspects of DA method are responsible for performance gains and finding more effective combining DA methods. Furthermore, designing more realistic label distribution fitting methods in unsupervised environment,visualizing data distribution characteristics in adversarialbased DA method to enhance their interpretability,and developing relevant data adaptive selection methods in multi-source DA are also expected to be better implemented in future research.

Fig. 38 Future development of DA.

7. Conclusions

In this survey, we focused on reviewing deep DA technology and its application in fault diagnosis from aspects of the discrepancy-based method, adversarial-based method,reconstruction-based method, and multi-source DA method.For the discrepancy-based method, statistic transformation,structure optimization and geometric transformation are introduced and the application technologies in these three operations are analyzed. The adversarial-based method is divided into two categories, including GADA and non-GADA,depending on whether there are additional generators, in which the feature extraction structure is usually regarded as a generator in non-GADA. The reconstruction-based method is introduced from encoder-decoder and GAN structures,which can ensure the invariance of features during the transfer.For labeled datasets of multiple domains, MDA is analyzed based on the above methods. Finally, based on the analysis of deep DA methods in these references, we summarized the current problems and provided the possible future works on deep DA methods for intelligent fault diagnosis.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

This work is supported by the National Natural Science Foundation of China (Grant Nos. 52175096, 51775243, and 11902124),the fellowship of China Postdoctoral Science Foundation (Grant No. 2021T140279) and 111 Project (Grant No.B18027).