A VIBRATION RECOGNITION METHOD BASED ON DEEP LEARNING AND SIGNAL PROCESSING

2021-04-21CHENGZhigangLIAOWenjieCHENXingyuLUXinzheng

工程力学 2021年4期

CHENG Zhi-gang , LIAO Wen-jie , CHEN Xing-yu , LU Xin-zheng

(1. Civil Engineering Department, Tsinghua University, Beijing 100084, China;2. Beijing Engineering Research Center of Steel and Concrete Composite Structures, Tsinghua University, Beijing 100084, China;3. Key Laboratory of Civil Engineering Safety and Durability of China Education Ministry, Tsinghua University, Beijing 100084, China)

Abstract: Effective vibration recognition can improve the performance of vibration control and structural damage detection and is in high demand for signal processing and advanced classification. Signal-processing methods can extract the potent time-frequency-domain characteristics of signals; however, the performance of conventional characteristics-based classification needs to be improved. Widely used deep learning algorithms(e.g., convolutional neural networks (CNNs)) can conduct classification by extracting high-dimensional data features, with outstanding performance. Hence, combining the advantages of signal processing and deep-learning algorithms can significantly enhance vibration recognition performance. A novel vibration recognition method based on signal processing and deep neural networks is proposed herein. First, environmental vibration signals are collected; then, signal processing is conducted to obtain the coefficient matrices of the time-frequency-domain characteristics using three typical algorithms: the wavelet transform, Hilbert-Huang transform, and Mel frequency cepstral coefficient extraction method. Subsequently, CNNs, long short-term memory (LSTM) networks, and combined deep CNN-LSTM networks are trained for vibration recognition, according to the time-frequencydomain characteristics. Finally, the performance of the trained deep neural networks is evaluated and validated.The results confirm the effectiveness of the proposed vibration recognition method combining signal preprocessing and deep learning.

Key words: vibration recognition; signal processing; time-frequency-domain characteristics; convolutional neural network (CNN); long short-term memory (LSTM) network

Vibrations can significantly degrade the product quality in industrial manufacturing and the occupant comfort in buildings. Additionally, vibration recognition can be beneficial in damage detection for mechanical and civil-engineering structures. Previous studies indicated that effective vibration recognition could improve the performance of vibration control[1],seismic damage prediction[2-3], and structural damage detection[4]. Moreover, accuracy is a critical factor in vibration recognition[1,5]. Hence, the development of an accurate feature extraction and recognition method is imperative[6].

Traditional vibration recognition methods generally require complex algorithms to extract features, exhibiting low performance and a low efficiency. Deep-learning algorithms have developed rapidly in recent years and have achieved excellent performance in object recognition based on computer vision[7]owing to their ability to extract in-depth data features[8]. Among them, convolutional neural networks (CNNs) are highly invariant to the translation, scaling, and distortion of images[9]and are widely utilized in image recognition[10]. Furthermore,long short-term memory (LSTM) networks can accurately identify time-series data and avoid vanishing- and exploding-gradient problems[11]. They are suitable for natural language processing[12].Therefore, CNN and LSTM exhibit considerable potential for vibration recognition.

CNN and LSTM training generally require a large amount of input data, while the number of available vibration samples is limited. To overcome the limitation of the data quantity, data preprocessing and extraction of data features to improve the training performance are essential[13]. Three classical signalprocessing methods are useful for extracting the timefrequency-domain characteristics of vibrations and assisting the training of deep neural networks: the wavelet transform (WT), Hilbert-Huang transform(HHT), and Mel frequency cepstral coefficient(MFCC) extraction. The WT is widely used for processing non-stationary signals via multi-scale analysis[14-15]. The HHT combines adaptive timefrequency analysis (i.e., empirical mode decomposition (EMD)) with the Hilbert transform to obtain signal characteristics[16-17]. The MFCC extraction method characterizes signal features according to the known variation of the human ear’s critical frequency bandwidth[18], which is widely utilized in speech recognition processing. Previous studies on signal processing and deep-learning algorithms have proven the applicability of this method. For example, Zhang et al. (2020)[19]and Zeng (2016)[20]utilized a CNN,ensemble EMD, and a continuous WT for mechanical fault diagnosis. Lu et al. (2020)[1]adopted a WT and a CNN to recognize the micro-vibrations induced by vehicles and construction.

Thus, in the present study, different signal processing methods and deep neural networks were combined to improve the vibration recognition accuracy. Their performance was compared to identify the optimal combination. First, environmental vibrations were collected. Subsequently, three featureextraction algorithms (WT, HHT, and MFCC extraction), as well as three deep neural networks(CNN, LSTM, and CNN-LSTM) were adopted;additionally, comparisons and parametric studies were performed. The results indicated that the combination of CNN and MFCC extraction is optimal for vibration recognition.

1 Method framework

The framework of the proposed vibration recognition method is presented in Figure 1. The critical parts are the feature extraction and the training of the deep neural networks. The specific steps are as follows:

1) Signal acquisition and data cleaning: Four types of typical vibration signals are collected, and the types of vibration sources are set as labels.Subsequently, the collected data are cleaned to identify valid data and establish the vibration datasets,according to a short-term energy analysis.

2) Extraction of vibration time-frequency-domain characteristics: Three signal-processing methods (WT,HHT, and MFCC extraction) are adopted, and the datasets are established with the obtained coefficient matrices of the time-frequency-domain features.

3) Vibration recognition based on deep-learning algorithms: The datasets from steps 1) and 2) are adopted to train the CNN, LSTM, and CNN-LSTM models for vibration recognition.

4) Evaluation and application of deep neural networks: The CNN, LSTM, and CNN-LSTM models trained in step 3) are evaluated. Subsequently, the qualified models can be adopted for (micro) vibration control, damage detection, fault identification, etc.

To further demonstrate the proposed framework,details of each step are introduced in Sections 2 to 5,taking the micro-vibration control as a potential scenario.

2 Data collection and cleaning

The main purpose of the micro-vibration control is to reduce the environmental vibration-induced negative influence on the production quality during the process of industrial production. The potential environmental vibration source mainly includes the construction activities, the subway, the trucks, etc.[1]In this work, bus-, construction-, subway-, and highspeed rail (HSR)-induced vibrations were collected using the acceleration sensors located on the hard and flat ground, approximately 10 m-50 m from the vibration source (Figure 2(a)). Moreover, an Orient Smart Testing-941B (V) accelerometer with a sampling resolution of approximately 10-4g and a sampling frequency of 256 Hz was used.

Fig.1 Framework of the novel vibration recognition method

Figure 2 shows the partial time-series data for four types of vibration signals. The subway- and businduced vibrations exhibited obvious peak portions,whereas the HSR- and construction-induced vibrations generally maintained a stable or regular distribution in a short time.

The collected original vibrations were primarily composed of significant environmental vibrations and tiny sensor-white-noise. However, neural networks'training datasets should only comprise the valid vibration segments and corresponding labels. Therefore,data cleaning was performed to remove noise and gain valid segments in the raw signal data. Short-term energy is a fundamental parameter of a vibration signal, characterizing the energy of each frame in a signal[21]. Because valid signals differ from noise signals in the acceleration amplitude (equivalently, the short-term energy), a short-term energy analysis was performed for data cleaning.

Fig.2 Vibration collection setup and comparison of four types of collected vibration signals in the time domain

In this study, bus-induced vibrations were investigated to explain the data-cleaning process based on short-term energy. Figure 3(a) shows the time-domain waveforms of the raw vibration signals.After framing the data, the absolute values (or square values) of the amplitude in one frame were summed to obtain the short-term energy, as shown in Figure 3(b)and Figure 3(c), respectively. As shown in Figure 3(c), the difference between valid signals and noise in the square value-summed short-term energy analysis is more obvious; thus, it was utilized to identify valid signals. Because the vibration energy was generally concentrated near the peak amplitude, the signal length was uniformly set to 15 s around the peak portion. Furthermore, different types of vibration signals necessitate different energy-associated thresholds to eliminate the primary noise effectively,as well as to obtain sufficient data for satisfying the requirement of neural-network training. The thresholds of the bus-, construction-, subway-, and HSR-induced vibrations were 150 × (10-2g)2, 80 ×(10-2g)2, 2500 × (10-2g)2, and 800 × (10-2g)2,respectively. According to these thresholds, the valid vibration signals were identified, and the number of valid vibration signals for the four types was 476, 302,477, and 266, respectively. As shown in Figure 3,short-term energy-based vibration interception is an effective data extraction method.

Fig.3 Waveform and short-term energy of the 10-min vibration signals induced by buses

3 Data processing for feature extraction

Deep neural network-based big data analysis performs well in civil engineering[22-23]. However,obtaining sufficient high-quality data and corresponding labels in civil engineering is a critical challenge due to the high cost of installation and maintenance of sensors[24]. Hence, to address the data quantity limitation, the combination method of classic time-frequency domain analysis and highperformance deep learning is proposed, which can achieve better recognition results when available data are limited. In this study, three signal-processing methods (e.g., WT, the HHT, and MFCC extraction)were adopted to extract preliminary features of signals for improving the vibration recognition performance of the deep-learning algorithms.

3.1 WT

The WT was derived from the Fourier transform[25]and is widely utilized in time-frequency-domain analysis[26]. The basis function and scale are critical parameters of the WT, which determine the size of the coefficient matrices (i.e., the frequency range) and the transformation speed. Hence, according to the continuous WT, parametric analyses of different basis functions and scales were performed.

The WT with the Cgau (complex Gaussian) basis function is faster and more adaptive to different transformation requirements than those with the Morlet and Mexican Hat basis functions. Therefore,the Cgau basis function was adopted. Figure 4 presents the WT results for the HSR-induced vibration based on the 4th-order and 8th-order Cgau basis functions at different scales. The results indicate that the WTs with the 4th-order and 8th-order functions performed similarly at the same scales. The processing with Cgau4 (i.e., 4th-order Cgau basis function) was faster, satisfying the rapid-output requirement for vibration control. Furthermore, the WT speeds varied among different scales, and more scales corresponded to a lower transforming speed,higher precision, and larger number of low-frequency features.

Fig.4 Effects of the Cgau wavelet order and scale number (HSR-induced vibration signal)

According to the foregoing analysis, considering the processing speed and extracted-feature quality, the Cgau4 wavelet basis function with 512 scales was adopted in the WT. Figure 5 presents the wavelet images of four typical vibrations, indicating that the time-frequency-domain characteristics of the four types of vibrations differed significantly.

3.2 HHT

The HHT is an adaptive time-frequency-domain analysis method that does not involve the selection of basis functions. It is composed of EMD and the Hilbert transform. Owing to the requirements for the intrinsic mode function (IMF) in EMD[27], the IMFs can characterize stationary and instantaneous nonstationary signals. Figure 6 presents the IMFs of four types of vibration signals after EMD, indicating the one-dimensional raw signals composed of a finite number of simple IMFs with different frequencies and amplitudes. For example, Figure 6(a) shows the decomposition results for bus-induced vibration signals, including time-domain waveforms of the typical mode functions (IMF1, IMF2, IMF11, IMF12)and the residual term (Res). As shown in Figure 6, the amplitude and waveform differed significantly between IMF1 and IMF2 for the four vibrations. In contrast, the noise interference and a slight difference in the decomposed signal frequency may cause modal aliasing during decomposition (e.g., IMF1 and IMF2 in Figure 6(d)), reducing the recognition accuracy[16].Additionally, because of the different characteristics of the vibration signals, the numbers of the IMFs differed. The Hilbert transform was applied to the IMFs to obtain the Hilbert spectra.

3.3 MFCC extraction

(d) Typical HSR-induced vibrationsFig.5 Comparison of time-frequency-domain characteristics based on the WT

Mel-frequency cepstrum analysis is widely utilized in speech recognition, and its nonlinear-transformation characteristic can effectively eliminate the noise. The process of MFCC extraction includes preprocessing, the discrete Fourier transform, Mel-frequency wrapping, and the discrete cosine transform[21,28].Figure 7 presents the linear and logarithmic power spectra of the vibration signal after Mel-frequency wrapping, where the decibel value at the maximum power is 0, and the decibel value is calculated relative to the maximum power. Hence, the brighter areas of the images correspond to a higher vibration energy.Figure 7(a) shows the energy distributions of businduced vibrations in the linear and logarithmic power spectra. The power is distributed uniformly in the linear power spectrum. In comparison, for the logarithmic power spectrum, the energy is concentrated in the low-frequency band, which is beneficial for signal recognition. According to Melfrequency wrapping, the discrete cosine transform on the logarithmic spectra can extract the MFCC matrices. The matrix size is mainly determined by the frame length, frameshift, signal length, and quantity of bandpass Mel-scale filters, which can be adjusted according to the practical conditions.

4 Deep neural network training

According to the extracted signal features, CNN and LSTM were used to classify the vibration types.With a limited amount of training data, the depth of the network architecture should not be excessive, to avoid overfitting and low performance[29-30]. Four types of deep neural networks were adopted in this study: CNN_1C1P, CNN_2C2P, CNN_1C1P+LSTM,and LSTM (Figure 8). “C” and “P” in the CNN labels correspond to the convolutional and pooling layers, respectively, and the number is the number of layers. The LSTM network comprised two layers of LSTM units. CNN_1C1P represents the CNN model with one convolutional layer and one pooling layer;CNN_1C1P+LSTM represents a model with a layer of LSTM units after the pooling layer of CNN_1C1P.The deep neural networks were trained on a computer equipped with AMD Ryzen Threadripper 2990WX 32-Core, NVIDIA GeForce GT 710, Windows 10 operating system.

In this study, the dataset of signal timefrequency-domain features was adopted for training.Lu et al. (2020)[1]determined that the vibration recognition should be completed within the first 1.5 s of the vibration signal during vibration control tasks,to avoid a control delay. Hence, the input signals used for recognition constituted the first 1 s of the vibration signals.

Fig.6 IMFs and residuals for four types of vibrations (the first and last two IMFs are presented)

There were four types of input data: 1) raw timeseries vibration signals, 2) WT coefficient matrices,3) HHT coefficient matrices, and 4) MFCC matrices.Previous studies indicated that the recognition speed with the coefficient matrices as inputs was faster and performed better than using the images as inputs[1];thus, the coefficient matrices were adopted as the inputs instead of the images. A comparative study on the vibration recognition performance of different neural networks and different data inputs was conducted.

Fig.7 Power spectra of four types of vibrations based on Mel-frequency cepstrum analysis

Furthermore, the architectures of deep neural networks varied with different input data types. For example, for the CNN models, raw time-series data were inputted in the form of (256, 2, 1), where time and vibration were in 2 columns; the WT coefficient matrix was inputted in the form of (256, 512, 1),where 512 and 1 denote the frequency scales and channel, respectively. For the LSTM models, raw data and the WT matrix were reshaped to (256 × 2, 1) and(256 × 512, 1), respectively. The networks using HHT and MFCC matrices as input were similar to that using WT matrix.

4.1 Raw vibration signal input

The training results of the deep neural networks are presented in Table 1, with the raw time-series signal data input. The hyperparameters of diverse neural networks are presented in the “ parameter value” column: i(j), k, l, m, n, and p, where i, k, and m represent the numbers of neural units in the first and second convolutional layers (or the LSTM layer) and the fully connected layer, respectively; j and l represent the sizes of units in the first convolutional layer and the pooling layer, respectively; n represents the dropout ratio; and p represents the batch size(number of data per batch for batch training).According to different parameter combinations, each type of network was trained for 200 epochs-400 epochs until the performance was stable. The accuracy and loss values for the last 10 epochs after training stabilization were adopted to evaluate the training performance, to avoid unreasonable results due to numerical deviation[1]. Then, the networks with the best performance were selected.

Figure 9 presents the CNN_1C1P training process for 100 epochs, indicating obvious overfitting and that the test accuracy is much lower than that of the training. As indicated by Table 1 and Figure 9, the various neural networks could hardly achieve satisfying results with the raw time-series signal input,possibly owing to the short signals with unobvious features and the limited number of signals.Additionally, as the number of data per batch and the number of neural units increased, the training slowed;the LSTM layer was relatively complex, resulting in a lower training speed.

Fig.8 Typical architectures of the deep neural networks

Table 1 Training results for the raw data input

Fig.9 CNN_1C1P training process with the raw time-series data input

4.2 WT coefficient matrix input

With the WT coefficient matrix input, analyses were performed to determine the suitable hyperparameters for different deep neural networks.CNN_1C1P is taken as an example for a detailed investigation. Parametric analyses of CNN_1C1P were conducted for different batch sizes, numbers of neural units in the convolutional layer, sizes of units in the pooling layer, and dropout ratios. The results are shown in Figure 10.

Fig.10 Parameter analysis results for the CNN_1C1P network (“～” indicates the parameter analyzed)

Figure 10(a) shows the influence of the batch size on the network performance, indicating that the optimal batch size for CNN_1C1P was 84,corresponding to the maximum accuracy and minimum loss. As shown in Figure 10(b), the network performance fluctuated with an increase in the number of convolutional kernels, and the performance decreased after this quantity exceeded 256, owing to the declining generalization ability of the complex network. Figure 10(c) presents the effect of the size of neural units in the pooling layer, revealing that the optimal size was 3, as a smaller size corresponded to a larger number of parameters trained in the neural network. Furthermore, a dropout layer was utilized to reduce the potential overfitting caused by the small amount of training data[32]. As shown in Figure 10(d),the influence of the dropout ratio was evaluated, and the optimal dropout ratio was 0.5.

The parametric adjustment process of CNN_2C2P was similar to those in the aforementioned studies. The effects of the batch size, dropout ratio(learning rate was 0.95), and learning rate were analyzed; the remaining parameters were identical to those of the optimal CNN_1C1P.

The effects of the dropout ratio (with a batch size of 32) and the batch size (with a dropout ratio of 0.6)are presented in Figure 11(a) and Figure 11(b),respectively, indicating that the optimal parameter values were 0.6 and 64, respectively. The effects of the learning rate and dropout ratio (with a batch size of 32) are presented in Table 2, indicating that the dropout ratio should be increased to maintain satisfactory performance when the learning rate is increased. The optimal learning rate and dropout ratio were 1.0 and 0.6, respectively.

Fig.11 Parametric analysis results for the CNN_2C2P network

The parametric adjustment processes of CNN_1C1P+LSTM and LSTM were similar, and the neural-unit quantity in the LSTM layer was adjusted with consideration of only the number of neural units.Table 3 presents the optimal performance of the four neural networks with the wavelet coefficient matrix input. As shown, with an increasing number of LSTM layers, the training speed of the neural network decreased significantly, and the classification performance became poor. The performance of the LSTM was inferior to that of the CNN, possiblybecause the short time-series with weak internal data connections were unfavorable for signal feature extraction using LSTM.

Table 2 Training effects of different learning rates and dropout rates with the WT matrix input

Thus, adopting the WT coefficient matrices as the inputs helped the CNN learn the data features sufficiently and perform well; however, they were unsuitable for LSTM.

4.3 Hilbert spectrum input

In the EMD process, the number of IMFs varied for different vibration signals, causing differences in the sizes of the Hilbert spectra. Thus, the parts that lacked an IMF were filled with zero to maintain identical sizes. The sizes of the HHT spectrum were Nmax× d, where N represents the number of IMFs (in this study, N varied from 10 to 13, and Nmaxwas 13),and d represents the number of data points in 1 s(sampling frequency was 256 Hz; thus, d was 256).

Compared with the foregoing parametric analysis, the training speed was higher with the Hilbert spectrum input, but the test results were less accurate, with more severe overfitting, owing to the small amount of data. Table 4 presents the training results and parameters for the neural networks with the best performance. As shown, the CNN was optimal, but its overall performance was worse than that with the WT coefficient matrix input.

Table 3 Training effects of different types of neural networks under the WT

4.4 MFCC matrix input

As the MFCC inputs had the advantage of the smallest data size for signal feature characterization,the training and testing of the neural networks with MFCC inputs were fast, with good performance.Furthermore, the LSTM layer did not significantly reduce the processing speed or accuracy, and CNN_2C2P exhibited the best performance among the networks tested (Table 5).

Table 4 Training effects of different neural networks with the Hilbert spectrum input

Table 5 Training effects of different neural networks with the MFCC matrix input

4.5 Discussion on deep neural network performance

In addition to comparing the performance of different input types and network models, this study also analyzed the effect of the vibration length and dataset splitting schemes. Table 6 lists the influence caused by the vibration length. The vibration recognition accuracy only slightly improves with the increased vibration length, which indicates that the vibration length is not the critical factor in determining the recognition performance. Table 7 shows the influence on the recognition accuracy by different dataset splitting schemes. In this study, when 60% and 70% of the data are used for training, the recognition accuracy will be reduced. In comparison,when 80% and 90% of the data are used for training, a reliable recognition ability can be achieved.Moreover, Zhou[33]recommended that, the proportion of data used for training should be within 80% to avoid an unstable testing performance. Thus, 80% of the data in the training sets are suggested when the data are sufficient.

Table 6 Training effects of different vibration length (based on CNN_1C1P)

Table 7 Training effects of different dataset proportions (based on CNN_1C1P)

Consequently, compared with the raw time-series data, the time-frequency characteristics data effectively improved the vibration recognition accuracy. The primary reason is that the initial timefrequency characteristics constrain the range of features extracted by deep neural networks, and guide the deep neural networks' optimization direction.Furthermore, another reason is that time-frequency characteristics input also utilizes CNN's big-data analysis ability sufficiently, by expanding the data size from (256, 1) to (256, 512, 1).

5 Validation and evaluation of models

The validation datasets were adopted for case studies, with the proportion of bus-, construction-,subway-, and HSR-induced vibration signals being 2∶1∶2∶1. Ten signals were randomly selected from the validation set as the standard prediction set(numbered 1-10). Moreover, various signals were randomly selected to synthesize the coupled signals,to validate the recognition ability of different signals.The maximum short-term energy determined the dominant signal type in the coupled signals. The artificial coupled signals were numbered as 11-15, and the dominant signals were subway-, subway-, HSR-,HSR-, and bus-induced vibration signals, respectively.Figure 12 presents the time-frequency-domain features of signals in the prediction set. The numbers represent the prediction order; the first 1 s of the vibration signals were used as the inputs. The optimal CNN_1C1P, CNN_2C2P, CNN_1C1P+LSTM, and LSTM from Section 4 were utilized for the predictions.

The types of standard vibration signals were predicted, and the prediction accuracy and speed were examined. Table 8 presents the results, where the numbers of the vibration signals denote the prediction order, and the bolded and underlined signal types correspond to wrong predictions. The results indicate that the optimal neural networks effectively recognized the vibration type, with both the wavelet coefficient matrix and MFCC matrix inputs. In contrast, the deep neural networks performed poorly with the Hilbert spectrum input. The incorrect predictions may have been caused by the unbalanced compositions of the training datasets, in which the bus- and subway-induced vibration signals accounted for the largest amounts of data. All the predictions took < 0.2 s, and the networks with the MFCC matrix input were the fastest. The validation results were similar to the training results.

When recognizing coupled signals, it is rational to recognize the coupled vibrations as the dominant

type, according to the vibration energy. Table 9 presents the dominant vibrations of the coupled signals and the prediction results. The bold and underlined signal types correspond to recognition results that differed from the dominant types. The results indicate that for coupled signals 11-14 with significant internal differences, most of the neural networks recognized them as the dominant types. The recognition results for coupled signal 15 differed significantly. Owing to the superimposed effects,some networks recognized it as an HSR-induced vibration signal. Consequently, the model recognition results with MFCC matrix, and wavelet coefficient matrix inputs were optimal for coupled vibration.

Fig.12 Time-frequency-domain features of the vibration signals in the prediction datasets (“Bus”, “Cons”, “Sub”, and“HSR” correspond to bus-, construction-, subway-, and HSR-induced vibration signals, respectively)

Table 8 Average prediction accuracy and processing speed for different inputs

Table 9 Recognition results for coupled vibration

6 Conclusions

A deep learning-based vibration recognition method using signal processing and diverse deep neural networks was proposed. The proposed method performed well, and the following conclusions are drawn.

(1) Adopting short-term energy analysis to remove the irrelevant noise of the raw data and identify valid signal fragments yields good performance.

(2) When using the raw time-series data as inputs, the deep neural networks perform poorly owing to the short signal length and small training data. Three signal-processing methods (i.e., WT,HHT, and MFCC extraction) were used to extract time-frequency-domain features, and yield the coefficient matrices which characterize the vibration signals. The extracted characteristics guided the neural networks' optimization direction and improved the recognition performance.

(3) Four types of neural networks (CNN_1C1P,CNN_2C2P, CNN_1C1P+LSTM, and LSTM) were built, and the coefficient matrix input was used for training. The four deep neural networks performed best with the MFCC matrix input. The CNNs had excellent recognition performance. Consequently, the recognition method combining MFCC extraction with the CNN was optimal, considering the evaluation results and the requirements of practical application scenarios.

Effective recognition of coupled vibration signals in complex situations will be the focus of our next study.

Acknowledgement

The authors would like to acknowledge the financial supports of Shandong Co-Innovation Center for Disaster Prevention and Mitigation of Civil Structures, Tsinghua University through "Improved Micro-Vibration Control Algorithm Research Based on Deep Learning" SRT Project (No. 2011T0017),Initiative Scientific Research Program, and the QingYuan Space of the Department of Civil Engineering. The authors would like to acknowledge Mr. Zhou Yuan, Mr. Zihan Wang, Ms. Peng An, Mr.Haolong Feng, and Mr. Haoran Tang for their contribution to this work.