Speech Enhancement with Nonnegative Dictionary Training and RPCA

2019-07-30,,,

复旦学报（自然科学版） 2019年3期

, , ,

(1. First Military Representative Office of Air Force Equipment Department in Changsha, Changsha 410100, China; 2. College of Information Science and Engineering, Hebei University of Science and Technology, Shijiazhuang 050000, China; 3. College of Information Science and Engineering, Yanshan University, Qinhuangdao 066000, China; 4. Shanghai Nanhui Senior High School, Shanghai 201300, China; 5. Institute of Command and Control Engineering, Arm Engineering University, Nanjing 210007, China)

Abstract： An unsupervised single channel speech enhancement algorithm is proposed. It combines both the nonnegative dictionary training and Robust Principal Component Analysis(RPCA) so that we name it as NRPCA in short. The combination is accomplished by incorporating the nonnegative speech dictionary into the RPCA model, which can be learned via Nonnegative Matrix Factorization(NMF). With the NRPCA model, the method of Alternating Direction Method of Multipliers(ADMM) is applied for optimized solutions. Objective evaluations using Perceptual Evaluation of Speech Quality(PESQ) on TIMIT with 20 noise types at various Signal-to-Noise Ratio(SNR) levels demonstrate that the proposed NRPCA model yields superior results over the conventional NMF and RPCA methods.

Keywords： speech enhancement; robust principal component analysis; nonnegative dictionary training

Working as a pre-processor in speech recognition, speech coding, etc., speech enhancement has been a challenging research topic for decades[1]. It attempts to improve the quality of noisy speech and suppress speech distortion caused by interfering noise. Over the years, a large number of speech enhancement approaches, generally divided into two broad classes of unsupervised and supervised, have been proposed[2].

Unsupervised approaches include a wide range of algorithms, such as Spectral Subtraction(SS)[3], Wiener[4]and Kalman filtering[5], Short-Time Spectral Amplitude(STSA) estimation[6]and methods based on statistical mathematical models of speech and noise signals[7]. Among those algorithms, one well-known type is the Robust Principal Component Analysis(RPCA)[8]model-based methods and their improved versions[9-10]. Under the hypothesis that the spectrogram of speech is sparse and the noise is low-rank, the RPCA-based approaches work by decomposing the noisy spectrogram into the sparse and low-rank components, which corresponding to the speech and noise parts of the noisy mixture, respectively. Under specific constraints, the separation of the sparse and low-rank components can be accomplished by solving an optimization problem of the mathematical model for RPCA with the proper optimization algorithm. These approaches have a significant advantage that they neither need to explicitly model nor require any prior knowledge of the noisy speech to be enhanced. Such an advantage makes the unsupervised approaches easily performed in real-world scenarios. However, its performance degrades severely in non-stationary noisy environments.

Supervised speech enhancement algorithms[11]have been proposed to overcome this limitation and gain competitive superiority by making proper use of the prior knowledge of the speech or noise signal available. In particular, the Nonnegative Matrix Factorization(NMF) algorithm, one of the most powerful machine learning algorithms successfully applied in signal processing area, has been proved to be a popular tool for supervised speech enhancement[12-13]. The NMF theory is proposed by Lee and Seung[14]and it aims to project the magnitude/power spectrogram of speech onto a space spanned by a linear combination of a set of basis vectors and activation coefficients.

With various types of speech enhancement algorithms, the combinations or fusions of different methods in a proper way can be effective for better performance[15]. In this paper, an unsupervised speech enhancement algorithm is proposed by combining the techniques of nonnegative dictionary training and RPCA. The proposed speech enhancement algorithm is unsupervised since we know nothing of the noisy mixture to be enhanced. Under the circumstances, we can make an assumption that its performance can be improved, if we can take some advantage of properly which is prior available. The nonnegative speech dictionary traditionally used in the NMF-based speech enhancement can be such prior, which is learned by using the sufficient training samples chosen from public datasets randomly. It is necessary to mention that the speech training samples are unrelated to the noisy speech to be enhanced. The combination of the proposed model (NRPCA) is accomplished by incorporating the nonnegative speech dictionary into the RPCA model. To decompose the noisy spectrogram into the speech and noise parts, mathematical model for the NRPCA is established with the constraint of the sparse and low-rank conditions. Finally, the enhancement process will be finished by solving the optimization problem of the NRPCA model via Alternating Direction Method of Multipliers(ADMM)[16]. Different from the multiplicative updates in the NMF, the ADMM is the optimization algorithm that can update variables separately and alternately by solving the corresponding sub-problems within its framework. Detailed analysis and comparisons of the convergence performance of the proposed NRPCA model with some state-of-the-art approaches will be executed in this paper.

1 Framework of the proposed algorithm

The proposed NRPCA-based speech enhancement scheme combines the techniques of RPCA and NMF. The overall diagram is illustrated in Fig.1, which mainly consists of a noise bases training process and an enhancement stage. In the beginning, the input noisy mixture is chopped into frames and Fast

Fig.1 The processing flowchart of the proposed speech enhancement algorithm

Fourier Transformation(FFT) is applied to compute the magnitude spectrogramYand its phase ∠Y. In the training process, the nonnegative speech dictionaryWsis learned by conducting NMF on magnitude spectrogram of the speech samples used for training. In the enhancement stage, the enhanced spectrogram of speech and noise are separated via the RPCA model combined with the trained nonnegative speech dictionaryWs. Finally, the time-domain speechsand noisencan be reconstructed by inverse FFT and overlap-add method.

2 Nonnegative dictionary training via NMF

(1)

Variables off(1≤f≤F),r(1≤r≤R) andm(1≤m≤M) above represent the indices of frequency bins, speech bases and frames, respectively.F,RandMcorrespond to the total number of frequency bins, speech bases and frames, respectively.

(2)

3 Proposed RPCA model and its optimization problem solved via ADMM

For speech enhancement, it is a common practice to assume that the clean speech is contaminated by an additive uncorrelated noise. Lets(t) andl(t) represent the clean speech and noise signal, respectively. The noisy speech can be computed as,

y(t)=s(t)+l(t).

(3)

The technique of Short-Time Fourier Transform(STFT) will be applied to transform the signal of Equation (3) into the time-frequency domain：

Y=S+L.

(4)

Then, according to the sparse and low-rank hypothesis for speech and noises, the RPCA can be employed to decompose the spectrogram of noisy speechYinto the sparse speech termSand the low-rank noise termL[18].

(5)

(6)

With the assumption that better performance can be achieved if prior knowledge can be utilized properly, the nonnegative speech dictionaryWstrained via NMF, exactly the same nonnegative speech dictionary trained in the Section 2, is incorporated into the RPCA model in the form below：

(7)

(8)

The Equation (8) can be rewritten with augmented Lagrangian function using Euclidean distance as below：

(9)

where,ΩY,ΩS,ΩHsandΩLare the scaled dual variables,ρis the scaling parameter that controls the convergence rate. As shown in Fig.2, the objective function shown in Equation (9) can be efficiently solved by the ADMM algorithm[8]. The value ofλis set in the same way with previous research[9]. The symbols ofS(·) is the soft-threshold operator. TheS+(·) denotes the nonnegative operation of the corresponding soft-threshold operatorS(·).

Fig.2 The program code of the algorithm

4 Experiments and results

4.1 Experiments preparations

The test clean speech examples consisting of samples lasting 25 seconds spoken by 2 males and 2 females are chosen randomly from the TIMIT dataset[19]. 20 types of noise from the Noizeus-92 dataset[20],babble,birds,buccaneer1,buccaneer2,casino,eatingchips,f16,factory1,factory2,hfchannel,jungle,leopard,m109,ocean,pink,rain,stream,thunder,white, are included. The signals are mixed at five different Signal-to-Noise Ratios(SNRs) levels from -10 to 10dB spaced by 5dB. The nonnegative speech dictionary is learned using 1000 clean utterances produced by 20 different speakers. The number of speech and noise bases is 40 each. All files are sampled at 8kHz sampling rate. To calculate the spectrograms, a window length of 64ms (512 points) and a frame shift of 16ms (128 points) are used.

4.2 The chosen baselines and evaluation metrics

In order to compare the performance of the proposed approaches, four unsupervised speech enhancement algorithms, including the Semi-supervised NMF(SNMF), one improved version of the traditional RPCA method in the magnitude spectrogram domain[9], the well-known SS[3], and a state-of-the-art Noise Estimation(NE) algorithm[21], are chosen as the baselines. The number of iterations for SNMF and RPCA is 200 where convergence can be observed in all the experiments. To evaluate the performance of the speech enhancement algorithms, the most frequently used criterion of Perceptual Evaluation of Speech Quality(PESQ) score is used to measure the speech quality. A higher score of the PESQ indicates better speech enhancement performance.

4.3 Enhancement performance of the algorithms

From the scores that all the algorithms reach in Fig.3(see page 368), we can find that the combination of nonnegative speech dictionary and RPCA makes a significant improvement for the metric of PESQ, scoring higher values than the traditional semi-supervised NMF-based speech enhancement and the baseline of RPCA. Compared with well-known unsupervised algorithms of SS and NE, the proposed algorithm has shown better performance for most of the considered 20 noise types. In terms of the performance of the noise estimation in Ref.[21], it outperforms the algorithms of SS, SNMF, RPCA and gains competitive results with the proposed algorithm.

Fig.3 The PESQ values of all the algorithmsRemarks： The numbers are the mean values over five input SNR conditions

The numbers in Tab.1(see page 368) are the average values of the five methods for 20 noise types at different SNR levels. We can see that proposed algorithm substantially achieves higher values than the SS, SNMF and RPCA methods under all SNR levels. Additionally, our proposed algorithm outperforms the NE algorithm at low SNR conditions (-10dB, -5dB, 0dB) while the performance degrades when the SNR condition improves (10dB). Moreover, when computing the improvements of the proposed method and RPCA over the SS at all SNR levels, we can see the improvement is more obvious at low SNR conditions (-10dB, -5dB, 0dB) than that at high SNR conditions (5dB, 10dB). It may be explained that the speech parts of the noisy mixture will be sparser at low SNR conditions, which makes the RPCA model more effective. Additionally, for some types of noise, such asfactory2, leopard andm109, the proposed algorithm may not improve greatly. The robustness may be different in various noise environments and the property of such noise may not fit the sparse and low rank assumption well in the proposed model.

Tab.1 Average PESQ values of all methods for 20 noise types

Remarks: 1) denote input SNR

5 Conclusion

This paper proposes an unsupervised speech enhancement algorithm by combining the techniques of robust principal component analysis and nonnegative dictionary training. The prior knowledge of the nonnegative speech dictionary trained via nonnegative matrix factorization is incorporated into the algorithm of robust principal component analysis. The optimization problem of the mathematical model describing the proposed algorithm is efficiently solved by the alternating direction method of multipliers. Experimental results under 20 noise types at different SNR levels demonstrate that the incorporation of nonnegative speech dictionary into the RPCA model may be a proper way for better performance. However, compared to the proposed algorithm with state-of-the-art algorithms, future research will be devoted to improve the performance of the proposed algorithm at high SNR levels.

复旦学报（自然科学版）

2019年3期