GTB-PPI:Predict Protein-protein Interactions Based on L1-regularized Logistic Regression and Gradient Tree Boosting

2020-09-02BinYuChengChenHongyanZhouBingqiangLiuQinMa

Genomics,Proteomics & Bioinformatics 2020年5期

Bin Yu*,Cheng Chen,Hongyan Zhou,Bingqiang Liu,Qin Ma*

1 School of Life Sciences,University of Science and Technology of China,Hefei 230027,China

2 College of Mathematics and Physics,Qingdao University of Science and Technology,Qingdao 266061,China

3 Artificial Intelligence and Biomedical Big Data Research Center,Qingdao University of Science and Technology,Qingdao 266061,China

4 School of Mathematics,Shandong University,Jinan 250100,China

5 Department of Biomedical Informatics,College of Medicine,The Ohio State University,Columbus,OH 43210,USA

KEYWORDS Protein-protein interaction;Feature fusion;L1-regularized logistic regression;Gradient tree boosting;Machine learning

Abstract Protein-protein interactions (PPIs) are of great importance to understand genetic mechanisms,delineate disease pathogenesis,and guide drug design.With the increase of PPI data and development of machine learning technologies,prediction and identification of PPIs have become a research hotspot in proteomics.In this study,we propose a new prediction pipeline for PPIs based on gradient tree boosting(GTB).First,the initial feature vector is extracted by fusing pseudo amino acid composition(PseAAC),pseudo position-specific scoring matrix(PsePSSM),reduced sequence and index-vectors(RSIV),and autocorrelation descriptor(AD).Second,to remove redundancy and noise,we employ L1-regularized logistic regression (L1-RLR) to select an optimal feature subset.Finally,GTB-PPI model is constructed.Five-fold cross-validation showed that GTB-PPI achieved the accuracies of 95.15%and 90.47%on Saccharomyces cerevisiae and Helicobacter pylori datasets,respectively.In addition,GTB-PPI could be applied to predict the independent test datasets forCaenorhabditis elegans, Escherichia coli, Homo sapiens,and Mus musculus,the one-core PPI network for CD9,and the crossover PPI network for the Wnt-related signaling pathways.The results show that GTB-PPI can significantly improve accuracy of PPI prediction.The code and datasets of GTB-PPI can be downloaded from https://github.com/QUST-AIBBDRC/GTB-PPI/.

Introduction

Knowledge of protein-protein interactions (PPIs) can help to probe the mechanisms underlying various biological processes,such as DNA replication,protein modification,and signal transduction [1,2].The accurate understanding and analysis of PPIs can reveal multiple functions at the molecular and proteome levels,which has become a research hotspot[3,4].However,web-lab identification methods suffer from incomplete and false prediction problems [5].Alternatively,employing reliable bioinformatics methods for PPI prediction could provide candidates for subsequent experimental validation in a cost-effective way.

Compared with structure-based methods,sequence-based methods are straightforward and do not requirea prioriinformation,which have been widely used.Martin et al.[6] proposed the signature kernel method to extract protein sequence feature information,but they did not use physicochemical property information.Subsequently,Guo et al.[7]employed seven physicochemical properties of amino acids to predict PPIs by combining autocovariance and support vector machine (SVM).

Different feature extraction methods can complement each other,and prediction accuracy can be improved by effective feature fusion [8,9].For instance,Du et al.[8] constructed a PPI prediction framework called DeepPPI,which employed deep neural networks as the classifier.They fused amino acid composition information-based features and physiochemical property-based sequence features.However,presence of information redundancy,noise,and excessively high dimensionalities after feature fusion would affect the classification accuracy.You et al.[10] used the minimum redundancy maximum relevance (mRMR) to determine important and distinguishable features to predict PPIs based on SVM.

Ensemble learning systems can achieve higher prediction performance than a single classifier.To our knowledge,Jia et al.[11] combined seven random forest (RF) classifiers according to voting principles.As an ensemble learning method,gradient tree boosting(GTB)has been widely applied in miRNA-disease association [12],drug-target interaction[13],and RNA-binding residue prediction [14].GTB outperforms SVM and RF,showing superior model generalization performance.

Although a large number of algorithms have been proposed and developed,challenges remain for sequenced-based PPI predictors currently available.First,the sequence-only-based information of PPIs is not fully represented and elucidated,and satisfactory results cannot be obtained by merely adjusting individual parameters.Multi-information fusion is a very useful strategy through fusing multiple descriptors,such as pseudo amino acid composition (PseAAC) and pseudo position-specific scoring matrix (PsePSSM),which have been widely applied in PPI prediction [15],Gram-negative protein localization prediction [16],identification of submitochondrial locations [17],and apoptosis protein localization prediction[18].Secondly,there is a severe data imbalance problem in PPI prediction.The number of non-interacting protein pairs is much higher than that of interacting protein pairs.Currently,machine learning methods cannot deal with such problems well and could result in poor overall performance when dealing with imbalanced data [19].

To overcome the aforementioned limitation of machine learning methods,this study proposes a new PPI prediction pipeline called GTB-PPI.First,we fuse PseAAC,PsePSSM,reduced sequence and index-vectors (RSIV),and autocorrelation descriptor (AD) to extract amino acid compositionbased information,evolutionary information,and physicochemical information.To retrieve effective details representing PPIs without losing important and reliable characteristic information,L1-regularized logistic regression(L1-RLR)is first utilized for PPI prediction to eliminate redundant features.At the same time,we employ GTB as a classifier to bridge the gap between the extracted PPI features and class label.Our data show that the PPI prediction performance of GTB is better than that of SVM,RF,Naı¨ve Bayes (NB),andKnearest neighbors (KNN) classifiers.The linear combination of decision trees can fit the PPI data well.When applied to the network prediction,GTB-PPI obtains the accuracy values of 93.75% and 95.83% for the one-core PPI network for CD9 and the crossover PPI network for the Wnt-related signaling pathways,respectively.

Method

Data source

TheSaccharomyces cerevisiaePPI dataset was obtained from the Database of Interacting Proteins (DIP) (DIP:20070219)[7].Protein sequences consisting of ＜50 amino acid residues or showing sequence identity ≥40% via CD-HIT [20] were removed.Thus,5594 interacting protein pairs are considered as positive samples;5594 protein pairs with different subcellular location information are selected as negative samples,and their location information is obtained from Swiss-Prot.TheHelicobacter pyloriPPI dataset was constructed before [6],which contains 2916 samples (1458 PPI pairs and 1458 non-PPI pairs).

Four independent PPI datasets [21] were also used to test the performance of GTB-PPI.These datasets are obtained fromCaenorhabditis elegans(4013 interacting pairs),Escherichia coli(6954 interacting pairs),Homo sapiens(1412 interacting pairs),andMus musculus(313 interacting pairs).The number of unique proteins in each dataset is shown in Table S1.

Feature extraction

We fuse PseAAC,PsePSSM,RSIV,and AD to extract the PPI feature information,including sequence-based features,evolutionary information features,and physicochemical property features.The detailed descriptions of methods are presented in File S1.

L1-RLR

L1-RLR is an embedded feature selection method.Given the sample datasetD=｛（x1，y1），（x2，y2），···，（xm，ym）｝,L1-RLR can be transformed into an unconstrained optimization problem.

where ‖·‖1represents the L1 norm;lis the number of samples;ω represents the weight coefficient;andCrepresents penalty term,which determines the number of selected features.We use the coordinate descent algorithm in LIBLINEAR [22]to solve Equation (1).

GTB

GTB can be used to aggregate multiple decision trees [23,24].Different from other ensemble learning algorithms,GTB fits residual of the regression tree at each iteration using negative gradient values of loss.

GTB can be expressed as the relationship between the labelyand the vector of input variablesx,which are connected via a joint probability distributionp（x，y）.The goal of GTB is to obtain the estimated functionthrough minimizingL（y，F（x））:

Lethm（x）be them-thdecision tree andJmindicates number of its leaves.The tree partitions the input space intoJmdisjoint regionsR1，m，R2，m，···，RJm，mand predicts a numerical valuebjmfor each regionRjm.The output ofhm（x）can be described as:

GTB can complement the weak learning ability of decision tree,thus improving the ability of representation,optimization,and generalization.GTB can capture higher-order information and is invariant to scaling of sample data.GTB can effectively avoid overfitting condition by weighting combination scheme.GTB-PPI uses the GTB algorithm of Scikit-learn[25].

Performance evaluation

In GTB-PPI pipeline,recall,precision,overall prediction accuracy (ACC),and Matthews correlation coefficient (MCC) are used to evaluate the model performance [8].The definitions are as follows:

TP indicates the number of predicted PPI samples found in PPI dataset;TN indicates the number of non-PPI samples correctly predicted;FP and FN indicate false positive and false negative,respectively.Receiver operating characteristic(ROC) curve [26],precision-recall (PR) curve [27],area under ROC curve (AUROC),and area under PR curve (AUPRC)are also used to evaluate the generalization ability of GTBPPI.

Results and discussion

GTB-PPI pipeline

The pipeline ofGTB-PPIfor predictingPPIsis shown inFigure 1,which can be implemented usingMATLAB2014a and Python 3.6.There are five steps ofGTB-PPIas described below.

Figure 1 Overall framework of GTB-PPI for PPI prediction

Figure 2 Prediction results of different parameters λ,ξ,and lag on the S.cerevisiae and H.pylori datasets

Data input

The input values of GTB-PPI are PPI samples,non-PPI samples,and the corresponding binary labels.

Feature extraction

PseAAC,PsePSSM,RSIV,and AD are fused to transform the protein character signal into numerical signal.1) Amino acid sequence composition and sequence order information are obtained using PseAAC to construct the 20 ＋λ dimensional vectors.2) PSSM matrix of the protein sequence is obtained and 20 ＋20×ξ features are extracted based on PsePSSM.3)Feature information is extracted using RSIV according to the six physicochemical properties.Each protein sequence is constructed as 120+77=197 dimensional vectors.4)Protein sequence is transformed into 3×7×lagdimensional vectors by Morean-Broto autocorrelation(MBA),Moran autocorrelation (MA),and Geary autocorrelation (GA).λ,ξ,andlagare the hyperparameters of GTB-PPI,and their detailed meaning can be seen in File S1.

Dimensionality reduction

L1-RLR is first employed to remove redundant features by adjusting the penalty parameters in logistic regression.The performance of L1-RLR is then compared with that of semisupervised dimension reduction (SSDR),principal component analysis (PCA),kernel principal component analysis (KPCA),factor analysis (FA),mRMR,and conditional mutual information maximization (CMIM) onS.cerevisiaeandH.pyloridatasets.

PPI prediction based on GTB

According to step 2 for feature extraction and step 3 for dimensionality reduction,L1-RLR is used to better capture the sequence representation details.In this way,GTB-PPI model can be constructed using GTB as the classifier.

PPI prediction on independent test datasets and network datasets

The optimal feature set representing PPIs can be obtained through feature encoding,fusion,and selection.GTB is employed to predict the binary labels on four independent test datasets and two network datasets.

Parameter optimization of PseAAC,PsePSSM,and AD

It is essential to optimize parameters of PseAAC,PsePSSM,and AD for GTB-PPI predictor construction.We implement the hyperparameter optimization through five-fold crossvalidation.

To extract features from the sequence,the values for λ of PseAAC,ξ of PsePSSM,andlagof AD should be determined.We set the values of λ as 1,3,5,7,9,and 11;similarly,values for ξ andlagare also set as 1,3,5,7,9,and 11 in order.GTB is then used to predict the binary labels(Tables S2-S4).As shown inFigure 2,the prediction performance onS.cerevisiaeandH.pyloridatasets changed with the alteration in the values of the respective parameters.For the parameter λ in PseAAC,the highest prediction performance for these two datasets was obtained at different λ values:the optimal λ value forS.cerevisiaeis 9,while the optimal λ value ofH.pyloriis 11.Considering that PseAAC generates fewer dimensional vectors than the other three feature extraction methods (PsePSSM,RSIV,and AD),we choose the optimal parameter λ=11 to mine more PseAAC information.The parameter selection of ξ andlagcan be found in File S2.In summary,for each protein sequence,PseAAC extracts 20 ＋11=31 features,PsePSSM obtains 20 ＋20×9=200 features,the dimension of RSIV is 197,and AD encodes 3×7×11=231 features.We can obtain 659-dimensional vectors by fusing all four coding methods.Then the 1318-dimensional feature vectors are constructed by concatenating two sequences of protein pairs.

Effect of dimensionality reduction

L1-RLR can effectively improve prediction performance with higher computational efficiency.The process of parameter selection is described in File S3.To evaluate the performance of L1-RLR (C=1),we compared its prediction performance with SSDR [28],PCA [29] (setting of contribution rate is shown in Table S5),KPCA [30] (adjustment of contribution rate is shown in Table S6),FA [31],mRMR [32],and CMIM[33] (Table S7).ROC and PR curves of different dimensionality reduction methods are shown inFigure 3.The AUROC and AUPRC are shown in Table S8.The numbers of raw features and optimal features can be obtained in Figures S1 and S2.

As shown in Figure 3A and B,ROC curves for both theS.cerevisiaeandH.pyloridatasets show that the L1-RLR has superior model performance.For theS.cerevisiaedataset,the AUROC value of L1-RLR is 0.9875,which is 4.55%,4.83%,6.13%,3.21%,1.07%,and 1.09% higher than that of SSDR,PCA,KPCA,FA,mRMR,and CMIM,respectively(Table S8).For theH.pyloridataset,the AUROC value of L1-RLR is 0.9559,which is 3.47%,9.80%,8.59%,8.33%,1.04%,and 9.55% higher than that of SSDR,PCA,KPCA,FA,mRMR,and CMIM,respectively(Table S8).As shown in Figure 3C and D,in PR curves,L1-RLR almost obtains the highest precision value at corresponding recall value.The AUPRC values of L1-RLR are 1.22%-6.21% and 0.36%-11.94%higher than the other six dimensionality reduction methods on theS.cerevisiaeandH.pyloridatasets,respectively(Table S8).These results indicate that L1-RLR can effectively remove the redundant features without losing important information.The effective features related to PPIs could be fed into a GTB classifier,generating a reliable GTB-PPI prediction model.

Selection of classifier algorithms

GTB is used as a classifier with the number of iterations set to 1000 and loss function set as ‘‘deviance”.The prediction results of other four classifiers are also provided via five-fold cross-validation,including KNN [34] (number of neighbors=3) (Table S9),NB [35],SVM [36] (recursive feature elimination as the kernel function),and RF [37] (number of the base decision trees=1000) (Table S10).The prediction results of KNN,SVM,NB,RF,and GTB on theS.cerevisiaeandH.pyloridatasets are shown in Table S11 and Figures S3 and S4.We also obtain the ROC and PR curves(Figure 4)and AUROC and AUPRC values for different classifiers(Table S12).

Figure 4 Comparison of GTB with KNN,NB,SVM,and RF classifiers

As shown in Figure 4A and B,ROC curves for both theS.cerevisiaeandH.pyloridatasets show that the GTB classifier outperforms than KNN,NB,SVM,and RF.The AUROC values of GTB are 1.16%-24.65% and 0.53%-22.95% higher than the other four classifier methods on theS.cerevisiaeandH.pyloridatasets,respectively (Table S12).As shown in Figure 4C and D,the prediction performance of GTB is superior to KNN,NB,SVM,and RF.The AUPRC values of GTB are 1.42%-24.32% and 0.22%-24.56% higher than the other four classifier methods on theS.cerevisiaeandH.pyloridatasets,respectively (Table S12).These results demonstrate that GTB-PPI can accurately indicate whether a pair of proteins interact with each other within theS.cerevisiaeorH.polyridataset.GTB is an ensemble method using boosting algorithm that can achieve superior generalization performance over a single learner.Specially,RF achieves worse performance than GTB,because all the base decision trees of RF are treated equally.If the base classifier’s prediction performance is biased,the final ensemble classifier may get the unreliable and biased predicted results.GTB can utilize steepest descent step algorithm to bridge the gap between the sequence and PPI label information.

Figure 5 Prediction results of one-core and crossover networks using GTB-PPIA

Table 1 Performance comparison of GTB-PPI with other state-of-the-art predictors on the S.cerevisiae dataset

Comparison of GTB-PPI with other PPI prediction methods

To verify the validity of the GTB-PPI model,we compare GTB-PPI with ACC+SVM [7],DeepPPI [8],and other state-of-the-art methods on theS.cerevisiaeandH.pyloridatasets.

As shown inTable 1,for theS.cerevisiaedataset,compared with other existing methods,the ACC of GTB-PPI increases by 0.14%-9.00%;the recall of GTB-PPI is 0.15%higher than DeepPPI[8]and 1.54%higher than MCD+SVM[10];the precision of GTB-PPI is 1.32% higher than DeepPPI [8] and 0.81% higher than MIMI+NMBAC+RF [41].

As shown inTable 2,for theH.pyloridataset,the performance of GTB-PPI is better than other tested predictors.In terms of ACC,GTB-PPI is 2.88%-7.07% higher than other methods (7.07% higher than SVM [6],4.24% higher than DeepPPI [8],and 3.73% higher than DCT+WSRC [45]).At the same time,the recall of GTB-PPI is 1.71%-12.15%higher than other methods (4.72% higher than DCT+WSRC [45]and 7.91% higher than MCD+SVM [10]).The precision of GTB-PPI is 1.76%-5.67% higher than other methods(4.29% higher than SVM [6] and 5.67% higher than DeepPPI[8]).

PPI prediction on independent test datasets

The performance of GTB-PPI can also be evaluated using cross-species datasets.After the feature extraction,fusion,and selection,theS.cerevisiaedataset is used as a training set to predict PPIs of four independent test datasets.

As shown inTable 3,for theC.elegansdataset,the ACC of GTB-PPI is 0.26% higher than MIMI+NMBAC+RF[41],4.71% higher than MLD+RF [39],and 11.23% higher than DCT+WSRC [45],but 2.42% lower than DeepPPI [8].For theE.colidataset,the ACC of GTB-PPI (94.06%) is 1.26%-27.98% higher than DeepPPI (92.19%) [8],MIMI+NMBAC+RF (92.80%) [41],MLD+RF (89.30%) [39],and DCT+WSRC (66.08%) [45].For theH.sapiensdataset,the ACC of GTB-PPI (97.38%) is 3.05%-15.16% higher than DeepPPI (93.77%) [8],MIMI+NMBAC+RF(94.33%) [41],MLD+RF (94.19%) [39],and DCT+WSRC(82.22%) [45].For theM.musculusdataset,the ACC of GTB-PPI (98.08%) is 2.23%-18.21% higher than DeepPPI(91.37%) [8],MIMI+NMBAC+RF (95.85%) [41],MLD+RF (91.96%) [39],and DCT+WSRC (79.87%) [45].The findings indicate that the hypothesis of mapping PPIs from one species to another species is reasonable.We can conclude that PPIs in one organism might have ‘‘co-evolve”with other organisms [41].

Table 2 Performance comparison of GTB-PPI with other state-of-the-art predictors on the H.pylori dataset

Table 3 Performance comparison of GTB-PPI with other state-of-the-art predictors on independent datasets

PPI network prediction

The graph visualization of the PPI network can provide a broad and informative idea to understand the proteome and analyze the protein functions.We employ GTB-PPI to predict the simple one-core PPI network for CD9 [46] and crossover PPI network for the Wnt-related signaling pathways[47]using theS.cerevisiaedataset as a training set.

As shown inFigure 5A,only the interaction between CD9 and Collagen-binding protein 2 is not predicted successfully based on GTB-PPI,which was not predited by Shen et al.[48] either.Compared with Shen et al.[48] and Ding et al.[41],GTB-PPI achieves the superior prediction performance.The ACC is 93.75%,which is 12.50% higher than Shen et al.(81.25%) [48] and 6.25% higher than Ding et al.(87.50%) [41].As shown in Figure 5B,92 of the 96 PPI pairs are identified based on GTB-PPI.The ACC is 95.83%,which is 19.79% higher than Shen et al.(76.04%) [48] and 1.04%higher than Ding et al.(94.79%) [41].

The palmitoylation of CD9 could support CD9 to interact with CD53 [49].In the one-core network for CD9,we can see that the interaction between CD9 and CD53 is predicted successfully based on GTB-PPI.In the crossover PPI network for the Wnt-related signaling pathways,ANP32A,CRMP1,and KIAA1377 are linked to the Wnt signaling pathway via PPIs.The ANP32A has been demonstrated as a potential tumor suppressor[50],and GTB-PPI could predict its interactions with the corresponding proteins.However,the interaction between ROCK1 and CRMP1 is not predicted.It is likely because we use theS.cerevisiaedataset as a training set,and ROCK1 and CRMP1 are different organism genes fromS.cerevisiae.At the same time,ROCK1 is part of the noncanonical Wnt signaling pathway [47],GTB-PPI may not be very effective in this case.A previous study has reported that AXIN1 could interact with multiple proteins [51].Here,we find that GTB-PPI can predict the interactions between AXIN1 and its satellite proteins,which provides new insights to elucidate the biological mechanism of PPI network.

Conclusion

The knowledge and analysis of PPIs can help us to reveal the structure and function of protein at the molecular level,including growth,development,metabolism,signal transduction,differentiation,and apoptosis.In this study,a new PPI prediction pipeline called GTB-PPI is presented.First,PseAAC,PsePSSM,RSIV,and AD are concatenated as the initial feature information for predicting PPIs.PseAAC obtains not only the amino acid composition information but also the sequence order information.PsePSSM can mine the evolutionary information and local order information.RSIV can obtain the frequency feature information using the reduced sequence.AD reflects the physicochemical property features on global amino acid sequence.Second,L1-RLR can obtain effective information features related to PPIs without losing accuracy and generalization.Simultaneously,the performance of L1-RLR is superior to SSDR,PCA,KPCA,FA,mRMR,and CMIMs (Figure 3).Finally,the PPIs are predicted based on GTB whose base classifier is a decision tree,which can bridge the gap between amino acid sequence information features and class label.Experimental results show that the PPI prediction performance of GTB is better than that of SVM,RF,NB,and KNN.Especially,in the field of binary PPI prediction,the L1-RLR is used for dimensionality reduction for the first time.The GTB is also first employed as a classifier.In a word,GTB-PPI shows good performance,representation ability,and generalization ability.

Availability

All datasets and code of GTB-PPI can be obtained on https://github.com/QUST-AIBBDRC/GTB-PPI/.

CRediT author statement

Bin Yu:Conceptualization,Data curation,Formal analysis,Investigation,Methodology,Writing -original draft,Validation,Writing -review & editing.Cheng Chen:Data curation,Formal analysis,Investigation,Methodology,Writing -original draft,Validation,Visualization.Hongyan Zhou:Formal analysis,Investigation,Methodology,Validation,Visualization.Bingqiang Liu:Formal analysis,Investigation,Methodology,Writing -original draft.Qin Ma:Data curation,Formal analysis,Investigation,Methodology,Writing -original draft,Writing-review& editing.All authors read and approved the final manuscript.

Competing interests

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (Grant No.61863010),the Key Research and Development Program of Shandong Province of China(Grant No.2019GGX101001),and the Natural Science Foundation of Shandong Province of China(Grant No.ZR2018MC007).

Supplementary material

Supplementary data to this article can be found online at https://doi.org/10.1016/j.gpb.2021.01.001.

ORCID

0000-0002-2453-7852 (Bin Yu)

0000-0002-4354-5508 (Cheng Chen)

0000-0003-4093-2585 (Hongyan Zhou)

0000-0002-5734-1135 (Bingqiang Liu)

0000-0002-3264-8392 (Qin Ma)

Genomics,Proteomics & Bioinformatics

2020年5期