NOGEA:A Network-oriented Gene Entropy Approach for Dissecting Disease Comorbidity and Drug Repositioning

2021-03-30ZihuGuoYingxueFuChaoHuangChunliZhengZiyinWuXuetongChenShuoGaoYaohuaMaMohamedShahenYanLiPengfeiTuJingboZhuZhenzhongWangWeiXiaoYonghuaWang

Genomics,Proteomics & Bioinformatics 2021年4期

Zihu Guo ,Yingxue Fu ,Chao Huang ,Chunli Zheng ,Ziyin Wu,§ ,Xuetong Chen ,Shuo Gao,Yaohua Ma,Mohamed Shahen,Yan Li,Pengfei Tu,Jingbo Zhu,Zhenzhong Wang,Wei Xiao,*,Yonghua Wang,,,*

1College of Life Science,Northwest University,Xi’an 710069,China

2College of Life Science,Northwest A &F University,Yangling 712100,China

3Zoology Department,Faculty of Science,Tanta University,Tanta 31527,Egypt

4Key Laboratory of Industrial Ecology and Environmental Engineering(Ministry of Education),Faculty of Chemical,Environmental and Biological Science and Technology,Dalian University of Technology,Dalian 116024,China

5State Key Laboratory of Natural and Biomimetic Drugs,School of Pharmaceutical Sciences,Peking University,Beijing 100191,China

6School of Food Science and Technology,Dalian Polytechnic University,Dalian 116034,China

7State Key Laboratory of New-tech for Chinese Medicine Pharmaceutical Process,Lianyungang 222001,China

Abstract Rapid development of high-throughput technologies has permitted the identification of an increasing number of disease-associated genes (DAGs),which are important for understanding disease initiation and developing precision therapeutics.However,DAGs often contain large amounts of redundant or false positive information,leading to difficulties in quantifying and prioritizing potential relationships between these DAGs and human diseases.In this study,a networkoriented gene entropy approach (NOGEA) is proposed for accurately inferring master genes that contribute to specific diseases by quantitatively calculating their perturbation abilities on directed disease-specific gene networks.In addition,we confirmed that the master genes identified by NOGEA have a high reliability for predicting disease-specific initiation events and progression risk.Master genes may also be used to extract the underlying information of different diseases,thus revealing mechanisms of disease comorbidity.More importantly,approved therapeutic targets are topologically localized in a small neighborhood of master genes in the interactome network,which provides a new way for predicting drug-disease associations.Through this method,11 old drugs were newly identified and predicted to be effective for treating pancreatic cancer and then validated by in vitro experiments.Collectively,the NOGEA was useful for identifying master genes that control disease initiation and co-occurrence,thus providing a valuable strategy for drug efficacy screening and repositioning.NOGEA codes are publicly available at https://github.com/guozihuaa/NOGEA.

KEYWORDS Systems pharmacology;Gene entropy;Disease gene network;Disease comorbidity;Drug repositioning

Introduction

The onset and progression of most complex diseases often involve the dysfunction of thousands of genes as well as certain altered interactions among them.High-throughput technologies such as gene expression profiling and wholegenome sequencing have permitted the identification of an increasing number of disease-associated genes(DAGs)[1],which may provide valuable insights into mechanisms of disease initiation and progression.However,as the existing DAGs are usually derived from multiple sources,they often contain large amounts of redundant or false positive information [2] due to collection bias and noise,such that causal relationships among these genes in most cases remain elusive.Therefore,identifying master genes that control disease state transitions from large numbers of DAGs plays a critical role in understanding the mechanisms of disease initiation.In addition,complex diseases show considerable comorbidity [3].The defects of master genes in one disease may initiate interaction cascades that lead to the co-occurrence of multiple diseases in a given patient.Pharmacological targeting of the DAG module in the human interactome has proven to be a valuable strategy for drug efficacy screening[4].At present,it is unclear whether the identification of master genes will further facilitate the network-based drug repositioning.

Recent trends in omics technologies and complex biological networks have led to a proliferation of attempts to find the master genes for different diseases.For example,genome-wide association studies(GWAS)have emerged as a powerful tool for detecting sequence variations associated with many human traits and diseases [5].Due to the low frequency of many mutations,GWAS usually require large cohort size to attain sufficient statistical power.More importantly,GWAS identify only the genetic risk factors associated with the disease rather than the master genes of the disease phenotypes,because patient genomes contain a certain proportion of “passenger mutations” [6] and the initiation of many diseases is often triggered by the interplay between genetic and non-genetic factors.Transcriptome analysis is considered to be an effective complement of GWAS due to its ability to capture nongenetic perturbations to the organism.Yet variations in mRNA expression are sometimes caused by aberrant protein activities of upstream regulators such as transcription factors,making it difficult to directly identify the master gene set using transcriptome profiling [7].

Recently,gene co-expression-based approaches have been proposed to construct context-specific regulatory networks [8],and a local network entropy measure has been developed based on co-expression networks for identifying master genes[9].While these approaches provide new ways to find master genes,building a highly confident coexpression regulatory network often requires large sample size,which is usually not available for relatively rare diseases.To overcome this limitation,protein-protein interaction (PPI) network-based approaches have been developed to infer master genes that are important for disease-related biological processes,such as predicting therapeutic targets [10] or driver genes [11].Some topological parameters such as the degree and betweenness centrality of the nodes are usually used as important measures to screen master genes [12].However,current approaches are based mainly on the constant global undirected interactome,ignoring the fact that disease initiation and therapeutics are frequently context-dependent,depending on specific tissues or pathological microenvironments[13].Therefore,some genes that exhibit important topological properties on the interaction network,such as the hub genes[14],will be automatically selected as key regulators for disease state initiation and maintenance,leading to a possible increase in false-positive master genes.Conversely,some classes of genes presenting as upstream regulators of a signaling cascade,such as the G protein-coupled receptors [15],may be identified as dispensable genes due to their relatively low degrees in the interactome,thus decreasing the sensitivity for distinguishing core ones from the giant pool of DAGs.

In this study,we developed a network-oriented gene entropy approach(NOGEA)to quantify the perturbation or regulatory ability of each DAG in distinct disease contexts by assembling and interrogating disease-specific regulatory networks.For each disease,genes exhibiting high entropy values by our in silico method were identified as master genes,whose altered expression was considered to be sufficient for disease state transitions;these master genes were further adopted to investigate comorbidity and causal relationships among different diseases.We further confirmed that existing effective drugs are most likely to target the local module of master genes in the interactome,and identified 11 old drugs as potent anti-cancer agents for pancreatic cancer treatment.

Method

Dataset collection

The DAGs for all diseases analyzed in this study were obtained from four publicly available databases,including KEGG Disease [16],Comparative Toxicogenomics Database [17],Therapeutic Target Database [18],and PharmGKB[19].All disease names and their corresponding IDs were standardized by mapping to Medical Subject Headings (MeSH;https://www.nlm.nih.gov/mesh/),and official gene symbols for these DAGs were retrieved from GeneCards (http://www.genecards.org/).We then conducted a disease filtering process to ensure disease specificity.We first removed diseases with levels ＜2 on the MeSH trees,such as “nervous system diseases” and“cardiovascular diseases”,as these disease types are too broad.Tanimoto similarity (ratio of the number of shared DAGs to the number of joined DAGs) was then computed for each disease pair and used to remove diseases showing high similarity (＞0.50) with its descendant diseases.The weighted directed PPI network was constructed using data from a previous study [20],which consisted of 13,684 weighted interactions among 6082 proteins.The DAGs were then mapped to corresponding proteins in the PPI network,and those diseases with at least 15 DAGs in the human interactome were retained,because they are likely to induce a module on the network.As a result,we obtained 11,414 disease-gene associations between 274 diseases and 2848 protein-coding genes.For each disease,we manually extracted drug-disease associations from the drug indication information in DrugBank[21].In addition,we obtained drug-target interactions for all FDA-approved drugs from DrugBank.To construct a disease comorbidity network,we retrieved disease pairs with comorbidity relationships from a previous study[3]of 665 diseases and their corresponding genes extracted from Online Mendelian Inheritance in Man(OMIM)[22].

Construction of NOGEA

Construction of a flux matrix based on the expectation of the Bernoulli distribution

wherea=1 or 0,indicating whether signal transduction exists between the paired nodes,andis a scale parameter to adjust the likelihood for different distances.In addition,d(i,j)is the directed distance between the given paired nodes,i.e.,the number of edges in a directed shortest path connecting them,and is calculated using the “igraph” package based on Dijkstra’s algorithm,reflecting the possibility of the pairwise regulatory relationship from.The details for determining the optimal scale parameter are presented in File S1 and Figures S1 and S2.Therefore,the space of“possible” values assumed byI(i,j) is{0,1},and ifa=1,represents the likelihood that there is a signaling flux between the paired nodes.In the field of network communication,it is widely accepted that the success rate of signal propagation decays exponentially with increasing distance [24].In addition,previous studies have demonstrated that exponential decay is a popular kernel to characterize the network influence between two nodes[25].Previously,we have used the exponential component to evaluate the association between two nodes in PPI networks[26].Thus,we believe that the success probability of signal transduction between two proteins decays exponentially with the increase of their distance,and the exponential componentis useful for representing the success probability.In this way,the stochastic information flux matrix for a given disease is obtained by a simplified equation

Normalization of the flux matrix

The probability distribution of signal flux is estimated from

whereZis the normalization constant or partition function,and

to ensure that the sum of the probability is 1.

Definition and calculation of disease gene entropy

Based on the probability distribution of signal flux,we calculate the entropy for a given diseaseS(D)in terms of the weighted Shannon entropy formula,which can be interpreted as the degree of disorder or complexity for the disease-specific context

whereis the out-degree of nodein the directed PPI network,which is calculated using the “igraph” package.Interestingly,we find that the disease entropyS(D)can be factorized as shown in Equation(8)

Gene entropy value normalization

Through the aforementioned procedure,a gene entropy map is established for 274 diseases.For any given diseaseD,the gene entropy Z-scores are calculated,making the gene entropy values of different diseases comparable

Rank score calculation of gene entropy

The gene entropy values for diseaseDare sorted in an ascending order,and a rank list is generated

Disease-gene classification based on the gene entropy values

To comprehensively explore the biological meaning of the entropy,we divide all DAGs into three(i.e.,master,interim,and redundant) groups based on their entropy values using an adaptive approach.Briefly,we create an entropy value curve for each disease,and identify two inflection points in the curve as thresholds.Specifically,for each diseaseD,we rank each gene entropy valuein ascending order.Then we map each entropy value onto a two-dimensional coordinate system,such that the lowest entropy valuebecomes coordinate,the second lowest value becomes,and so on,until the maximum entropy valueis reached.Two inflection points are identified in the entropy value curve from the intervals of 10th-50th percentile and 51st-90th percentile of all entropy values,respectively,which are separately defined as the threshold points of most rapid increase from the low to the medium and from the medium to the high entropy values.The entropy values corresponding to the two inflection points are used as the adaptive disease-specific classification thresholds.Master genes of all diseases are then merged and adopted as the whole master gene set to explore their common biological meanings.Interim and redundant genes from different diseases are treated in the same way to obtain the whole interim and redundant gene sets,respectively.Some genes may belong to all three gene sets (master,interim,and redundant),because they play different roles in distinct disease contexts.

Disease comorbidity relationship evaluation

We first construct a real human disease comorbidity network (HDCN),where nodes represent diseases and edges represent the reported comorbidity relationships,respectively.Then,five different types of inferred disease comorbidity networks are built to compare with the HDCN:1)a master gene-based disease network (M-GDN),where edges link two different diseases only if they share at least one high-entropy gene,2) a redundant gene-based disease network (R-GDN),3) an interim gene-based disease network (I-GDN),4) an all DAG-based disease network(A-GDN),and 5) a traditional hereditary disease network(THDN).A Tanimoto coefficient is used to evaluate the similarity between different networks as shown in Equation(14)

whereAandBare different networks,represents the edge set of a given network andis the number of edges in the network.To assess the significance of the similarity of different networks,a random gene-based disease network(R-GN) is randomly generated 1000 times and compared with the HDCN using Equation (14).In the R-GN,each disease involves a randomly sampled gene set with the same size as the disease in A-GDN.

Previous research has demonstrated that cellular interaction links result in statistically significant comorbidity patterns [3].Therefore,we believe that the directed interaction strength from the DAGs of one disease to another in the directed cellular network can reflect the causal relationship between the two diseases.To evaluate whether a causal relationship exists between two diseases,we estimate the significance of the interaction strength between the DAGs of the disease pairs using the Monte Carlo method.We first define a raw causal relationship score(RCRS) for two given diseasesD1 andD2

wherepcutis a threshold,below which the probability is discarded and considered not contributive to the overall interaction,andpcutis determined according to a previous study [27].We then use a normalized causal relationship score(NCRS)to quantify the risk that diseaseD1 will induce diseaseD2.The NCRS is defined in Equation (17)

Calculation of drug disturbance entropy

To quantify the effects of a drug on each disease based on the gene network entropy,we apply an ensemble approach,referred to as drug disturbance entropy (DDE),to evaluate the relationship between drug targets and disease-associated proteins (encoded by DAGs) in the interactome.We first evaluate the linkage strength between each DAG and drug target in the interactome,which is then transformed to a probability.The perturbation value for each target and DAG is defined as the product of the strength probability and the DAG entropy

whereTcutis a cut-off threshold of the perturbation values,which is determined by extensive sampling,and relationships with a perturbation value below this threshold are discarded.The remaining values are summed as the raw DDE of the drug to the disease.The advantage of this procedure is that weak relationships are eliminated,which greatly reduces noise and improves the robustness of the measure.By sampling across the range ofTcutchoices,the threshold that leads to the highest ROC AUC is chosen.We obtain the properTcutas 0.89×max(T(t,i))by evaluating the performance of predictions of drug-disease associations.Detailed information for determiningTcutis depicted in File S1.

To avoid possible high DDE that may be caused by a large number of drug targets and DAGs,we convert raw DDE to a size-bias-free value using the mean and standard deviation of raw DDE modeled from sets of random molecules,so that the potential therapeutic effects between distinct drugs and diseases could be evaluated under the same metric.The raw DDE score is transformed to a sizebias-free score under Equation(22)

whereTandV(D) are the drug target set and the DAG set,respectively;are the estimation of the expectation and standard deviation of DDE under this condition,respectively.

The estimation procedures ofμ(ET(T,V(D)))andare as follows:for each pair of(T,V(D)),we construct 1000 random set pairs with|T|targets andDAGs,preserving the degree distribution of the randomized targets and disease-associated proteins.To avoid repeatedly choosing the same nodes during the degree-preserving random selection,we use a binning approach as described ina previous report[4].

Cell culture and viability assays

A human pancreatic adenocarcinoma cell line,moderately differentiated BxPC3,was obtained from the American Type Culture Collection(Manassas,VA,USA).Cells were maintained at 37°C under 5% CO2air atmosphere in Dulbecco’s Modified Eagle Medium supplemented with 10%fetal bovine serum.The BxPC3 cells were plated in 96-well tissue culture microtiter plates at a density of 5 × 103cells/well and treated with the selected drugs for at least five different concentrations.Cytotoxic effects of drugs on cells were determined by the MTT assay.The absorbance was recorded on a microplate reader (Catalog No.DNM-9602,Beijing Pulang New Technology,Beijing,China) at a wavelength of 490 nm.The maximum drug effects on cell viability were experimentally observed at the endpoint,and the IC50value was determined after 72 h of treatment.All experiments were performed in quadruplicate and repeated three times.

Results and discussion

Computation and characterization of gene entropy in disease networks

To identify master genes in distinct disease contexts,NOGEA model was developed(Figure 1A and B).Briefly,Shannon entropy theory was applied to quantify the amount of disorder within intracellular signals in each diseasespecific context,which was subsequently factorized as the summation of contribution of each DAG.First,directed disease-specific gene networks for 274 diseases were constructed to reflect the distinct disease contexts by mapping all DAGs (Table S1) to a previously established directed PPI network(Table S2)[20].A directed network visualizes the hierarchy of intracellular signal transduction between the interacting proteins,and hence clearly reflects the importance of each DAG in a certain physiological and pathological context.The regulation likelihood between each pair of DAGs was then calculated based on the directed distance on the PPI network to generate a probability-based signaling flux matrix(Figure 1A).Finally,the perturbation ability of each DAG in a disease-specific context was calculated by the network-oriented gene entropy metric(Figure 1B;see Method).The distribution of entropy values for all DAGs is illustrated as a histogram in Figure S3,and the perturbation ability of each DAG was then ranked based on their entropy values(Table S1).

Figure 1 Computation and characterization of gene entropy in disease networks

To efficiently explore the biological features of entropy distribution for each disease,all DAGs were classified as“Master”,“Interim”,and “Redundant” genes which represent high-,medium-,and low-entropy genes,respectively.We created an entropy value curve for each disease and then identified two inflection points as thresholds to separate the low-,medium-,and high-entropy genes,respectively (see Method).We then merged the master genes of all diseases into a whole master gene set.Interim and redundant genes from different diseases were treated in the same way to obtain the whole interim and redundant gene sets,respectively.As a result,770 master,1894 interim,and 1332 redundant genes were obtained(Figure 1C;Table S3).

In order to verify whether the master genes play a key role in disease initiation and development,enrichment analyses were performed using several well-established gene sets (Table S4).We observed that there was an overrepresentation (enrichment Z-score=22.61) of master genes in the OMIM gene set,which was higher than the enrichment Z-scores of both interim and redundant genes(Figure 1D).The “essential” genes were demonstrated to play critical roles in human diseases[28],and master genes were enriched in the “essential” gene set,whose Z-score was two times larger than that of the redundant genes(Figure 1D).More importantly,we found that master genes were highly enriched in the cancer-associated gene set;however,redundant genes showed less enrichment(Figure 1D).Further KEGG analysis of the master genes showed that these genes were mainly enriched in pathways with close relationships with cancer initiation and progression (Figure S4).For example,PI3K-Akt signaling pathway (has:04151),which is commonly perturbed in cancers,was found among the top 5 enriched pathways(P＜10E-30).In a recent study,genes in the interactome were classified into different node types,in which “indispensable”nodes were found to be key players in mediating the transition of disease states.As shown in Figure S5A,we found that master genes were highly enriched in the “indispensable”gene set,but redundant genes were enriched in the “dispensable” gene set.Consistent with these observations,master genes were highly enriched in the “critical”gene set that acted as driver nodes in all control configurations (Figure S5B) [26].Further dissection of all different functional classes within signaling proteins revealed that master genes were most likely enriched in the“kinase”and “membrane receptor (MR)” gene sets (Figure 1E).In summary,these results indicate that the master genes are preferred key regulators in disease initiation and development,reflecting the reliability of the NOGEA method.

Traditional network topology parameters,such as the connective degree and betweenness centrality,are commonly used as baseline methods for characterizing the importance of nodes in biological networks [29].To validate the effectiveness of NOGEA,we compared it with four baseline methods(connective degree,connective in-degree,connective out-degree,and betweenness centrality-based methods) and four newly proposed methods (Katz [30],Catapult[30],HANRD[31],and GPS[32]),all of which are network-based methods for prioritizing DAGs.We first compared the area under the receiver operating characteristic curve(AUROC)values between different methods(see Method) and found that NOGEA significantly outperformed both the baseline methods and the newly proposed methods (Figure 1F).We further evaluated the area under the precision-recall curve(AUPRC)for each method.NOGEA consistently surpassed all other methods,overmatching the second-best method by～10%(Figure 1F).

Correlations between gene entropy values and four traditional network topology parameters were assessed using Pearson’s correlation coefficients(PCCs).For most diseases,we observed that the PCCs between gene entropy values and network topology parameters were relatively small(＜0.25;Figure S6A).Nonetheless,significant correlation was observed between the connective in-degree(R=0.2254,P＜1.0E-15;Figure 1G),connective out-degree (R=0.5239,P＜1.0E-15;Figure 1H),connective degree (sum of indegree and out-degree;R=0.3942,P＜1.0E-15;Figure 1I)and betweenness centrality (R=0.1774,P＜1.0E-15;Figure 1J) for genes in the primary directed PPI networkversusgene entropy values.Fisher’s exact test was then applied to further determine whether gene entropy is associated with these four traditional network topology parameters.Specifically,we constructed a contingency table to classify the DAGs into different bins based on their entropy values and network parameter values (Figure 1K).We found that gene entropy was significantly associated with traditional network topology parameters,including connective degree(P＜0.01),connective in-degree(P＜0.01),connective out-degree (P＜0.01),and betweenness centrality(P＜0.01).All these results demonstrate that master genes prefer to possess high topology parameter values,indicating relative consistency between gene entropy and the four network topology parameters.

To investigate variation of the regulatory role of a specific gene in different diseases,we calculated the divergencedegree of gene entropy across diseases using the coefficient of variation(CV)(Table S1;Figure S6B).The results show that up to 60% of the DAGs have a high CV (＞0.15),indicating that these DAGs play distinct roles in different disease contexts.We then examined the entropy value variation of the shared DAGs in different diseases,and observed that these DAGs usually exhibited similar entropy values in distinct diseases within the same disease category.For example,corticotropin-releasing hormone receptor 1(CRHR1) is related to several mental health-associated diseases with similar entropy rank scores (rank ＞0.80),including anxiety and depressive disorders (Table S1),which is consistent with its major role in mental disorders [33].We also observed a low entropy rank score forCRHR1in pulmonary disease (rank=0.55),indicating variation in its regulatory role in distinct disease contexts.Further,we found that～15%of DAGs have approximately equal rank scores in their associated diseases.For instance,among these DAGs,both interleukin 4 receptor(IL4R)and phosphatidylinositol-4,5-bisphosphate 3-kinase catalytic subunit alpha (PIK3CA) have high rank scores in their associated diseases (Table S1),especially for neoplasms,suggesting their crucial roles in these diseases.Taken together,NOGEA provides a new way to explore the regulatory roles of each DAG in distinct disease contexts.

NOGEA for exploring disease comorbidity

Exploration of the underlying mechanisms of comorbidity,which refers to the co-occurrence of multiple diseases or disorders,is difficult due to complex interactions among environmental,lifestyle,and treatment-related factors[34].In addition,disease comorbidity includes not only the cooccurrence of multiple diseases,but also the potential cause-and-effect relationships among these diseases.Thus,uncovering the underlying mechanisms of disease cooccurrence and causal relationships is of great significance for their prevention and treatment.Using experiment-based approaches or mathematical models,previous studies explored the molecular features of disease comorbidity for several diseases,including from gastritis to gastric cancer[35]and from diabetes to cancer [36].However,existing experiment-based methods for exploring the underlying mechanisms of co-occurrence and causal relationships remain costly and labor-intensive,and sometimes focus on a small fraction of molecular features.Comparatively,mathematical models provide novel ways to reveal disease comorbidity using multi-omics data;however,these models are difficult to apply in other diseases,due to the lack of multi-scale information for these diseases.

The results discussed above demonstrate that NOGEAinferred master genes are closely associated with disease initiation and development,prompting us to investigate whether the network entropy-based approach would be capable of uncovering the molecular basis of disease cooccurrence.Therefore,we constructed M-GDN,where an edge would link two different diseases if they share at least one master gene (Table S5).For comparison,we constructed five other disease comorbidity networks:R-GDN,I-GDN,A-GDN,THDN,and R-GN.

To test whether the M-GDN would provide an accurate picture of disease comorbidity,we evaluated the Tanimoto similarities between these networks and the HDCN,which was extracted from the Medicare Claims Database and constructed in a previous study [3].The M-GDN showed the highest similarity with the HDCN (higher than that of R-GDN and THDN) and remarkably higher level than the average of random similarity values (Figure 2A),which indicates that genes most associated with disease comorbidity tend to be master genes with high entropy rather than arbitrary DAGs.In contrast to previous THDN models,M-GDN considers genetic factors as well as genes that respond to environmental,lifestyle,and/or treatment-related factors,thus providing a more comprehensive solution for exploring disease comorbidity.Furthermore,in view of the impact of cellular network interactions on disease comorbidity,we extended our result to a PPI-based M-GDN(Table S6),where two diseases were linked if the master gene of one disease directly interacted with genes of the other disease in the PPI network.Consistent with the aforementioned results,the PPI-based M-GDN demonstrated the best predictive ability in identifying disease comorbidity.We then observed that the inferred underlying molecular mechanisms of disease comorbidity are in accordance with current pathobiological knowledge (Figure 2B).For example,M-GDN confirmed the conclusion thatAKT1mutations lead to schizophrenia and type 2 diabetes mellitus(T2DM)[37],with entropy rank scores of 0.96 and 0.94 in schizophrenia and T2DM,respectively.We also observed that in the M-GDN,ADRB2mutations may lead to asthma and obesity with entropy rank scores of 0.95 and 0.97 in asthma and obesity,respectively,which is consistent with a previous study[38].Previous reports have suggested that mutations in theIRS1gene are closely related to the comorbidity of T2DM and obesity[39],andPTGS2influences the inflammatory response and is also closely connected with the comorbidity of T2DM and obesity [40].Here,M-GDN also revealed thatIRS1andPTGS2plays a crucial role in the comorbidity of T2DM and obesity.Another example revealed in M-GDN is the comorbidity of leukemia and cardiomyopathy,whose underlying mechanisms remain unclear.Interestingly,FASis involved in the regulation of cell apoptosis,which affects left ventricular function [41],whilePRKCAenhances cell resistance [42],regulates cardiac contractility,and has been implicated in increased risk for heart failure.More importantly,theFASPRKCAinteraction has been identified as the top connected cross-talk PPI by in situ proximity ligation assays [43].These results demonstrate that the interaction betweenFASandPRKCAmay account for the comorbidity of leukemia and cardiomyopathy.Taken together,these results suggest that M-GDN helps bridge the gap between bench-based biological discoveries and bedside clinical solutions,and thus may provide new insights into the mechanisms of disease comorbidity.

Next,we investigated the molecular basis of disease causal relationships from the perspective of directed biological networks.As an illustration,we constructed a directed comorbidity network(Figure 2C;Table S7)centered on Parkinson’s disease.We observed high co-occurrence risks between Parkinson’s disease and other diseases including Alzheimer’s disease.Recent research suggests that these diseases are related to the accumulation of commonproteins in the brain,such as alpha-synuclein protein [44].Using alcoholism and Parkinson’s disease as an example,we observed a significant directed interaction from alcoholism to Parkinson’s disease (P＜0.01;see Method),but notvice versa.This result is consistent with recent clinical studies,which suggest that alcoholism may an inducer of Parkinson’s disease [45].A subsequent network analysis further discovered that the aberration of alcoholismassociated master genes may lead to the modification of most Parkinson’s disease-associated master genes (Figure 2D).Collectively,NOGEA is potentially useful for investigating mechanisms underlying disease comorbidity as well as their causal relationships.

Figure 2 Exploration of disease comorbidity using NOGEA

NOGEA can infer drug-disease associations

Recently,several state-of-the-art network-based methods have been proposed to investigate the relationships between drugs and diseases,such as the network proximity approach(NPA) and network inference algorithm (NIA) [4,46].In this study,we assessed relationships between DAGs and drug targets based on the gene network entropy to evaluate the effects of drugs on each disease.For each drug-disease relationship,we calculated the DDE parameter,which represents potential therapeutic effects of the drug (Tables S8-S10;see Method).To further investigate DDE’s effectiveness,we evaluated the correlation between the DDE score and the number of hits for known drug-disease interactions(DDIs),and found that the occurrence number of known DDIs in each bin increased with increasing DDE scores(Figure 3A).Consistent with previous research[4],a highly significant correlation occurred between the average DDE score of each bin and the enrichment fold of hits for known DDIs (R2=0.75,P=2.2E-16;Figure 3B),indicating a high likelihood that a drug will successfully treat a disease if the drug is capable of strongly perturbing the local module of master genes in the interactome.

Figure 3 Drug-disease association inference based on NOGEA

To validate the utility of DDE for distinguishing known drug-disease pairs from the unknown ones,we compared the AUROC values for different drug-disease association prediction methods (see Method).To obtain a robust AUROC estimation,the drug-disease set was split into a training set and a testing set according to a given fraction coefficient for developing and validating the model,respectively (Figure S7).We compared the DDE’s performancewith several other state-of-the-art methods [4,46],including NIA,NPA,network kernel approach (NKA),network shortest approach(NSA),network center approach(NCA),and network separation approach (NSEA).As shown in Figure 3C,DDE exhibited the best performance(average AUROC=0.70) in discriminating known and unknown drug-disease pairs.Interestingly,NIA appeared to be the second-best method(average AUROC=0.68),which was also able to construct a directed disease-specific gene network and identify master genes before predicting the drug-disease associations.A compressive comparison between the two methods demonstrated their connection and difference (File S1;Figure S8;Tables S11 and S12).Collectively,these results suggest that DDE is effective for predicting drug-disease associations.

Pancreatic cancer is a refractory malignant carcinoma of the digestive tract with a 5-year survival rate of～4%[47],and it modestly responds to very few existing chemotherapy treatment options.Revisiting the complex interaction pattern between drug targets and pancreatic cancer-associated genes in a systemic manner is essential for developing more effective therapeutic regimens.Therefore,we used pancreatic cancer as an example to explore the utility of NOGEA for drug-disease association inference.By measuring the entropy of each pancreatic cancer-associated gene in the pancreatic cancer-specific network (Figure 3D and E),we found that those genes with high entropy such asMET,KDR(VEGFR-2),ERBB2,CD44,andEGFRmay play more important roles than the lower-entropy genes for pancreatic cancer treatment.As reported in a previous study [48],EGFR-mediated signaling is involved in the tumorigenesis of pancreatic cancer,and the preclinical data supportEGFRinhibition as a potential treatment strategy for pancreatic cancer.In addition,c-Met protein,which is encoded by theMETgene,is a marker of pancreatic cancer stem cells and thus a therapeutic target [49].KDRis known to be crucial for embryonic vasculature development by modulating endothelial cell proliferation and migration [50].Moreover,CD44is a potentially interesting prognostic marker and therapeutic target in pancreatic cancer[51].

To investigate differences in the targeting patterns between effective drugs and other less-effective drugs from a network-based perspective,we constructed a gene entropy map for pancreatic cancer.We first calculated the linkage strength between drug targets and pancreatic cancerassociated genes for two FDA-approved drugs:axitinib and erythromycin(Figure 3D).Axitinib targetsFLT4,FLT1,andKDR,among whichKDRwas identified as a pancreatic cancer master gene by NOGEA.The DDE score of axitinib to pancreatic cancer was 37.6,suggesting that targets of axitinib are more closely related to pancreatic cancerassociated genes than expected by chance.Conversely,the DDE score of erythromycin (whose efficacy remains unknown)to pancreatic cancer was 1.1.Even though this drug inhibitsABCB1,ALB,andKCNH2,they are not closely related to pancreatic cancer-associated genes than expected by randomly selecting gene sets.However,some drugs that do not directly inhibit the pancreatic cancer-associated master genes may still have the potential to be effective drugs.For example,sirolimus,which is currently in phase II clinical trials,targets three proteins (FKBP1A,FGF2,andMTOR) but no known pancreatic cancer-associated genes.Nevertheless,sirolimus had a high DDE score of 12.1 to pancreatic cancer due to the relatively strong perturbation of high-entropy genes such asCD44andEGFR(Figure 3E)viaFGF2.Drugs(e.g.,pravastatin,DDE=-0.7)were predicted to be ineffective pancreatic cancer drugs due to their weak perturbation of nearly all pancreatic cancer-associated genes(Figure 3E).Collectively,these results suggest that NOGEA may be capable of identifying the core genes among many DAGs that provide the basis for rational drug discovery.

Screening of potential drugs for pancreatic cancer treatment

Due to the encouraging performance of the DDE metric for accurately inferring drug-disease associations,we screened potentially effective drugs for pancreatic cancer treatment.We first calculated and prioritized DDE scores for all FDAapproved drugs (Tables S13 and S14).From top 10% of these drugs,we selected 19 molecules that were not known to be associated with pancreatic cancer for further experimental validation.The half-maximal inhibitory concentration(IC50)of a molecule,an important metric to measure its response to a certain cancer cell line,has been widely applied in the screening of potential anti-proliferative agents in preclinical cancer pharmacogenomics.The BxPC3 human pancreatic cancer cell line,which has been frequently used in the studies of pancreatic cancer and screening of chemo preventive agents[52],was used in ourin vitrostudy to evaluate its response to the candidate drugs.We identified 11 candidate drugs that inhibited BxPC3 cell line in a dosedependent manner and exhibited low IC50values(＜100 μM;Figure 4A-C,Figure S9),demonstrating their efficacies for inhibiting pancreatic cancer cell proliferation and potentials for pancreatic cancer therapyin vivo.One drug for example,vinorelbine,is a drug that has already been approved for non-small-cell lung cancer treatment [53].In our study,vinorelbine exhibited a low IC50value of 1.55 nM (Figure 4A).Interestingly,some non-classical anti-cancer drugs also displayed acceptable suppressive effects on BxPC3.For example,saquinavir(mainly used with other medications for HIV/AIDS treatment or prevention [54]) and celecoxib(mainly used for treatment of pain and inflammation in adults[55]),showed low IC50values of 22.63 μM(Figure 4B)and 45.36 μM (Figure 4C),respectively.These results indicate that our model has the capacity to predict proper drug candidates for disease therapy.

Figure 4 Screening of potential drugs for pancreatic cancer treatment

Transcriptional expression analysis was conducted to validate our hypothesis that efficient drugs tend to perturb the master genes directly or through their targets.We first identified 1335 differentially expressed genes (DEGs)after saquinavir treatment (referred to as SAQDEGs)(Figure S10A;Table S15).Then,we identified 849 most possibly affected pancreatic cancer-associated master genes aftersaquinavir treatment(named as SAQPAGs;Table S15),and further incorporated them with their corresponding neighbor genes in the interactome.Finally,a hypergeometric test was used to assess the overlap between SAQDEGs and SAQPAGs.The results showed that the SAQDEGs were significantly enriched for SAQPAGs(P＜0.01,Figure 4D).Results for celecoxib treatment were similar to those for saquinavir treatment (Figure 4E,Figure S10B),suggesting a close relationship between genes perturbed by the efficient drugs and the local module of master genes.

Finally,to demonstrate the reliability of the DDE approach for extensive screening of pancreatic cancer candidate drugs,we further conducted a literature mining analysis to evaluate the therapeutic potential of the top 10%FDA-approved drugs ranked by DDE scores (drugs without clear pharmacological category were excluded;n=108)as described in our previous report[56](see Method).We observed that 9 of the top 10 selected drugs were antineoplastic agents (ANAs) and showed significant correlation with pancreatic cancer (P＜0.01,Table S16).In addition,most selected drugs belonging to ANAs (27/31,87.1%) were significantly associated with pancreatic cancer (Figure 4F;Table S16),suggesting the sensitivity of this model.Interestingly,an analysis of the categories of these candidate drugs revealed that the largest proportion(41/108,38.0%) was assigned to central nervous system agents (CNSAs) (Figure 4F).For example,celecoxib,which is sensitive to the BxPC3 cell line as mentioned above (Figure 4C),also acts as a CNSA.In general,these results indicate that DDE provides a rational strategy for drug repurposing due to its capacity to quantify drug targeting tendency in the interactome.

Conclusion

Disease phenotypes typically result from interactions among multiple complex environmental and genetic factors.The onset,development,and treatment of a disease usually involve hundreds of genes [29].In this study,we propose NOGEA for accurately inferring master genes that contribute to specific diseases by quantitatively calculating their perturbation abilities on directed disease-specific gene networks.Our results confirm that master genes are enriched in gene sets that account for disease onset and development.This may imply that at a molecular level,master genes with high entropy are the underlying start points of the disease state,impacting those redundant genes with low entropy through a directed disease-specific gene network.Interestingly,the comorbidity prediction model built using the master genes shows the best agreement with the independent clinical dataset compared to the model established using the whole disease gene set.This indicates that our method may decrease the influence of noise and improve the efficiency for extracting more important genes from massive genomic datasets.Finally,through this method,11 old drugs were newly identified and predicted to be effective for treating pancreatic cancer and then validated byin vitroexperiments.However,it remains challenging to simulate the complex contents of the tumor microenvironmentin vitro,making it difficult to comprehensively evaluate drug response using IC50.Therefore,despite our encouraging results,future work focusing onin vivovalidation before clinical use is needed.

Although the identified master genes may be important for elucidating mechanisms of disease progression and drug screening,we acknowledge that it is difficult to directly evaluate the accuracy of NOGEA for identifying master genes at this stage due to the lack of ‘gold standard’ reference datasets.Nevertheless,the availability of more personal genome data in the future will allow for construction of patient-specific networks,and NOGEA will provide new opportunities to identify patient-specific master genes and promote the development of personalized medicine.Emerging deep learning methods may become powerful techniques for exploring poly-pharmacy side effects [57] and discovering disease-gene associations [58]from massive datasets [59].Because gene entropy values can be used as novel disease feature data,we expect that integrating deep learning with NOGEA will significantly improve the accuracy for determining disease-drug or disease-disease associations.Extending the systematic approach presented here from signal drugs to multiple drugs may pave the way toward a better understanding of drug combinations.

Code availability

The source code and a detailed usage guide of NOGEA are freely available on GitHub at https://github.com/guozihuaa/NOGEA.

CRediT author statement

Zihu Guo:Methodology,Data curation,Formal analysis,Software,Investigation,Methodology,Visualization,Writing -original draft,Writing -review &editing.Yingxue Fu:Methodology,Investigation,Writing-original draft.Chao Huang:Methodology,Investigation,Visualization,Writing -original draft.Chunli Zheng:Methodology,Writing -original draft.Ziyin Wu:Validation,Visualization.Xuetong Chen:Data curation.Shuo Gao:Data curation.Yaohua Ma:Data curation.Mohamed Shahen:Writing-original draft.Yan Li:Writing-review&editing.Pengfei Tu:Investigation.Jingbo Zhu:Investigation.Zhenzhong Wang:Resources.Wei Xiao:Conceptualization,Resources,Supervision,Project administration.Yonghua Wang:Conceptualization,Resources,Supervision,Project administration,Funding acquisition.All authors have read and approved the final manuscript.

Competing interests

The authors have declared no competing interests.

Acknowledgments

This study was supported by the National Natural Science Foundation of China(Grant Nos.U1603285 and 81803960)and the National Science and Technology Major Project of China(Grant No.2019ZX09201004-001).We thank TopEdit(www.topeditsci.com)for its linguistic assistance during the preparation of this manuscript.

Supplementary material

Supplementary data to this article can be found online at https://doi.org/10.1016/j.gpb.2020.06.023.

ORCID

0000-0003-3975-3141 (Zihu Guo)

0000-0003-3052-4131 (Yingxue Fu)

0000-0003-1186-0973(Chao Huang)

0000-0001-9552-8040 (Chunli Zheng)

0000-0002-3382-2620 (Ziyin Wu)

0000-0002-0566-0740 (Xuetong Chen)

0000-0003-1857-0995 (Shuo Gao)

0000-0003-4402-5464 (Yaohua Ma)

0000-0003-3608-2581 (Mohamed Shahen)

0000-0001-8295-530X (Yan Li)

0000-0002-3848-8174 (Pengfei Tu)

0000-0002-0522-3890 (Jingbo Zhu)

0000-0002-1364-3537 (Zhenzhong Wang)

0000-0001-8809-9137 (Wei Xiao)

0000-0002-3060-1072 (Yonghua Wang)

Genomics,Proteomics & Bioinformatics

2021年4期