APP下载

The improved FASTmrEMMA and GCIM algorithms for genome-wide association and linkage studies in large mapping populations

2020-10-21YngjunWenYwenZhngJinZhngJinyingFengYunmingZhng

The Crop Journal 2020年5期

Yngjun Wen,Ywen Zhng,Jin Zhng,Jinying Feng,Yunming Zhng,*

aState Key Laboratory of Crop Genetics and Germplasm Enhancement,Nanjing Agricultural University,Nanjing 210095,Jiangsu,China

bCrop Information Center,College of Plant Science and Technology,Huazhong Agricultural University,Wuhan 430070,Hubei,China

ABSTRACT

1.Introduction

There is an increasing interest in using multi-locus models to test for the association between marker and trait in genomewide association studies(GWAS)and linkage analyses because of their demonstrated effectiveness in multiple testing,polygenic background,and population stratification controlling[1–3].Among these multi-locus methods,fast multi-locus random-SNP-effect mixed linear model(mrMLM)[4],fast multi-locus random-SNP-effect efficient mixed model association(FASTmrEMMA)[5],and genome-wide composite interval mapping(GCIM)[6,7]are developed,under the framework of the mixed linear model(MLM),to further increase the detection power and reduce false positive rate in the GWAS,especially for small-effect and linked quantitative trait nucleotides(QTN).

However,fitting MLM and conducting each SNP test are computationally nontrivial and involved in the determinant and matrix inversion computations.This results in the significant increase of running time for each marker scanning,especially in large mapping population.To address this issue,Kang et al.[8]proposed the efficient mixed-model association eXpedited(EMMAX)algorithm and Zhang et al.[9]developed the compressed MLM(CMLM)and population parameters previously determined(P3D)algorithms.Thereafter,a series of fast algorithms have been proposed,such as,multiple loci linear mixed model(MLMM)[10],GRAMMAR-Gamma[11],factored spectrally transformed linear mixed model(FaST-LMM)[12],FaST-LMMSelect[13],genome-wide efficient mixed-model association(GEMMA)[14],enriched CMLM(ECMLM)[15],the settlement of MLM under progressively exclusive relationship(SUPER)[16],fast Bayesian mixed-model association(BOLT-LMM)[17],fixed and random model circulating probability unification(FarmCPU)[18],and fastGWA[19].Recently,Zhang et al.[20]used the GEMMA algorithm and Miller's matrix transformation[21]to improve the computation efficiency of Wang et al.[4].This is FASTmrMLM.However,the running time of FASTmrEMMA[5]increases with the quadratic of the number of individuals(n),indicating long running time in large mapping population.

In this article,some matrix identities were used to replace the spectral decomposition in each marker scanning of the FASTmrEMMA algorithm.This replacement substantially reduces the running time of each marker scanning,especially in large mapping population.The results from three real data analyses and a series of Monte Carlo simulation experiments validated the improved FASTmrEMMA algorithm.In addition,the new algorithm was also extended to the GCIM algorithm of Wen et al.[7].

2.Materials and methods

2.1.Methods

2.1.1.Genetic model

The n×1 phenotypic vector of quantitative trait,y,for all the individuals(or lines)in mapping population may be described by the following model

2.1.2.The FASTmrEMMA algorithm

FASTmrEMMA[5]is an existing fast multi-locus two-stage GWAS approach.In the first stage,SNP effect was treated as random and a new matrix transformation was used to whiten the covariance matrix of the polygenic matrix and the environmental noise.Then,a polygenic-to-residual variance ratio under the null hypothesis was fixed in all the single marker genome tests.In the second stage,all the selected SNP effects in the first stage were placed into one multi-locus model,all the effects were estimated by expectation–maximization empirical Bayes,and non-zero effects were further detected by likelihood ratio test for true QTN identification.

To estimate parameterλβ,here the Newton-Raphson(NR)algorithm was used to obtain the ML and REML estimates

where the first derivatives were

and the second derivatives were

The best linear unbiased prediction(BLUP),^β=E(β|yc),is

It was found from the above formula that the eigenvectors URand(η1,…,ηv)Tshould be calculated for each marker scanning in the FASTmrEMMA algorithm.This makes computation time consuming,especially for large mapping population.To overcome this issue,in this study we proposed an improved algorithm without calculating the eigenvectors.

2.1.3.The improved FASTmrEMMA algorithm

In this study,the matrix identities and properties,such as the Woodbury Matrix identity,are used to infer the corresponding formulas without matrix inversion and eigenvectors.The related formulas are briefly described here;for technical detail the readers are referred to Supplementary Methods.Based on the FASTmrEMMA algorithm,we concluded that the ML and REML estimates forλβare equivalent to maximizing the following target functions:

To estimate the parameterλβ,we still used the formula(5).However,the first derivatives(6)and(7)in formula(5)were respectively replaced by

and the second derivatives(8)and(9)in formula(5)were respectively replaced by

Once the estimates forλβwas obtained,the BLUP forβwas

Under the null hypothesis,the above D follows approximately a mixture of twoχ2distributions and the P-values can be calculated accordingly from the literature[5].

To understand the improved algorithms,the major differences of the squared form,β,inverse matrix,testing statistic and running time between original and improved algorithms are summarized in Table 1.The improved algorithm can be implemented in R[20].

Table 1–Expressional com parisons of the squared form,β,inverse m atrix,testing statistic and running tim e between original and improved algorithms.

2.1.4.The improved GCIM algorithm

According to the framework of the GCIM method[6],the FASTmrEMMA algorithm was extended to detect small and linked QTL in F2[7].First,the FASTmrEMMA algorithm is used to separately conduct genome scanning for additive or dominant effects in F2.For each kind of effect,all the peaks of negative logarithm P-value curve are viewed as potential QTL.Then,all the potential QTL are included into one multilocus model,all the effects were estimated by adaptive lasso,and all the non-zero effects were further identified by LRT for true QTL identification.

The main difference between the origin and improved GCIM is that the FASTmrEMMA in the GCIM is replaced by the improved FASTmrEMMA.The others are the same as those in GCIM.

The improved GCIM can be implemented in R[22].

2.2.Monte Carlo simulation experiments

2.2.1.Genome-wide association studies

2.2.2.Linkage analyses in F2

Each sample was analyzed by the GCIM-random and its improved algorithm.For each simulated QTL,we counted the samples in which the LOD score had passed 2.5.A detected QTL within 5 cM of the simulated QTL was considered to be a true QTL.

2.3.Real datasets in Arabidopsis thaliana,rice,and maize

To validate the improved FASTmrEMMA method,three real datasets in Arabidopsis thaliana[23]and rice[20,24,25]were reanalyzed in this study.The datasets I and II were used to confirm faster running time of the improved method as compared with original method,while the dataset III was used to confirm the advantage of the improved method over widely-used approaches(FarmCPU[18],EMMAX[8],GEMMA[14],and TASSEL[26]).In the dataset I,the phenotypic and genotypic observations were downloaded from http://www.arabidopsis.usc.edu/.The datasets include a total of 199 Arabidopsis lines,each line has 216,130 SNP markers and four flowering time traits:days to flowering under long days(LD),days to flowering under short days with vernalization(SDV),leaf number at flowering with 8 weeks vernalization,greenhouse(8W GH LN),and days to flowering,8 weeks vernalization,greenhouse(8W GH FT).The others were found in the literature[5,23].In the dataset II,the phenotypic and genotypic observations were downloaded from the Rice SNPSeek Database.In the large datasets,the number of phenotypic accessions was 2262,the number of markers was 1.01 million,and the trait of interest was grain width(GW).The others were found in the literature[20,24].In the dataset III,the phenotypic and genotypic observations were downloaded from www.ricediversity.org/44kgwas.In this dataset,the number of phenotypic accessions was 413,the number of markers was 44 K,and the trait of interest was plant height.The others were found in the literature[25].

To validate the improved GCIM,two real datasets in rice[27]and maize[28]were re-analyzed in this study.The dataset IV was used to confirm faster running time of the improved method as compared with original method,while the dataset V was used to confirm the advantage of the improved method over widely-used approaches(ICIM[29],CIM[30],and MCIM[31]).In the dataset IV,the phenotypic and genotypic observations were downloaded from http://www.pnas.org/content/suppl/2012/09/07/1214141109.DCSupplemental.The dataset includes a total of 278 IMF2individuals,and each individual has 1619 bin markers and four yield related traits:yield per plant(YIELD),tillers per plant(TILLER),grains per panicle(GRAIN),and thousand grain weight(KGW).The others were found in the literature[7,27].In the dataset V,the phenotypic and genotypic observations were downloaded from https://github.com/mdzievit/Meta_QTL-LAMapping-Paper.The dataset includes a total of 125 F2plants,and each plant has 2710 bins and the trait of interest was leaf angle.The others were found in the literature[28].

The analysis of dataset II[24]was implemented on the computer(an Intel(R)Xeon(R)Gold 6130 CPU@2.10 GHz,64 CPU cores and 629G memory)[20],while the other real and simulated data analyses were implemented on the computer(an Intel Core i7-7700 2.80 GHz CPU 16G).

2.4.Availability

Improved FASTmrEMMA and GCIM algorithms have been implemented in R and their software packages(mrMLM,mrMLM.GUI,QTL.gCIMapping,and QTL.gCIMapping.GUI)can be freely available from R(https://cran.r-project.org/web/packages)or BioCode(https://bigd.big.ac.cn/biocode/categories)[20,22].The computer memory should be at least 8G.

3.Results

3.1.Computational efficiency

3.1.1.Comparison of FASTmrEMMA with its improved method

To validate the computational efficiency for the improved FASTmrEMMA algorithm,a series of Monte Carlo simulation experiments,along with real data analyses,were conducted.Each sample was analyzed by origin and its improved methods.As a result,the running times for the first to third simulation datasets were 1.707, 1.722, and 1.754 h, respectively, for the improved algorithm, and 5.360, 5.722, and 5.159 h, respectively, for its original method (Table 2). This means that the improved method takes approximately 68.15%, 69.91%, and 66.00% less than its original method,respectively, for the 1st to 3rd simulation experiments.

In addition, we re-analyzed two groups of real datasets in rice [24] and A. thaliana [23]. In the dataset II, the improved algorithm took 11.58 h, whereas FASTmrEMMA took 31.68 h[20]. This indicates that the improved method takes approximately 63.45% less than its original approach (Table 2). In the dataset I, the improved method took 1.28, 1.16, 1.18, and 1.14 min, respectively, for LD, SDV, 8W GH LN, and 8W GH FT,whereas FASTmrEMMA took 3.26, 3.17, 3.19, and 3.10 min,respectively, for the above four traits (Table 2). This indicates that the improved method took about 60.74%, 63.41%,63.01%, and 63.23% for the above four traits less than its original method.

Table 2–Comparison of running times for real data analyses in Arabidopsis thaliana and rice and simulation studies using original and improved algorithms.

3.1.2. Comparison of GCIM with its improved GCIM

To validate the computational efficiency for the improved GCIM method, a series of Monte Carlo simulation experiments, along with real data analyses, were performed. As a result, the improved GCIM method took 4.690, 3.170, and 4.579 h, respectively, for the first to third simulation datasets,whereas the original method took 6.204, 4.773, and 6.161 h,respectively, for the three datasets (Table 3). This indicates that the improved method took about 24.40%, 33.58%, and 25.68% for the above three simulation datasets less than its original method.

In addition, we re-analyzed one group of real datasets in rice [27]. In the rice IMF2datasets, the improved algorithm took 25.65, 23.52, 24.46, and 19.82 s, respectively, for the four traits YIELD, TILLER, GRAIN and KGW, whereas GCIM-random took 37.58, 30.34, 30.73, and 24.94 s, respectively, for the above four traits (Table 3). This indicates that the improved method took about 31.75%, 22.48%, 20.40%, and 20.53% for the above four traits less than its original method.

3.2. Power, accuracy, false positive rate (FPR) and false negative rate (FNR)

As described above, the improved approaches significantly decreased the running times, especially for large mapping populations. However, the more important things for the improved algorithms are the statistical power, false positive rate (FPR) and false negative rate (FNR) for QTN detection, and the accuracy for the estimates of QTN effect and positions. Allthe results are listed in Tables S1–S3. We found the same results between original and improved methods.

Table 3–Comparison of running times for rice real and simulated data analyses using original and improved GCIM.

3.3. Comparison of the improved approaches with widely-used approaches

The dataset III was re-analyzed by the improved FASTmrEMMA and widely-used (FarmCPU, EMMAX, GEMMA and TASSEL)approaches. As a result, 13, 7, 1, 1, and 0 QTN were found,respectively, by the above five approaches to be significantly associated with plant height (Tables 4 and S4). At this case, we established the linear regression of plant height on all the QTN from each approach. The corresponding Akaike information criterion (AIC) values for the above five methods were 3004.57,3257.16, 3389.04, 3389.04, and 3425.38, respectively, indicating the best model fit of the improved method (Table 4). Around these significant QTN, 4, 0, 1, 1, and 0 were found to be closely linked with previously reported genes (Tables 4 and S4), indicating the good performance of the improved method. In addition, the improved method took 84.20 s, while its original method took 279.42 s (Table 4). This is similar to the results in Table 2.

The dataset V was re-analyzed by the improved and widely-used (ICIM, CIM and MCIM) methods. As a result, 10,4, 5, and 3 QTL were found, respectively, by the above four approaches to be significantly associated with leaf angle in maize (Tables 4 and S5). At this case, we also established the linear regression of maize leaf angle on all the QTL from each approach. The corresponding AIC values for the above four methods were 844.93, 867.14, 873.71, and 869.96, respectively, indicating the best model fit of the improved method (Table 4). Around these significant QTL,3, 1, 0, and 0 were found to be closely linked with previously reported genes (Tables 4 and S5), indicating the good performance of the improved method. In addition,the improved method took 15.06 s, while its original method took 25.99 s (Table 4). This is similar to the results in Table 3.

4.Discussion

Table 4–Comparison of QTN/QTL detection via the improved and widely-used algorithms.

In the GCIM,adaptive lasso is used to obtain significant QTL from all the potential QTL.In adaptive lasso,the parameterλis estimated from cross-validation experiment,which needs one random seed.In theory,the random seed affects the results of QTL mapping via the GCIM method.Thus,we add the seed information(Table S6)and a user option for the selection of random seed in our updated version of the GCIM software,although we have set one default value.

In single-locus GWAS,Bonferroni correction is frequently used to control false positive rate.However,this stringent correction results in the exclusion of important loci.To address this issue,our multi-locus GWAS methodologies have been proposed.In these approaches without Bonferroni correction,our simulation studies have confirmed their low false positive rates[1,20].Meanwhile,our simulation studies in the literatures[6,7]confirmed the low false positive rate of the GCIM algorithm.These significant QTN/QTL can be used to mine candidate genes for the traits of interest via multiomics data analyses.

Although our approaches have high statistical powers in QTN detection,high accuracies in the estimation of QTN positions and effects,and low false positive and negatives(as compared with other approaches)(Tables S1–S3),the running time is relatively long while sample sizes are more than tens of thousands and the number of markers is more than one million.In this case,some new techniques should be incorporated into our methods in near future.

5.Conclusions

Improved FASTmrEMMA and GCIM algorithms are proposed in this study with the help of the matrix identities and properties.In the new algorithms,it is unnecessary to calculate the eigenvectors,and this substantially reduces each marker computation time.Meanwhile,there are no changes in the statistical power,accuracy and false positives and negatives between original and improved algorithms.In addition,real data analyses indicate the advantages of the improved methods in QTN/QTL detection over widely-used methods.

Declaration of competing interest

The authors declare no conflict of interest.

Acknowledgments

The work was supported by the Fundamental Research Funds for the Central Universities(KJQN201849 and 2662020ZKPY017),National Natural Science Foundation of China(31701071,31871242 and 31571268),Huazhong Agricultural University Scientific&Technological Self-innovation Foundation(2014RC020),and State Key Laboratory of Cotton Biology Open Fund(CB2019B01).

Author contributions

Yuanming Zhang conceived and designed the study.Yangjun Wen and Yuanming Zhang developed the improved algorithms and wrote the manuscript.Yangjun Wen,Yawen Zhang,Jin Zhang,Jianying Feng,and Yuanming Zhang analyzed the simulated and real datasets.All authors have read and approved the final manuscript.

Appendix A. Supplementary data

Supplementary data for this article can be found online at https://doi.org/10.1016/j.cj.2020.04.008.