A Computational Approach for Modeling the Allele Frequency Spectrum of Populations with Arbitrarily Varying Size

2019-06-03HuaChen

Genomics,Proteomics & Bioinformatics 2019年6期

Hua Chen

1 CAS Key Laboratory of Genomic and Precision Medicine, Beijing Institute of Genomics, Chinese Academy of Sciences,Beijing 100101, China

2 CAS Center for Excellence in Animal Evolution and Genetics, Chinese Academy of Sciences, Kunming 650223, China

3 School of Future Technology, University of Chinese Academy of Sciences, Beijing 100049, China

Abstract The allele frequency spectrum (AFS), or site frequency spectrum, is commonly used to summarize the genomic polymorphism pattern of a sample, which is informative for inferring population history and detecting natural selection. In 2013, Chen and Chen developed a method for analytically deriving the AFS for populations with temporally varying size through the coalescence time-scaling function.However,their approach is only applicable to population history scenarios in which the analytical form of the time-scaling function is tractable.In this paper,we propose a computational approach to extend the method to populations with arbitrary complex varying size by numerically approximating the time-scaling function. We demonstrate the performance of the approach by constructing the AFS for two population history scenarios: the logistic growth model and the Gompertz growth model,for which the AFS are unavailable with existing approaches.Software for implementing the algorithm can be downloaded at http://chenlab.big.ac.cn/software/.

KEYWORDS Allele frequency spectrum;Complex demography;Population history;Population genetic inference;Coalescent

Introduction

The allele frequency spectrum (AFS, aka, the site frequency spectrum)is a series of fundamental statistics for summarizing genomic polymorphism. It is defined as the sampling distribution of allele frequencies of genetic polymorphism in a finite sample. In practice, AFS can be the number or proportion of SNPs constructed by binning them according to the counts of derived alleles.For a sample of n sequences with m identified segregating sites (polymorphic sites), AFS is written as｛（Si）， 1 ≤i ＜n｝, with=m, where Sidenotes the number of segregating sites in the sample that have i copies of derived alleles among the n haplotypes. AFS has been a main focus in theoretical and methodological studies in the past decades,since it is informative for the inference of ancient demography of populations[1]. The theoretical expectation of AFS under a given population history and parameter setting can be developed using both coalescent theory and diffusion[2-4]. Methods for ancestral inference based on AFS are then developed in a Poisson random field framework by assuming that each entry of the AFS follows a Poisson distribution with the mean equal to the theoretical expectation of the AFS given population genetic parameters[2-17].These methods are gaining popularity with the abundance of genomic sequencing data.

Coalescent theory has been applied to developing AFS in a single population with time-varying population sizes,including the exponential-growth model [7,18] and the n-epoch model,which models the population size changes using several consecutive periods (epochs) with different constant sizes [8]. Compared with the AFS developed with diffusion, the coalescentbased AFS has the advantage of being in analytical form,and the estimation is fast and accurate for small samples. In contrast,the diffusion approximation has to rely on numerical methods, such as finite difference approaches, to approximate the solutions[9,19].The coalescent-based AFS is thus very useful for the inference of past demographic history and has been extensively applied to data analysis [20-24].

One limitation of the coalescent-based AFS methods is that AFS can only be analytically derived for some simple population growth models, such as the n-epoch model and the exponential-growth model or their combinations thereof[7,8,23], and generalizations to other complex population histories are often impracticable [7,13,25]. A second limitation is that for large samples (e.g., haplotype number n ＞50), it is hard to accurately calculate the expected AFS from the formulae. The expected coalescence times ETi， 1 ≤i ＜n are essential for deriving the coalescent-based AFS,which contain coefficients in the alternating sum of the hypergeometric series and are explosively large, causing overflow for large sample sizes [26]. When the sample size is large, AFS and its derived statistics are informative for inferring recent population history. And thus, calculating AFS for large samples becomes common in population genetic inference from genomic data[23,27,28]. A high-precision arithmetic library is usually adopted to obtain accurate numerical values when analyzing larger samples, which requires tedious programming and intensive computation [8]. Some alternative solutions were proposed, specifically for AFS of a single population, e.g.,Polanski and Kimmel[7]replaced it with hypergeometric summation to avoid estimating coefficients with large values.Their approach can efficiently solve the numerical issue,but it is difficult to generalize to other scenarios with complex population histories for which the integral function in the hypergeometric summation is difficult to compute. Most studies have adopted coalescent simulations to generate a large number of samples to approximate AFS under specific demographic histories and applied them to analyze genomic polymorphism. However, this approach is computationally very intensive[14,23,27,29-31].

To address the numerical issue in large samples, Chen and Chen used the large-sample asymptotic distributions of coalescence times [32]. Griffiths proved that the coalescence times and ancestral lineage numbers asymptotically follow a normal distribution in a constant population [33]. Chen and Chen extended their forms to populations with time-varying sizes by using a time-scaling function scheme (see the ‘‘Coalescence times”subsection below;[34-36])and then used the first-order Taylor expansion approximation to achieve the coalescence times (and further AFS). They illustrated the usage of this approach by deriving a simple-form formula for AFS in populations under exponential growth,which shows high accuracy compared with simulated results.Note that the first-order Taylor expansion approximation and time-scaling function approach of Chen and Chen work for both large and small size samples[32].Technically,their approach allows to derive AFS in any populations with arbitrary complex demography.However,as illustrated in the‘‘Method”section,for some complex demographies,it is difficult to derive the analytical form of the time-scaling function and/or its inverse function, which are essential in deriving the coalescent-based AFS. In this paper,we propose a computational approach to efficiently approximate the analytical formula of the time-scaling function with a finite sum approximation, and find the set of coalescence times ETi， 1 ≤i ＜n, with the computing time being nearly constant as the sample size increases. It is applicable to any arbitrary complex history for which the time-scaling function is not tractable. This greatly extends the application of AFSbased methods in population genetic inference and other studies,e.g.,cancer evolution.We demonstrate the performance of the approach by obtaining AFS for two population history scenarios that were difficult to derive using the existing approaches: the logistic growth model and the Gompertz model.

In the following sections we first review the coalescent theory framework for obtaining AFS of a single population. We then summarize the first-order Taylor expansion approximation method for populations with time-varying size proposed by Chen and Chen[32].We illustrate the idea of the computational approach to construct AFS for arbitrary demography,and we further derive AFS for populations with two demographic histories to demonstrate its performance.

Method

Modeling framework

For a sample of n lineages (haplotypes), the coalescence time Tkis defined as the time when k+1 lineages merge into k lineages,and time is measured backward(from the present to the past). The intercoalescence time Wk=Tk-1-Tk is the time during which there are k lineages. Following Fu [2], we say that any of the k branches spanning the intercoalescence time Wkhas the branch of size k. We assume an infinitely-manysites model for mutations,and further the mutations occurring on any branch along the gene genealogy follow a Poisson process.The number of mutations occurring at any branch of size k then follows a Poisson distribution with the mean of μkE （Wk）,where μ is the point mutation rate.During the bifurcation process in which k lineages increase to n lineages at present, any of these mutations increases the allele count from a single copy to j among the n lineages with the probability[3,37]:

Summing over mutations that occur on branches with different sizes, we can obtain the entries for AFS:

Coalescence times

In a constant-size population, the distribution of coalescence times follows that of the standard Kingman’s n-coalescent,which are exponential variables with the mean where μmis the coalescence time in units of haploid population size N. In addition, the intercoalescence times are mutually independent.

For a population with time-varying size,we denote its population history as N（t），t∈[0，∞）.It is not trivial to derive the distribution or the expectation of coalescence times for a population with time-varying sizes. The joint distribution of coalescence timesfor populations with timevarying size is calcluated as described previously[3]and shown as follows.

where λ （t）=N （t）/N0is the relative size function. Polanski et al.derived the marginal probability density function of coalescence times fTmby expanding an integral transform of the marginal probability density function (PDF) into partial fractions[26].Another way to derive fTmis based on the definition of a pure-death process,in the form of a function of the ancestral lineage number,[13,38]. With the marginal distribution of coalescence times derived, Polanski and Kimmel obtained AFS for a population under exponential growth,which is in complex form, and requires calculating the hypergeometric series and exponential integral [7].

Chen and Chen[32]used the time rescaling approach in the variable-population-size coalescent model [34-36]. The coalescence time is rescaled at the rate 1/N （t）, denoted as τm:

where τmfollows the coalescence time distribution in the standard Kingman’s n- coalescent [39]. Chen and Chen [32] then used a Taylor expansion of Tm=g-1（τm）around μmto achieve the approximation:

Thus we have the mean and variance of Tm,

In general, for any population history N（t），0 ≤t ＜∞,time-scaling function g （t） can be obtained as in Equation (5),and further ETm=g-1（μm） can be obtained as shown above.Chen and Chen demonstrate the application of this approach using an exponentially growing population as an example[32]. ET for the exponential growth model is in a very simple analytical form:

with γ is the population growth rate, and the obtained AFS is highly accurate (Figure 6 of [32]).

Since it is not trivial to derive the coalescence times for populations with time-varying size in existing studies, and simulations are usually required as a replacement for most studies [14,23,27,30,31], Chen and Chen’s [32] approach provides simple and efficient solution to obtaining ETm. However, for some complex demographies, the analytical form of the time-scaling function g（t） and its inverse function,which are essential for deriving ETm, are not tractable. This prohibits the general usage of their approach for arbitrary population histories.

Coalescence times under complex demographic history

In this section, we illustrate how to extend Chen and Chen’s[32] method to be applicable to arbitrary population histories using a computational approach. As we can see from the section above, g （t） and g-1（t） are the two essential components for deriving coalescence times for a given population history N （t）[see Equation(7)].Note that to obtain ETm,the analytical form is not required for calculating an arbitrary point t. In contrast,we only need to find a finite number of Tmvalues that correspond to μm， 1 ≤m ＜n and satisfy

The following two numerical schemes are thus proposed for calculating ETm, applicable to different situations. The first approach is generally applicable to all cases, including those for which g（t） cannot be obtained; the second approach is specifically for the case in which an analytical form of g（t） is available but g-1（t） is not tractable.

Approach 1 (finite-sum approximation)

For a sample of size n under the population history N（t），t ∈[0，∞）, the integral of the time scaling function equation can be simply approximated using the discrete finite summation:

Then, for each μm, the corresponding expected coalescence times ETmcan be obtained during the following sequential summation procedures:

Step 1 Given a series of expected coalescence times under the standard n-Kingman’s coalescent1 ≤m ＜n, initialize the procedure from generation 0 (the current generation) with

Step 2 Keep increasing the discrete generation time t, and calculate G=until the value t satisfies

Step 3 Repeat Step 2, and keep increasing t to obtain the rest of the values for ETi， n-2 ≥i ≥1.

Step 4 Terminate the process when ET1is obtained.

Approach 2

For some population histories,analytical form of the time scaling function g（t） can be achieved, but the inverse function g-1（t）is not tractable.An alternative approach can be applied to obtain ETmfor such cases through the following procedures.For each Tm，1 ≤m ＜n, we have the non-linear equation,

Table 1 Procedures for calculating coalescence times using the finite-sum approximation

The non-linear equations above can be solved using numerical algorithms to obtain Tm，such as Newton-Raphson[40].In this paper, we adopt two numerical methods implemented in MATLAB. The first one is the fzero function, which implements Dekker’s algorithm as a combination of bisection,secant, and inverse quadratic interpolation methods [41]. The second is the fminsearch function, which uses the simplex search method of Lagarias and colleagues [42]. This approach usually takes more time than Approach 1, as for each coalescence time Tm, we need to solve the corresponding equation iteratively. Furthermore, the number of equations and the computational complexity increase with the sample size, and thus Approach 2 is more suitable for small samples.

Results

Various population growth models have been proposed to approximate the ancient population history of humans and other species. For example, Gazave et al. proposed a fivescenario model for the European population, including two stages of population bottlenecks and a very recent exponential growth[23]. The simple exponential population growth model may be the most commonly used model.It assumes a constant growth rate, which is valid when space and resources are unlimited. The exponential growth model is a good approximation for the early stage of humans,bacteria,and most populations. In cancer evolution studies, models with more parameters were developed to describe tumor growth [43].These models are complicated by modifying growth rates with carrying capacity or other factors, e.g., the logistic growth model and Gompertz model.

In this section, exponential, logistic, and Gompertz growth models are used to illustrate the usage of our proposed approach. For the exponential growth model with a growth rate γ, N（t）=N0e-rt, it is straightforward to analytically derive the expected coalescence times ETm[Equation(9)].Running time using the three approaches (including the analytical approach,the finite-sum approximation,and Approach 2)was then compared for the model with the two parameters N0=100，000 and γ=0.003. For Approach 2, two numerical methods were adopted: the bisection+interpolation method implemented in the MATLAB function fzero and the downhill simplex method implemented in the MATLAB function fminsearch. The running time was averaged over 1000 repeats run in MATLAB and is presented in Table 2. The detailed results for the logistic growth and Gompertz model are elaborated below.

AFS of the logistic growth model

Compared to the exponential growth model, the logistic growth model regulates the growth rate with a factor, in which Nk is the carrying capacity. It thus has a sigmoid shape and reaches an equilibrium size of Nkinstead of unlimited growth (Figure 1A). A logistic growth model is consistent with the population dynamics of many organisms and is widely used in ecological research. Let γ be the maximum population growth rate (aka, intrinsic growth rate), for a population under logistic growth,the population dynamics is described by the differential equation as below.

Table 2 Comparison of running time between different methods for three population growth models

Note that in the equation above, time is measured forward(from the past to the present), and denoted with t～to distinguish it from the backward time in other sections.The population size N t～follows a logistic curve,

After changing the variable of forward time t～to backward time t,

and the model includes three free parameters: Nk, γ, and T.

Given the population history function N （ t）, the timescaling function for the logistic growth model can be derived as follows,

and further obtained its inverse function,

where W（·） is the Lambert W function, which is calculated numerically.

According to Chen and Chen[32],the expected coalescence time ETm=g-1（μm） can be obtained from Equation (17),which can also be calculated through Approaches 1 and 2 as described in the previous section. AFS generated from Equation (17) (‘‘Analytical”) and Approach 1 (‘‘Approach 1”)for Nk=10，000, T=5000 at three different growth rates γ=0.003， 0.006， and 0.015 were shown in Figure 1B-D. In addition, AFS was also obtained using Approach 2, and the comparison of the running time for a specific parameter setting(Nk=10，000, T=5000, and γ=0.005) for three approaches was listed in Table 2.

It can be seen that the AFS obtained by finite-sum approximation (Approach 1) is very close to that from the analytical approach (Figure 1B-D). The differences in AFS obtained using Approach 1 and that obtained using the analytical approach were further quantified by plotting1 ≤i ≤50 for each entry of the AFS(Figure 1E).The resulting values are within the range of [-0.02， 0.02],confirming the accuracy of the approximation using Approach 1.

AFS of the Gompertz growth model

The Gompertz model is another widely used model to approximate population dynamics. It was originally proposed to explain human mortality [44] and is also used to describe the population growth of other species, including bacteria, animals, and plants [45]. The Gompertz model was found to fit well with the growth of breast cancer and 19 other tumor cell populations [46-48]. One of its forms is

And the solution of the differential equation is

Figure 1 The allele freqeuncy spectra of the logistic growth modelA.The population size as a function of time.B.-D.AFS of the logistic growth model for three growth rates(γ)of 0.003(B),0.006(C),and 0.015 (D), respectively, with the carrying capacity Nk =10，000 and initial time T=5000. E. The relative errors of the AFS from the computational approach compared to the analytical results for the three growth rates of 0.003(B),0.006(C),and 0.015(D),respectively.AFS, allele frequency spectrum.

Figure 2 The allele frequency spectra of the Gompertz modelA.The population growth rate as a function of time.B.The population size as a function of time.C.-H.AFS of the Gompertz model for six settings with different combination of growth rate(γ)and its exponential decay rate(α):γ=0.01,α=0.0005(C);γ=0.01,α=0.001(D);γ=0.02,α=0.004(E);γ=0.03,α=0.004(F);γ=0.05,α=0.004(G),and γ=0.05,α=0.008(H),with initial population size N0 =1 and time T=5000.

where γ is the initial growth rate; N0is the initial population size when it started to grow; and α can be viewed as the exponential decay rate of the growth rate.

It is unfeasible to derive the time-scaling function g（t） and its inverse function g-1（t）for the Gompertz model.Therefore,there is no analytical calculation or numerical solution(Approach 2) of the coalescence times for the Gompertz model. In Figure 2A and B, the growth rates and population size trajectories as a function of time were shown for six parameter settings: γ=0.01， α=0.0005; γ=0.01， α=0.001;γ=0.02， α=0.004; γ=0.03， α=0.004; γ=0.05， α=0.004;and γ=0.05，α=0.008. The corresponding AFS for n=50 haplotypes at these six parameter settings are presented in Figure 2C-H. The running time of Approach 1 for a specific parameter setting (T=5000, r=0.01, α=0.001 and N0=1)and with different sample sizes (10, 50, 100, and 500) is presented in Table 2.

Comparison of computing time of different approaches

We compared the computing times to construct the coalescence times Tm， 1 ≤m ＜n， using Approach 1 (finite-sum approximation), Approach 2, and the analytical approach.For Approach 2, we used two methods for solving the nonlinear equations, including the combination of bisection and interpolation (bisection+interpolation; implemented in the MATLAB function fzero) and the downhill simplex (implemented in fminsearch) methods. All the comparisons are run in MATLAB for three population growth models: the exponential growth, logistic growth, and Gompertz growth model.The running time for constructing the coalescence times was recorded for four sample sizes (n=10， 50， 100, and 500) and averaged over 1000 repeats, as listed in Table 2 (in seconds).

A trend in Table 2 worth noting is that the finite-sum approximation runs very fast. The running time of finite-sum approximation is close to that of the analytical calculation,nearly of the same magnitude, and much shorter than that of numerical approaches (Approach 2). The only outlier is the logistic model, for which the finite-sum approximation runs much faster than the analytical approach. This is because the analytical form of the g（t）function for the logistic model consists of the Lambert W function, which is calculated numerically and is time-consuming.

Second, the running time of the finite-sum approximation approach is nearly constant with increasing sample size n. As we mentioned above, the computational complexity is O （1）,and thus, it is insensitive to the sample size. This guarantees the computational efficiency of the approach when the sample size becomes large, enabling its application to large-sample data analysis.

The numerical approach for solving the g（t） function(Approach 2) also works efficiently but is more timeconsuming than the finite-sum approximation approach for all three population growth models.Furthermore,the running time increases with the sample size n,as the number of nonlinear equations to solve increases linearly with n.

Conclusion

AFS is informative for population genetic inference. Various AFS-based methods have been developed for inferring population histories and detecting natural selection in the past years.They have gained popularity with the abundance of genomic sequencing data (e.g., [3,5,7-10,49]). Compared with the diffusion-based AFS methods that require approximation of the solutions with numerical approaches,modeling AFS using coalescent theory is computationally efficient. Most population genetic inference methods using the coalescent likelihood require computationally intensive algorithms for parameter estimation, such as importance sampling or Markov chain Monte Carlo, while the coalescent-based AFS methods only depend on the expected coalescence times, which guarantee the analytical form [2,3,13].

The coalescent-based AFS methods have shortcomings.First, for large samples it is impossible to obtain accurate calculations due to numerical overflow of large coefficients in the hypergeometric series. Second, it is difficult to derive the coalescent-based AFS for complex population histories,which limits its application to simple growth models, such as the exponential growth and n-epoch models. Chen and Chen [32]showed that for complex demography, we can obtain the expected coalescence times through a linear Taylor expansion approximation, which involves the time-scaling function g（t）and its inverse function g-1（t）.The analytical equations of coalescence times derived using this approach are in a simple form and can successfully overcome the numerical issue for large samples. Furthermore, the time-scaling scheme is technically applicable to arbitrary complex population histories.However,in practice, the analytical forms of the time-scaling function g（t）and its inverse function are not achievable for many cases,limiting the applications. For example, in the study of cancer cell growth, various population growth models in complex form were proposed to describe the dynamics of cancer cells[43],for which the analytical form of AFS is difficult to derive.In this paper, we propose a computational approach, the finite-sum approximation,which efficiently solves the problem of Chen and Chen [32] when the analytical form of the timescaling function g（t） and its inverse function g-1（t） are not derivable.

We apply the computational approach to three widely used models, including the exponential, logistic, and Gompertz growth models to demonstrate its performance. As shown in the Results section, the finite-sum approximation approach is computationally very efficient, and the running time is nearly on the magnitude of that of the analytical approach. Furthermore, the computational time does not increase linearly with the sample size,ensuring its efficiency for AFS of large sample sizes. This is especially attractive for the flexibility to tackle a complex population history that is intractable by using the analytical approach,for example,the Gompertz growth model shown in Table 2. The computational approach presented in this paper is applicable to single populations with arbitrary complex varying size and significantly enables the application of the coalescent-based AFS approaches to population genetic inference in the genomic sequencing era. However, we should note that using the proposed computational approach to model the joint AFS of multiple populations with arbitrary population size changes and gene flows remains challenging and will be addressed in future work.

Availability

Software for implementing the algorithm can be downloaded from the lab webpage at http://chenlab.big.ac.cn/software/.

Author’s contributions

HC designed the study,developed the method,and performed the analysis. HC wrote the manuscript and approved the final manuscript.

Competing interests

The author has declared no competing interests.

Acknowledgments

I am grateful to Dr.Kun Chen for helpful discussions,to Shilei Zhao for helping with the numerical methods in MATLAB,to Di Wei for converting the manuscript from LaTeX to Word,and to the reviewers and editor for their valuable comments.This project was supported by the National Natural Science Foundation of China (Grant Nos. 91731302, 31571370, and 91631106), the National Key R&D Program of China (Grant No. 2018YFC1406902), the Strategic Priority Research Program of the Chinese Academy of Sciences (Grant No.XDB13000000), Shanghai Municipal Science and Technology Major Project (Grant No. 2017SHZDZX01), and the ‘‘100-Talent” Program of the Chinese Academy of Sciences, China.

Genomics,Proteomics & Bioinformatics

2019年6期