7 Methods

This section is only intended as a quick reference guide. The primary literature should be consulted for further information about the methods implemented in Genepop.

7.1 Null alleles

When apparent null homozygotes are observed, one may wonder whether these are truly null homozygotes, or whether some technical failure independent of genotype has occurred. Maximum likelihood estimates of null allele frequency, or of this frequency jointly with the failure rate, can be obtained by the EM algorithm (Dempster, Laird, and Rubin 1977; Hartl and Clark 1989; Kalinowski and Taper 2006), which is one of the methods implemented in Genepop (menu option 8.1).

Also implemented is a simpler estimator defined by Brookfield (1996) for the case where apparent null homozygotes are true null homozygotes. He also described this as a maximum likelihood estimator, but there are some (often small) differences with the ML estimates derived by the EM algorithm as implemented in this and previous versions of Genepop, which may to be due to the fact that Brookfield wrote a likelihood formula for the number of apparent homozygotes and heterozygotes, while the EM implementation is based on a likelihood formula where apparent homozygotes and heterozygotes for different alleles are distinguished.

For the case where one is unsure whether apparent null homozygotes are true null homozygotes, Chakraborty et al. (1992) described a method to estimate the null allele frequency from the other data, excluding any apparent null homozygote. The estimator is not implemented in Genepop because, beyond its relatively low efficiency, its behavior is sometimes puzzling (for example, where there is no obvious heterozygote in a sample, the estimated null allele frequency is always 1, whatever the number of alleles obviously present and even if only non-null genotypes are present). Actually, even if apparent null homozygotes are not true null homozygotes, their number bring some information, and it is more logical to estimate the null allele frequency jointly with the nonspecific genotyping failure rate by maximum likelihood (Kalinowski and Taper 2006). This analysis is possible when at least three alleles are obviously present.

7.2 Exact tests

The probability of a sample of genotypes depends on allele frequencies at one or more loci. In the tests of Hardy Weinberg equilibrium, population differentiation and pairwise independence between loci (“linkage equilibrium”) implemented in Genepop, one is not interested in the allele frequencies themselves and, given they are unknown, the aim is to derive valid conclusions whatever their values. In these different cases, this can be achieved by considering only the probability of samples conditional on observed allelic (e.g. for HW tests) or genotypic counts (e.g. for tests of population differentiation not assuming HW equilibrium). Because exact probabilities are computed, these conditional tests are also known as exact tests. See Cox and Hinkley (1974) and Lehmann (1994) for the underlying theory; a much more elementary introduction to the tests implemented in Genepop is Rousset and Raymond (1997).

7.3 Algorithms for exact tests

Conditional tests require in principle the complete enumeration of all possible samples satisfying the given condition. In many cases this is not practical, and the \(P\)-value may be computed by simple permutation algorithms or by more elaborate Markov chain algorithms, in particular the Metropolis-Hastings algorithm (Hastings 1970). The latter algorithm explores the universe of samples satisfying the given condition in a “random walk” fashion. For HW testing Guo and Thompson (1992) found a Metropolis-Hastings algorithm to be efficient compared to permutations. A slight modification of their algorithm is implemented in Genepop. Guo and Thompson also considered tests for contingency tables (Technical report No. 187, Department of Statistics, University of Washington, Seattle, USA, 1989) and again a slightly modified algorithm is implemented in Genepop (Raymond and Rousset 1995a). A run of the Markov chain (MC) algorithms starts with a dememorization step; if this step is long enough, the state of the chain at the end of the dememorization is independent of the initial state. Then, further simulation of the MC is divided in batches. In each batch a P-value estimate is derived by counting the proportion of time the MC spends visiting sample configurations more extreme (according to the given test statistic) than the observed sample. If the batches are long enough, the P-value estimates from successive batches are essentially independent from each other and a standard error for the P-value can be derived from the variance of per-batch P-values (Hastings 1970). As could be expected, the longer the runs, the lower the standard error.

7.4 Accuracy of P values estimated by the Markov chain algorithms

For most data sets the MC “mixes well” so that the default values of the dememorization length and batch length implemented in Genepop appear quite sufficient [in many other applications of MC algorithms, things are not so simple; e.g. Brooks and Gelman (1998)]. Nevertheless, inaccurate P-values can be detected when the standard error is large, or else if the number of switches (the number of times the sample configuration changes in the MC run) is low (this may occur when the P-value estimate is close to 0 or 1). Therefore, it is wise to increase the number of batches if the standard error is too large, in particular if it is of the order of \(P\) (the P-value) for small \(P\) or of the order of \(1-P\) for large \(P\), or else if the number of switches is low (\(<1000\)).

7.5 Test statistics

The Markov chain algorithms were first implemented for probability tests, i.e. tests where the rejection zone is defined out of the least likely samples under the null hypothesis. Such tests also had Fisher’s preference (e.g. Fisher 1935); in particular the probability test for independence in contingency tables is known as Fisher’s exact test. However, probability tests are not necessarily the most powerful. Depending on the alternative hypothesis of importance, other test statistics are often preferable Lehmann (1994). Efficient tests for detecting heterozygote excesses and deficits (Rousset and Raymond 1995) were introduced in Genepop from the start (see option 1), and log likelihood ratio (\(G\)) tests were introduced with the implementation of the genotypic tests for population differentiation (Jérome Goudet et al. 1996). The allelic weighting implicit in the \(G\) statistic is indeed optimal for detecting differentiation under an island model (Rousset 2007) and use of the \(G\) statistic has been generalized to all contingency table tests in Genepop 4.0, though probability tests performed in earlier versions of Genepop are still available.

Global tests are performed either using methods tuned to specific alternative hypotheses (for heterozygote excess or deficiency) or using Fisher’s combination of probabilities technique. While the latter has been criticized (Whitlock 2005), the recommended alternative can fail spectacularly on discrete data.

7.6 Estimating \(F\)-statistics and related quantities

The definition of \(F\)-statistics used here is

\[\begin{aligned} {F_\mathrm{IS}}\equiv &\frac{Q_1-Q_2}{1-Q_2}\\ {F_\mathrm{ST}}\equiv &\frac{Q_2-Q_3}{1-Q_3}\\ {F_\mathrm{IT}}\equiv &\frac{Q_1-Q_3}{1-Q_3} \end{aligned}\]

where the \(Q\) are probabilities of identity in state, \(Q_1\) among genes (gametes) within individuals, \(Q_2\) among genes in different individuals within groups (populations), and \(Q_3\) among groups (populations). Such formulas appear in Cockerham and Weir (1987); see Rousset (2002a) for an account of most implications of such definitions, except estimation.

The commonly held idea that it is more difficult to estimate \(F\)-statistics when there are more alleles is generally incorrect; actually many inferences may be more accurate when more alleles are present (e.g. Raphael Leblois, Estoup, and Rousset 2003, at least as long as gene diversity is less than 0.8). The issue is not to estimate the frequencies of all alleles, but only to estimate the above ratios. Any expression of the form \((Q_i-Q_j)/(1-Q_j)\) can be estimated as \((\hat{Q}_i-\hat{Q}_j)/(1-\hat{Q}_j)\) where any \(\hat{Q}_k\) is the observed frequency of identical pairs of genes in the sample, among pairs satisfying the condition designated by the \(k\) index. This is only slightly different (see Rousset 2007) from what the following estimators achieve.

7.6.1 ANOVA estimators: single- and multilocus definitions

Well-known work by Cockerham (e.g. Cockerham 1973; B. S. Weir and Cockerham 1984) has used the formalism of analysis of variance (ANOVA) to define estimators of \(F\)-statistics. These estimators may be expressed in terms of the mean sums of squares \(MSG\), \(MSI\), \(MSP\) (for Gametes, Individuals, and Populations) computed by an analysis of variance (see e.g. Bruce S. Weir 1996). Equivalently, they can be expressed in terms of “components of variances” \(\hat{\sigma}^2_G\), \(\hat{\sigma}^2_I\), \(\hat{\sigma}^2_P\) which are unbiased estimates of the corresponding parametric “components of variances” \(\sigma^2_G\), \(\sigma^2_I\), \(\sigma^2_P\) in an ANOVA model. The snag is, in general (and in some notable applications), these parametric “components of variance” are not variances but rather differences between variances and can be negative. The \(\sigma^2\) notation is misleading in this respect; this is a lasting source of confusion, explained in Rousset (2007). Of course, the \(\hat{\sigma}^2\) estimators can be negative even if the \(\sigma^2\) parameters are positive, but this is a distinct issue.

The mean squares can themselves be interpreted in terms of observed frequencies \(\hat{Q}\) of identical pairs of genes in the sample. For balanced samples, the relationships are simple:

\(1-\hat{Q}_1=MSG\equiv \hat{\sigma}^2_G\), \(\hat{Q}_1-\hat{Q}_2=(MSI-MSG)/2\equiv \hat{\sigma}^2_I\) and \(\hat{Q}_2-\hat{Q}_3=(MSP-MSI)/(2n)\equiv \hat{\sigma}^2_P\) where \(n\) is group size. Hence the single-group (single-population) \(F_\mathrm{IS}\) estimator is

\[\label{} \frac{\hat{Q}_1-\hat{Q}_2}{1-\hat{Q}_2}= \frac{MSI-MSG}{MSI+MSG}= \frac{\hat{\sigma}^2_I}{\hat{\sigma}^2_I+\hat{\sigma}^2_G}.\]

For unbalanced groups (“populations” of unequal size), estimates over several groups are complex weighted averages of observed frequencies of identical pairs of genes within groups, not detailed here (see Rousset 2007). However, ANOVA expressions still satisfy \(MSG\equiv \hat{\sigma}^2_G\) and \((MSI-MSG)/2\equiv \hat{\sigma}^2_I\), and \((MSP-MSI)/(2n_c)\equiv \hat{\sigma}^2_P\) where \(n_c\) is a function of the size of each group (\(n_c\equiv [S_1-S_2/S_1]/(n-1)\), where \(S_1\) is the total sample size, \(S_2\) is the sum of squared group sizes, and \(n\) is the number of non-empty groups). Then

\[\begin{gathered} \hat{F}_{\mathrm{IS}}= \frac{MSI-MSG}{MSI+MSG}= \frac{\hat{\sigma}^2_I}{\hat{\sigma}^2_I+\hat{\sigma}^2_G}, \\ \hat{F}_{\mathrm{ST}}= \frac{MSP-MSI}{MSP+(n_c-1)MSI+n_cMSG}= \frac{\hat{\sigma}^2_P}{\hat{\sigma}^2_P+\hat{\sigma}^2_I+\hat{\sigma}^2_G}, \\ \hat{F}_{\mathrm{IT}}= \frac{MSP+(n_c-1)MSI-n_cMSG}{MSP+(n_c-1)MSI+n_cMSG}= \frac{\hat{\sigma}^2_P+\hat{\sigma}^2_I}{\hat{\sigma}^2_P+\hat{\sigma}^2_I+\hat{\sigma}^2_G}.\end{gathered}\]

With several loci, such an analysis is performed for each locus \(i\) and the multilocus estimate is the ratio of a weighted sum of the above locus-specific numerators over locus-specific denominators. However, there is no single consistent way to compute the weighted sums. B. S. Weir and Cockerham (1984)’s multilocus estimators are defined in terms of intermediate statistics \(a\), \(b\), and \(c\) for each locus, which appear to be the \(\hat{\sigma}^2\)’s. The numerator of the multilocus estimator of \(F_\mathrm{ST}\) is thus \(\sum_{\textrm{loci }i}a_i=\sum_{i}[(MSP-MSI)/(2n_c)]_i\). On the other hand (Bruce S. Weir 1996’s) multilocus estimators are defined from distinct intermediate statistics \(S_1\), \(S_2\), and \(S_3\) for each locus, where for locus \(i\), \(S_{1i}=[(MSP-MSI)]_i/(2\bar{n})\) for an average sample size across loci \(\bar{n}\), and the numerator of the multilocus estimate is \(\sum_{\textrm{loci }i}S_i=\sum_{i}[a n_c]_i/\bar{n}\). Hence the 1984 and 1996 estimators slightly differ.

However, both give the same weight to the estimates of the \(Q\)’s for a locus typed at 5 individuals in each subpopulation as for a locus typed at 50 individuals in each subpopulation. Genepop follows another logic. The multilocus estimator of \(F_\mathrm{ST}\) has numerator \(\sum_i [n_c(MSP-MSI)]_i\), which will give 10 time more weight to the \(Q\) estimates for the more intensively typed locus. ‘Explicit’ formulas for the estimators are:

\[\begin{gathered} \hat{F}_{\mathrm{IS}}= \frac{\sum_i [n_c(MSI-MSG)]_i}{\sum_i [n_c(MSI+MSG)]_i}= \frac{\sum_i [n_c\hat{\sigma}^2_I]_i}{\sum_i [n_c\hat{\sigma}^2_I+n_c\hat{\sigma}^2_G]_i}, \\ \hat{F}_{\mathrm{ST}}= \frac{\sum_i [MSP-MSI]_i}{\sum_i [MSP+(n_c-1)MSI+n_cMSG]_i}= \frac{\sum_i [n_c\hat{\sigma}^2_P]_i}{\sum_i [n_c\hat{\sigma}^2_P+n_c\hat{\sigma}^2_I+n_c\hat{\sigma}^2_G]_i}, \\ \hat{F}_{\mathrm{IT}}= \frac{\sum_i [MSP+(n_c-1)MSI-n_cMSG]_i}{\sum_i [MSP+(n_c-1)MSI+n_cMSG]_i}= \frac{\sum_i [n_c\hat{\sigma}^2_P+n_c\hat{\sigma}^2_I]_i}{\sum_i [n_c\hat{\sigma}^2_P+n_c\hat{\sigma}^2_I+n_c\hat{\sigma}^2_G]_i}.\end{gathered}\]

Data from the example file Fmulti.txt (3 samples, 3 loci) illustrate the difference between results obtained by the different methods:

Estimate	\(F_\mathrm{IS}\)	\(F_\mathrm{ST}\)	\(F_\mathrm{IT}\)
locus 1	-0.0483	0.5712	0.5505
locus 2	-0.1161	0.8560	0.8393
locus 3	0.0051	-0.0023	0.0028
Multilocus (1984 a,b,c method)	-0.0286	0.5606	0.5480
Multilocus (1996 S1,S2,S3 method)	-0.0286	0.5633	0.5508
Multilocus (Genepop v3.3 and later)	-0.0275	0.5436	0.5310

Most of the time the different estimators yield close values; I expect the Genepop method to provide better \(F_\mathrm{ST}\) estimates under weak differentiation.

7.6.2 Microsatellite allele sizes, \(R_\mathrm{ST}\), and \(\rho_\mathrm{ST}\)

Following Slatkin (1995), statistics based on allele size have been widely used. The parameters \(\rho_\mathrm{IS}\), \(\rho_\mathrm{ST}\) and \(\rho_\mathrm{IT}\) and their estimators are defined by replacing any \(1-Q_k\) by the expected square difference in allele size between the genes compared (Rousset 1996) in all formulas above, and any \(1-\hat{Q}_k\) by the observed mean square difference (more formulas are given in Michalakis and Excoffier 1996). Then the estimators become plain ANOVA estimators of intraclass correlation for allele size; if there are only two alleles, \(\hat{\rho}_{\mathrm{ST}}=\hat{F}_{\mathrm{ST}}\), but Slatkin’s \(R_{\mathrm{ST}}\neq\hat{F}_{\mathrm{ST}}\).

7.6.3 Robertson and Hill’s estimator of \(F_\mathrm{IS}\)

This estimator, reported in options 1 and 5, was designed to have lower variance than the ANOVA estimator and no small-sample bias when \(F_\mathrm{IS}\) is low, assuming that deviations from Hardy-Weinberg proportions are characterized by the same \(F_\mathrm{IS}\) for all pairs of alleles (Robertson and Hill 1984). The score test computed in heterozygote excess and deficiency sub-options of option 1 is equivalent to this estimator for testing purposes.

7.7 Bootstraps

Option 6 constructs approximate bootstrap confidence intervals (DiCiccio and Efron 1996), assuming that each locus is an independent realization of genealogical and mutation processes. The bootstrap is a general methodology with different incarnations: ABC, BC and BCa variants are implemented for this option. The default bootstrap method, ABC, was chosen for typical microsatellite data sets because it balances moderate computation needs (for small number of loci) with good accuracy compared to alternatives. Bootstrap methods are approximate, and simulation tests of their performance (a too rare deed in statistical population genetics) for the present application are reported in Raphael Leblois, Estoup, and Rousset (2003) and Watts et al. (2007).

For SNP data sets of thousands of loci, the ABC method can become very slow and the alternative BC bootstrap method may be useful. BC is the bias-corrected percentile method discussed in the early bootstrap literature (Efron 1987) and superseded by the BCa method which is more accurate for small samples. However the BCa method (also implemented) will again be slow for large number of loci, while the BC may be both reasonably accurate and reasonably fast in that case.

The ABC method is also applied over individuals in option 8 to compute confidence intervals for null allele frequency estimates.

7.8 Mantel test

The principle of the Mantel permutation procedure is to permute samples between geographical locations, so it generates a distribution conditional on having \(n\) given sets of genotypic data in \(n\) different samples. The permutations provide the distribution of any statistic under the null hypothesis of independence between the two variables (here, genotype counts and geographic location).

Mantel (1967) considered a particular statistics and approximations for its distribution. Instead, Genepop uses no such approximation. Isolation by distance will generate positive correlations between geographic distance and genetic distance estimates, and this is best tested using one-tailed P-values. The program provides both one-tailed P-values. The probability of observing the sample correlation is the sum of these two P-values minus 1.

7.8.1 Misuse 1: tests of correlation at different distance

Genetic processes of isolation by distance generate asymptotically decreasing variation in genetic differentiation with increasing geographic distances, and there is some temptation to use the Mantel test to test for the presence of correlation at specific distances. However, Genepop prevents this as this is logically unsound, and the more quantitative methods it provides are better suited to address variation of patterns with distance.

As soon as a process generates data with an expected non-zero correlation at some distance, it contradicts the null hypothesis under which the Mantel test is an exact test. Thus it may not make sense to use a Mantel test for testing correlation at some distance if there is correlation at another distance.

One can still wonder whether a permutation-based test could have some approximate validity for testing absence of correlation at some distance. However, the bootstrap procedure already addresses this case. Alternative procedures would require further definition on an ad-hoc basis to be operational (e.g., the idea of eliminating samples that form pairs below or above a given distance may not unambiguously define a sample selection procedure that will retain power) and would be likely to generate some confusion.

For these reasons, in the present implementation the Mantel tests are always based on all pairs, ignoring all selection of data according to distance.

7.8.2 Misuse 2: partial Mantel tests

Partial Mantel tests have been used to test for effects of a variable Y on a response variable Z, while supposedly removing spatial autocorrelation effects on Z. Both standard theory of exact tests (as used by Raufaste and Rousset 2001) and simulation (Oden and Sokal 1992; Raufaste and Rousset 2001; Rousset 2002b; Guillot and Rousset 2013) show that the permutation procedure of the Mantel test is not appropriate for the partial Mantel test when the Y variable itself presents spatial correlations. Asymptotic arguments have also been proposed to support the use of such permutation tests (e.g. Anderson 2001) but they fail in the same conditions. As shown by Raufaste and Rousset (2001), the problem is inherent to the permutation procedure, not to a specific test statistic. Unfortunately, some papers maintain confusion about these different aspects of “partial Mantel tests”. Legendre and Fortin (2010) argued how miserable the papers by Raufaste and Rousset (2001) and Rousset (2002b) were, and claimed that some versions of the tests should be preferred because they used pivotal statistics (without evidence that the statistics were indeed pivotal, a property that depends on the statistical model). Guillot and Rousset (2013) reviewed old and more recent literature demonstrating issues with the partial Mantel test, provided new simulations showing that the different tests discussed by Legendre and Fortin (2010) failed, and criticized their verbal arguments. Despite this, Legendre, Fortin, and Borcard (2015) criticized this more recent paper again for ignoring the old literature, and repeated the same kind of verbal explanations that have previously failed.