7 Methods

This section is only intended as a quick reference guide. The primary literature should be consulted for further information about the methods implemented in Genepop.

7.1 Null alleles

When apparent null homozygotes are observed, one may wonder whether these are truly null homozygotes, or whether some technical failure independent of genotype has occurred. Maximum likelihood estimates of null allele frequency, or of this frequency jointly with the failure rate, can be obtained by the EM algorithm (Dempster, Laird, and Rubin 1977; Hartl and Clark 1989; Kalinowski and Taper 2006), which is one of the methods implemented in Genepop (menu option 8.1).

Also implemented is a simpler estimator defined by Brookfield (1996) for the case where apparent null homozygotes are true null homozygotes. He also described this as a maximum likelihood estimator, but there are some (often small) differences with the ML estimates derived by the EM algorithm as implemented in this and previous versions of Genepop, which may to be due to the fact that Brookfield wrote a likelihood formula for the number of apparent homozygotes and heterozygotes, while the EM implementation is based on a likelihood formula where apparent homozygotes and heterozygotes for different alleles are distinguished.

For the case where one is unsure whether apparent null homozygotes are true null homozygotes, Chakraborty et al. (1992) described a method to estimate the null allele frequency from the other data, excluding any apparent null homozygote. The estimator is not implemented in Genepop because, beyond its relatively low efficiency, its behavior is sometimes puzzling (for example, where there is no obvious heterozygote in a sample, the estimated null allele frequency is always 1, whatever the number of alleles obviously present and even if only non-null genotypes are present). Actually, even if apparent null homozygotes are not true null homozygotes, their number bring some information, and it is more logical to estimate the null allele frequency jointly with the nonspecific genotyping failure rate by maximum likelihood (Kalinowski and Taper 2006). This analysis is possible when at least three alleles are obviously present.

7.2 Exact tests

The probability of a sample of genotypes depends on allele frequencies at one or more loci. In the tests of Hardy Weinberg equilibrium, population differentiation and pairwise independence between loci (“linkage equilibrium”) implemented in Genepop, one is not interested in the allele frequencies themselves and, given they are unknown, the aim is to derive valid conclusions whatever their values. In these different cases, this can be achieved by considering only the probability of samples conditional on observed allelic (e.g. for HW tests) or genotypic counts (e.g. for tests of population differentiation not assuming HW equilibrium). Because exact probabilities are computed, these conditional tests are also known as exact tests. See Cox and Hinkley (1974) and Lehmann (1994) for the underlying theory; a much more elementary introduction to the tests implemented in Genepop is Rousset and Raymond (1997).

7.3 Algorithms for exact tests

Conditional tests require in principle the complete enumeration of all possible samples satisfying the given condition. In many cases this is not practical, and the \(P\)-value may be computed by simple permutation algorithms or by more elaborate Markov chain algorithms, in particular the Metropolis-Hastings algorithm (Hastings 1970). The latter algorithm explores the universe of samples satisfying the given condition in a “random walk” fashion. For HW testing Guo and Thompson (1992) found a Metropolis-Hastings algorithm to be efficient compared to permutations. A slight modification of their algorithm is implemented in Genepop. Guo and Thompson also considered tests for contingency tables (Technical report No. 187, Department of Statistics, University of Washington, Seattle, USA, 1989) and again a slightly modified algorithm is implemented in Genepop (Raymond and Rousset 1995a). A run of the Markov chain (MC) algorithms starts with a dememorization step; if this step is long enough, the state of the chain at the end of the dememorization is independent of the initial state. Then, further simulation of the MC is divided in batches. In each batch a P-value estimate is derived by counting the proportion of time the MC spends visiting sample configurations more extreme (according to the given test statistic) than the observed sample. If the batches are long enough, the P-value estimates from successive batches are essentially independent from each other and a standard error for the P-value can be derived from the variance of per-batch P-values (Hastings 1970). As could be expected, the longer the runs, the lower the standard error.

7.4 Accuracy of P values estimated by the Markov chain algorithms

For most data sets the MC “mixes well” so that the default values of the dememorization length and batch length implemented in Genepop appear quite sufficient [in many other applications of MC algorithms, things are not so simple; e.g. Brooks and Gelman (1998)]. Nevertheless, inaccurate P-values can be detected when the standard error is large, or else if the number of switches (the number of times the sample configuration changes in the MC run) is low (this may occur when the P-value estimate is close to 0 or 1). Therefore, it is wise to increase the number of batches if the standard error is too large, in particular if it is of the order of \(P\) (the P-value) for small \(P\) or of the order of \(1-P\) for large \(P\), or else if the number of switches is low (\(<1000\)).

7.5 Test statistics

The Markov chain algorithms were first implemented for probability tests, i.e. tests where the rejection zone is defined out of the least likely samples under the null hypothesis. Such tests also had Fisher’s preference (e.g. Fisher 1935); in particular the probability test for independence in contingency tables is known as Fisher’s exact test. However, probability tests are not necessarily the most powerful. Depending on the alternative hypothesis of importance, other test statistics are often preferable Lehmann (1994). Efficient tests for detecting heterozygote excesses and deficits (Rousset and Raymond 1995) were introduced in Genepop from the start (see option 1), and log likelihood ratio (\(G\)) tests were introduced with the implementation of the genotypic tests for population differentiation (Jérome Goudet et al. 1996). The allelic weighting implicit in the \(G\) statistic is indeed optimal for detecting differentiation under an island model (Rousset 2007) and use of the \(G\) statistic has been generalized to all contingency table tests in Genepop 4.0, though probability tests performed in earlier versions of Genepop are still available.

Global tests are performed either using methods tuned to specific alternative hypotheses (for heterozygote excess or deficiency) or using Fisher’s combination of probabilities technique. While the latter has been criticized (Whitlock 2005), the recommended alternative can fail spectacularly on discrete data.

7.7 Bootstraps

Option 6 constructs approximate bootstrap confidence intervals (DiCiccio and Efron 1996), assuming that each locus is an independent realization of genealogical and mutation processes. The bootstrap is a general methodology with different incarnations: ABC, BC and BCa variants are implemented for this option. The default bootstrap method, ABC, was chosen for typical microsatellite data sets because it balances moderate computation needs (for small number of loci) with good accuracy compared to alternatives. Bootstrap methods are approximate, and simulation tests of their performance (a too rare deed in statistical population genetics) for the present application are reported in Raphael Leblois, Estoup, and Rousset (2003) and Watts et al. (2007).

For SNP data sets of thousands of loci, the ABC method can become very slow and the alternative BC bootstrap method may be useful. BC is the bias-corrected percentile method discussed in the early bootstrap literature (Efron 1987) and superseded by the BCa method which is more accurate for small samples. However the BCa method (also implemented) will again be slow for large number of loci, while the BC may be both reasonably accurate and reasonably fast in that case.

The ABC method is also applied over individuals in option 8 to compute confidence intervals for null allele frequency estimates.

7.8 Mantel test

The principle of the Mantel permutation procedure is to permute samples between geographical locations, so it generates a distribution conditional on having \(n\) given sets of genotypic data in \(n\) different samples. The permutations provide the distribution of any statistic under the null hypothesis of independence between the two variables (here, genotype counts and geographic location).

Mantel (1967) considered a particular statistics and approximations for its distribution. Instead, Genepop uses no such approximation. Isolation by distance will generate positive correlations between geographic distance and genetic distance estimates, and this is best tested using one-tailed P-values. The program provides both one-tailed P-values. The probability of observing the sample correlation is the sum of these two P-values minus 1.

7.8.1 Misuse 1: tests of correlation at different distance

Genetic processes of isolation by distance generate asymptotically decreasing variation in genetic differentiation with increasing geographic distances, and there is some temptation to use the Mantel test to test for the presence of correlation at specific distances. However, Genepop prevents this as this is logically unsound, and the more quantitative methods it provides are better suited to address variation of patterns with distance.

As soon as a process generates data with an expected non-zero correlation at some distance, it contradicts the null hypothesis under which the Mantel test is an exact test. Thus it may not make sense to use a Mantel test for testing correlation at some distance if there is correlation at another distance.

One can still wonder whether a permutation-based test could have some approximate validity for testing absence of correlation at some distance. However, the bootstrap procedure already addresses this case. Alternative procedures would require further definition on an ad-hoc basis to be operational (e.g., the idea of eliminating samples that form pairs below or above a given distance may not unambiguously define a sample selection procedure that will retain power) and would be likely to generate some confusion.

For these reasons, in the present implementation the Mantel tests are always based on all pairs, ignoring all selection of data according to distance.

7.8.2 Misuse 2: partial Mantel tests

Partial Mantel tests have been used to test for effects of a variable Y on a response variable Z, while supposedly removing spatial autocorrelation effects on Z. Both standard theory of exact tests (as used by Raufaste and Rousset 2001) and simulation (Oden and Sokal 1992; Raufaste and Rousset 2001; Rousset 2002b; Guillot and Rousset 2013) show that the permutation procedure of the Mantel test is not appropriate for the partial Mantel test when the Y variable itself presents spatial correlations. Asymptotic arguments have also been proposed to support the use of such permutation tests (e.g. Anderson 2001) but they fail in the same conditions. As shown by Raufaste and Rousset (2001), the problem is inherent to the permutation procedure, not to a specific test statistic. Unfortunately, some papers maintain confusion about these different aspects of “partial Mantel tests”. Legendre and Fortin (2010) argued how miserable the papers by Raufaste and Rousset (2001) and Rousset (2002b) were, and claimed that some versions of the tests should be preferred because they used pivotal statistics (without evidence that the statistics were indeed pivotal, a property that depends on the statistical model). Guillot and Rousset (2013) reviewed old and more recent literature demonstrating issues with the partial Mantel test, provided new simulations showing that the different tests discussed by Legendre and Fortin (2010) failed, and criticized their verbal arguments. Despite this, Legendre, Fortin, and Borcard (2015) criticized this more recent paper again for ignoring the old literature, and repeated the same kind of verbal explanations that have previously failed.