5 All menu options
5.1 Option 1: Hardy-Weinberg (HW) exact tests
The following menu appears:
Hardy Weinberg tests:
HW test for each locus in each population:
H1 = Heterozygote deficiency.......1
H1 = Heterozygote excess...........2
Probability test...................3
Global test:
H1 = Heterozygote deficiency.......4
H1 = Heterozygote excess...........5
Main menu.............................6
5.1.1 Sub-options 1–3: Tests for each locus in each population
Three distinct tests are available, all concerned with the same null hypothesis (random union of gametes). The difference between them is the construction of the rejection zone. For the Probability test (sub-option 3), the probability of the observed sample is used to define the rejection zone, and the \(P\)-value of the test corresponds to the sum of the probabilities of all tables (with the same allelic counts) with the same or lower probability. This is the “exact HW test” of Haldane (1954), Bruce S. Weir (1996), Guo and Thompson (1992) and others. When the alternative hypothesis of interest is heterozygote excess or deficiency, more powerful tests than the probability test can be used (Rousset and Raymond 1995). One of them, the score test or \(U\) test, is available here, either for heterozygote deficiency (sub-option 1) or heterozygote excess (sub-option 2). The multi-samples versions of these two tests are accessible through sub-options 4 or 5.
Two distinct algorithms are available: first, the complete enumeration method, as described by Louis and Dempster (1987). This algorithm works for less than five alleles. As an exact \(P\)-value is calculated by complete enumeration, no standard error is computed. Second, a Markov chain (MC) algorithm to estimate without bias the exact \(P\)-value of this test (Guo and Thompson 1992), and three parameters are needed to control this algorithm (see Section 7.3). These different values may be provided either at Genepop’s request, or through the Dememorisation
, BatchLength
and BatchNumber
settings. Two results are provided for each test by the MC algorithm: the estimated \(P\)-value associated with the null hypothesis of HW equilibrium, and the standard error (S.E.) of this estimate.
For all tests concerned with sub-options 1-3, there are three possible cases. The number of distinct alleles at each locus in each sample is
no more than 4: Genepop will give you the choice between the complete enumeration and the MC method. If you have less than 1000 individuals per sample, the complete enumeration is recommended. Otherwise, the MC method could be much faster. But there are no general rules, results are highly variable, depending also on allele frequencies.
always 5 or more: Genepop will automatically perform only the MC method.
sometimes higher than 4, sometimes not: For cases where the number of alleles is 4 or lower, Genepop will give you the choice between both methods. For the other situations (5 alleles or more in some samples), the MC method will be automatically performed.
Whether one wants enumeration or MC methods to be performed can be specified at runtime, or otherwise by the HWtests
setting, with options HWtests=enumeration
and HWtests=MCMC
. The default in the batch mode is enumeration
.
5.1.2 Output
Results are stored in a file named as follows
sub-option | Extension |
---|---|
1 | yourdata.D |
2 | yourdata.E |
3 | yourdata.P |
4 | yourdata.DG |
5 | yourdata.EG |
where yourdata is (throughout this document) the name of the input file.
For each test, several values are indicated on the same line: (i) the \(P\)-value of the test (or “-” if no data were available, or only one allele was present, or two alleles were detected but one was represented by only one copy); (ii) the standard error (only if a MC method was used); (iii) two estimates of \(F_\mathrm{IS}\), B. S. Weir and Cockerham (1984)’s (1984) estimate (W&C), and Robertson and Hill (1984)’s (1984) estimate (R&H). The latter has a lower variance under the null hypothesis. Finally, the number of “steps” is given: for the complete enumeration algorithm this is the number of different genotypic matrices considered, and for the Markov chain algorithm the number of switches (change of genotypic matrice) performed.11
5.1.3 Sub-options 4,5: Global tests across loci or across samples
For sub-option 3, a global test across loci or across sample is constructed using Fisher’s method. This method (sometimes conservative because discrete probabilities are analyzed), is only performed for convenience and its relevance should be first established (e.g. statistical independence of loci).
General statistical theory shows that there is no uniformly better way to combine \(P\)-values of different tests. When an alternative model is specified, it is possible to find a better way of combining results from different data sets than Fisher’s method, and usually not by combining \(P\)-values. In the present context one such method is the multisample score test of Rousset and Raymond (1995), which defines a global test across loci and/or across samples generalizing the tests of sub-options 1 and 2. The global tests are performed by sub-options 4 and 5, only by the MC algorithm. Independence of loci is also assumed for these global tests.
The output file reports global P value estimates and standard errors per population, per locus, and over all loci and populations. For each global P value, the average number of switches per test combined is also reported. Since it is tempting to reduce the chain length parameters in this option, special care is needed in checking this accuracy diagnostic (see p.41).12
This option generates several large temporary files. The space used temporarily by Genepop can be estimated as: (#of Loci+#of pop+1)*batches*(iterations per batch)*8 octets. For example it will require about 240 Mo of temporary hard disk space if you have 10 loci, 50 samples and if you use a chain of 500,000 steps (100 batches of 5000 iterations).
5.1.4 Analyzing a single genotypic matrix
It is possible to perform a single HW test independently of the Genepop input file. This option is not presented in the Genepop menu. You should have an input file with a genotypic matrix (which can be taken from the output file of option 5 and edited), and use the HWfile
setting.13 When Genepop is launched in this way, the following menu will appear:
HW test for each locus in each population:
H1 = Heterozygote deficiency .................1
H1 = Heterozygote excess .....................2
Probability test .............................3
Allele frequencies, expected genotypes, Fis .... 4
Quit ........................................... 5
All HW tests corresponding to options 1.1–3 of “regular” Genepop are available through options 1–3, and basic information similar to that given by regular option 5.1 is available through the present option 4. Results are stored at the end of your input file. The exact format of the input file is:
First line: anything. Use this line to store information about your data.
Second line: The number of alleles \(n\).
Line three through \(n+2\): the genotypic matrix (see example).
Beyond line \(n+2\) : anything (this is not read by the program).
An example with four alleles is:
Human Monoamine Oxidase (MOAO) Data
4
2
12 24
30 34 54
22 21 20 10
If this file is named MOAO
, you can analyze it by setting HWfile=MOAO
in the settings; you can also set HWfileOptions=1
to run option 1 without making your way through the menus. All this can be done through the console command line. For example
Genepop HWFile=MOAO HWfileOptions=1,2,3,4
will perform all four analyses available through the above menu. General settings Dememorisation
, BatchLength
, BatchNumber
, and Mode
all affect these analyses in the same way as they affect analyses of regular input files.
5.1.5 Code checks
Code for HW tests has a now venerable history of testing. Early versions of Genepop were compared with the Exactp step in Biosys (Swofford and Selander 1989) for two allele cases, and with data published in Louis and Dempster (1987) and Guo and Thompson (1992) for more alleles. The sample files LouisD87.txt
and GuoT92.txt
contain two such test samples, in single-matrix format.
5.2 Option 2: Tests and tables for linkage disequilibrium
The following menu appears:14
Pairwise associations (haploid and genotypic disequilibrium):
Test for each pair of loci in each population ......... 1
Only create genotypic contingency tables .............. 2
Menu ....................................................... 3
5.2.1 Sub-option 1: Tests
For this option the null hypothesis is: “Genotypes at one locus are independent from genotypes at the other locus”. For a pair of diploid loci, no assumption is made about the gametic phase in double heterozygotes. In particular, it is not inferred assuming one-locus HW equilibrium, as such equilibrium is not assumed anywhere in the formulation of the test. The test is thus one of association between diploid genotypes at both loci, sometimes described as a test of the composite linkage disequilibrium (Bruce S. Weir 1996, 126–28). For a haploid locus and a diploid one, a test of association between the haploid and diploid genotypes is computed (there is no concern about gametic phase in this case). This makes it easy to test for cyto-nuclear disequilibria. For a pair of loci with haploid information, a straightforward test of association of alleles at the two loci is computed.
The default test statistic is now the log likelihood ratio statistic (\(G\)-test). However one can still perform probability tests (as implemented in earlier versions of Genepop) by using the GameticDiseqTest=Proba
setting.
For a given pair of loci within one sample, the relevant information is represented by a contingency table looking e.g. like
GOT2
1.1 1.3 3.3 1.7 3.7
EST _________________________
1.1 1 1 0 0 1 3
1.2 16 6 1 3 2 28
_________________________
17 7 1 3 3 31
for two diploid loci (1.1
, etc., are the diploid genotypes at each locus). Contingency tables are created for all pairs of loci in each sample, then a \(G\) test or a probability test for each table is computed for each table using the Markov chain algorithm of Raymond and Rousset (1995a). The number of switches of the algorithm is given for each table analyzed.15
5.2.2 Output
Results are stored in the file yourdata.DIS
. Three intractable situations are indicated: empty tables (“No data”), table with one row or one column only (“No contingency table”), and tables for which all rows or all columns marginal sums are 1 (“No information”). For each locus pair within each sample, the unbiased estimate of the P-value is indicated, as well as the standard error. Next, a global test (Fisher’s method) for each pair of loci is performed across samples.
See also the next section for analysis of a single table.
5.3 Option 3: population differentiation
The following menu appears:
Testing population differentiation :
Genic differentiation:
for all populations ........................ 1
for all pairs of populations ............... 2
Genotypic differentiation:
for all populations ........................ 3
for all pairs of populations ............... 4
Main menu ...................................... 5
All tests are based on Markov chain algorithms. The Markov chain parameters are controlled exactly as in option 1.
5.3.1 Sub-options 1 or 2 (genic differentiation)
They are concerned with the distribution of alleles is the various samples. The null hypothesis tested is “alleles are drawn from the same distribution in all populations”. For each locus, the test is performed on a contingency table like this one:
Sub-Pop. Alleles
1 2 Total
_______
1 14 46 60
2 6 76 82
3 10 74 84
4 4 58 62
_______
Total 34 254 288
For each locus, an unbiased estimate of the P-value is computed. The test statistic is either the probability of the sample conditional on marginal values, the \(G\) log likelihood ratio, or the level of gene diversity. In the first case, the test is Fisher’s exact probability test, and the algorithm is described in Raymond and Rousset (1995a). A simple modification of this algorithm is used for the exact \(G\) test.16 Genepop’s default is the \(G\) test. You can revert to Fisher’s test by using the DifferentiationTest=Proba
setting. Finally, the level of gene diversity can be used as a test statistic when coupled with the GeneDivRanks
setting (this was new to version 4.1; see Section 5.3.4).
For sub-option 2, the tests are the same, but they are performed for all pairs of samples for all loci.
5.3.2 Sub-options 3 or 4 (genotypic differentiation)
are concerned with the distribution of diploid genotypes in the various populations. The null hypothesis tested is “genotypes are drawn from the same distribution in all populations”. For each locus, the test is performed on a contingency table like this one:
Genotypes:
-------------------------
1 1 2 1 2 3
Pop: 1 2 2 3 3 3 All
----
Pop1 142 27 0 13 1 0 183
Pop2 149 20 0 11 0 4 184
Pop3 131 12 0 9 0 1 153
Pop4 119 22 1 10 0 0 152
Pop5 120 17 1 10 1 0 149
Pop6 134 18 2 15 0 0 169
Pop7 116 15 1 10 1 1 144
Pop8 214 41 3 14 2 1 275
Pop9 84 17 0 7 2 0 110
Pop10 107 18 0 15 3 0 143
Pop11 134 32 1 21 4 0 192
Pop12 105 26 1 11 1 4 148
Pop13 97 19 2 23 4 0 145
Pop14 95 28 3 19 3 1 149
All: 1747 312 15 188 22 12 2296
An unbiased estimate of the P-value of a log-likelihood ratio (\(G\)) based exact test is performed. For this test, the statistics defining the rejection zone is the \(G\) value computed on the genic table derived from the genotypic one (see Jérome Goudet et al. 1996 for the choice of this statistic), so that the rejection zone is defined as the sum of the probabilities of all tables (with the same marginal genotypic values as the observed one) having a \(G\) value computed on the derived genic table higher than or equal to the observed \(G\) value.
For sub-option 4, the test is the same but is performed for all pairs of samples for all loci.
5.3.3 Output
For the four sub-options, results are stored in a file named as follows:17
sub-option | test | output file name |
---|---|---|
1 | Probability test | yourdata.PR |
1 | \(G\) | yourdata.GE |
2 | Probability test | yourdata.PR2 |
2 | \(G\) | yourdata.GE2 |
3 | \(G\) | yourdata.G |
4 | \(G\) | yourdata.2G2 |
All contingency tables are saved in the output file. Two intractable situations are indicated: empty tables or tables with one row or one column only (“No table”), and tables for which all rows or all columns marginal sums are 1 (“No information”). Estimates of P-values are given, as well as (for sub-options 1 and 3) a combination of all test results (Fisher’s method), which assumes a statistical independence across loci. For sub-options 2 and 4, this combination of all tests across loci (Fisher’s method) is performed for each sample pair. The result Highly sign.
[ificant] is reported when at least one of the individual tests being combined yielded a zero \(P\)-value estimate.
5.3.4 Gene diversity as a test statistic
DifferentiationTest=GeneDiv
GeneDivRanks=2,1,3,3,3
DifferentiationTest=GeneDiv
makes Genepop use gene diversity as test statistic in tests of genetic differentiation (option 3). The test will look for a decrease in gene diversity from populations ranked first (value 1
in GeneDivRanks
) to populations ranked last. This should work for both genic and genotypic tables, and for pairwise comparisons as well as for all populations, i.e. for all sub-options 3.1 to 3.4. The test statistic is
\[\sum_{\textrm{all subsamples $i$}}\sum_{j>i} (Q_j-Q_i)(R_j-R_i)\]
where \(Q_i\) is gene identity in subsample \(i\) and \(R_i\) is the GeneDivRanks
value for this subsample.
This option also works on input files in contingency table format (strucfile
setting). In that case each row of the table is interpreted as a new population.
5.3.5 Analyzing a single contingency table
It is possible to analyse any contingency table independently of the Genepop input file. You should have an input file with a contingency table, and use the strucFile
setting.18 This option is not presented in the Genepop menu. Both the \(G\) and probability tests are available and performed as in option 3.1. Results are stored at the end of your input file. An example of input file is:
Dull example
6 5
1 2 5 10 11
2 0 8 11 15
0 0 1 5 6
10 15 20 51 55
0 0 0 2 1
4 5 6 11 10
If this file is named structest
, you can analyze it by writing StrucFile=structest
in the settings file, or by the console command line
Genepop StrucFile=structest
The exact format of the input file is:
First line: anything. Use this line to store information about your data.
Second line: The numbers of rows (\(n\)) and columns.
Line three through \(n+2\) : the contingency table (see example).
Beyond line \(n+2\) : anything (this is not read by the program).
The default is to perform a \(G\) test, but as in options 3.1 and 3.2 you can revert to Fisher’s exact test by the setting DifferentiationTest=Proba
.
5.4 Option 4: private alleles
This option provides a multilocus estimate of the effective number of migrants (\(Nm\))by Barton and Slatkin’s (1986) method. Three estimates of \(Nm\) are provided, using the three regression lines published in that reference, and a corrected estimate is provided using the values from the closest regression line. Results are stored in the file yourdata.PRI
.
5.5 Option 5: Basic information, \(F_\mathrm{IS}\), and gene diversities
The following menu appears:
Allele and genotype frequencies per locus and per sample .. 1
Gene diversities & Fis :
Using allele identity ......... 2
Using allele size ............. 3
Main menu ................................................. 4
5.5.1 Sub-option 1: Allele and genotype frequencies
This option provides basic information on the data set. The output file is saved in the file yourdata.INF
. For each locus in each sample, several variables are calculated:
allele frequencies.
observed and expected genotype proportions.
\(F_\mathrm{IS}\) estimates for each allele following B. S. Weir and Cockerham (1984).
global estimate of \(F_\mathrm{IS}\) over alleles according to B. S. Weir and Cockerham (1984) (W&C) and Robertson and Hill (1984) (R&H).
observed and “expected” numbers of homozygotes and heterozygotes. “Expected” here means the expected numbers, conditional on observed allelic counts, under HW equilibrium; the difference from naive products of observed allele frequencies is sometimes called Levene’s correction, after Levene (1949).
the genotypic matrix.
A table of allele frequencies for each locus and for each sample is also computed.
5.5.2 Sub-option 2: Identity-based gene diversities and \(F_\mathrm{IS}\)
This option takes the observed frequencies of identical pairs of genes as estimates (\(Q\)) of corresponding probabilities of identity (\(Q\)) and then simply computes diversities as \(1-Q\): gene diversity within individuals (1-Qintra
), and among individuals within samples (1-Qinter
), per locus per sample, and averaged over samples or over loci. One-locus \(F_\mathrm{IS}\) estimates are also computed in a way consistent with B. S. Weir and Cockerham (1984). No estimate is given when no information is available (e.g. no estimate of diversity between individuals within a sample when only one individual has been genotyped).
For haploid data, only the gene diversity among individuals is computed. Multilocus estimates ignore haploid loci, or on the contrary ignore diploid loci if the setting EstimationPloidy=Haploid
is used. Single-locus estimates are computed for both haploid and diploid loci irrespective of this setting.
The output is saved in the file yourdata.DIV
.
5.5.3 Sub-option 3: Allele size-based gene diversities and \(\rho_{\mathrm{IS}}\)
Option 5.3 is analogous to option 5.2. It computes measures of diversity
based on allele size, namely mean squared allele size differences within
individuals (MSDintra
), and among individuals within samples
(MSDinter
), per locus per sample, and averaged over samples or over
loci. Corresponding \(\rho_\mathrm{IS}\) (the \(F_\mathrm{IS}\) analogue, see Section 7.6.2) estimates are also computed. Allele size is the allele name unless it has been given through the AlleleSizes
setting.
For haploid data, only the mean squared difference MSDinter
among individuals is computed. Multilocus estimates ignore haploid loci, or on the contrary ignore diploid loci if the setting EstimationPloidy=Haploid
is used. Single-locus estimates are computed for both haploid and diploid loci irrespective of this setting.
The output is saved in the file yourdata.MSD
.
5.6 Option 6: Fst and other correlations, isolation by distance
The following menu appears:
Estimating spatial structure:
The information considered is :
--> Allele identity (F-statistics)
For all populations ............ 1
For all population pairs ....... 2
--> Allele size (Rho-statistics)
For all populations ............ 3
For all population pairs ....... 4
Isolation by distance
between individuals ............ 5
between groups.................. 6
Main menu ................................. 7
Data ploidy | pop = individual? | isolationStatistic setting |
Estimator used |
---|---|---|---|
Diploid | Yes (option 6.5) | =a |
\(\hat{a}\) |
Diploid | Yes (option 6.5) | =e |
\(\hat{e}\) |
Diploid | No (option 6.6) | none (default) | \(F_\mathrm{ST}\)/(1-\(F_\mathrm{ST}\)) |
Diploid | No (option 6.6) | =singleGeneDiv |
\(F/(1-F)\) variant with denominator common to all pairs |
Haploid | Yes (option 6.5) | none (default) | \(\hat{a}\)-like statistic with stand-in for within-deme gene diversity |
Haploid | No (option 6.6) | none (default) | \(F_\mathrm{ST}\)/(1-\(F_\mathrm{ST}\)) |
Haploid | No (option 6.6) | =singleGeneDiv |
\(F/(1-F)\) variant with denominator common to all pairs |
Suboptions 5 and 6 provide a variety of analyses of isolation by distance patterns, including bootstrap confidence intervals of the slope of spatial regression (or equivalently, for “neighborhood” size estimates). Starting with version 4.1, it is even possible to test given values of the slope, through the testPoint
setting; and additional estimators (merely minor variation on a common logic) have been implemented, in particular for haploid data. Table 5.1 summarizes the choice of methods, each of which will now be detailed.
5.6.1 Sub-options 1–4: \(F\)-statistics and \(\rho\)-statistics
These options compute estimates of \(F_\mathrm{IS}\), \(F_\mathrm{IT}\) and \(F_\mathrm{ST}\) or analogous correlations for allele size, either for each pair of population (sub-options 2 and 4) or a single measure for all populations (sub-options 1 and 3). \(F_\mathrm{ST}\) is estimated by a “weighted” analysis of variance Cockerham (1973; B. S. Weir and Cockerham 1984), and the analogous measure of correlation in allele size (\(\rho_\mathrm{ST}\)) is estimated by the same technique (see Section 7.6.2). Multilocus estimates are computed as detailed in Section 7.6.1). For haploid data, remember to use the EstimationPloidy=Haploid
setting.
In sub-option 1, the output is saved in the file yourdata.FST
. Beyond \(F_\mathrm{IS}\), \(F_\mathrm{IT}\) and \(F_\mathrm{ST}\) estimates, estimation of within-individual gene diversity and within-population among-individual gene diversity are reported as in option 5.2.
In sub-option 2 (pairs of populations), single locus and multilocus estimates are written in the yourdata.ST2
file and multilocus estimates are also written in the yourdata.MIG
file in a format suitable for analysis of isolation by distance (see option 6.6 for further details).
Sub-option 3 is analogous to sub-option 1, but for allele-size based estimates. the output is saved in the file yourdata.RHO
. Beyond \(\rho_\mathrm{IS}\), \(\rho_\mathrm{IT}\) and \(\rho_\mathrm{ST}\) estimates, estimation of within-individual gene diversity and within-population among-individual gene diversity are reported as in option 5.3.
Sub-option 4 is analogous to sub-option 2, but for allele-size based estimates. Output file names are as in sub-option 2.
5.6.2 Sub-option 5: isolation by distance between individuals
This option allows analysis of isolation by distance between pairs of individuals. It provides estimates of “neighborhood size”, or more precisely of \(D\sigma^2\), the product of population density and axial mean square parent-offspring distance, derived from the slope of the regression of pairwise genetic statistics against geographical distance or log(distance) in linear or two-dimensional habitats, respectively. More details are described in Rousset (2000) (\(\hat{a}\) statistic), Raphael Leblois, Estoup, and Rousset (2003) (bootstrap confidence intervals) and Watts et al. (2007) (\(\hat{e}\) statistic). For haploid data, a proxy for the \(\hat{a}\) statistic has been introduced in version 4.1.
The position of individuals must be specified as two coordinates standing for their name (i.e. before the comma on the line for each individual), and since each individual is considered as a sample, it must be separated by a Pop
. An example of such input file is given below: The first individual is located at the point \(x = 0.0\), \(y = 15.0\) (showing that the decimal separator is a period), the second at the point \(x = 0\), \(y =30\), etc. This example also shows that individual identifiers can be added after these coordinates.
Title line: A really too small data set
ADH Locus 1
ADH #2
ADH three
ADH-4
ADH-5
Pop
0.0 15.0, 0201 0303 0102 0302 1011
Pop
0 30 Second indiv, 0202 0301 0102 0303 1111
Pop
0 45, 0102 0401 0202 0102 1010
Pop
0 60, 0103 0202 0101 0202 1011
Pop
0 75, 0203 0204 0101 0102 1010
POP
15 15, 0102 0202 0201 0405 0807
Pop
15 30, 0102 0201 0201 0405 0307
Pop
15 45, 0201 0203 0101 0505 0402
Pop
15 60, 0201 0303 0301 0303 0603
Pop
15 75, 0101 0201 0301 0505 0807
Missing information arises when there is no genetic estimate (if a pair of individuals has no genotypes for the same locus, for example), or when geographic distance is zero and log(distance) is used. Genepop will correctly handle such missing information until it comes to the point where regression cannot be computed or there are not several loci to bootstrap over.
Options to be described within option 6.5 are: \(\hat{a}\) or \(\hat{e}\) pairwise statistics (for diploid data); log transformation for geographic distances; minimal geographic distance; coverage probability of confidence interval; testing a given value of the slope; Mantel test settings; conversion to genetic distance matrix in Phylip format. Allele-size based analogues of \(\hat{a}\) or \(\hat{e}\) can be defined, but they should perform very poorly (Raphael Leblois, Estoup, and Rousset 2003; Rousset 2007), so such an analysis has been purposely disabled.
Pairwise statistics for diploid data: Watts et al. (2007) contrasted two pairwise genetic distance statistics, \(\hat{a}\) and \(\hat{e}\). Using \(\hat{e}\) is practically equivalent to using Loiselle’s statistic (Loiselle et al. 1995), which has previously been advocated by e.g. Vekemans & Hardy (2004). Genepop actually uses a statistic \(e_r\) that handles missing data differently from \(\hat{e}\) (see Methods) but the following discussion holds for both.
The pairwise statistic is selected by the setting IsolationStatistic=a
or =e
, or at runtime (in batch mode, the default is \(\hat{a}\)).
\(\hat{e}\) is asymptotically biased in contrast to \(\hat{a}\), but has lower
variance. The bias of the \(\hat{e}\)-based slope is higher the more
limited dispersal is, so it performs less well in the lower range of
observed dispersal among various species. Confidence intervals are also
biased (Leblois, Estoup, and Rousset 2003; Watts et al. 2007), being too
short in the direction of low \(D\sigma^2\) values, and on the contrary
conservative in the direction of low \(D\sigma^2\) values. Based on the
simulation results of Watts et al. (2007), a provisional advice is to
run analyses with both statistics, and to derive an upper bound for the
\(D\sigma^2\) confidence interval (CI), hence the lower bound for the
regression slope, from \(\hat{e}\) (which has CI shorter than
\(\hat{a}\), though still conservative) and the other \(D\sigma^2\)
bound, hence the upper bound for the regression slope, from \(\hat{a}\)
(which has too short CI, but less biased than the \(\hat{e}\) CI). When
the \(\hat{e}\)-based \(D\sigma^2\) estimate is below 2500 (linear
habitat) or 4 (two-dimensional habitat) it is suggested to derive both
bounds from \(\hat{a}\).
For haploid data (i.e. EstimationPloidy=Haploid
) the denominators of the \(\hat{a}\) and \(\hat{e}\) statistics cannot be computed. Ideally the denominator should be the gene diversity among individuals that would compete for the same position, as could be estimated from “group” data. As a reasonable first substitute, Genepop uses a single estimate of gene diversity (from the total sample and for each locus) to compute the denominators for all pairs of individuals. This amount to assume that overall differentiation in the population is weak.
Log transformation for geographic distances: This transformation is required for estimation of \(D\sigma^2\) when dispersal occurs over a surface rather than over a linear habitat. It is the default option in batch mode. It can be turned on and off by the setting GeographicScale=Log
or =Linear
or equivalently by Geometry=2D
or =1D
.
Nonparametric bootstrap is used in particular to obtain confidence intervals (DiCiccio and Efron 1996). The default method is the ABC
bootstrap, but this can be changed by the setting bootstrapMethod
to BC
or BCa
method. The number of computations of regression estimates scales as the number of loci for the ABC method, and as a chosen number of bootstrap resamples for the BC method (which is controlled by the BootstrapNsim
setting, with default 999). The latter may thus be useful when the data include thousands of loci. The BCa method differs from the BC one by an additional step that scales as the number of loci.
Coverage probability of confidence interval This is the target probability that the confidence interval contains the parameter value. The usage is to compute intervals with 95% coverage and equal 2.5% tails, and this is the default coverage in Genepop. This can be changed by the setting CIcoverage
, e.g. CIcoverage=0.99
will compute interval with target probabilities 0.5% that either the confidence interval is too low or too high (an unrealistically large number of loci may be necessary to achieve the latter precision).
Minimal and maximal geographic distances: As discussed in Rousset (1997), samples at small geographic distances are not expected to follow the simple theory of the regression method, so the program asks for a minimum geographical distance. Only pairwise comparisons of samples at strictly larger distances are used to estimate the regression coefficient (all pairs are used for the Mantel test). The minimal distance may be specified by the setting MinimalDistance=
value or at runtime. This being said, it is wise to include all pairs in the estimation as no substantial bias is expected, and this avoids uncontrolled hacking of the data. Thus, the suggested minimal distance here is any distance large enough to exclude only pairs at zero geographical distance. Negative values are thus not recommended (and rejected in 2D), and the default in batch mode is 0.0001.
There is also a setting MaximalDistance=
value. This should not be abused, and is (therefore) available only through the settings file, not as a runtime option.
Testing a given value of the slope The setting testPoint=0.00123
(say) returns the unidirectional P-value for a specific value of the slope, using the non-parametric bootstrap. This is the reciprocal of a confidence interval computation: confidence intervals evaluate parameter values corresponding to given error levels, say the 0.025 and 0.975 unidirectional levels for a 95% bidirectional CI, while this option evaluates the unidirectional P-value associated with a given parameter value.
Mantel test: The Mantel test is implemented. See Section 7.8 for limitations of this test. In the present context this is an exact test of the null hypothesis that there in no spatial correlation between genetic samples.
Up to version 4.3 Genepop implemented only a Mantel test based on the rank correlation. It now also implements, and performs by default, Mantel tests based on the regression coefficient for the “genetic distance” statistic used to quantify isolation by distance. The latter tests should generally be more congruent with the confidence intervals based on the same distances than the rank-based tests are. The rank test can now be performed by using the setting MantelRankTest=
(no value needed).
Ideally the confidence interval for the slope should contain zero if and only if the Mantel test is non-significant. Some exceptions may occur as the bootstrap method is only approximate, but such exceptions appear to be rare. Exceptions may more commonly occur when the bootstrap is based on the regression of genetic “distance” and geographic distance over a selected range of the latter.
The number of permutations may be specified by the setting MantelPermutations=
value, or else at runtime. In batch mode, if no such value has been given the default behaviour is not to perform the test.
Export genetic distance matrix in Phylip format. This option is activated by the setting PhylipMatrix=
(no value needed). It may be useful, if you wish to use Phylip, to draw a tree based on genetic distances. A constant is added to all values if necessary so that all resulting distances are positive. Output is written in the file yourdata.PMA
. No further estimation or testing is done, so the name of the groups/individuals does not need to be their spatial coordinates.
Except for this export option, output files are:
the yourdata
.ISO
output file, containing (i) a genetic distance (\(\hat{a}\) or \(\hat{e}\)) half-matrix and a geographic (log-)distance half-matrix; missing information is reported as ‘-
’; (ii) regression estimates and bootstrap confidence intervals; (iii) the result of testing a slope value (usingtestPoint
); (iv) results of a Mantel test for evidence of isolation by distance, if requested; (v) a bootstrap interval for the intercept. The order of elements in the half-matrices is:1 2 3 2 x 3 x x 4 x x x
a yourdata
.MIG
output file, containing the same genetic and geographic distances as in theISO
file, but with more digits, and without estimation or test results. This file was formerly useful as input for the Isolde program (see “Former option 5 of Genepop”, below), and is a bit redundant now.a yourdata
.GRA
output file, where again the genetic and geographic distances are reported, now as \((x,y)\) coordinates for each pair of individuals (one per line). This is useful e.g. for importing the output into programs with good graphics. Pairs with missing values (either \(x\) or \(y\)) are not reported in this file.
5.6.3 Sub-option 6: isolation by distance between groups
This option is analogous to the previous one, but derives \(D\sigma^2\) estimates from a regression of \(F_{\mathrm{ST}}\)/(1-\(F_{\mathrm{ST}}\))* estimates to geographic distance in a linear habitat, or log(distance) in a two-dimensional habitat (Rousset 1997).
Both diploid and haploid data (through EstimationPloidy=Haploid
) are handled. Missing information is handled as in option 6.5. Input format is the same, except that some samples must contain several individuals. The coordinates of each sample are still contained in the name of each sample, that is in the name of the last individual in each sample.
In addition some allele-size based analyses are possible (by the setting AllelicDistance=Size
) but again they are not advised in general. Further options within option 6.6 are: isolationStatistic
; SingleGeneDiv
; minimal geographic distance; log transformation for geographic distances; testing a given value of the slope; Mantel test settings; conversion to genetic distance matrix in Phylip format. They operate as described above for analyses between individuals, the only difference being the genetic distance used (see Table 5.1). In particular, a minor variant of the \(F/(1-F)\) estimator is introduced in version 4.1, by analogy to the “between individuals” estimators. Recall that \(F/(1-F)=(Q_0-Q_r)/(1-Q_0)\) where \(1-Q_0\) is the within-deme gene diversity. The \(F/(1-F)\) method uses per-pair estimates of this within-deme gene diversity, which may not be best. With IsolationStatistic=SingleGeneDiv
a single estimate is used for all pairwise statistics. In principe this should be better when small per-group samples are considered, but the generic \(F/(1-F)\) method is still available as the default method. Limited testing so far suggests little effect of the choice of the statistic on inferences from samples with 10 haploid individuals per group and high overall diversity.
Output is written in three files yourdata.ISO
, yourdata.MIG
, and yourdata.GRA
with the same contents as in option 6.5, except for the nature of the genetic distances.
5.6.4 Former sub-option 5 of Genepop: analysis of isolation by distance from a genetic distance matrix
That option (using the Isolde program) allowed one to perform the analyses of sub-options 5 and 6 from a file with two semi-matrices, one for genetic “distances” \(F_{\mathrm{ST}}\) or whatever), the other for Euclidian distances. These analyses are now available through the IsolationFile
setting. Most choices within options 6.5 and 6.6 are available through this option, and missing data are handled19 (see example below). However, it is not possible to compute nonparametric confidence intervals for the regression slope since per-locus information is not provided (remarkably, some software pretends to compute nonparametric intervals in this case). This option may serve as a general purpose program for Mantel tests. Of course, some settings (minimal geographic distance, the \(F/(1-F)\) transformation, and the interpretation of one one-tailed \(P\) value as a test of isolation by distance) make sense in the narrower inference context of options 6.5 and 6.6.
The option is called by IsolationFile=
input file name where the input file follows the format of the yourdata.MIG
file written by options 6.5 and 6.6, which may be used as models. An example is
Lousy data <------anything (comments)
8 (an example) <---# of samples (comments ignored)
Fst estimates: <---anything (comments)
0.003
0.18 0.107
0.19 0.068 0.011
0.20 0.664 0.665 0.009
0.21 0.098 - 0.673 0.675
0.22 0.048 0.682 0.683 0.017 0.001
0.23 0.715 0.721 0.666 0.666 0.037 0.006
distances: <---anything (comments)
158.0
158.0 1215.0
158.1 1213.0 2300.0
158.2 2300.0 2.0 1057.0
158.3 1055.0 2525.0 2525.0 1000.0
158.4 1057.0 1055.0 2525.0 2525.0 1000.0
- 3582.0 3582.0 3582.0 3582.0 1.0 2.222
Anything after the second half matrix <----as it says
is ignored
The order of elements in the half-matrices is again
1 2 3
2 x
3 x x
4 x x x
Again as in options 6.5 and 6.6, both missing genetic and geographic information (‘-
’) are handled.
Output is written at the end of the input file, and as in options 6.5 and 6.6, \((x,y)\) data points are also written in the file yourdata.GRA
.
Genepop IsolationFile=
input file name MantelRankTest=
will further replicate the rank test of the old Isolde program.
5.6.5 User-provided geographic distance matrices
The setting geoDistFile=
file name20 can be used to provide a geographic distance matrix. Its format is that of other geographic distances matrices, with one required line of comment:
Geographic distances: <---anything (comments)
21
31 32
41 42 43
...
The number of samples does not need to be given.
5.6.6 Analysis of isolation by distance from multiple genetic distance matrices
If another program has generated \(F_{\mathrm{ST}}\) or \(F_{\mathrm{ST}}\)/(1 - \(F_{\mathrm{ST}}\)) matrices for a number of loci, the computation of bootstrap confidence intervals is possible. Analysis of such data sets is allowed by the MultiMigFile=
input file name setting. The format of the input file is the same as for a single genetic matrix, except that it contains multiple matrices and that the number of genetic matrices must be given (third line of input):
More lousy data
8
16 loci (for example) <---# of samples (comments ignored)
locus 1: <---anything (comments)
... <-half matrix (not shown here)
locus 2: <---anything (comments)
...
... <-more loci and half matrices (not shown here)
...
locus 16: <---anything (comments)
...
Geographic distances: <---anything (comments)
158.0
158.0 1215.0
158.1 1213.0 2300.0
158.2 2300.0 2.0 1057.0
158.3 1055.0 2525.0 2525.0 1000.0
158.4 1057.0 1055.0 2525.0 2525.0 1000.0
- 3582.0 3582.0 3582.0 3582.0 1.0 2.222
Anything after the second half matrix <----as it says
is ignored
The main use of this option is to allow analyses based on genetic distances not considered in Genepop. If the same estimates are input as would be computed by Genepop, the results should be similar to those from options 6.5 and 6.6, but not identical in general, because Genepop’s bootstrap estimates are computed as ratio of weighted average numerators and denominators of genetic estimates, while MultiMigFile
can only use weighted averages of the ratios, i.e. of the input genetic values.
5.6.7 Analysis of mean differentiation
It is possible to perform a bootstrap analysis of the mean pairwise differentiation, through all menu options that lead to bootstrap analyses of isolation by distance, when additionally using the setting MeanDifferentiationTest=TRUE
. It takes into account selection of data by both PopTypes
and range of geographical distances.
5.7 Data selection for analyses of isolation by distance
5.7.1 Selecting a subset of samples
The settings PopTypes
and PopTypeSelection
have been developed to facilitate comparison of differentiation patterns within and among different ecotypes or host races. They are used as follows:
PopTypes= 1 1 2 1 2 1 1 2 3 4
PopTypeSelection=only 1
// PopTypeSelection=inter 1 2
// PopTypeSelection=all
PopTypes
allows to distinguish different types of samples (e.g. different ecotypes) by integer indices. The number of indices must match the number of samples in the data file.
PopTypeSelection
allows performing analyses (genetic distance regressions, confidence intervals, Mantel tests) only on pairs of populations belonging to the types specified. That is, the genetic differentiation statistic among excluded pairs is not used in any of these analyses. The different choices are shown above: all
excludes no pairs (this is the default value); inter
\(a\) \(b\) will exclude all pairs that do not involve both types \(a\) and \(b\) (only two types can be specified); and only
\(a\) will exclude all pairs that involve a type different from \(a\) (only one type can be specified). For the latter two choices, permutations are made only among samples from a given type. inter_all_types
excludes all pairs within types; no Mantel test is performed in that case. intra_all_types
keeps all pairs within types, and performs a single regression for all types; again, no Mantel test is performed in that case.
You have to perform the “only
” and “inter
” analyses in distinct Genepop runs if you wish to compare their results. Rousset (1999) explains how inferences can be made from such comparisons. Note that in this perspective, some comparison of the intercept may be useful and that Genepop also provides confidence intervals on the intercept at zero distance [or log(distance)].
The inter-type Mantel test may be misleading. The null hypothesis implied by the permutation procedure is that there is no isolation by distance among populations within each type, rather than the often more relevant hypothesis that spatial processes within each type of populations are independent from each other. For this reason, a more appropriate test of the latter hypothesis is whether the bootstrap confidence interval for the inter-types regression slope includes zero or not.
5.8 Option 7: File conversions
This option allows the conversion of the Genepop input file toward other formats required by some other programs (the “ecumenical” function of Genepop). Given the limited interest in some of these conversions, little effort has been made to update them. In particular, data including haploid loci or in three-digits format may not be converted into valid input for the other programs.
The following menu appears:
File conversion (diploid data, 2-digits coding only):
GENEPOP --> FSTAT (F statistics) ........................ 1
GENEPOP --> BIOSYS (letter code) ........................ 2
GENEPOP --> BIOSYS (number code) ........................ 3
GENEPOP --> LINKDOS (D statistics) ...................... 4
Main menu .............................................. 5
Sub-option 1 converts the Genepop input file into the format required by the Fstat program of J. Goudet (1995). The new format is saved in the file yourdata.DAT
.
Sub-options 2 and 3 converts the Genepop input file into the format required by Biosys (Swofford and Selander 1989), either the letter or the number code. The new format is saved in the file yourdata.BIO
. You should add the STEP procedures at the end of this new file before running Biosys. Refer to the Biosys manual for details.
Sub-option 4 converts the Genepop input file into the format required by Linkdos, a program described by Garnier-Géré and Dillmann (1992) and based on Black and Krafsur (1985). This program performs pairwise linkage disequilibria analyses in subdivided populations and Ohta (1982)’s (1982) \(D\) statistics. The new format is saved in the file yourdata.LKD
. The source Linkdos program (LINKDOS.PAS) and an executable (LINKDOS.EXE) have been distributed with previous versions of Genepop with permission of their authors, and are still available on the Genepop distribution page. The executable distributed with Genepop has been compiled for 40 samples, 20 loci and 99 alleles per locus. It may be wise to relabel alleles (option 8.3) before the conversion. Garnier-Géré and Dillmann (1992) should be cited whenever this program is used.
5.9 Option 8: Null alleles and some input file utilities
The following menu appears21
Miscellaneous :
Null allele: estimates of allele frequencies .......... 1
Diploidisation of haploid data ........................ 2
Relabeling alleles .................................... 3
Conversion to individual data with population names ... 4
Conversion to individual data with individual names ... 5
Random sampling of haploid genotypes from diploid ones 6
Main Menu ........................................... 7
5.9.1 Sub-option 1: null alleles
This sub-option allows estimation of gene frequencies when a null allele is present. Different methods are available: maximum likelihood, maximum likelihood with genotyping failure, and Brookfield’s (1996) estimator, which differences are explained in Section 7.1.22
Genepop takes the allele with the highest number for a given locus across all populations as the null allele.23 For example, if you have 4 alleles plus a null allele, a null homozygote individual should be indicated as e.g. 0505
or 9999
in the input file.
The default estimation method is maximum likelihood, using the EM algorithm of Dempster, Laird, and Rubin (1977). Apparent null genotypes may also be due to nonspecific genotyping failures. Joint maximum likelihood estimation of such failure rate (“\(\beta\)”) and of allele frequencies is available through the setting NullAlleleMethod=ApparentNulls
. Finally, the estimator of Brookfield (1996) is also available through the setting NullAlleleMethod=B96
. Confidence intervals for null allele frequencies are computed for each locus in each population. Their coverage probability can be modified by the same setting CIcoverage
as in options 6.5 and 6.6.
The output file is saved in the file yourdata.NUL
. This file may contain
For the maximum likelihood methods, estimated allelic frequencies and predicted numbers of homozygotes and of heterozygotes with a null allele. For example, in an output such as
Allele EM freq. Homoz. Null Heter. 1 0.2762 2.7046 4.2954 2 0.2576 1.8500 3.1500 3 0.2251 1.3567 2.6433 4 0.0217 0.0000 0.0000 Null 0.2193
of the seven (
2.7046+4.2954
) apparent homozygotes for allele 1, it is predicted that 4.2954 are actually heterozygotes for allele 1 and for the null allele. This predicted value is the expected, or average, number of such heterozygotes over different samples with the same number of apparent genotypes, under the assumptions of the model.a summary locus-by-population table of estimates of null allele frequencies.
a summary locus-by-population table of estimates of genotyping failure frequencies (“
beta
”), if applicable.A table of bootstrap confidence intervals for estimates of null allele frequencies.
Note that there may be insufficient information to compute estimates and/or confidence intervals: not enough alleles in the sample, for example. These are indicated by the message No information
. Sometimes the point estimate can formally be computed but the computed CI is not meaningful. This happens for example in case of heterozygote excess, and generates a (No info for CI)
warning (if all pseudo-samples generated by some resampling technique show an heterozygote excess, all pseudo-estimates of null allele frequency will be zero and there is no information to construct a non-null CI from this distribution).
The confidence intervals for null allele frequencies are obtained by a bootstrap method, and are not suitable for testing for the presence of null alleles, because the null hypothesis is at the boundary of the parameter space (Andrews 2000). Instead, the exact score test for Hardy-Weinberg proportions can be used.
5.9.2 Sub-option 2: Diploidisation of haploid data
This sub-option “diploidizes” haploid loci. For example, the line
popul 1, 01 02 10 00
of an haploid dataset with 4 loci, will become
popul 1, 0101 0202 1010 0000
.
Only haploid data are thus modified in a mixed haploid/diploid data file. The new file is named D
yourdata.24
Note that there may no longer be any need for this option for further analyses with Genepop (except perhaps as a preliminary to file conversions, option 7), since Genepop 4.0 now perform analyses on haploid data without such prior “diploidization” (don’t forget the EstimationPloidy=Haploid
setting).
5.9.3 Sub-option 3: Relabeling alleles names
This sub-option relabels all alleles starting from 1 up to \(x\), \(x\) being the true number of distinct alleles for each locus. The new file is named N
yourdata. The correspondence between the old and the new numbering is indicated in the file new_file_name.NUM. This option was originally introduced in Genepop because for some options, the memory space required depends on the highest allele number. I don’t expect this to be a cause of concern now.
5.9.4 Sub-options 4 and 5: Conversion of population data to individual data
These sub-options convert “population” data (with several individuals per Pop
to “individual” data where each individual is put in a distinct Pop
. This is useful for individual-based analyses of isolation by distance and, in this perspective, the name of each individual is replaced by what should be its coordinates, that is, either the name of the last individual in the original population (sub-option 4), or the name of each individual if their locations are distinguished (sub-option 5)25.
5.9.5 Sub-option 6: Random sampling of haploid genotypes from diploid ones
This sub-option randomly samples haploid genotypes at diploid loci.26 This may be useful for external analyses that require haploid data or that would be biased by Hardy-Weinberg disequilibria.
New to Genepop 4.0.↩︎
Again new to Genepop 4.0.↩︎
In earlier versions of Genepop, this analysis was done through the HW.BAT batch file.↩︎
The distinct option 2.3 of Genepop 3.4 is no longer necessary as option 2.1 of Genepop 4.0 more gracefully handles haploid data.↩︎
This was not the case in earlier versions of Genepop↩︎
Up to version 3.4, Genepop only computed Fisher’s exact test in these sub-options.↩︎
slightly modified in comparison to earlier versions of Genepop↩︎
In previous versions of Genepop, this analysis was done by the Struc program called through the
Struc.BAT
batch file.↩︎more extensively than in earlier versions of Genepop.↩︎
New to Genepop 4.2↩︎
Former sub-option 3 (erasing all temporary files) has been discarded.↩︎
The last two methods are new to Genepop 4.0.↩︎
This is a notable difference from Genepop 3.4, where the allele with the highest number in each population was taken as the null allele in this population. Consequently, null allele estimation is now meaningful even if no null homozygote is observed in a given population. The output format has also been improved, compared to earlier versions of Genepop, with a more logical ordering of results (samples within loci) and a final locus by population table of estimated null allele frequencies.↩︎
No longer truncated to 8 letters as it was in earlier versions of Genepop↩︎
New to Genepop 4.3↩︎
New to Genepop 4.3↩︎