3 The input file
As illustrated by the following examples, the input format requested by Genepop is:
- First line: anything Use this line to store information about your data.
- Locus names They may be given one per line, or on the same line but separated by commas.
Pop
sample indicator (Capitalization does not matter)2. Each sample from a different geographical original is declared by a line with apop
statement. - Information for first individual. An example is:
ind#001 fem ,0101 0202 0000 0410
Hereind#001 fem
is an identifier for your personal use. You can use any character (except a comma!). You may leave it blank (at least one space) if you wish. The last identifier of every sub-population is used by Genepop as the sample name in output files. The comma between the identifier and the list of genotypes is required.0101
indicates that this individual is homozygous for the01
allele at the first locus. The same individual is homozygous for the02
allele at the second locus (0202
). Data are missing at the third locus (0000
). At the fourth locus, the genotype is0410
, which indicates the presence of alleles04
and10
. - More individuals Each individual information starts on a new line, but may extend over several lines (do not start a new line in the middle of a one-locus genotype!).
- More samples each declared by a
pop
statement on a new line - Blank lines at the end of the file are removed by Genepop.
An example of a short input file is given below:
Title line: "Grape populations in southern France"
ADH Locus 1
ADH #2
ADH three
ADH-4
ADH-5
mtDNA
Pop
Grange des Peres , 0201 003003 0102 0302 1011 01
Grange des Peres , 0202 003001 0102 0303 1111 01
Grange des Peres , 0102 004001 0202 0102 1010 01
Grange des Peres , 0103 002002 0101 0202 1011 01
Grange des Peres , 0203 002004 0101 0102 1010 01
POP
Tertre Roteboeuf , 0102 002002 0201 0405 0807 01
Tertre Roteboeuf , 0102 002001 0201 0405 0307 01
Tertre Roteboeuf , 0201 002003 0101 0505 0402 01
Tertre Roteboeuf , 0201 003003 0301 0303 0603 01
Tertre Roteboeuf , 0101 002001 0301 0505 0807 01
pop
Bonneau 01 , 0101 002002 0304 0805 0304 01
Bonneau 02 , 0201 002002 0404 0505 0304 01
Bonneau 03 , 0101 002100 0304 0505 0101 01
Bonneau 04 , 0101 100100 0204 0805 0304 01
Bonneau 05 , 0101 100002 0104 0808 0304 01
Pop
, 0000 002001 0202 0402 0007 01
, 0200 002001 0202 0205 0707 01
, 0010 002001 0101
0105 0807 01
last pop, 0101 002001 0101 0401 0807 02
This example shows some useful features of the input file:
There is no constraint on the number of blanks separating the various fields.
The individual identifier has a free format.
Alleles are numbered from 01 to 99 or 001 to 999 if needed. In 3-digits coding, (say) homozygotes for the
90
allele are noted090090
, not9090
as in the 2-digits format. 2-digits and 3-digits coding of alleles can be intermixed (among loci, not within loci!).3To designate alleles, consecutive numbers are not required.
haploid and diploid data can be intermixed.4 6-digits genotypes are recognized as 3-digits diploid genotypes; 4-digits genotypes are recognized as 2-digits diploid genotypes; 2- and 3-digits genotypes are recognized as haploid genotypes. The same coding should be used consistently within each locus. See the
EstimationPloidy
setting for more information about analyzing haploid data. For haplo-diploid data at a given locus, the haploid genotypes should be coded as diploid genotypes with one unknown allele; note however that the information from haploid genotypes at haplo-diploid loci will be used only for genic contingency table tests, and will be ignored in estimation of genetic structure.Genotypes can extend on more than one line (see penultimate individual)
To group various samples, just remove each relevant
Pop
separator.
It is possible to write all the locus names on one line, provided that a comma is used as separator. This could be useful to clearly label each column. Thus the above input file could have started as
Title line: "Grape populations in southern France"
Loc1,Loc2, ADH3,ADH4,ADH5,mtDNA
Pop
Grange des Peres , 0201 003003 0102 0302 1011 01
...
Note the absence of comma after the last locus name.
There are however constraints to be obeyed
Missing data should be indicated with
00
(or000
for 3-digits coding) and not with blanks. The first locus in the last sample illustrates the various possibilities of missing data: no information (first individual coded0000
) or partial information (only one allele is determined: allele02
for the second individual coded0200
and allele10
for the third individual coded0010
).The number of locus names should correspond to the number of genotypes in each individual. If you remove one or several loci from your input file, you should remove both their names and the corresponding genotypes.
No empty line should be present in the data file.
Genepop accepts input file names either with the extension
.txt
5 or without any extension.Genepop input files are ASCII text files.
The last point implies that under Windows, you should avoid using Microsoft Word to edit input files (and settings files as well). Rather use a text editor such as Notepad++.6 It has also appeared that certain Microsoft products under Mac OS X still produced files formatted according to the older Mac format. Genepop now catches and corrects this miserable feature.
One can also find some conversion tools (e.g. from EXCEL) on the web.
If the input file is correctly read, the name of the larger allele number is indicated for each locus. The number of distinct alleles for each locus is provided upon request. If alleles have been labeled with consecutive numbers from 01
onwards, then the name of the larger allele will correspond to the number of distinct alleles for each locus.
There are some limits to the number of samples and individuals imposed by the compiler. These values, and a few other ones, are shown by running “Genepop Maxima=
” (see the Maxima
setting). However, these built-in maxima are so large7 as to be practically infinite even in the era of whole-genome sequencing. Computer memory, or user patience, are more likely limits.