3 The input file

As illustrated by the following examples, the input format requested by Genepop is:

  • First line: anything Use this line to store information about your data.
  • Locus names They may be given one per line, or on the same line but separated by commas. Pop sample indicator (Capitalization does not matter)2. Each sample from a different geographical original is declared by a line with a pop statement.
  • Information for first individual. An example is: ind#001 fem ,0101 0202 0000 0410 Here ind#001 fem is an identifier for your personal use. You can use any character (except a comma!). You may leave it blank (at least one space) if you wish. The last identifier of every sub-population is used by Genepop as the sample name in output files. The comma between the identifier and the list of genotypes is required. 0101 indicates that this individual is homozygous for the 01 allele at the first locus. The same individual is homozygous for the 02 allele at the second locus (0202). Data are missing at the third locus (0000). At the fourth locus, the genotype is 0410, which indicates the presence of alleles 04 and 10.
  • More individuals Each individual information starts on a new line, but may extend over several lines (do not start a new line in the middle of a one-locus genotype!).
  • More samples each declared by a pop statement on a new line
  • Blank lines at the end of the file are removed by Genepop.

An example of a short input file is given below:

 Title line: "Grape populations in southern France"
 ADH Locus 1
 ADH #2
 ADH three
 Grange des Peres  ,  0201 003003 0102 0302 1011 01
 Grange des Peres  ,  0202 003001 0102 0303 1111 01
 Grange des Peres  ,  0102 004001 0202 0102 1010 01
 Grange des Peres  ,  0103 002002 0101 0202 1011 01
 Grange des Peres  ,  0203 002004 0101 0102 1010 01
 Tertre Roteboeuf ,      0102 002002 0201 0405 0807 01
 Tertre Roteboeuf ,      0102 002001 0201 0405 0307 01
 Tertre Roteboeuf ,      0201 002003 0101 0505 0402 01
 Tertre Roteboeuf ,      0201 003003 0301 0303 0603 01
 Tertre Roteboeuf ,      0101 002001 0301 0505 0807 01
 Bonneau 01   , 0101    002002 0304 0805 0304 01
 Bonneau 02   , 0201    002002 0404 0505 0304 01
 Bonneau 03   , 0101    002100 0304 0505 0101 01
 Bonneau 04 , 0101    100100 0204 0805 0304 01
 Bonneau 05   , 0101    100002 0104 0808 0304 01
  ,            0000 002001 0202 0402 0007 01
  ,            0200 002001 0202 0205 0707 01
  ,            0010 002001 0101
 0105 0807 01
 last pop,      0101 002001 0101 0401 0807 02

This example shows some useful features of the input file:

  • There is no constraint on the number of blanks separating the various fields.

  • The individual identifier has a free format.

  • Alleles are numbered from 01 to 99 or 001 to 999 if needed. In 3-digits coding, (say) homozygotes for the 90 allele are noted 090090, not 9090 as in the 2-digits format. 2-digits and 3-digits coding of alleles can be intermixed (among loci, not within loci!).3

  • To designate alleles, consecutive numbers are not required.

  • haploid and diploid data can be intermixed.4 6-digits genotypes are recognized as 3-digits diploid genotypes; 4-digits genotypes are recognized as 2-digits diploid genotypes; 2- and 3-digits genotypes are recognized as haploid genotypes. The same coding should be used consistently within each locus. See the EstimationPloidy setting for more information about analyzing haploid data. For haplo-diploid data at a given locus, the haploid genotypes should be coded as diploid genotypes with one unknown allele; note however that the information from haploid genotypes at haplo-diploid loci will be used only for genic contingency table tests, and will be ignored in estimation of genetic structure.

  • Genotypes can extend on more than one line (see penultimate individual)

  • To group various samples, just remove each relevant Pop separator.

It is possible to write all the locus names on one line, provided that a comma is used as separator. This could be useful to clearly label each column. Thus the above input file could have started as

 Title line: "Grape populations in southern France"
                      Loc1,Loc2,  ADH3,ADH4,ADH5,mtDNA
 Grange des Peres  ,  0201 003003 0102 0302 1011 01

Note the absence of comma after the last locus name.

There are however constraints to be obeyed

  • Missing data should be indicated with 00 (or 000 for 3-digits coding) and not with blanks. The first locus in the last sample illustrates the various possibilities of missing data: no information (first individual coded 0000) or partial information (only one allele is determined: allele 02 for the second individual coded 0200 and allele 10 for the third individual coded 0010).

  • The number of locus names should correspond to the number of genotypes in each individual. If you remove one or several loci from your input file, you should remove both their names and the corresponding genotypes.

  • No empty line should be present in the data file.

  • Genepop accepts input file names either with the extension .txt5 or without any extension.

  • Genepop input files are ASCII text files.

The last point implies that under Windows, you should avoid using Microsoft Word to edit input files (and settings files as well). Rather use a text editor such as Notepad++.6 It has also appeared that certain Microsoft products under Mac OS X still produced files formatted according to the older Mac format. Genepop now catches and corrects this miserable feature.

One can also find some conversion tools (e.g. from EXCEL) on the web.

If the input file is correctly read, the name of the larger allele number is indicated for each locus. The number of distinct alleles for each locus is provided upon request. If alleles have been labeled with consecutive numbers from 01 onwards, then the name of the larger allele will correspond to the number of distinct alleles for each locus.

There are some limits to the number of samples and individuals imposed by the compiler. These values, and a few other ones, are shown by running “Genepop Maxima=” (see the Maxima setting). However, these built-in maxima are so large7 as to be practically infinite even in the era of whole-genome sequencing. Computer memory, or user patience, are more likely limits.

  1. Earlier versions of Genepop only accepted Pop, POP and pop↩︎

  2. New to Genepop 4.0↩︎

  3. Also new to Genepop 4.0↩︎

  4. New to Genepop 4.0↩︎

  5. Other text editors including the Windows basic text editor may not show all end-of-line characters correctly.↩︎

  6. in constrast to earlier versions of Genepop↩︎