--- title: "Why and how to use the Ease package?" author: "Ehouarn Le Faou" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Why and how to use the Ease package?} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup, include = FALSE} library(Ease) ``` `Ease` aims to implement in a simple and efficient way in R the possibility to perform population genetics simulations considering multiple loci whose epistasis is fully customizable. Specifically suited to the modelling of multilocus nucleocytoplasmic systems, it is nevertheless possible to simulate purely nucleic, i.e. diploid (or purely cytoplasmic, i.e. haploid) genetic models. The simulations are not individual-centred in that the transition from one generation to the next is done matrix-wise on the basis of deterministic equations. Instead of each individual being described separately, the simulations only handle the genotype frequencies within the population. All possible genotype frequencies considering the loci and alleles defined by the user are explicitly tracked. The simulations are therefore fast only if the number of loci and alleles is not too large. The consideration of genetic drift and a specific population size is nevertheless introduced as a multinomial draw each generation, which adds to the realism of the simulations by adding randomisation. One can also simulate a simple demography, i.e. an initial population size, a growth rate and a carrying capacity. The size of the population is thus drawn each generation in a Poisson distribution. Varying the carrying capacity is not (yet) included in `Ease`, but it is possible to run different models in succession, giving the next one as the initial genotype frequencies the frequencies of the previous one. The population considered can be structured and connected by migration. Migration is only possible at the individual stage and not at the gamete stage, which future updates may add. It is also not possible to define different migration rates between different genotypes. The consideration of genetic drift and thus a specific population size is nevertheless introduced as a multinomial draw each generation, which adds to the realism of the simulations by adding randomisation. In the `Ease` package, the life cycle of the simulated population is standard ([migration] - [selection on gamete production] - [gametogenesis (recombination + meiosis + mutation)] - [selection on gametes] - [syngamy] - [selection on individuals] - [demography] - [drift]) and may consider the population dioecious or hermaphroditic. # Genome ## Definition A genome is defined by the set of loci to which lists of alleles are attached. Each loci and each allele is defined by a unique name, which allows it to be unequivocally identified. There are two types of loci: diploid and haploid. A genotype is defined as an allelic combination of all the alleles of an individual's loci and a haplotype as only those alleles that have been inherited together from a single parent. A genotype is therefore made up of two haplotypes. A distinction is also made between diploid (resp. haploid) haplotypes which correspond to allelic combinations taking into account only diploid (resp. haploid) loci. The loci are defined by a list of vectors that enumerates their respective alleles. The order in which the loci are placed is not important in the case of haploid loci. It does matter in the case of diploid loci because recombination is likely to affect the haplotypes. In the `Ease` package, diploid loci are In the case of diploid loci, however, if several are defined, the order of the diploid loci in the list is not trivial. The rates of two-to-one combinations between them must indeed be defined by a vector of recombination rates. For example, if three diploid loci are defined, this vector must be of length 2, the first of its values defining the recombination rate between the first and second loci, the second of its values the recombination rate between the second and third loci. For example, if we want to define two groups of two loci that are linked to each other but are on two different chromosomes, we can define the recombination rate vector as `c(0.1, 0.5, 0.1)`. The first two loci are thus relatively linked (recombination rate of `0.1`), as are the last two loci. On the other hand, the recombination rate of `0.5` between the second and third loci ensures that the two groups are independent. To create a haplotype ID, we concatenate all diploid alleles and all haploid alleles separately, then concatenate these two strings by separating them with `"||"`. For example `"Ab||CD"` corresponds to a haplotype with four loci, two diploid with alleles `A` and `b`, and two haploid with alleles `C` and `D`. The principle is the same for the genotypes, but the second diploid haplotype is added by separating it from the first by a `"/"`, for example `"Ab/ab||CD"`. ## Construction Each loci is represented by a name and a factor vector that lists its alleles. If one wish to consider a system with two loci, a diploid and a haploid, each of which has two alleles, `A` and `a`, and `B` and `b` respectively, the construction of the genome is done as follows: ```{r} DL <- list(dl = c("A", "a")) HL <- list(hl = c("B", "b")) genomeObj <- setGenome(listHapLoci = HL, listDipLoci = DL) ``` The haplotypes and genotypes have been generated automatically, their numbers can be retrieved by simply displaying the `Genome` object created: ```{r} genomeObj ``` and an exhaustive list can be displayed using the `print` method: ```{r} print(genomeObj) ``` The haplotypes and genotypes are numbered, and these numberings will be important in defining the different types of fitness, as we shall now see. # Mutation matrix ## Definition A genome necessarily has a mutation matrix attached to it. This mutation matrix is haplotypic: it is a square probability matrix (the sum of the rows of which is equal to 1), of size equal to the number of haplotypes defined in the genome. This mutation matrix is not provided as is by the user, in which case it would be too tedious to define. Instead the user is asked to either : * define an allelic mutation matrix (square probability matrix) per loci, in the form of a list of matrices for diploid loci and one for haploid loci; * give a forward mutation rate and a backward mutation rate. Forward then corresponds to the passage from one allele to the other in the sense that the alleles have been defined in the genome, backward in the other direction. * give a list of mutations to be defined individually, from one allele to another, using the `mutation` function **_NOTE:_** *In practice, the mutation matrix is not used as such in the simulations. It is associated with the recombination matrix and the meiosis matrix which associates to each genotype the probability that they produce each haplotype by chromosomal segregation. It is with a matrix product *Recombination matrix *x* Meiosis matrix *x* Mutation matrix *that a single gametogenesis matrix is produced and used for the simulations.* ## Construction Definition of the haplotypic mutation matrix by filling in the allelic mutation matrices : ```{r} mutMatrixObj <- setMutationMatrix( genomeObj = genomeObj, mutHapLoci = list(matrix(c( 0.95, 0.05, 0.03, 0.97 ), 2, byrow = TRUE)), mutDipLoci = list(matrix(c( 0.9, 0.1, 0.09, 0.91 ), 2, byrow = TRUE)) ) mutMatrixObj ``` Definition of the haplotypic mutation matrix by filling in the forward and backward mutation rates: ```{r} mutMatrixObj <- setMutationMatrix(genomeObj = genomeObj, forwardMut = 0.01) mutMatrixObj ``` Definition of the haplotypic mutation matrix by making a mutation list: ```{r} mutMatrixObj <- setMutationMatrix( genomeObj = genomeObj, mutations = list( mutation(from = "A", to = "a", rate = 0.01), mutation(from = "B", to = "b", rate = 0.01) ) ) mutMatrixObj ``` # Selection ## Definition Selection can be defined at three stages of this cycle: on mature individuals directly, on their gamete production or on the gametes. For individuals and gamete production, a fitness value is associated with each genotype. For gametes, it is with each haplotype. A fitness value is any positive or zero real. Fitness values are relative, so if all genotypes have a fitness value of 3, there will be no effect on the dynamics of the model. By default, as is customary, the fitness of individuals is set to 1, and the additional fitness effects are added multiplicatively. ## Construction In all cases, the construction of a `Selection` object is done using a genome class object (which is used to check the compatibility between the constructed genome and the desired selection parameters). There are two ways of defining selection : * give a vector of length equal to the number of genotypes. By defining the fitness vector as such, it is therefore necessary to know the order of haplotypes and genotypes (use the method `print` on an object of class `Genome`). This way is often impractical and the second way is often preferred. * a simpler and more readable way to define the selection is to create a list of selection formulas, as they will be called here. The principle is explained below. ### Selection formulas A selection formula is a formula (in the sense of R) that associates allelic combinations with a fitness effect. The fitness effect is the left-hand member of the formula, while the right-hand member corresponds to the allelic combinations. The fitness effect is any literal or numerical expression. If it is literal, it will be evaluated at the time of the definition of the object selection The definition of allelic combinations is based on several rules. First, each allelic combination is described using the keywords `:` and `h()`. The first one, `:`, is used to separate the alleles from each other in the enumeration of allele combinations and `h()` is used to specify whether the allele must be in the homozygous state (if this keyword is absent, the allele is in the heterozygous state). For example, in a model with two diploid loci and two alleles each, respectively *A* and *a*, *B* and *b*, we note `h(a):b` to mean that the fitness effect must apply to any genotype that has the *a* allele in the homozygous state and the *b* allele in the heterozygous state. Other examples are: `a:b` (*a* heterozygous and *b* heterozygous), `h(A):h(B)` (*A* and *B* homozygous). We can therefore construct the formula: `1 + s ~ a:b`, which means that the presence of *a* and *b* in the heterozygous state in a genotype has an effect (`1+s`) on the fitness of that genotype. To add other allelic combinations to which the same fitness effect will apply, the operator `+` must be used. Using the previous example, `1 + s ~ a:b + h(a)` means that the fitness effect should also apply to genotypes that are homozygous for *a*. To define a selection regime, one has to give as input a list of selection formula that describe how, depending on their allelic combinations, genotypes suffer or benefit from fitness effects, for example of the form `list(1 + s ~ h(a):h(b), 1 + h*s ~ a:b + a:h(b) + h(a):b)`. An important thing to note is that `list(1 + s ~ a + b)` is not equivalent to `list(1 + s ~ a, 1 + s ~ b)` In the first situation, if a genotype has *a* OR *b* in the heterozygous state, the fitness effect will apply. In the second situation the effect will apply twice (multiplicatively) to the same genotype because it fits two different formulas. **_NOTE:_** *This way of defining fitness does not allow to distinguish between cis and trans effects on fitness.* ### Neutral selection Then it is for example possible to define no selection (neutral model) with the function `setSelectNeutral` to construct a `selection` object where the fitnesses are all identical (equal to 1): ```{r} selectionObj <- setSelectNeutral(genomeObj = genomeObj) ``` We can then check that no selection has been defined: ```{r} selectionObj ``` or with : ```{r} print(selectionObj) ``` ### Non-neutral selection Using the example given in the [Genome] section, one might want to simulate a system of genetic incompatibility where when the derived alleles `a` and `b` are put together within the same genotype, they induce a fitness cost through negative epistasis. This cost, which we will call `s`, is associated with `h` dominance which reduces this cost when the `a` nuclear allele is in the heterozygous state. Thus individuals `A/A||B`, `A/a||B`, `a/a||B` and `A/A||b` do not suffer any fitness cost (because they have only one of the two incompatible alleles), their fitness is equal to 1. The genotype `A/a||b` undergoing the reduced cost of incompatibility has a fitness of `1-h*s` and the genotype `a/a||b` undergoing the full cost of incompatibility has a fitness of `1 - s`. To define selection it is then sufficient to specify that if only one allele `a` is present with allele `b` (`a:b`), the cost is only `1 - h*s`, and if allele `a` is homozygous with allele `b`, the cost is `1 - s`. ```{r} indFit <- list( 1 - h * s ~ a:b, 1 - s ~ h(a):b ) s <- 0.8 h <- 0.5 selectionObj <- setSelectOnInds( genomeObj = genomeObj, indFit = indFit ) ``` We can then check that selection has been defined: ```{r} selectionObj ``` Regarding selection on individuals, it is necessary to understand that it will potentially not be identical if the modelled population is hermaphroditic or dioecious. In the case of hermaphroditism there is no distinction between female and male fitness, and so the `indFit` parameter will govern their fitness. If the sexes are separated, however, one can either define a fitness in individuals `indFit` that will apply to both males and females, or specify separately for males and females with the parameters `femaleFit` and `maleFit`. In any case it is good to check with the `print` method that the fitnesses are those wanted: ```{r} print(selectionObj) ``` Selection can also be defined on gamete production: ```{r} indProdFit <- list( 1 - h * s ~ a:b, 1 - s ~ h(a):b ) s <- 0.8 h <- 0.5 selectionObj <- setSelectOnGametesProd( genomeObj = genomeObj, indProdFit = indProdFit ) ``` or on the gametes directly: ```{r} gamFit <- list( 1 - s ~ a:b ) s <- 0.8 h <- 0.5 selectionObj <- setSelectOnGametes( genomeObj = genomeObj, gamFit = gamFit ) ``` For the latter two selection modes, it is possible to specify the fitness effect according to sex or to define the same effect regardless of sex, as desired. Last but not least, it is obviously possible to combine these different layers of selections. This is done using the `selectionObj` parameter that each of the `setSelect...` functions has (except `setSelectNeutral`), it is then unnecessary to recall the genome to which the selection refers. For example, if we want to combine the three types of selections presented here : ```{r} indFit <- list( 1 - h * s ~ a:b, 1 - s ~ h(a):b ) indProdFit <- list( 1 - h * s ~ a:b, 1 - s ~ h(a):b ) gamFit <- list( 1 - s ~ a:b ) s <- 0.8 h <- 0.5 selectionObj <- setSelectOnInds(genomeObj = genomeObj, indFit = indFit) selectionObj <- setSelectOnGametesProd(indProdFit = indProdFit, selectionObj = selectionObj) selectionObj <- setSelectOnGametes(femaleFit = gamFit, selectionObj = selectionObj) print(selectionObj) ``` # Population ## Definition Once the previous steps have been completed, i.e. definition of the [genome](#genome), [mutation](#mutation-matrix) and [selection](#selection), the population can be constructed. A population is defined strictly by a name, a size, a sexual system (dioecy or hermaphodite), and the three objects defined previously: genome, mutation matrix and selection. In addition to that, it is possible to define * a selfing rate (by default equal to 0) * a vector of initial genotypic frequencies * a demography Two demographic regimes are possible: no demography, i.e. a fixed population size, or demography, i.e. a population where the size fluctuates stochastically. The boolean argument `demography` is used to define whether there should be stochasticity. For a fixed population size, it is therefore sufficient to define that `demography = FALSE` (default) and to set the desired population size with the `popSize` parameter. For a fluctuating demography, `demography` must be `TRUE` and three other parameters are then needed: the initial population size (`initPopSize`), the population growth rate (`growthRate`) and the carrying capacity of the population (the population size, `popSize`). It is also possible to avoid defining a population size altogether, by setting off the genetic drift (`drift` parameter). This will allow the model to be simulated deterministically. ## Construction Definition of a population in its simplest form: ```{r} DL <- list(dl = c("A", "a")) HL <- list(hl = c("B", "b")) mutations <- list( mutation(from = "A", to = "a", rate = 1e-3), mutation(from = "B", to = "b", rate = 1e-3) ) genomeObj <- setGenome(listHapLoci = HL, listDipLoci = DL) pop <- setPopulation( name = "A", size = 1000, dioecy = TRUE, genomeObj = genomeObj, selectionObj = setSelectNeutral(genomeObj), mutMatrixObj = setMutationMatrix(genomeObj, mutations = mutations) ) pop ``` More information on the population can be accessed using the `print` method; ```{r} print(pop) ``` To simulate it is then necessary to go through the metapopulation class, even if only one population is simulated. More information below. # Metapopulation ## Definition A metapopulation is a set of population(s) (from 1) that are simulated with potential migration between them. Only genotypes can migrate, i.e. adult individuals. ## Construction The construction of a `Metapopulation` object requires only two arguments (one optional). The first is a population(s) list, defined from the population class. The second is a migration matrix, which connects the populations together. This matrix is a probability matrix (square with the sum of the rows equal to 1, whose size is equal to the number of populations) where each value corresponds to the proportion of individuals (genotypes) that disperse from their source population (row) to their target population (column). ## Simulate The `nameOutFunct` (optionnal) argument is the name of a function that allow to produce a custom output for a simulation. It is called each generation in each population of a simulation and systematically returns a list with the first element being a logical that indicates whether something should be saved. If so, the second element of this list will be saved (no need to add a second element the first is `FALSE`). The custom output function must take only one argument, `pop`. This argument is a list of : * `customOutput` : list of all previous savings * `gen` : generation * `freqGeno` : list of genotypic frequency matrices (matrix 1 x # genotypes). The list is constructed as follows: if the population is hermaphroditic it has only one element "ind", if the population is dioecious it has three elements, "female", "male" and "ind" which correspond respectively to the genotypic frequencies of the females, the males and the average of the two (assuming a sex ratio of 50:50). * `freqHaplo` : list of genotypic frequency matrices (matrix 1 x # haplotypes). The list is constructed in the same way as for genotypic frequencies (see above). * `freqAlleles` : list of allelic frequency matrices (matrix 1 x # alleles). The list is constructed in the same way as for genotypic frequencies (see above). An example of such a function could be : ```{r} customOutFunct <- function(pop) { if (pop$freqAlleles[4] < 0.1 | pop$freqAlleles[4] > 0.9) { return(list(TRUE, a = list(gen = pop$gen, freq = pop$freqAlleles[4]))) } return(list(FALSE)) } ``` Here the function returns orders to save the frequency of the fourth allele of the model if its frequency is between 0 and 0.1 or between 0.9 and 1. After defining it in this way, one will have to give `"customOutFunct"` as the value of the `nameOutFunct` parameter in the `simulate` method of the class `Metapopulation` for it to be taken into account. # Example of building a metapopulation and generating results For example, consider a model with a single locus with two alleles, *A* and *a*, where the *a* allele has a small positive fitness effect. The genome: ```{r} DL <- list(dl = c("A", "a")) genomeObj <- setGenome(listDipLoci = DL) print(genomeObj) ``` *This warning is normal when working only with diploid (or only haploid) loci.* The mutation matrix: ```{r} mutMatrixObj <- setMutationMatrix( genomeObj = genomeObj, forwardMut = 1e-3, backwardMut = 1e-3 ) print(mutMatrixObj) ``` The object of selection is then created which describes that allele `A` is beneficial in population 1 while allele `a` is beneficial in population 2. This implements a selection differential between the two populations, a form of local selection. ```{r} # Selection in population 1 indFit <- list( 1 + h * s ~ a, 1 + s ~ h(a) ) h <- 0.5 s <- 0.05 selectionObj1 <- setSelectOnInds( genomeObj = genomeObj, indFit = indFit ) print(selectionObj1) # Selection in population 2 indFit <- list( 1 + h * s ~ A, 1 + s ~ h(A) ) h <- 0.5 s <- 0.05 selectionObj2 <- setSelectOnInds( genomeObj = genomeObj, indFit = indFit ) print(selectionObj2) ``` We can now define a migration rate of 0.005 from each of the two populations to the other. ```{r} migMat <- matrix(c( 0.995, 0.005, 0.005, 0.995 ), 2, 2) ``` And finally create the `Metapopulation` object ```{r} metapop <- setMetapopulation( populations = list( setPopulation( name = "pop1", size = 500, dioecy = F, genomeObj = genomeObj, selectionObj = selectionObj1, mutMatrixObj = mutMatrixObj, selfRate = 0.5 ), setPopulation( name = "pop2", size = 500, dioecy = F, genomeObj = genomeObj, selectionObj = selectionObj2, mutMatrixObj = mutMatrixObj, selfRate = 0.5 ) ), migMat = migMat ) metapop ``` The print `method` allows access to all the characteristics of an object of class `Metapopulation` ```{r} print(metapop) ``` The model is then simulated 100 times recording the population state every 50 generations up to 800 generations. The `seed` is specified to ensure reproducibility of the results obtained. ```{r} nsim <- 100 metapop <- simulate(metapop, nsim = nsim, seed = 123, recording = TRUE, recordGenGap = 50, threshold = 800 ) metapopDeterminist <- simulate(metapop, drift = FALSE, seed = 123, recording = TRUE, threshold = 800 ) ``` Then we sum up the recording data frames of each simulation and divide them by the number of simulations (which results in an overall average of the data frames): ```{r} rec <- getRecords(metapop) recMean <- Reduce( function(x, y) { x$pop1 <- x$pop1 + y$pop1 x$pop2 <- x$pop2 + y$pop2 x }, rec ) recMean$pop1 <- recMean$pop1 / nsim recMean$pop2 <- recMean$pop2 / nsim recDeterminist <- getRecords(metapopDeterminist)$s1 ``` This creates a mutation - selection - drift - migration equilibrium visible on these graphs. Dashed lines indicate deterministic predictions (without drift). ```{r, dpi = 100, fig.width = 7, fig.asp = 2} par(mfrow = c(2, 1)) ### Allelic frequency plot(c(), xlim = c(0, 800), ylim = c(0, 1), col = "blue", xlab = "Generation\n(blue for A, red for B)", ylab = "Frequency of a" ) # Stochastic for (i in seq_len(nsim)) { lines(rec[[i]]$pop1$gen, rec[[i]]$pop1$a, lty = "longdash", col = "lightblue") lines(rec[[i]]$pop2$gen, rec[[i]]$pop2$a, lty = "longdash", col = "lightpink") } lines(recMean$pop1$gen, recMean$pop1$a, lty = "longdash", col = "blue") lines(recMean$pop2$gen, recMean$pop2$a, lty = "longdash", col = "red") # Deterministic lines(recDeterminist$pop1$gen, recDeterminist$pop1$a, lty = "longdash", col = "blue", lwd = 2) lines(recDeterminist$pop2$gen, recDeterminist$pop2$a, lty = "longdash", col = "red", lwd = 2) # Mean stochastic lines(recMean$pop1$gen, recMean$pop1$a, col = "blue") lines(recMean$pop2$gen, recMean$pop2$a, col = "red") ### Mean fitness plot(c(), xlim = c(0, 800), ylim = c(1, 1 + s), col = "blue", xlab = "Generation\n(blue for A, red for B)", ylab = "Frequency of a" ) # Stochastic for (i in seq_len(nsim)) { lines(rec[[i]]$pop1$gen, rec[[i]]$pop1$indMeanFit, lty = "longdash", col = "lightblue") lines(rec[[i]]$pop2$gen, rec[[i]]$pop2$indMeanFit, lty = "longdash", col = "lightpink") } lines(recMean$pop1$gen, recMean$pop1$indMeanFit, lty = "longdash", col = "blue") lines(recMean$pop2$gen, recMean$pop2$indMeanFit, lty = "longdash", col = "red") # Deterministic lines(recDeterminist$pop1$gen, recDeterminist$pop1$indMeanFit, lty = "longdash", col = "blue", lwd = 2) lines(recDeterminist$pop2$gen, recDeterminist$pop2$indMeanFit, lty = "longdash", col = "red", lwd = 2) # Mean stochastic lines(recMean$pop1$gen, recMean$pop1$indMeanFit, col = "blue") lines(recMean$pop2$gen, recMean$pop2$indMeanFit, col = "red") ``` The average fitnesses in the populations equalise while the allelic frequencies are maintained at intermediate values but still as close as possible to the optimum in both populations.