Title: | Pedigree Inference from SNPs |
---|---|
Description: | Multi-generational pedigree inference from incomplete data on hundreds of SNPs, including parentage assignment and sibship clustering. See Huisman (2017) (<DOI:10.1111/1755-0998.12665>) for more information. |
Authors: | Jisca Huisman [aut, cre] |
Maintainer: | Jisca Huisman <[email protected]> |
License: | GPL-2 |
Version: | 2.11.2 |
Built: | 2024-12-25 06:58:05 UTC |
Source: | CRAN |
Estimate the probability that an individual with unknown birth
year is born in year y
, based on BirthYears
or BY.min
and/or BY.max
of its parents, offspring, and siblings, combined with
AgePrior
(the age distribution of other parent-offspring pairs),
and/or Year.last
of its parents.
CalcBYprobs(Pedigree = NULL, LifeHistData = NULL, AgePrior = NULL)
CalcBYprobs(Pedigree = NULL, LifeHistData = NULL, AgePrior = NULL)
Pedigree |
dataframe with columns id-dam-sire. |
LifeHistData |
data.frame with up to 6 columns:
"Birth year" may be in any arbitrary discrete time unit relevant to the species (day, month, decade), as long as parents are never born in the same time unit as their offspring, and only integers are used. Individuals do not need to be in the same order as in ‘GenoM’, nor do all genotyped individuals need to be included. |
AgePrior |
a matrix with probability ratios for individuals with age
difference A to have relationship R, as generated by
|
This function assists in estimating birth years of individuals for which these are unknown, provided they have at least one parent or one offspring in the pedigree. It is not a substitute for field-based estimates of age, only a method to summarise the pedigree + birth year based information.
A matrix with for each individual (rows) in the pedigree that has a missing
birth year in LifeHistData
, or that is not included in
LifeHistData
, the probability that it is born in y
(columns).
Probabilities are rounded to 3 decimal points and may therefore not sum
exactly to 1.
Any errors in the pedigree or lifehistory data will cause errors in the birth year probabilities of their parents and offspring, and putatively also of more distant ancestors and descendants. If the ageprior is based on the same erroneous pedigree and lifehistory data, all birth year probabilities will be affected.
MakeAgePrior
to estimate effect of age on
relationships.
BYprobs <- CalcBYprobs(Pedigree = SeqOUT_griffin$Pedigree, LifeHistData = SeqOUT_griffin$LifeHist) ## Not run: # heatmap lattice::levelplot(t(BYprobs), aspect="fill", col.regions=hcl.colors) ## End(Not run)
BYprobs <- CalcBYprobs(Pedigree = SeqOUT_griffin$Pedigree, LifeHistData = SeqOUT_griffin$LifeHist) ## Not run: # heatmap lattice::levelplot(t(BYprobs), aspect="fill", col.regions=hcl.colors) ## End(Not run)
Calculate the maximum expected number of mismatches for duplicate samples, parent-offspring pairs, and parent-parent-offspring trios.
CalcMaxMismatch(Err, MAF, ErrFlavour = "version2.9", qntl = 1 - 1e-05)
CalcMaxMismatch(Err, MAF, ErrFlavour = "version2.9", qntl = 1 - 1e-05)
Err |
estimated genotyping error rate, as a single number or 3x3 matrix (averaged value(s) across SNPs), or a vector with the same length as MAF, or a nSnp x 3 x 3 array. If a matrix, this should be the probability of observed genotype (columns) conditional on actual genotype (rows). Each row must therefore sum to 1. If an array, each 3x3 slice should abide this rule. |
MAF |
vector with minor allele frequency at each SNP. |
ErrFlavour |
function that takes |
qntl |
quantile of binomial distribution to be used as the maximum, of
individual-level probability. For a desired dataset-level probability
quantile |
The thresholds for maximum number of mismatches calculated here aim
to minimise false negatives, i.e. to minimise the chance that any true
duplicates or true parent-offspring pairs are already excluded during the
filtering steps where these MaxMismatch
values are used.
Consequently, there is a high probability of false positives, i.e. it is
likely that some sample pairs with fewer mismatches than the
MaxMismatch
threshold, are in fact not duplicate samples or
parent-offspring pairs. Use of these MaxMismatch
thresholds is
therefore only the first step of pedigree reconstruction by
sequoia
.
A vector with three integers:
DUP |
Maximum number of differences between 2 samples from the same individual |
OH |
Maximum number of Opposing Homozygous SNPs between a true parent-offspring pair |
ME |
Maximum number of Mendelian Errors among a true parent-parent- offspring trio |
.
CalcMaxMismatch(Err = 0.05, MAF = runif(n=100, min=0.3, max=0.5)) ## Not run: CalcMaxMismatch(Err = 0.02, MAF = SnpStats(MyGenoMatrix, Plot=FALSE)[,"AF"]) ## End(Not run)
CalcMaxMismatch(Err = 0.05, MAF = runif(n=100, min=0.3, max=0.5)) ## Not run: CalcMaxMismatch(Err = 0.02, MAF = SnpStats(MyGenoMatrix, Plot=FALSE)[,"AF"]) ## End(Not run)
Count opposite homozygous (OH) loci between parent-offspring pairs and Mendelian errors (ME) between parent-parent-offspring trios, and calculate the parental log-likelihood ratios (LLR).
CalcOHLLR( Pedigree = NULL, GenoM = NULL, CalcLLR = TRUE, LifeHistData = NULL, AgePrior = FALSE, SeqList = NULL, Err = 1e-04, ErrFlavour = "version2.9", Tassign = 0.5, Tfilter = -2, Complex = "full", Herm = "no", quiet = FALSE )
CalcOHLLR( Pedigree = NULL, GenoM = NULL, CalcLLR = TRUE, LifeHistData = NULL, AgePrior = FALSE, SeqList = NULL, Err = 1e-04, ErrFlavour = "version2.9", Tassign = 0.5, Tfilter = -2, Complex = "full", Herm = "no", quiet = FALSE )
Pedigree |
dataframe with columns id-dam-sire. May include
non-genotyped individuals, which will be treated as dummy individuals. If
provided, any pedigree in |
GenoM |
numeric matrix with genotype data: One row per individual,
one column per SNP, coded as 0, 1, 2, missing values as a negative number
or NA. You can reformat data with |
CalcLLR |
calculate log-likelihood ratios for all assigned parents
(genotyped + dummy/non-genotyped; parent vs. otherwise related). If
|
LifeHistData |
data.frame with up to 6 columns:
"Birth year" may be in any arbitrary discrete time unit relevant to the species (day, month, decade), as long as parents are never born in the same time unit as their offspring, and only integers are used. Individuals do not need to be in the same order as in ‘GenoM’, nor do all genotyped individuals need to be included. |
AgePrior |
logical ( |
SeqList |
list with output from |
Err |
estimated genotyping error rate, as a single number, or a length 3
vector with P(hom|hom), P(het|hom), P(hom|het), or a 3x3 matrix. See
details below. The error rate is presumed constant across SNPs, and
missingness is presumed random with respect to actual genotype. Using
|
ErrFlavour |
function that takes |
Tassign |
minimum LLR required for acceptance of proposed relationship, relative to next most likely relationship. Higher values result in more conservative assignments. Must be zero or positive. |
Tfilter |
threshold log10-likelihood ratio (LLR) between a proposed relationship versus unrelated, to select candidate relatives. Typically a negative value, related to the fact that unconditional likelihoods are calculated during the filtering steps. More negative values may decrease non-assignment, but will increase computational time. |
Complex |
Breeding system complexity. Either "full" (default), "simp" (simplified, no explicit consideration of inbred relationships), "mono" (monogamous). |
Herm |
Hermaphrodites, either "no", "A" (distinguish between dam and sire role, default if at least 1 individual with sex=4), or "B" (no distinction between dam and sire role). Both of the latter deal with selfing. |
quiet |
logical, suppress messages |
Any individual in Pedigree
that does not occur in
GenoM
is substituted by a dummy individual; these can be recognised
by the value 0' in columns 'SNPd.id.dam' and 'SNPd.id.sire' in the output.
For non-genotyped individuals the parental log-likelihood ratio can be
calculated if they have at least one genotyped offspring (see also
getAssignCat
).
The birth years in LifeHistData
and the AgePrior
are not used
in the calculation and do not affect the value of the likelihoods for the
various relationships, but they _are_ used during some filtering steps, and
may therefore affect the likelihood _ratio_. The default
(AgePrior=FALSE
) assumes all age-relationship combinations are
possible, which may mean that some additional alternatives are considered
compared to the sequoia
default, resulting in somewhat lower
LLR
values.
A negative LLR for A's parent B indicates either that B is not truely the parent of A, or that B's parents are incorrect. The latter may cause B's presumed true, unobserved genotype to divert from its observed genotype, with downstream consequences for its offspring. In rare cases it may also be due to 'weird', non-implemented double or triple relationships between A and B.
The Pedigree
dataframe with additional columns:
LLRdam |
Log10-Likelihood Ratio (LLR) of this female being the mother, versus the next most likely relationship between the focal individual and this female (see Details for relationships considered) |
LLRsire |
idem, for male parent |
LLRpair |
LLR for the parental pair, versus the next most likely configuration between the three individuals (with one or neither parent assigned) |
OHdam |
Number of loci at which the offspring and mother are opposite homozygotes |
OHsire |
idem, for father |
MEpair |
Number of Mendelian errors between the offspring and the parent pair, includes OH as well as e.g. parents being opposing homozygotes, but the offspring not being a heterozygote. The offspring being OH with both parents is counted as 2 errors. |
SNPd.id |
Number of SNPs scored (non-missing) for the focal individual |
SNPd.id.dam |
Number of SNPs scored (non-missing) for both individual and dam |
SNPd.id.sire |
Number of SNPs scored for both individual and sire |
Sexx |
Sex in LifeHistData, or inferred Sex when assigned as part of parent-pair |
BY.est |
mode of birth year probability distribution |
BY.lo |
lower limit of 95% highest density region of birth year probability distribution |
BY.hi |
higher limit |
The columns 'LLRdam', 'LLRsire' and 'LLRpair' are only included when
CalcLLR=TRUE
. When a parent or parent-pair is incompatible with the
lifehistory data or presumed genotyping error rate, the error value '777' may
be given.
The columns 'Sexx', 'BY.est', 'BY.lo' and 'BY.hi' are only included when
LifeHistData
is provided, and at least one genotyped individual has an
unknown birth year or unknown sex.
SummarySeq
for visualisation of OH & LLR
distributions; CalcPairLL
for the likelihoods underlying the
LLR, GenoConvert
to read in various genotype data formats,
CheckGeno
; PedPolish
to check and 'polish' the
pedigree; getAssignCat
to find which id-parent pairs are both
genotyped or can be substituted by dummy individuals; sequoia
for pedigree reconstruction.
# count Mendelian errors in an existing pedigree Ped.OH <- CalcOHLLR(Pedigree = Ped_HSg5, GenoM = SimGeno_example, CalcLLR = FALSE) Ped.OH[50:55,] # view histograms SummarySeq(Ped.OH, Panels="OH") # Parent likelihood ratios in an existing pedigree, including for # non-genotyped parents Ped.LLR <- CalcOHLLR(Pedigree = Ped_HSg5, GenoM = SimGeno_example, CalcLLR = TRUE, LifeHistData=LH_HSg5, AgePrior=TRUE) SummarySeq(Ped.LLR, Panels="LLR") ## Not run: # likelihood ratios change with presumed genotyping error rate: Ped.LLR.B <- CalcOHLLR(Pedigree = Ped_HSg5, GenoM = SimGeno_example, CalcLLR = TRUE, LifeHistData=LH_HSg5, AgePrior=TRUE, Err = 0.005) SummarySeq(Ped.LLR.B, Panels="LLR") # run sequoia with CalcLLR=FALSE, and add OH + LLR later: SeqOUT <- sequoia(Geno_griffin, LH_griffin, CalcLLR=FALSE,quiet=TRUE,Plot=FALSE) PedA <- CalcOHLLR(Pedigree = SeqOUT[["Pedigree"]][, 1:3], GenoM = Genotypes, LifeHistData = LH_griffin, AgePrior = TRUE, Complex = "full") SummarySeq(PedA, Panels=c("LLR", "OH")) ## End(Not run)
# count Mendelian errors in an existing pedigree Ped.OH <- CalcOHLLR(Pedigree = Ped_HSg5, GenoM = SimGeno_example, CalcLLR = FALSE) Ped.OH[50:55,] # view histograms SummarySeq(Ped.OH, Panels="OH") # Parent likelihood ratios in an existing pedigree, including for # non-genotyped parents Ped.LLR <- CalcOHLLR(Pedigree = Ped_HSg5, GenoM = SimGeno_example, CalcLLR = TRUE, LifeHistData=LH_HSg5, AgePrior=TRUE) SummarySeq(Ped.LLR, Panels="LLR") ## Not run: # likelihood ratios change with presumed genotyping error rate: Ped.LLR.B <- CalcOHLLR(Pedigree = Ped_HSg5, GenoM = SimGeno_example, CalcLLR = TRUE, LifeHistData=LH_HSg5, AgePrior=TRUE, Err = 0.005) SummarySeq(Ped.LLR.B, Panels="LLR") # run sequoia with CalcLLR=FALSE, and add OH + LLR later: SeqOUT <- sequoia(Geno_griffin, LH_griffin, CalcLLR=FALSE,quiet=TRUE,Plot=FALSE) PedA <- CalcOHLLR(Pedigree = SeqOUT[["Pedigree"]][, 1:3], GenoM = Genotypes, LifeHistData = LH_griffin, AgePrior = TRUE, Complex = "full") SummarySeq(PedA, Panels=c("LLR", "OH")) ## End(Not run)
For each specified pair of individuals, calculate the log10-likelihoods of being PO, FS, HS, GP, FA, HA, U (see Details). Individuals must be genotyped or have at least one genotyped offspring.
NOTE values are various
NA
types, see 'Likelihood
special codes' in 'Value' section below.
CalcPairLL( Pairs = NULL, GenoM = NULL, Pedigree = NULL, LifeHistData = NULL, AgePrior = TRUE, SeqList = NULL, Complex = "full", Herm = "no", Err = 1e-04, ErrFlavour = "version2.9", Tassign = 0.5, Tfilter = -2, quiet = FALSE, Plot = TRUE )
CalcPairLL( Pairs = NULL, GenoM = NULL, Pedigree = NULL, LifeHistData = NULL, AgePrior = TRUE, SeqList = NULL, Complex = "full", Herm = "no", Err = 1e-04, ErrFlavour = "version2.9", Tassign = 0.5, Tfilter = -2, quiet = FALSE, Plot = TRUE )
Pairs |
dataframe with columns
|
GenoM |
numeric matrix with genotype data: One row per individual,
one column per SNP, coded as 0, 1, 2, missing values as a negative number
or NA. You can reformat data with |
Pedigree |
dataframe with columns id-dam-sire; likelihoods will be calculated conditional on the pedigree. May include non-genotyped individuals, which will be treated as dummy individuals. |
LifeHistData |
data.frame with up to 6 columns:
"Birth year" may be in any arbitrary discrete time unit relevant to the species (day, month, decade), as long as parents are never born in the same time unit as their offspring, and only integers are used. Individuals do not need to be in the same order as in ‘GenoM’, nor do all genotyped individuals need to be included. |
AgePrior |
logical ( |
SeqList |
list with output from |
Complex |
Breeding system complexity. Either "full" (default), "simp" (simplified, no explicit consideration of inbred relationships), "mono" (monogamous). |
Herm |
Hermaphrodites, either "no", "A" (distinguish between dam and sire role, default if at least 1 individual with sex=4), or "B" (no distinction between dam and sire role). Both of the latter deal with selfing. |
Err |
estimated genotyping error rate, as a single number, or a length 3
vector with P(hom|hom), P(het|hom), P(hom|het), or a 3x3 matrix. See
details below. The error rate is presumed constant across SNPs, and
missingness is presumed random with respect to actual genotype. Using
|
ErrFlavour |
function that takes |
Tassign |
minimum LLR required for acceptance of proposed relationship, relative to next most likely relationship. Higher values result in more conservative assignments. Must be zero or positive. |
Tfilter |
threshold log10-likelihood ratio (LLR) between a proposed relationship versus unrelated, to select candidate relatives. Typically a negative value, related to the fact that unconditional likelihoods are calculated during the filtering steps. More negative values may decrease non-assignment, but will increase computational time. |
quiet |
logical, suppress messages |
Plot |
logical, display scatter plots by |
The same pair may be included multiple times, e.g. with different sex, age difference, or focal relationship, to explore their effect on the likelihoods. Likelihoods are only calculated for relationships that are possible given the age difference, e.g. PO (parent-offspring) is not calculated for pairs with an age difference of 0.
Non-genotyped individuals can be included if they have at least one
genotyped offspring and can be turned into a dummy (see
getAssignCat
); to establish this a pedigree must be provided.
Warning 1: There is no check whether the input pedigree is genetically
sensible, it is simply conditioned upon. Checking whether a pedigree is
compatible with the SNP data can be done with CalcOHLLR
.
Warning 2: Conditioning on a Pedigree
can make computation
orders of magnitude slower.
The Pairs
dataframe including all optional columns listed
above, plus the additional columns:
xx |
Log10-Likelihood of this pair having relationship xx, with xx being the relationship abbreviations listed below. |
TopRel |
Abbreviation of most likely relationship |
LLR |
Log10-Likelihood ratio between most-likely and second most likely relationships. Other LLRs, e.g. between most-likely and unrelated, can easily be computed. |
Relationship abbreviations:
PO |
Parent - offspring |
FS |
Full siblings |
HS |
Half siblings |
GP |
Grandparent |
FA |
Full avuncular |
HA |
Half avuncular and other 3rd degree relationships |
U |
Unrelated |
2nd |
Unclear which type of 2nd degree relatives (HS, GP, or FA) |
?? |
Unclear which type of 1st, 2nd or 3rd degree relatives |
Likelihood special codes:
222 |
Maybe (via) other parent (e.g. focal="GP", but as likely to be maternal as paternal grandparent, and therefore not assignable) |
333 |
Excluded from comparison (shouldn't occur) |
444 |
Not implemented (e.g. would create an odd double/triple relationship in combination with the provided pedigree) |
777 |
Impossible (e.g. cannot be both full sibling and grandparent) |
888 |
Already assigned in the provided pedigree (see |
999 |
NA |
This function uses the same machinery as sequoia
, which will to save
time not calculate the likelihood when it is quickly obvious that the pair
cannot be related in the specified manner.
For PO (putative parent-offspring pairs) this is the case when:
the sex of the candidate parent, via Pairs$Sex2
or
LifeHistData
, does not match Pairs$patmat
, which defaults
to 1 (maternal relatives, i.e. dam)
a dam is already assigned via Pedigree
and Pairs$dropPar1
='none'
, and Pairs$patmat = 1
Pairs$focal
is not 'U' (the default), and the OH count between the
two individuals exceeds MaxMismatchOH. This value can be found in
SeqList$Specs
), and is calculated by CalcMaxMismatch
the age difference, either calculated from LifeHistData
or
specified via Pairs$AgeDif
, is impossible for a parent-offspring
pair according to the age prior. The latter can be specified via
AgePrior
, or is taken from SeqList
, or is calculated when
both Pedigree
and LifeHistData
are provided.
For FS (putative full siblings) this happens when e.g. ID1 has a dam
assigned which is not dropped (Pairs$dropPar1='none'
or
'sire'
), and the OH count between ID1's dam and ID2 exceeds
MaxMismatchOH. The easiest way to 'fix' this is by increasing the presumed
genotyping error rate.
Especially when Complex='full', not only the seven relationship alternatives listed above are considered, but a whole range of possible double and even triple relationships. For example, mother A and offspring B (PO) may also be paternal half-siblings (HS, A and A's mother mated with same male), grandmother and grand-offspring (GP, B's father is A's son), or paternal aunt (B's father is a full or half sib of A).
The likelihood reported as 'LL_PO' is the most-likely one of the possible alternatives, among those that are not impossible due to age differences or due to the pedigree (as reconstructed up to that point). Whether e.g. the likelihood to be both PO & HS is counted as PO or as HS, depends on the situation and is determined by the variable 'focal': During parentage assignment, it is counted as PO but not HS, while during sibship clustering, it is counted as HS but not PO – not omitting from the alternative relationship would result in a deadlock.
PlotPairLL
to plot alternative relationship pairs from
the output; CalcOHLLR
to calculate LLR for parents &
parent-pairs in a pedigree; GetRelM
to find all pairwise
relatives according to the pedigree; GetMaybeRel
to get
likely relative pairs not in the pedigree.
CalcPairLL(Pairs = data.frame(ID1='i116_2006_M', ID2='i119_2006_M'), GenoM = Geno_griffin, Err = 1e-04, Plot=FALSE) ## likelihoods underlying parent LLR in pedigree: # Example: dams for bottom 3 individuals tail(SeqOUT_griffin$PedigreePar, n=3) # set up dataframe with these pairs. LLRdam & LLRsire ignore any co-parent Pairs_d <- data.frame(ID1 = SeqOUT_griffin$PedigreePar$id[140:142], ID2 = SeqOUT_griffin$PedigreePar$dam[140:142], focal = "PO", dropPar1 = 'both') # Calculate LL's, conditional on the rest of the pedigree + age differences CalcPairLL(Pairs_d, GenoM = Geno_griffin, Err = 1e-04, LifeHistData = LH_griffin, Pedigree = SeqOUT_griffin$PedigreePar) # LLR changes when ignoring age and/or pedigree, as different relationships # become (im)possible CalcPairLL(Pairs_d, GenoM = Geno_griffin, Err = 1e-04) # LLRpair is calculated conditional on co-parent, and min. of dam & sire LLR Pairs_d$dropPar1 <- 'dam' Pairs_s <- data.frame(ID1 = SeqOUT_griffin$PedigreePar$id[141:142], ID2 = SeqOUT_griffin$PedigreePar$sire[141:142], focal = "PO", dropPar1 = 'sire') CalcPairLL(rbind(Pairs_d, Pairs_s), GenoM = Geno_griffin, Err = 1e-04, LifeHistData = LH_griffin, Pedigree = SeqOUT_griffin$PedigreePar) ## likelihoods underlying LLR in getMaybeRel output: MaybeRel_griffin$MaybePar[1:5, ] FivePairs <- MaybeRel_griffin$MaybePar[1:5, c("ID1", "ID2", "Sex1", "Sex2")] PairLL <- CalcPairLL(Pairs = rbind( cbind(FivePairs, focal = "PO"), cbind(FivePairs, focal = "HS"), cbind(FivePairs, focal = "GP")), GenoM = Geno_griffin, Plot=FALSE) PairLL[PairLL$ID1=="i121_2007_M", ] # LL(FS)==222 : HSHA, HSGP, FAHA more likely than FS # LL(GP) higher when focal=HS: GP via 'other' parent also considered # LL(FA) higher when focal=PO: FAHA, or FS of 'other' parent
CalcPairLL(Pairs = data.frame(ID1='i116_2006_M', ID2='i119_2006_M'), GenoM = Geno_griffin, Err = 1e-04, Plot=FALSE) ## likelihoods underlying parent LLR in pedigree: # Example: dams for bottom 3 individuals tail(SeqOUT_griffin$PedigreePar, n=3) # set up dataframe with these pairs. LLRdam & LLRsire ignore any co-parent Pairs_d <- data.frame(ID1 = SeqOUT_griffin$PedigreePar$id[140:142], ID2 = SeqOUT_griffin$PedigreePar$dam[140:142], focal = "PO", dropPar1 = 'both') # Calculate LL's, conditional on the rest of the pedigree + age differences CalcPairLL(Pairs_d, GenoM = Geno_griffin, Err = 1e-04, LifeHistData = LH_griffin, Pedigree = SeqOUT_griffin$PedigreePar) # LLR changes when ignoring age and/or pedigree, as different relationships # become (im)possible CalcPairLL(Pairs_d, GenoM = Geno_griffin, Err = 1e-04) # LLRpair is calculated conditional on co-parent, and min. of dam & sire LLR Pairs_d$dropPar1 <- 'dam' Pairs_s <- data.frame(ID1 = SeqOUT_griffin$PedigreePar$id[141:142], ID2 = SeqOUT_griffin$PedigreePar$sire[141:142], focal = "PO", dropPar1 = 'sire') CalcPairLL(rbind(Pairs_d, Pairs_s), GenoM = Geno_griffin, Err = 1e-04, LifeHistData = LH_griffin, Pedigree = SeqOUT_griffin$PedigreePar) ## likelihoods underlying LLR in getMaybeRel output: MaybeRel_griffin$MaybePar[1:5, ] FivePairs <- MaybeRel_griffin$MaybePar[1:5, c("ID1", "ID2", "Sex1", "Sex2")] PairLL <- CalcPairLL(Pairs = rbind( cbind(FivePairs, focal = "PO"), cbind(FivePairs, focal = "HS"), cbind(FivePairs, focal = "GP")), GenoM = Geno_griffin, Plot=FALSE) PairLL[PairLL$ID1=="i121_2007_M", ] # LL(FS)==222 : HSHA, HSGP, FAHA more likely than FS # LL(GP) higher when focal=HS: GP via 'other' parent also considered # LL(FA) higher when focal=PO: FAHA, or FS of 'other' parent
Morph pedigree into a kinship2 compatible format and use
kinship
to calculate kinship coefficients;
relatedness = 2*kinship.
CalcRped(Pedigree, OUT = "DF")
CalcRped(Pedigree, OUT = "DF")
Pedigree |
dataframe with columns id-dam-sire. |
OUT |
desired output format, 'M' for matrix or 'DF' for dataframe with columns IID1 - IID2 - R.ped. |
A matrix or dataframe.
Check that the provided genotype matrix is in the correct format, and check for low call rate samples and SNPs.
CheckGeno( GenoM, quiet = FALSE, Plot = FALSE, Return = "GenoM", Strict = TRUE, DumPrefix = c("F0", "M0") )
CheckGeno( GenoM, quiet = FALSE, Plot = FALSE, Return = "GenoM", Strict = TRUE, DumPrefix = c("F0", "M0") )
GenoM |
the genotype matrix. |
quiet |
suppress messages. |
Plot |
display the plots of |
Return |
either 'GenoM' to return the cleaned-up genotype matrix, or 'excl' to return a list with excluded SNPs and individuals (see Value). |
Strict |
Exclude any individuals genotyped for <5 genotyped for <5 up to version 2.4.1. Otherwise only excluded are (very nearly) monomorphic SNPs, SNPs scored for fewer than 2 individuals, and individuals scored for fewer than 2 SNPs. |
DumPrefix |
length 2 vector, to check if these don't occur among genotyped individuals. |
If Return='excl'
a list with, if any are found:
ExcludedSNPs |
SNPs scored for <10
excluded when running |
ExcludedSnps-mono |
monomorphic (fixed) SNPs; automatically excluded
when running |
ExcludedIndiv |
Individuals scored for <5 reliably included during pedigree reconstruction. Individual call rate is calculated after removal of 'Excluded SNPs' |
Snps-LowCallRate |
SNPs scored for 10 recommended to be filtered out |
Indiv-LowCallRate |
individuals scored for <50 recommended to be filtered out |
When Return='excl'
the return is invisible
, i.e. a check
is run and warnings or errors are always displayed, but nothing may be
returned.
Appropriate call rate thresholds for SNPs and individuals depend on the total number of SNPs, distribution of call rates, genotyping errors, and the proportion of candidate parents that are SNPd (sibship clustering is more prone to false positives). Note that filtering first on SNP call rate tends to keep more individuals in.
SnpStats
to calculate SNP call rates;
CalcOHLLR
to count the number of SNPs scored in both focal
individual and parent.
GenoM <- SimGeno(Ped_HSg5, nSnp=400, CallRate = runif(400, 0.2, 0.8)) # the quick way: GenoM.checked <- CheckGeno(GenoM, Return="GenoM") # the user supervised way: Excl <- CheckGeno(GenoM, Return = "excl") GenoM.orig <- GenoM # make a 'backup' copy if ("ExcludedSnps" %in% names(Excl)) GenoM <- GenoM[, -Excl[["ExcludedSnps"]]] if ("ExcludedSnps-mono" %in% names(Excl)) GenoM <- GenoM[, -Excl[["ExcludedSnps-mono"]]] if ("ExcludedIndiv" %in% names(Excl)) GenoM <- GenoM[!rownames(GenoM) %in% Excl[["ExcludedIndiv"]], ] # warning about SNPs scored for <50% of individuals ? # note: this is not necessarily a problem, and sometimes unavoidable. SnpCallRate <- apply(GenoM, MARGIN=2, FUN = function(x) sum(x!=-9)) / nrow(GenoM) hist(SnpCallRate, breaks=50, col="grey") GenoM <- GenoM[, SnpCallRate > 0.6] # to filter out low call rate individuals: (also not necessarily a problem) IndivCallRate <- apply(GenoM, MARGIN=1, FUN = function(x) sum(x!=-9)) / ncol(GenoM) hist(IndivCallRate, breaks=50, col="grey") GoodSamples <- rownames(GenoM)[ IndivCallRate > 0.8]
GenoM <- SimGeno(Ped_HSg5, nSnp=400, CallRate = runif(400, 0.2, 0.8)) # the quick way: GenoM.checked <- CheckGeno(GenoM, Return="GenoM") # the user supervised way: Excl <- CheckGeno(GenoM, Return = "excl") GenoM.orig <- GenoM # make a 'backup' copy if ("ExcludedSnps" %in% names(Excl)) GenoM <- GenoM[, -Excl[["ExcludedSnps"]]] if ("ExcludedSnps-mono" %in% names(Excl)) GenoM <- GenoM[, -Excl[["ExcludedSnps-mono"]]] if ("ExcludedIndiv" %in% names(Excl)) GenoM <- GenoM[!rownames(GenoM) %in% Excl[["ExcludedIndiv"]], ] # warning about SNPs scored for <50% of individuals ? # note: this is not necessarily a problem, and sometimes unavoidable. SnpCallRate <- apply(GenoM, MARGIN=2, FUN = function(x) sum(x!=-9)) / nrow(GenoM) hist(SnpCallRate, breaks=50, col="grey") GenoM <- GenoM[, SnpCallRate > 0.6] # to filter out low call rate individuals: (also not necessarily a problem) IndivCallRate <- apply(GenoM, MARGIN=1, FUN = function(x) sum(x!=-9)) / ncol(GenoM) hist(IndivCallRate, breaks=50, col="grey") GoodSamples <- rownames(GenoM)[ IndivCallRate > 0.8]
Compare, count and identify different types of relative pairs between two pedigrees, or within one pedigree.
ComparePairs( Ped1 = NULL, Ped2 = NULL, Pairs2 = NULL, GenBack = 1, patmat = FALSE, ExcludeDummies = TRUE, DumPrefix = c("F0", "M0"), Return = "Counts" )
ComparePairs( Ped1 = NULL, Ped2 = NULL, Pairs2 = NULL, GenBack = 1, patmat = FALSE, ExcludeDummies = TRUE, DumPrefix = c("F0", "M0"), Return = "Counts" )
Ped1 |
first (e.g. original/reference) pedigree, dataframe with 3 columns: id-dam-sire. |
Ped2 |
optional second (e.g. inferred) pedigree. |
Pairs2 |
optional dataframe with as first three columns: ID1-ID2-
relationship, e.g. as returned by |
GenBack |
number of generations back to consider; 1 returns parent-offspring and sibling relationships, 2 also returns grandparental, avuncular and first cousins. GenBack >2 is not implemented. |
patmat |
logical, distinguish between paternal versus maternal relative pairs? |
ExcludeDummies |
logical, exclude dummy IDs from output? Individuals
with e.g. the same dummy father will still be counted as paternal halfsibs.
No attempt is made to match dummies in one pedigree to individuals in the
other pedigree; for that use |
DumPrefix |
character vector with the prefixes identifying dummy
individuals. Use 'F0' ('M0') to avoid matching to regular individuals with
IDs starting with 'F' ('M'), provided |
Return |
return a matrix with |
If Pairs2
is as returned by GetMaybeRel
(identified by the additional column names 'LLR' and 'OH'), these
relationship categories are appended with an '?' in the output, to
distinguish them from those derived from Ped2
.
When Pairs2$TopRel
contains values other than the ones listed among
the return values for the combination of patmat
and GenBack
,
they are prioritised in decreasing order of factor levels, or in decreasing
alphabetical order, and before the default (ped2
derived) levels.
The matrix returned by DyadCompare
[Deprecated] is a subset
of the matrix returned here using default settings.
Depending on Return
, one of the following, or a list with all:
Counts |
(the default), a matrix with counts, with the classification in
|
Summary |
a matrix with one row per relationship type and four columns
, named as if
|
Array |
a 2xNxN array (if |
Dataframe |
a dataframe with
Each pair is listed twice, e.g. once as P and once as O, or twice as FS. |
By default (GenBack=1, patmat=FALSE
) the following 7 relationships are
distinguished:
S: Self (not included in Counts
)
MP: Parent
O: Offspring (not included in Counts
)
FS: Full sibling
HS: Half sibling
U: Unrelated, or otherwise related
X: Either or both individuals not occurring in both pedigrees
In the array and dataframe, 'MP' indicates that the second (column) individual is the parent of the first (row) individual, and 'O' indicates the reverse.
When GenBack=1, patmat=TRUE
the categories are (S)-M-P-(O)-FS-MHS-PHS-
U-X.
When GenBack=2, patmat=TRUE
, the following relationships are
distinguished:
S: Self (not included in Counts
)
M: Mother
P: Father
O: Offspring (not included in Counts
)
FS: Full sibling
MHS: Maternal half-sibling
PHS: Paternal half-sibling
MGM: Maternal grandmother
MGF: Maternal grandfather
PGM: Paternal grandmother
PGF: Paternal grandfather
GO: Grand-offspring (not included in Counts
)
FA: Full avuncular; maternal or paternal aunt or uncle
HA: Half avuncular
FN: Full nephew/niece (not included in Counts
)
HN: Half nephew/niece (not included in Counts
)
FC1: Full first cousin
DFC1: Double full first cousin
U: Unrelated, or otherwise related
X: Either or both individuals not occurring in both pedigrees
Note that for avuncular and cousin relationships no distinction is made
between paternal versus maternal, as this may differ between the two
individuals and would generate a large number of sub-classes. When a pair is
related via multiple paths, the first-listed relationship is returned. To get
all the different paths between a pair, use GetRelM
with
Return='Array'
.
When GenBack=2, patmat=FALSE
, MGM, MGF, PGM and PGF are combined
into GP, with the rest of the categories analogous to the above.
PedCompare
for individual-based comparison;
GetRelM
for a pairwise relationships matrix of a single
pedigree; PlotRelPairs
for visualisation of relationships
within each pedigree.
To estimate P(actual relationship (Ped1) | inferred relationship (Ped2)),
see examples at EstConf
.
PairsG <- ComparePairs(Ped_griffin, SeqOUT_griffin[["Pedigree"]], patmat = TRUE, ExcludeDummies = TRUE, Return = "All") PairsG$Counts # pairwise correct assignment rate: PairsG$Summary[,"OK"] / PairsG$Summary[,"n"] # check specific pair: PairsG$Array[, "i190_2010_M", "i168_2009_F"] # or RelDF <- PairsG$Dataframe # for brevity RelDF[RelDF$id.A=="i190_2010_M" & RelDF$id.B=="i168_2009_F", ] # Colony-style lists of full sib dyads & half sib dyads: FullSibDyads <- with(RelDF, RelDF[Ped1 == "FS" & id.A < id.B, ]) HalfSibDyads <- with(RelDF, RelDF[Ped1 == "HS" & id.A < id.B, ]) # Use 'id.A < id.B' because each pair is listed 2x
PairsG <- ComparePairs(Ped_griffin, SeqOUT_griffin[["Pedigree"]], patmat = TRUE, ExcludeDummies = TRUE, Return = "All") PairsG$Counts # pairwise correct assignment rate: PairsG$Summary[,"OK"] / PairsG$Summary[,"n"] # check specific pair: PairsG$Array[, "i190_2010_M", "i168_2009_F"] # or RelDF <- PairsG$Dataframe # for brevity RelDF[RelDF$id.A=="i190_2010_M" & RelDF$id.B=="i168_2009_F", ] # Colony-style lists of full sib dyads & half sib dyads: FullSibDyads <- with(RelDF, RelDF[Ped1 == "FS" & id.A < id.B, ]) HalfSibDyads <- with(RelDF, RelDF[Ped1 == "HS" & id.A < id.B, ]) # Use 'id.A < id.B' because each pair is listed 2x
Example output of EstConf
, with the inferred
pedigree in SeqOUT_griffin
used as reference pedigree.
data(Conf_griffin)
data(Conf_griffin)
a list, see sequoia
Jisca Huisman, [email protected]
## Not run: Conf_griffin <- EstConf(Pedigree = SeqOUT_griffin$Pedigree, LifeHistData = LH_griffin, args.sim = list(nSnp = 400, SnpError = 0.001, ParMis=0.4), args.seq = list(Module = 'ped', Err=0.001), nSim = 20, nCores = 5, quiet = TRUE) ## End(Not run)
## Not run: Conf_griffin <- EstConf(Pedigree = SeqOUT_griffin$Pedigree, LifeHistData = LH_griffin, args.sim = list(nSnp = 400, SnpError = 0.001, ParMis=0.4), args.seq = list(Module = 'ped', Err=0.001), nSim = 20, nCores = 5, quiet = TRUE) ## End(Not run)
Count the number of half and full sibling pairs correctly and
incorrectly assigned. DEPRECATED - PLEASE USE ComparePairs
DyadCompare(Ped1 = NULL, Ped2 = NULL, na1 = c(NA, "0"))
DyadCompare(Ped1 = NULL, Ped2 = NULL, na1 = c(NA, "0"))
Ped1 |
original pedigree, dataframe with 3 columns: id-dam-sire. |
Ped2 |
second (inferred) pedigree. |
na1 |
the value for missing parents in Ped1. |
A 3x3 table with the number of pairs assigned as full siblings (FS), half siblings (HS) or unrelated (U, including otherwise related) in the two pedigrees, with the classification in Ped1 on rows and that in Ped2 in columns.
ComparePairs
which supersedes this function;
PedCompare
## Not run: DyadCompare(Ped1=Ped_HSg5, Ped2=SeqOUT_HSg5$Pedigree) ## End(Not run)
## Not run: DyadCompare(Ped1=Ped_HSg5, Ped2=SeqOUT_HSg5$Pedigree) ## End(Not run)
Make a vector or matrix specifying the genotyping error pattern, or a function to generate such a vector/matrix from a single value Err.
with the probabilities of observed genotypes (columns) conditional on actual genotypes (rows), or return a function to generate such matrices (using a single value Err as input to that function).
ErrToM(Err = NA, flavour = "version2.9", Return = "matrix")
ErrToM(Err = NA, flavour = "version2.9", Return = "matrix")
Err |
estimated genotyping error rate, as a single number, or 3x3 or 4x4
matrix, or length 3 vector. If a single number, an error model is used that
aims to deal with scoring errors typical for SNP arrays. If a matrix, this
should be the probability of observed genotype (columns) conditional on
actual genotype (rows). Each row must therefore sum to 1. If
|
flavour |
vector-generating or matrix-generating function, or one of
'version2.9', 'version2.0', 'version1.3' (='SNPchip'), 'version1.1'
(='version111'), referring to the sequoia version in which it was used as
default. Only used if |
Return |
output, 'matrix' (default), 'vector', 'function' (matrix-generating), or 'v_function' (vector-generating) |
By default (flavour
= "version2.9"), Err
is
interpreted as a locus-level error rate (rather than allele-level), and
equals the probability that an actual heterozygote is observed as either
homozygote (i.e., the probability that it is observed as AA = probability
that observed as aa = Err
/2). The probability that one homozygote is
observed as the other is (Err
/2.
The inbuilt 'flavours' correspond to the presumed and simulated error structures, which have changed with sequoia versions. The most appropriate error structure will depend on the genotyping platform; 'version0.9' and 'version1.1' were inspired by SNP array genotyping while 'version1.3' and 'version2.0' are intended to be more general.
This function, and throughout the package, it is assumed that the two alleles
and
are equivalent. Thus, using notation
(observed
genotype |actual genotype), that
,
, and
.
version | hom|hom | het|hom | hom|het |
2.9 | |
|
|
2.0 | |
|
|
1.3 | |
|
|
1.1 | |
|
|
0.9 | |
|
|
or in matrix form, Pr(observed genotype (columns) | actual genotype (rows)):
version2.9:
0 | 1 | 2 | |
0 | |
|
|
1 | |
|
|
2 | |
|
|
version2.0:
0 | 1 | 2 | |
0 | |
|
|
1 | |
|
|
2 | |
|
|
version1.3
0 | 1 | 2 | |
0 | |
|
|
1 | |
|
|
2 | |
|
|
version1.1
0 | 1 | 2 | |
0 | |
|
|
1 | |
|
|
2 | |
|
|
version0.9 (not recommended)
0 | 1 | 2 | |
0 | |
|
|
1 | |
|
|
2 | |
|
|
When Err
is a length 3 vector, or if Return = 'vector'
these
are the following probabilities:
hom|hom: an actual homozygote is observed as the other homozygote
()
het|hom: an actual homozygote is observed as heterozygote ()
hom|het: an actual heterozygote is observed as homozygote ()
and Pr(observed genotype (columns) | actual genotype (rows)) is then:
0 | 1 | 2 | |
0 | |
|
|
1 | |
|
|
2 | |
|
|
When the SNPs are scored via sequencing (e.g. RADseq or DArTseq), the 3rd error rate (hom|het) is typically considerably higher than the other two, while for SNP arrays it tends to be similar to P(het|hom).
Depending on Return
, either:
'matrix'
: a 3x3 matrix, with probabilities of observed genotypes
(columns) conditional on actual (rows)
'function'
: a function taking a single value Err
as input, and
generating a 3x3 matrix
'vector'
: a length 3 vector, with the probabilities (observed given
actual) hom|other hom, het|hom, and hom|het.
ErM <- ErrToM(Err = 0.05) ErM ErrToM(ErM, Return = 'vector') # use error matrix from Whalen, Gorjanc & Hickey 2018 funE <- function(E) { matrix(c(1-E*3/4, E/2, E/4, E/4, 1-2*E/4, E/4, E/4, E/2, 1-E*3/4), 3,3, byrow=TRUE) } ErrToM(Err = 0.05, flavour = funE) # equivalent to: ErrToM(Err = c(0.05/4, 0.05/2, 0.05/4))
ErM <- ErrToM(Err = 0.05) ErM ErrToM(ErM, Return = 'vector') # use error matrix from Whalen, Gorjanc & Hickey 2018 funE <- function(E) { matrix(c(1-E*3/4, E/2, E/4, E/4, 1-2*E/4, E/4, E/4, E/2, 1-E*3/4), 3,3, byrow=TRUE) } ErrToM(Err = 0.05, flavour = funE) # equivalent to: ErrToM(Err = c(0.05/4, 0.05/2, 0.05/4))
Estimate confidence probabilities ('backward') and assignment
error rates ('forward') per category (genotyped/dummy) by repeatedly
simulating genotype data from a reference pedigree using
SimGeno
, reconstruction a pedigree from this using
sequoia
, and counting the number of mismatches using
PedCompare
.
EstConf( Pedigree = NULL, LifeHistData = NULL, args.sim = list(nSnp = 400, SnpError = 0.001, ParMis = c(0.4, 0.4)), args.seq = list(Module = "ped", Err = 0.001, Tassign = 0.5, CalcLLR = FALSE), nSim = 10, nCores = 1, quiet = TRUE )
EstConf( Pedigree = NULL, LifeHistData = NULL, args.sim = list(nSnp = 400, SnpError = 0.001, ParMis = c(0.4, 0.4)), args.seq = list(Module = "ped", Err = 0.001, Tassign = 0.5, CalcLLR = FALSE), nSim = 10, nCores = 1, quiet = TRUE )
Pedigree |
reference pedigree from which to simulate, dataframe with columns id-dam-sire. Additional columns are ignored. |
LifeHistData |
dataframe with id, sex (1=female, 2=male, 3=unknown), birth year, and optionally BY.min - BY.max - YearLast. |
args.sim |
list of arguments to pass to |
args.seq |
list of arguments to pass to |
nSim |
number of iterations of simulate - reconstruct - compare to perform, i.e. number of simulated datasets. |
nCores |
number of computer cores to use. If |
quiet |
suppress messages. |
The confidence probability is taken as the number of correct (matching) assignments, divided by all assignments made in the observed (inferred-from-simulated) pedigree. In contrast, the false negative & false positive assignment rates are proportions of the number of parents in the true (reference) pedigree. Each rate is calculated separatedly for dams & sires, and separately for each category (Genotyped/Dummy(fiable)/X (none)) of individual, parent and co-parent.
This function does not know which individuals in the actual Pedigree
are genotyped, so the confidence probabilities need to be added to the
Pedigree
as shown in the example at the bottom.
A confidence of means all assignments on simulated data were correct for
that category-combination. It should be interpreted as (and perhaps modified
to)
, where sample size
N
is given in the last column
of the ConfProb
and PedErrors
dataframes in the output. The
same applies for a false negative/positive rate of (i.e. to be
interpreted as
).
A list, with elements:
ConfProb |
See below |
PedErrors |
See below |
Pedigree.reference |
the pedigree from which data was simulated |
LifeHistData |
|
Pedigree.inferred |
a list with for each iteration the inferred pedigree based on the simulated data |
SimSNPd |
a list with for each iteration the IDs of the individuals simulated to have been genotyped |
PedComp.fwd |
array with |
RunParams |
a list with the call to |
RunTime |
|
Dataframe ConfProb
has 7 columns:
id.cat , dam.cat , sire.cat
|
Category of the focal individual, dam, and sire, in the pedigree inferred based on the simulated data. Coded as G=genotyped, D=dummy, X=none |
dam.conf |
Probability that the dam is correct, given the categories of the assigned dam and sire (ignoring whether or not the sire is correct) |
sire.conf |
as |
pair.conf |
Probability that both dam and sire are correct, given their categories |
N |
Number of individuals per category-combination, across all
|
Array PedErrors
has three dimensions:
class |
|
cat |
Category of individual + parent, as a two-letter code where the first letter indicates the focal individual and the second the parent; G=Genotyped, D=Dummy, T=Total |
parent |
dam or sire |
Because the actual true pedigree is (typically) unknown, the provided
reference pedigree is used as a stand-in and assumed to be the true
pedigree, with unrelated founders. It is also assumed that the probability
to be genotyped is equal for all parents; in each iteration, a new random
set of parents (proportion set by ParMis
) is mimicked to be
non-genotyped. In addition, SNPs are assumed to segregate independently.
An experimental version offering more fine-grained control is available at https://github.com/JiscaH/sequoiaExtra .
The size in Kb of the returned list can become pretty big, as each of the
inferred pedigrees is included. When running EstConf
many times for
a range of parameter values, it may be prudent to save the required summary
statistics for each run rather than the full output.
If you have a large pedigree and try to run this function on multiple cores, you may run into "Cannot allocate vector of size ..." errors or even unexpected crashes: there is not enough computer memory for each separate run. Try reducing 'nCores'.
# estimate proportion of parents that are genotyped (= 1 - ParMis) sumry_grif <- SummarySeq(SeqOUT_griffin, Plot=FALSE) tmp <- apply(sumry_grif$ParentCount['Genotyped',,,], MARGIN = c('parentSex', 'parentCat'), FUN = sum) props <- sweep(tmp, MARGIN='parentCat', STATS = rowSums(tmp), FUN = '/') 1 - props[,'Genotyped'] / (props[,'Genotyped'] + props[,'Dummy']) # Example for parentage assignment only conf_grif <- EstConf(Pedigree = SeqOUT_griffin$Pedigree, LifeHistData = SeqOUT_griffin$LifeHist, args.sim = list(nSnp = 150, # no. in actual data, or what-if SnpError = 5e-3, # best estimate, or what-if CallRate=0.9, # from SnpStats() ParMis=c(0.39, 0.20)), # calc'd above args.seq = list(Err=5e-3, Module="par"), # as in real run nSim = 1, # try-out, proper run >=20 (10 if huge pedigree) nCores=1) # parent-pair confidence, per category (Genotyped/Dummy/None) conf_grif$ConfProb # Proportion of true parents that was correctly assigned 1 - apply(conf_grif$PedErrors, MARGIN=c('cat','parent'), FUN=sum, na.rm=TRUE) # add columns with confidence probabilities to pedigree # first add columns with category (G/D/X) Ped.withConf <- getAssignCat(Pedigree = SeqOUT_griffin$Pedigree, SNPd = SeqOUT_griffin$PedigreePar$id) Ped.withConf <- merge(Ped.withConf, conf_grif$ConfProb, all.x=TRUE, sort=FALSE) # (note: merge() messes up column order) head(Ped.withConf[Ped.withConf$dam.cat=="G", ]) # save output summary ## Not run: conf_griff[['Note']] <- 'You could add a note' saveRDS(conf_grif[c('ConfProb','PedComp.fwd','RunParams','RunTime','Note')], file = 'conf_200SNPs_Err005_Callrate80.RDS') ## End(Not run) ## P(actual FS | inferred as FS) etc. ## Not run: PairL <- list() for (i in 1:length(conf_grif$Pedigree.inferred)) { # nSim cat(i, "\t") PairL[[i]] <- ComparePairs(conf_grif$Pedigree.reference, conf_grif$Pedigree.inferred[[i]], GenBack=1, patmat=TRUE, ExcludeDummies = TRUE, Return="Counts") } # P(actual relationship (Ped1) | inferred relationship (Ped2)) PairRel.prop.A <- plyr::laply(PairL, function(M) sweep(M, MARGIN='Ped2', STATS=colSums(M), FUN="/")) PairRel.prop <- apply(PairRel.prop.A, 2:3, mean, na.rm=TRUE) #avg across sims round(PairRel.prop, 3) # or: P(inferred relationship | actual relationship) PairRel.prop2 <- plyr::laply(PairL, function(M) sweep(M, MARGIN='Ped1', STATS=rowSums(M), FUN="/")) ## End(Not run) ## Not run: # confidence probability vs. sibship size source('https://raw.githubusercontent.com/JiscaH/sequoiaExtra/main/conf_vs_sibsize.R') conf_grif_nOff <- Conf_by_nOff(conf_grif) conf_grif_nOff['conf',,'GD',] conf_grif_nOff['N',,'GD',] ## End(Not run)
# estimate proportion of parents that are genotyped (= 1 - ParMis) sumry_grif <- SummarySeq(SeqOUT_griffin, Plot=FALSE) tmp <- apply(sumry_grif$ParentCount['Genotyped',,,], MARGIN = c('parentSex', 'parentCat'), FUN = sum) props <- sweep(tmp, MARGIN='parentCat', STATS = rowSums(tmp), FUN = '/') 1 - props[,'Genotyped'] / (props[,'Genotyped'] + props[,'Dummy']) # Example for parentage assignment only conf_grif <- EstConf(Pedigree = SeqOUT_griffin$Pedigree, LifeHistData = SeqOUT_griffin$LifeHist, args.sim = list(nSnp = 150, # no. in actual data, or what-if SnpError = 5e-3, # best estimate, or what-if CallRate=0.9, # from SnpStats() ParMis=c(0.39, 0.20)), # calc'd above args.seq = list(Err=5e-3, Module="par"), # as in real run nSim = 1, # try-out, proper run >=20 (10 if huge pedigree) nCores=1) # parent-pair confidence, per category (Genotyped/Dummy/None) conf_grif$ConfProb # Proportion of true parents that was correctly assigned 1 - apply(conf_grif$PedErrors, MARGIN=c('cat','parent'), FUN=sum, na.rm=TRUE) # add columns with confidence probabilities to pedigree # first add columns with category (G/D/X) Ped.withConf <- getAssignCat(Pedigree = SeqOUT_griffin$Pedigree, SNPd = SeqOUT_griffin$PedigreePar$id) Ped.withConf <- merge(Ped.withConf, conf_grif$ConfProb, all.x=TRUE, sort=FALSE) # (note: merge() messes up column order) head(Ped.withConf[Ped.withConf$dam.cat=="G", ]) # save output summary ## Not run: conf_griff[['Note']] <- 'You could add a note' saveRDS(conf_grif[c('ConfProb','PedComp.fwd','RunParams','RunTime','Note')], file = 'conf_200SNPs_Err005_Callrate80.RDS') ## End(Not run) ## P(actual FS | inferred as FS) etc. ## Not run: PairL <- list() for (i in 1:length(conf_grif$Pedigree.inferred)) { # nSim cat(i, "\t") PairL[[i]] <- ComparePairs(conf_grif$Pedigree.reference, conf_grif$Pedigree.inferred[[i]], GenBack=1, patmat=TRUE, ExcludeDummies = TRUE, Return="Counts") } # P(actual relationship (Ped1) | inferred relationship (Ped2)) PairRel.prop.A <- plyr::laply(PairL, function(M) sweep(M, MARGIN='Ped2', STATS=colSums(M), FUN="/")) PairRel.prop <- apply(PairRel.prop.A, 2:3, mean, na.rm=TRUE) #avg across sims round(PairRel.prop, 3) # or: P(inferred relationship | actual relationship) PairRel.prop2 <- plyr::laply(PairL, function(M) sweep(M, MARGIN='Ped1', STATS=rowSums(M), FUN="/")) ## End(Not run) ## Not run: # confidence probability vs. sibship size source('https://raw.githubusercontent.com/JiscaH/sequoiaExtra/main/conf_vs_sibsize.R') conf_grif_nOff <- Conf_by_nOff(conf_grif) conf_grif_nOff['conf',,'GD',] conf_grif_nOff['N',,'GD',] ## End(Not run)
Estimate the genotyping error rates in SNP data, based on a pedigree and/or duplicates. Estimates probabilities (observed given actual) hom|other hom, het|hom, and hom|het. THESE ARE APPROXIMATE VALUES!
EstEr( GenoM, Pedigree, Duplicates = NULL, Er_start = c(0.05, 0.05, 0.05), perSNP = FALSE )
EstEr( GenoM, Pedigree, Duplicates = NULL, Er_start = c(0.05, 0.05, 0.05), perSNP = FALSE )
GenoM |
Genotype matrix |
Pedigree |
data.frame with columns id - dam - sire |
Duplicates |
matrix or data.frame with 2 columns, id1 & id2 |
Er_start |
vector of length 3 with starting values for |
perSNP |
logical, estimate error rate per SNP. WARNING not very
precise, use only as an approximate indicator! Try on simulated data first,
e.g. with |
The result should be interpreted as approximate, ballpark estimates! The estimated error rates from a pedigree will not be as accurate as from duplicate samples. Errors in individuals without parents or offspring will not be counted, and errors in individuals with only few offspring may not be noted either. Deviation of genotype frequencies among founders from Hardy-Weinberg equilibrium may wrongly be attributed to genotyping errors. Last but not least, any pedigree errors will result in higher estimated genotyping errors.
vector of length 3 with estimated genotyping error rates: the probabilities that
hom|hom: an actual homozygote is observed as the other homozygote
het|hom: an actual homozygote is observed as heterozygote
hom|het: an actual heterozygote is observed as homozygote
These are three independent parameters, that define the genotyping error
matrix (see ErrToM
) as follows:
0 | 1 | 2 | |
0 | |
|
|
1 | |
|
|
2 | |
|
|
Note that for optim
a lower bound of 1e-6 and upper bound of 0.499
are used; if these values are returned this should be interpreted as
'inestimably small' and 'inestimably large', respectively. PLEASE DO NOT USE
THESE VALUES AS INPUT IN SUBSEQUENT ANALYSIS BUT SUBSITUTE BY A SENSIBLE
VALUE!!
GenoX <- SimGeno(Ped_griffin, nSnp=400, SnpError=c(0.01,0.07, 0.1), ParMis=0.1, CallRate=0.9) # EstEr(GenoM=GenoX, Pedigree=Ped_griffin)
GenoX <- SimGeno(Ped_griffin, nSnp=400, SnpError=c(0.01,0.07, 0.1), ParMis=0.1, CallRate=0.9) # EstEr(GenoM=GenoX, Pedigree=Ped_griffin)
Example field pedigree used in vignette for
PedCompare
example. Non-genotyped females have IDs 'BlueRed',
'YellowPink', etc.
data(FieldMums_griffin)
data(FieldMums_griffin)
A data frame with 144 rows and 2 variables (id, mum)
Jisca Huisman, [email protected]
SeqOUT_griffin
for a sequoia run on simulated genotype
data, Ped_griffin
for the 'true' pedigree.
## Not run: PC_griffin <- PedCompare(Ped1 = cbind(FieldMums_griffin, sire=NA), Ped2 = SeqOUT_griffin$Pedigree) ## End(Not run)
## Not run: PC_griffin <- PedCompare(Ped1 = cbind(FieldMums_griffin, sire=NA), Ped2 = SeqOUT_griffin$Pedigree) ## End(Not run)
Find clusters of connected individuals in a pedigree, and assign each cluster a unique family ID (FID).
FindFamilies(Pedigree = NULL, SeqList = NULL, MaybeRel = NULL)
FindFamilies(Pedigree = NULL, SeqList = NULL, MaybeRel = NULL)
Pedigree |
dataframe with columns id - parent1 - parent2; only the first 3 columns will be used. |
SeqList |
list as returned by |
MaybeRel |
Output from |
This function repeatedly finds all ancestors and all descendants of each individual in turn, and ensures they all have the same Family ID. Not all connected individuals are related, e.g. all grandparents of an individual will have the same FID, but will typically be unrelated.
When UseMaybeRel = TRUE
, probable relatives are added to existing
family clusters, or existing family clusters may be linked together.
Currently no additional family clusters are created.
A numeric vector with length equal to the number of unique
individuals in the pedigree (i.e. number of rows in pedigree after running
PedPolish
on Pedigree
).
GetAncestors, GetDescendants,
getGenerations
PedG <- SeqOUT_griffin$PedigreePar[,1:3] FID_G <- FindFamilies(PedG) PedG[FID_G==4,]
PedG <- SeqOUT_griffin$PedigreePar[,1:3] FID_G <- FindFamilies(PedG) PedG[FID_G==4,]
Simulated genotype data from Pedigree Ped_griffin
data(Geno_griffin)
data(Geno_griffin)
A genotype matrix with 142 rows (individuals) and 200 columns (SNPs). Each SNP is coded as 0/1/2 copies of the reference allele, with -9 for missing values. Ids are stored as rownames.
Jisca Huisman, [email protected]
Simulated genotype data for all* individuals in Pedigree
Ped_HSg5
(*: with 40
data(Geno_HSg5)
data(Geno_HSg5)
A genotype matrix with 920 rows (ids) and 200 columns (SNPs). Each SNP is coded as 0/1/2 copies of the reference allele, with -9 for missing values. Ids are stored as rownames.
Jisca Huisman, [email protected]
## Not run: # this output was created as follows: Geno_HSg5 <- SimGeno(Ped = Ped_HSg5, nSnp = 200, ParMis=0.4, CallRate = 0.9, SnpError = 0.005) ## End(Not run)
## Not run: # this output was created as follows: Geno_HSg5 <- SimGeno(Ped = Ped_HSg5, nSnp = 200, ParMis=0.4, CallRate = 0.9, SnpError = 0.005) ## End(Not run)
Convert genotype data in various formats to sequoia's 1-column-per-marker format, PLINK's ped format, or Colony's 2-columns-per-marker format.
GenoConvert( InData = NULL, InFile = NULL, InFormat = "raw", OutFile = NA, OutFormat = "seq", Missing = c("-9", "NA", "??", "?", "NULL", "-1", c("0")[InFormat %in% c("col", "ped")]), sep = c(" ", "\t", ",", ";"), header = NA, IDcol = NA, FIDcol = NA, FIDsep = "__", dropcol = NA, quiet = FALSE )
GenoConvert( InData = NULL, InFile = NULL, InFormat = "raw", OutFile = NA, OutFormat = "seq", Missing = c("-9", "NA", "??", "?", "NULL", "-1", c("0")[InFormat %in% c("col", "ped")]), sep = c(" ", "\t", ",", ";"), header = NA, IDcol = NA, FIDcol = NA, FIDsep = "__", dropcol = NA, quiet = FALSE )
InData |
dataframe, matrix or
|
InFile |
character string with name of genotype file to be converted. |
InFormat |
One of 'seq' (sequoia), 'ped' (PLINK .ped file), 'col'
(COLONY), 'raw' (PLINK –recodeA), 'vcf' (requires library |
OutFile |
character string with name of converted file. If NA, return matrix with genotypes in console (default); if NULL, write to 'GenoForSequoia.txt' in current working directory. |
OutFormat |
as |
Missing |
vector with symbols interpreted as missing data. '0' is
missing data for |
sep |
vector with field separator strings that will be tried on
|
header |
a logical value indicating whether the file contains a header as its first line. If NA (default), set to TRUE for 'raw', and FALSE otherwise. |
IDcol |
number giving the column with individual IDs; 0 indicates the rownames (for InData only). If NA (default), set to 2 for InFormat 'raw' and 'ped', and otherwise to 1 for InFile and 0 (rownames) for InData, except when InData has a column labeled 'ID'. |
FIDcol |
column with the family IDs, if any are wished to be used. This is column 1 for InFormat 'raw' and 'seq', but those are by default not used. |
FIDsep |
string used to paste FID and IID together into a composite-ID
(value passed to |
dropcol |
columns to exclude from the output data, on top of IDcol and FIDcol (which become rownames). When NA, defaults to columns 3-6 for InFormat 'raw' and 'seq'. Can also be used to drop some SNPs, see example below on how to do this for the 2-columns-per-SNP input formats. |
quiet |
suppress messages and warnings. |
The first two arguments are interchangeable, and can be given unnamed. The first argument is assumed to be a file name if it is of class 'character' and length 1, and to be the genetic data if it is a matrix or dataframe.
If package data.table is detected, fread
is used to read in the data from file. Otherwise, a combination of
readLines
and strsplit
is used.
A genotype matrix in the specified output format; the default sequoia format ('seq') has 1 column per SNP coded in 0/1/2 format (major homozygote /heterozygote /minor homozygote) with -9 for missing values, sample IDs in row names and SNP names in column names. If 'OutFile' is specified, the matrix is written to this file and nothing is returned inside R.
The following formats can be specified by InFormat
:
(sequoia) genotypes are coded as 0, 1, 2, missing as (in input
any negative number or NA are OK),
in 1 column per marker. Column 1 contains IDs, there is no header row.
(PLINK) genotypes are coded as A, C, T, G, missing as 0, in 2 columns per marker. The first 6 columns are descriptive (1:FID, 2:IID, 3 to 6 ignored). If an associated .map file exists, SNP names will be read from there.
(PLINK) genotypes are coded as 0, 1, 2, missing as NA, in 1 column per marker. The first 6 columns are descriptive (1:FID, 2:IID, 3 to 6 ignored), and there is a header row. This is produced by PLINK's option –recodeA
(Colony) genotypes are coded as numeric values, missing as 0, in 2 columns per marker. Column 1 contains IDs.
(VCF) genotypes are coded as '0/0','0/1','1/1', variable number of header rows followed by 1 row per SNP, with various columns of metadata followed by 1 column per individual. Requires package vcfR.
1 column per marker, otherwise unspecified
2 columns per marker, otherwise unspecified
For each InFormat
, its default values for Missing, header,
IDcol, FIDcol
, and dropcol
can be overruled by specifying the
corresponding input parameters.
Occasionally when reading in a file GenoConvert
may give an error
that 'rows have unequal length'. GenoConvert
makes use of
readLines
and strsplit
, which is much faster
than read.table
for large datafiles, but also more sensitive
to unusual line endings, unusual end-of-file characters, or invisible
characters (spaces or tabs) after the end of some lines. In these cases,
try to read the data from file using read.table or read.csv, and then use
GenoConvert
on this dataframe or matrix, see example.
Jisca Huisman, [email protected]
CheckGeno, SnpStats, LHConvert
.
## Not run: # Requires PLINK installed & in system PATH: # tinker with window size, window overlap and VIF to get a set of # 400 - 800 markers (100-200 enough for just parentage): system("cmd", input = "plink --file mydata --indep 50 5 2") system("cmd", input = "plink --file mydata --extract plink.prune.in --recodeA --out PlinkOUT") GenoM <- GenoConvert(InFile = "PlinkOUT.raw", InFormat='raw') # which is the same as GenoM <- GenoConvert(PlinkOUT.raw, InFormat='single', IDcol=2, dropcol=c(1,3:6), header=TRUE) # (but it will complain that InFormat='single' is not consistent with .raw # file extension) # save time on file conversion next time: write.table(GenoM, file="Geno_sequoia.txt", quote=FALSE, col.names=FALSE) GenoM <- as.matrix(read.table("Geno_sequoia.txt", row.names=1, header=FALSE)) # drop some SNPs, e.g. after a warning of >2 alleles: dropSNP <- c(5,68,101,128) GenoM <- GenoConvert(ColonyFile, InFormat = "col", dropcol = 1 + c(2*dropSNP-1, 2*dropSNP) ) # circumvent a 'rows have unequal length' error: GenoTmp <- as.matrix(read.table("mydata.txt", header=TRUE, row.names=1)) GenoM <- GenoConvert(InData=GenoTmp, InFormat="single", IDcol=0) ## End(Not run)
## Not run: # Requires PLINK installed & in system PATH: # tinker with window size, window overlap and VIF to get a set of # 400 - 800 markers (100-200 enough for just parentage): system("cmd", input = "plink --file mydata --indep 50 5 2") system("cmd", input = "plink --file mydata --extract plink.prune.in --recodeA --out PlinkOUT") GenoM <- GenoConvert(InFile = "PlinkOUT.raw", InFormat='raw') # which is the same as GenoM <- GenoConvert(PlinkOUT.raw, InFormat='single', IDcol=2, dropcol=c(1,3:6), header=TRUE) # (but it will complain that InFormat='single' is not consistent with .raw # file extension) # save time on file conversion next time: write.table(GenoM, file="Geno_sequoia.txt", quote=FALSE, col.names=FALSE) GenoM <- as.matrix(read.table("Geno_sequoia.txt", row.names=1, header=FALSE)) # drop some SNPs, e.g. after a warning of >2 alleles: dropSNP <- c(5,68,101,128) GenoM <- GenoConvert(ColonyFile, InFormat = "col", dropcol = 1 + c(2*dropSNP-1, 2*dropSNP) ) # circumvent a 'rows have unequal length' error: GenoTmp <- as.matrix(read.table("mydata.txt", header=TRUE, row.names=1)) GenoM <- GenoConvert(InData=GenoTmp, InFormat="single", IDcol=0) ## End(Not run)
get all ancestors of an individual
GetAncestors(id, Pedigree)
GetAncestors(id, Pedigree)
id |
id of the individual |
Pedigree |
dataframe with columns id - parent1 - parent2; only the first 3 columns will be used. |
a list with as first element id
, second parents, third
grandparents, etc.. Each element is a vector with ids, the first three
elements are named, the rest numbered. Ancestors are unsorted within each
list element.
Anc_i200 <- GetAncestors('i200_2010_F', Ped_griffin)
Anc_i200 <- GetAncestors('i200_2010_F', Ped_griffin)
Identify which individuals are SNP genotyped, and which can potentially be substituted by a dummy individual ('Dummifiable').
getAssignCat(Pedigree, SNPd, minSibSize = "1sib1GP")
getAssignCat(Pedigree, SNPd, minSibSize = "1sib1GP")
Pedigree |
dataframe with columns id-dam-sire. Reference pedigree. |
SNPd |
character vector with ids of genotyped individuals. |
minSibSize |
minimum requirements to be considered 'dummifiable':
. |
It is assumed that all individuals in SNPd
have been
genotyped for a sufficient number of SNPs. To identify samples with a
too-low call rate, use CheckGeno
. To calculate the call rate
for all samples, see the examples below.
Some parents indicated here as assignable may never be assigned by sequoia, for example parent-offspring pairs where it cannot be determined which is the older of the two, or grandparents that are indistinguishable from full avuncular (i.e. genetics inconclusive because the candidate has no parent assigned, and ageprior inconclusive).
The Pedigree
dataframe with 3 additional columns,
id.cat
, dam.cat
and sire.cat
, with coding similar to
that used by PedCompare
:
G |
Genotyped |
D |
Dummy or 'dummifiable' |
X |
Not genotyped and not dummifiable, or no parent in pedigree |
PedA <- getAssignCat(Ped_HSg5, rownames(SimGeno_example)) tail(PedA) table(PedA$dam.cat, PedA$sire.cat, useNA="ifany") # calculate call rate ## Not run: CallRates <- apply(MyGenotypes, MARGIN=1, FUN = function(x) sum(x!=-9)) / ncol(MyGenotypes) hist(CallRates, breaks=50, col="grey") GoodSamples <- rownames(MyGenotypes)[ CallRates > 0.8] # threshold depends on total number of SNPs, genotyping errors, proportion # of candidate parents that are SNPd (sibship clustering is more prone to # false positives). PedA <- getAssignCat(MyOldPedigree, rownames(GoodSamples)) ## End(Not run)
PedA <- getAssignCat(Ped_HSg5, rownames(SimGeno_example)) tail(PedA) table(PedA$dam.cat, PedA$sire.cat, useNA="ifany") # calculate call rate ## Not run: CallRates <- apply(MyGenotypes, MARGIN=1, FUN = function(x) sum(x!=-9)) / ncol(MyGenotypes) hist(CallRates, breaks=50, col="grey") GoodSamples <- rownames(MyGenotypes)[ CallRates > 0.8] # threshold depends on total number of SNPs, genotyping errors, proportion # of candidate parents that are SNPd (sibship clustering is more prone to # false positives). PedA <- getAssignCat(MyOldPedigree, rownames(GoodSamples)) ## End(Not run)
get all descendants of an individual
GetDescendants(id, Pedigree)
GetDescendants(id, Pedigree)
id |
id of the individual |
Pedigree |
dataframe with columns id - parent1 - parent2; only the first 3 columns will be used. |
a list with as first element id
, second offspring, third
grand-offspring, etc.. Each element is a vector with ids, the first three
elements are named, the rest numbered.
For each individual in a pedigree, count the number of generations since its most distant pedigree founder.
getGenerations(Ped, StopIfInvalid = TRUE)
getGenerations(Ped, StopIfInvalid = TRUE)
Ped |
dataframe, pedigree with the first three columns being id - dam - sire. Column names are ignored, as are additional columns. |
StopIfInvalid |
if a pedigree loop is detected, stop with an error (TRUE, default) or return the Pedigree, to see where the problem(s) occur. |
A vector with the generation number for each individual, starting at
0 for founders. Offspring of G0 X G0 are G1, offspring of G0 X G1 or G1 x
G1 are G2, etc. NA
indicates a pedigree loop where an individual is
its own ancestor (or that the pedigree has >1000 generations).
If no output name is specified, no results are returned, only an error message when the pedigree contains a loop.
To get more details about a pedigree loop, you can use https://github.com/JiscaH/sequoiaExtra/blob/main/find_pedigree_loop.R
GetAncestors, GetDescendants
to get all
ancestors resp. descendants of a specific individual (with a warning if it
is its own ancestor); FindFamilies to find connected sub-pedigrees.
# returns nothing if OK, else error: getGenerations(SeqOUT_griffin$Pedigree) # returns vector with generation numbers: G <- getGenerations(SeqOUT_griffin$Pedigree, StopIfInvalid=FALSE) table(G, useNA='ifany') Ped_plus_G <- cbind(SeqOUT_griffin$Pedigree, G)
# returns nothing if OK, else error: getGenerations(SeqOUT_griffin$Pedigree) # returns vector with generation numbers: G <- getGenerations(SeqOUT_griffin$Pedigree, StopIfInvalid=FALSE) table(G, useNA='ifany') Ped_plus_G <- cbind(SeqOUT_griffin$Pedigree, G)
Get log10-likelihood ratios for a specific age difference from
matrix AgePriorExtra
.
GetLLRAge(AgePriorExtra, agedif, patmat)
GetLLRAge(AgePriorExtra, agedif, patmat)
AgePriorExtra |
matrix in |
agedif |
vector with age differences, in whole numbers. Must occur in
rownames of |
patmat |
numeric vector; choose maternal (1), paternal (2) relatives, or for each relationship the most-likely alternative (3). |
A matrix with nrow
equal to the length of agedif
, and 7
columns: PO-FS-HS-GP-FA-HA-U.
PairsG <- data.frame(ID1="i122_2007_M", ID2 = c("i124_2007_M", "i042_2003_F", "i083_2005_M"), AgeDif = c(0,4,2)) cbind(PairsG, GetLLRAge(SeqOUT_griffin$AgePriorExtra, agedif = PairsG$AgeDif, patmat=rep(2,3)))
PairsG <- data.frame(ID1="i122_2007_M", ID2 = c("i124_2007_M", "i042_2003_F", "i083_2005_M"), AgeDif = c(0,4,2)) cbind(PairsG, GetLLRAge(SeqOUT_griffin$AgePriorExtra, agedif = PairsG$AgeDif, patmat=rep(2,3)))
Identify pairs of individuals likely to be related, but not assigned as such in the provided pedigree.
GetMaybeRel( GenoM = NULL, SeqList = NULL, Pedigree = NULL, LifeHistData = NULL, AgePrior = NULL, Module = "par", Complex = "full", Herm = "no", Err = 1e-04, ErrFlavour = "version2.9", Tassign = 0.5, Tfilter = -2, MaxPairs = 7 * nrow(GenoM), quiet = FALSE, ParSib = NULL, MaxMismatch = NA )
GetMaybeRel( GenoM = NULL, SeqList = NULL, Pedigree = NULL, LifeHistData = NULL, AgePrior = NULL, Module = "par", Complex = "full", Herm = "no", Err = 1e-04, ErrFlavour = "version2.9", Tassign = 0.5, Tfilter = -2, MaxPairs = 7 * nrow(GenoM), quiet = FALSE, ParSib = NULL, MaxMismatch = NA )
GenoM |
numeric matrix with genotype data: One row per individual,
one column per SNP, coded as 0, 1, 2, missing values as a negative number
or NA. You can reformat data with |
SeqList |
list with output from |
Pedigree |
dataframe with id - dam - sire in columns 1-3. May include
non-genotyped individuals, which will be treated as dummy individuals. When
provided, all likelihoods (and thus all maybe-relatives) are conditional on
this pedigree. Note: |
LifeHistData |
data.frame with up to 6 columns:
"Birth year" may be in any arbitrary discrete time unit relevant to the species (day, month, decade), as long as parents are never born in the same time unit as their offspring, and only integers are used. Individuals do not need to be in the same order as in ‘GenoM’, nor do all genotyped individuals need to be included. |
AgePrior |
Agepriors matrix, as generated by |
Module |
type of relatives to check for. One of
When 'par', all pairs are returned that are more likely parent-offspring than unrelated, potentially including pairs that are even more likely to be otherwise related. |
Complex |
Breeding system complexity. Either "full" (default), "simp" (simplified, no explicit consideration of inbred relationships), "mono" (monogamous). |
Herm |
Hermaphrodites, either "no", "A" (distinguish between dam and sire role, default if at least 1 individual with sex=4), or "B" (no distinction between dam and sire role). Both of the latter deal with selfing. |
Err |
estimated genotyping error rate, as a single number, or a length 3
vector with P(hom|hom), P(het|hom), P(hom|het), or a 3x3 matrix. See
details below. The error rate is presumed constant across SNPs, and
missingness is presumed random with respect to actual genotype. Using
|
ErrFlavour |
function that takes |
Tassign |
minimum LLR required for acceptance of proposed relationship, relative to next most likely relationship. Higher values result in more conservative assignments. Must be zero or positive. |
Tfilter |
threshold log10-likelihood ratio (LLR) between a proposed relationship versus unrelated, to select candidate relatives. Typically a negative value, related to the fact that unconditional likelihoods are calculated during the filtering steps. More negative values may decrease non-assignment, but will increase computational time. |
MaxPairs |
the maximum number of putative pairs to return. |
quiet |
logical, suppress messages. |
ParSib |
DEPRECATED, use |
MaxMismatch |
DEPRECATED AND IGNORED. Now calculated
automatically using |
When Module="par"
, the age difference of the putative pair is
temporarily set to NA so that genetic parent-offspring pairs declared to be
born in the same year may be discovered. When Module="ped"
, only
relationships possible given the age difference, if known from the
LifeHistData, are considered.
A list with
MaybePar |
A dataframe with non-assigned likely parent-offspring pairs, with columns:
|
MaybeRel |
A dataframe with non-assigned likely pairs of relatives,
with columns identical to |
MaybeTrio |
A dataframe with non-assigned parent-parent-offspring trios, with columns:
|
The following categories are used in column 'TopRel', indicating the most likely relationship category:
PO |
Parent-Offspring |
FS |
Full Siblings |
HS |
Half Siblings |
GP |
GrandParent - grand-offspring |
FA |
Full Avuncular (aunt/uncle) |
2nd |
2nd degree relatives, not enough information to distinguish between HS,GP and FA |
Q |
Unclear, but probably 1st, 2nd or 3rd degree relatives |
sequoia
to identify likely pairs of duplicate
genotypes and for pedigree reconstruction; GetRelM
to
identify all pairs of relatives in a pedigree; CalcPairLL
for
the likelihoods underlying the LLR.
## Not run: # without conditioning on pedigree MaybeRel_griffin <- GetMaybeRel(GenoM=Geno_griffin, Err=0.001, Module='par') ## End(Not run) names(MaybeRel_griffin) # conditioning on pedigree MaybePO <- GetMaybeRel(GenoM = Geno_griffin, SeqList = SeqOUT_griffin, Module = 'par') head(MaybePO$MaybePar) # instead of providing the entire SeqList, one may specify the relevant # elements separately Maybe <- GetMaybeRel(GenoM = Geno_griffin, Pedigree = SeqOUT_griffin$PedigreePar, LifeHistData = LH_griffin, Err=0.0001, Complex = "full", Module = "ped") head(Maybe$MaybeRel) # visualise results, turn dataframe into matrix first: MaybeM <- GetRelM(Pairs = Maybe$MaybeRel) PlotRelPairs(MaybeM) # or combine with pedigree (note suffix '?') RelM <- GetRelM(Pedigree =SeqOUT_griffin$PedigreePar, Pairs = Maybe$MaybeRel) PlotRelPairs(RelM)
## Not run: # without conditioning on pedigree MaybeRel_griffin <- GetMaybeRel(GenoM=Geno_griffin, Err=0.001, Module='par') ## End(Not run) names(MaybeRel_griffin) # conditioning on pedigree MaybePO <- GetMaybeRel(GenoM = Geno_griffin, SeqList = SeqOUT_griffin, Module = 'par') head(MaybePO$MaybePar) # instead of providing the entire SeqList, one may specify the relevant # elements separately Maybe <- GetMaybeRel(GenoM = Geno_griffin, Pedigree = SeqOUT_griffin$PedigreePar, LifeHistData = LH_griffin, Err=0.0001, Complex = "full", Module = "ped") head(Maybe$MaybeRel) # visualise results, turn dataframe into matrix first: MaybeM <- GetRelM(Pairs = Maybe$MaybeRel) PlotRelPairs(MaybeM) # or combine with pedigree (note suffix '?') RelM <- GetRelM(Pedigree =SeqOUT_griffin$PedigreePar, Pairs = Maybe$MaybeRel) PlotRelPairs(RelM)
Generate a matrix or 3D array with all pairwise relationships from a pedigree or dataframe with pairs.
GetRelM( Pedigree = NULL, Pairs = NULL, GenBack = 1, patmat = FALSE, directed = TRUE, Return = "Matrix", Pairs_suffix = "?" )
GetRelM( Pedigree = NULL, Pairs = NULL, GenBack = 1, patmat = FALSE, directed = TRUE, Return = "Matrix", Pairs_suffix = "?" )
Pedigree |
dataframe with columns id - dam - sire. |
Pairs |
dataframe with columns ID1 - ID2 - Rel, e.g. as returned by
|
GenBack |
number of generations back to consider; 1 returns parent-offspring and sibling relationships, 2 also returns grand-parental, avuncular and first cousins. |
patmat |
logical, distinguish between paternal versus maternal relative pairs? For avuncular pairs, the distinction is never made. |
directed |
logical, distinguish between e.g. ID1=offspring, ID2=mother
('M') and ID1=mother, ID2=offspring ('O')? Defaults to TRUE; if FALSE both
are are scored as 'PO', as are father-offspring pairs, and all
grandparent– grand-offspring pairs are scored as 'GPO', and avuncular
pairs as 'FNA' and 'HNA'. Not (currently) compatible with |
Return |
'Matrix', 'Array', or 'List'. 'Matrix' returns an N x N matrix
with the closest relationship between each pair. 'Array' returns an N x N x
R array with for each of the R considered relationships whether it exists
between the pair (1) or not (0). See Details below. 'List' returns a list
with for each of the R considered relationships a 2-column matrix with the
IDs of the pairs having such a relationship. The size of the list (in Mb)
is much smaller than for the matrix or array, and this is therefore the
only format suitable for pedigrees with many thousands of individuals. If
|
Pairs_suffix |
symbol added to the relationship abbreviations derived
from |
Double relationships are ignored when Return='Matrix'
, but
not when Return='Array'
. For example, when A and B are both
mother-offspring and paternal siblings (A mated with her father to produce
B), only the mother-offspring relationship will be indicated when
Return='Matrix'
.
Note that full siblings are the exception to this rule: in the Array
they will be indicated as 'FS' only, and not as 'MHS' or 'PHS'. Similarly,
full avuncular pairs are not indicated as 'HA'. Double half-avuncular
relationships are indicated as both FA and HA.
When Pairs
is provided, GenBack
and patmat
are
ignored, and no check is performed if the abbreviations are compatible with
other functions.
If Return='Matrix'
, an N x N square matrix, with N equal to
the number of rows in Pedigree
(after running
PedPolish
) or the number of unique individuals in
Pairs
. If Return='Array'
, an N x N x R array is returned,
with R, the number of different relationships, determined by GenBack
and patmat
.
The following abbreviations are used within the returned Matrix
, or
as names of the 3rd dimension in the Array
or of the List
:
S |
Self |
M |
Mother |
P |
Father |
MP |
Mother or Father ( |
O |
Offspring |
FS |
Full sibling |
MHS |
Maternal half-sibling |
PHS |
Paternal half-sibling |
XHS |
other half-sibling (hermaphrodites) |
HS |
half-sibling ( |
MGM |
Maternal grandmother |
MGF |
Maternal grandfather |
PGM |
Paternal grandmother |
PGF |
Paternal grandfather |
GP |
Grandparent ( |
GO |
Grand-offspring |
FA |
Full avuncular; maternal or paternal aunt or uncle. |
FN |
Full nephew/niece |
HA |
Half avuncular |
HN |
Half nephew/niece |
DFC1 |
Double full first cousin |
FC1 |
Full first cousin |
U |
Unrelated (or otherwise related) |
ComparePairs
for comparing pairwise relationships
between two pedigrees; PlotRelPairs
.
Rel.griffin <- GetRelM(Ped_griffin, directed=FALSE) # few categories Rel.griffin <- GetRelM(Ped_griffin, patmat=TRUE, GenBack=2) # many cat. table(as.vector(Rel.griffin)) # turning matrix into vector first makes table() much faster PlotRelPairs(Rel.griffin)
Rel.griffin <- GetRelM(Ped_griffin, directed=FALSE) # few categories Rel.griffin <- GetRelM(Ped_griffin, patmat=TRUE, GenBack=2) # many cat. table(as.vector(Rel.griffin)) # turning matrix into vector first makes table() much faster PlotRelPairs(Rel.griffin)
Inheritance patterns used by SimGeno for non-autosomal SNPs, identical to those in Inherit.xlsx
data(Inherit_patterns)
data(Inherit_patterns)
An array with the following dimensions:
type: autosomal, x-chromosome, y-chromosome, or mtDNA
offspring sex: female, male, or unknown
offspring genotype: aa (0), aA (1), Aa (1), or AA (2)
mother genotype
father genotype
Jisca Huisman, [email protected]
Example life history data associated with the griffin pedigree.
data(LH_griffin)
data(LH_griffin)
A data frame with 200 rows and 3 variables (ID, Sex, BirthYear)
Jisca Huisman, [email protected]
## Not run: BY <- rep(c(2001:2010), each=20) Sex <- sample.int(n=2, size=200, replace=TRUE) ID <- paste0("i", formatC(1:200, width=3, flag="0"), "_", BY, "_", ifelse(Sex==1, "F", "M")) LH_griffin <- data.frame(ID, Sex, BirthYear = BY) ## End(Not run)
## Not run: BY <- rep(c(2001:2010), each=20) Sex <- sample.int(n=2, size=200, replace=TRUE) ID <- paste0("i", formatC(1:200, width=3, flag="0"), "_", BY, "_", ifelse(Sex==1, "F", "M")) LH_griffin <- data.frame(ID, Sex, BirthYear = BY) ## End(Not run)
This is the life history file associated with
Ped_HSg5
, which is Pedigree II in the paper.
data(LH_HSg5)
data(LH_HSg5)
A data frame with 1000 rows and 3 variables:
Female IDs start with 'a', males with 'b'; the next 2 numbers give the generation number (00 – 05), the last 3 numbers the individual ID number (runs continuously across all generations)
1 = female, 2 = male
from 2000 (generation 0, founders) to 2005
Jisca Huisman, [email protected]
Huisman, J. (2017) Pedigree reconstruction from SNP data: Parentage assignment, sibship clustering, and beyond. Molecular Ecology Resources 17:1009–1024.
Convert the first six columns of a PLINK .fam, .ped or .raw file into a three-column lifehistory file for sequoia. Optionally FID and IID are combined.
LHConvert( PlinkFile = NULL, UseFID = FALSE, SwapSex = TRUE, FIDsep = "__", LifeHistData = NULL )
LHConvert( PlinkFile = NULL, UseFID = FALSE, SwapSex = TRUE, FIDsep = "__", LifeHistData = NULL )
PlinkFile |
character string with name of genotype file to be converted. |
UseFID |
use the family ID column. The resulting ids (rownames of GenoM) will be in the form FID__IID. |
SwapSex |
change the coding from PLINK default (1=male, 2=female) to sequoia default (1=female, 2=male); any other numbers are set to NA. |
FIDsep |
characters inbetween FID and IID in composite-ID. By default a double underscore is used, to avoid problems when some IIDs contain an underscore. Only used when UseFID=TRUE. |
LifeHistData |
dataframe with additional sex and birth year info. In case of conflicts, LifeHistData takes priority, with a warning. If UseFID=TRUE, IDs in LifeHistData are assumed to be already as FID__IID. |
The first 6 columns of PLINK .fam, .ped and .raw files are by default FID - IID - father ID (ignored) - mother ID (ignored) - sex - phenotype.
A dataframe with id, sex and birth year, which can be used as input
for sequoia
.
GenoConvert
, PedStripFID
to reverse
UseFID
.
## Not run: # combine FID and IID in dataframe with additional sex & birth years ExtraLH$FID_IID <- paste(ExtraLH$FID, ExtraLH$IID, sep = "__") LH.new <- LHConvert(PlinkFile, UseFID = TRUE, FIDsep = "__", LifeHistData = ExtraLH) ## End(Not run)
## Not run: # combine FID and IID in dataframe with additional sex & birth years ExtraLH$FID_IID <- paste(ExtraLH$FID, ExtraLH$IID, sep = "__") LH.new <- LHConvert(PlinkFile, UseFID = TRUE, FIDsep = "__", LifeHistData = ExtraLH) ## End(Not run)
Estimate probability ratios for age
differences A and five categories of parent-offspring and sibling
relationships R.
MakeAgePrior( Pedigree = NULL, LifeHistData = NULL, MinAgeParent = NULL, MaxAgeParent = NULL, Discrete = NULL, Flatten = NULL, lambdaNW = -log(0.5)/100, Smooth = TRUE, Plot = TRUE, Return = "LR", quiet = FALSE )
MakeAgePrior( Pedigree = NULL, LifeHistData = NULL, MinAgeParent = NULL, MaxAgeParent = NULL, Discrete = NULL, Flatten = NULL, lambdaNW = -log(0.5)/100, Smooth = TRUE, Plot = TRUE, Return = "LR", quiet = FALSE )
Pedigree |
dataframe with id - dam - sire in columns 1-3, and optional column with birth years. Other columns are ignored. |
LifeHistData |
dataframe with 3 or 5 columns: id - sex (not used) - birthyear (optional columns BY.min - BY.max - YearLast not used), with unknown birth years coded as negative numbers or NA. "Birth year" may be in any arbitrary discrete time unit relevant to the species (day, month, decade), as long as parents are never born in the same time unit as their offspring. It may include individuals not in the pedigree, and not all individuals in the pedigree need to be in LifeHistData. |
MinAgeParent |
minimum age of a parent, a single number (min across dams and sires) or a vector of length two (dams, sires). Defaults to 1. When there is a conflict with the minimum age in the pedigree, the pedigree takes precedent. |
MaxAgeParent |
maximum age of a parent, a single number (max across
dams and sires) or a vector of length two (dams, sires). If NULL, it will
be set to latest - earliest birth year in |
Discrete |
discrete generations? By default (NULL), discrete
generations are assumed if all parent-offspring pairs have an age
difference of 1, and all siblings an age difference of 0, and there are at
least 20 pairs of each category (mother, father, maternal sibling, paternal
sibling). Otherwise, overlapping generations are presumed. When
|
Flatten |
logical. To deal with small sample sizes for some or all
relationships, calculate weighed average between the observed age
difference distribution among relatives and a flat (0/1) distribution. When
|
lambdaNW |
control weighing factors when |
Smooth |
smooth the tails of and any dips in the distribution? Sets dips
(<10% of average of neighbouring ages) to the average of the neighbouring
ages, sets the age after the end (oldest observed age) to LR(end)/2, and
assigns a small value (0.001) to the ages before the front (youngest
observed age) and after the new end. Peaks are not smoothed out, as these
are less likely to cause problems than dips, and are more likely to be
genuine characteristics of the species. Is set to |
Plot |
plot a heatmap of the results? |
Return |
return only a matrix with the likelihood-ratio |
quiet |
suppress messages. |
is the ratio between the observed
counts of pairs with age difference A and relationship R (
),
and the expected counts if age and relationship were independent
(
).
During pedigree reconstruction, are multiplied by the
genetic-only
to obtain a probability that the
pair are relatives of type R conditional on both their age difference and
their genotypes.
The age-difference prior is used for pairs of genotyped individuals, as
well as for dummy individuals. This assumes that the propensity for a pair
with a given age difference to both be sampled does not depend on their
relationship, so that the ratio does not differ between
sampled and unsampled pairs.
For further details, see the vignette.
A matrix with the probability ratio of the age difference between two
individuals conditional on them being a certain type of relative
() versus being a random draw from the sample (
).
Assuming conditional independence, this equals the probability ratio of
being a certain type of relative conditional on the age difference, versus
being a random draw.
The matrix has one row per age difference (0 - nAgeClasses) and five columns, one for each relationship type, with abbreviations:
M |
Mothers |
P |
Fathers |
FS |
Full siblings |
MS |
Maternal half-siblings |
PS |
Paternal half-siblings |
When Return
='all', a list is returned with the following elements:
BirthYearRange |
vector length 2 |
MaxAgeParent |
vector length 2, see details |
tblA.R |
matrix with the counts per age difference (rows) / relationship (columns) combination, plus a column 'X' with age differences across all pairs of individuals |
PA.R |
Proportions, i.e. |
LR.RU.A.raw |
Proportions |
Weights |
vector length 4, the weights used to flatten the distributions |
LR.RU.A |
the ageprior, flattend and/or smoothed |
Specs.AP |
the names of the input |
The small sample correction with Smooth
and/or Flatten
prevents errors in one dataset, but may introduce errors in another; a
single solution that fits to the wide variety of life histories and
datasets is impossible. Please do inspect the matrix, e.g. with
PlotAgePrior
, and adjust the input parameters and/or the output
matrix as necessary.
When all individuals in LifeHistData
have the same birth year, it is
assumed that Discrete=TRUE
and MaxAgeParent=1
. Consequently,
it is assumed there are no avuncular pairs present in the sample; cousins
are considered as alternative. To enforce overlapping generations, and
thereby the consideration of full- and half- avuncular relationships, set
MaxAgeParent
to some value greater than .
When no birth year information is given at all, a single cohort is assumed, and the same rules apply.
"Birth year" may be in any arbitrary time unit relevant to the species (day, month, decade), as long as parents are always born before their putative offspring, and never in the same time unit (e.g. parent's BirthYear= 1 (or 2001) and offspring BirthYear=5 (or 2005)). Negative numbers and NA's are interpreted as unknown, and fractional numbers are not allowed.
The maximum parental age for each sex equals the maximum of:
the maximum age of parents in Pedigree
,
the input parameter MaxAgeParent
,
the maximum range of birth years in LifeHistData
(including
BY.min and BY.max). Only used if both of the previous are NA
, or
if there are fewer than 20 parents of either sex assigned.
1, if Discrete=TRUE
or the previous three are all NA
If the age distribution of assigned parents does not capture the maximum
possible age of parents, it is advised to specify MaxAgeParent
for
one or both sexes. Not doing so may hinder subsequent assignment of both
dummy parents and grandparents. Not compatible with Smooth
. If the
largest age difference in the pedigree is larger than the specified
MaxAgeParent
, the pedigree takes precedent (i.e. the largest of the
two is used).
@section grandparents & avuncular
The agepriors for grand-parental and avuncular pairs is calculated from
these by sequoia
, and included in its output as
'AgePriorExtra'.
sequoia
and its argument args.AP
,
PlotAgePrior
for visualisation. The age vignette gives
further details, mathematical justification, and some examples.
# without pedigree or lifehistdata: MakeAgePrior(MaxAgeParent = c(2,3)) MakeAgePrior(Discrete=TRUE) # single cohort: MakeAgePrior(LifeHistData = data.frame(ID = letters[1:5], Sex=3, BirthYear=1984)) # overlapping generations: # without pedigree: MaxAgeParent = max age difference between any pair +1 MakeAgePrior(LifeHistData = SeqOUT_griffin$LifeHist) # with pedigree: MakeAgePrior(Pedigree=Ped_griffin, LifeHistData=SeqOUT_griffin$LifeHist, Smooth=FALSE, Flatten=FALSE) # with small-sample correction: MakeAgePrior(Pedigree=Ped_griffin, LifeHistData=SeqOUT_griffin$LifeHist, Smooth=TRUE, Flatten=TRUE) # Call from sequoia() via args.AP: Seq_HSg5 <- sequoia(SimGeno_example, LH_HSg5, Module="par", args.AP=list(Discrete = TRUE), # non-overlapping generations CalcLLR = FALSE, # skip time-consuming calculation of LLR's Plot = FALSE) # no summary plots when finished
# without pedigree or lifehistdata: MakeAgePrior(MaxAgeParent = c(2,3)) MakeAgePrior(Discrete=TRUE) # single cohort: MakeAgePrior(LifeHistData = data.frame(ID = letters[1:5], Sex=3, BirthYear=1984)) # overlapping generations: # without pedigree: MaxAgeParent = max age difference between any pair +1 MakeAgePrior(LifeHistData = SeqOUT_griffin$LifeHist) # with pedigree: MakeAgePrior(Pedigree=Ped_griffin, LifeHistData=SeqOUT_griffin$LifeHist, Smooth=FALSE, Flatten=FALSE) # with small-sample correction: MakeAgePrior(Pedigree=Ped_griffin, LifeHistData=SeqOUT_griffin$LifeHist, Smooth=TRUE, Flatten=TRUE) # Call from sequoia() via args.AP: Seq_HSg5 <- sequoia(SimGeno_example, LH_HSg5, Module="par", args.AP=list(Discrete = TRUE), # non-overlapping generations CalcLLR = FALSE, # skip time-consuming calculation of LLR's Plot = FALSE) # no summary plots when finished
Example output of a check for parent-offspring pairs and
parent-parent-offspring trios with GetMaybeRel
, with
Geno_griffin
as input (simulated from
Ped_griffin
).
data(MaybeRel_griffin)
data(MaybeRel_griffin)
a list with 2 dataframes, 'MaybePar' and 'MaybeTrio'. See
GetMaybeRel
for further details.
Jisca Huisman, [email protected]
## Not run: MaybeRel_griffin <- GetMaybeRel(GenoM = Geno_griffin, Err=0.001, Module = 'par') ## End(Not run)
## Not run: MaybeRel_griffin <- GetMaybeRel(GenoM = Geno_griffin, Err=0.001, Module = 'par') ## End(Not run)
Generate errors and missing values in a (simulated) genotype matrix.
MkGenoErrors( SGeno, CallRate = 0.99, SnpError = 5e-04, ErrorFV = function(E) c((E/2)^2, E - (E/2)^2, E/2), ErrorFM = NULL, Error.shape = 0.5, CallRate.shape = 1, WithLog = FALSE )
MkGenoErrors( SGeno, CallRate = 0.99, SnpError = 5e-04, ErrorFV = function(E) c((E/2)^2, E - (E/2)^2, E/2), ErrorFM = NULL, Error.shape = 0.5, CallRate.shape = 1, WithLog = FALSE )
SGeno |
matrix with genotype data in Sequoia's format: 1 row per individual, 1 column per SNP, and genotypes coded as 0/1/2. |
CallRate |
either a single number for the mean call rate (genotyping success), OR a vector with the call rate at each SNP, OR a named vector with the call rate for each individual. In the third case, ParMis is ignored, and individuals in the pedigree (as id or as parent) not included in this vector are presumed non-genotyped. |
SnpError |
either a single value which will be combined with
|
ErrorFV |
function taking the error rate (scalar) as argument and returning a length 3 vector with hom->other hom, hom->het, het->hom. May be an 'ErrFlavour', e.g. 'version2.9'. |
ErrorFM |
function taking the error rate (scalar) as argument and
returning a 3x3 matrix with probabilities that actual genotype i (rows) is
observed as genotype j (columns). See below for details. To use, set
|
Error.shape |
first shape parameter (alpha) of beta-distribution of per-SNP error rates. A higher value results in a flatter distribution. |
CallRate.shape |
as Error.shape, for per-SNP call rates. |
WithLog |
Include dataframe in output with which datapoints have been edited, with columns id - SNP - actual (original, input) - observed (edited, output). |
The input genotype matrix, with some genotypes replaced, and some
set to missing (-9). If WithLog=TRUE
, a list with 3 elements: GenoM,
Log, and Counts_actual (genotype counts in input, to allow double checking
of simulated genotyping error rate).
Example pedigree with overlapping generations and polygamy.
data(Ped_griffin)
data(Ped_griffin)
A data frame with 200 rows and 4 variables (id, dam, sire, birthyear)
The R code used to create this pedigree can be found in /data-raw.
Jisca Huisman, [email protected]
LH_griffin
; SeqOUT_griffin
for a
sequoia run on simulated genotype data based on this pedigree;
Ped_HSg5
for another pedigree; sequoia
.
A pedigree with five non-overlapping generations and considerable inbreeding. Each female mated with two random males and each male with three random females, producing four full-sib offspring per mating. This is Pedigree II in the paper.
data(Ped_HSg5)
data(Ped_HSg5)
A data frame with 1000 rows and 3 variables (id, dam, sire)
Jisca Huisman, [email protected]
Huisman, J. (2017) Pedigree reconstruction from SNP data: Parentage assignment, sibship clustering, and beyond. Molecular Ecology Resources 17:1009–1024.
LH_HSg5 SimGeno_example sequoia
Compare an inferred pedigree (Ped2) to a previous or simulated pedigree (Ped1), including comparison of sibship clusters and sibship grandparents.
PedCompare( Ped1 = NULL, Ped2 = NULL, DumPrefix = c("F0", "M0"), SNPd = NULL, Symmetrical = TRUE, minSibSize = "1sib1GP", Plot = TRUE )
PedCompare( Ped1 = NULL, Ped2 = NULL, DumPrefix = c("F0", "M0"), SNPd = NULL, Symmetrical = TRUE, minSibSize = "1sib1GP", Plot = TRUE )
Ped1 |
first (e.g. original) pedigree, dataframe with columns id-dam-sire; only the first 3 columns will be used. |
Ped2 |
second pedigree, e.g. newly inferred |
DumPrefix |
character vector with the prefixes identifying dummy
individuals in |
SNPd |
character vector with IDs of genotyped individuals. If
|
Symmetrical |
when determining the category of individuals
(Genotyped/Dummy/X), use the 'highest' category across the two pedigrees
( |
minSibSize |
minimum requirements to be considered 'dummifiable',
passed to
|
Plot |
show square Venn diagrams of counts? |
The comparison is divided into different classes of ‘assignable’
parents (getAssignCat
). This includes cases where the focal
individual and parent according to Ped1 are both Genotyped (G-G), as well
as cases where the non-genotyped parent according to Ped1 can be lined up
with a sibship Dummy parent in Ped2 (G-D), or where the non-genotyped focal
individual in Ped1 can be matched to a dummy individual in Ped2 (D-G and
D-D). If SNPd is NULL (the default), and DumPrefix is set to NULL, the
intersect between the IDs in Pedigrees 1 and 2 is taken as the vector of
genotyped individuals.
A list with
Counts |
A 7 x 5 x 2 named numeric array with the number of matches and mismatches, see below |
Counts.detail |
a large numeric array with number of matches and mismatches, with more detail for all possible combination of categories |
MergedPed |
A dataframe with side-by-side comparison of the two pedigrees |
ConsensusPed |
A consensus pedigree, with Pedigree 2 taking priority over Pedigree 1 |
DummyMatch |
Dataframe with all dummy IDs in Pedigree 2 (id.2), and the best-matching individual in Pedigree 1 (id.1). Also includes the class of the dam & sire, as well as counts of offspring per outcome class (off.Match, off.Mismatch, etc.) |
Mismatch |
A subset of MergedPed with mismatches between Ped1 and Ped2, as defined below |
Ped1only |
as Mismatches, with parents in Ped1 that were not assigned in Ped2 |
Ped2only |
as Mismatches, with parents in Ped2 that were missing in Ped1 |
'MergedPed', 'Mismatch', 'Ped1only' and 'Ped2only' provide the following columns:
id |
All ids in both Pedigree 1 and 2. For dummy individuals, this is the id in pedigree 2 |
dam.1 , sire.1
|
parents in Pedigree 1 |
dam.2 , sire.2
|
parents in Pedigree 2 |
id.r , dam.r , sire.r
|
The real id of dummy individuals or parents in Pedigree 2, i.e. the best-matching non-genotyped individual in Pedigree 1, or "nomatch". If a sibship in Pedigree 1 is divided over 2 sibships in Pedigree 2, the smaller one will be denoted as "nomatch" |
id.dam.cat , id.sire.cat
|
the category of the individual (first letter)
and highest category of the dam (sire) in Pedigree 1 or 2:
G=Genotyped, D=(potential) dummy, X=none. Individual, one-letter categories
are generated by |
dam.class , sire.class
|
classification of dam and sire: Match, Mismatch, P1only, P2only, or '_' when no parent is assigned in either pedigree |
The first dimension of Counts
denotes the following categories:
GG |
Genotyped individual, assigned a genotyped parent in either pedigree |
GD |
Genotyped individual, assigned a dummy parent, or at least 1 genotyped sibling or a genotyped grandparent in Pedigree 1) |
GT |
Genotyped individual, total |
DG |
Dummy individual, assigned a genotyped parent (i.e., grandparent of the sibship in Pedigree 2) |
DD |
Dummy individual, assigned a dummy parent (i.e., avuncular relationship between sibships in Pedigree 2) |
DT |
Dummy total |
TT |
Total total, includes all genotyped individuals, plus non-genotyped individuals in Pedigree 1, plus non-replaced dummy individuals (see below) in Pedigree 2 |
The second dimension of Counts
gives the outcomes:
Total |
The total number of individuals with a parent assigned in either or both pedigrees |
Match |
The same parent is assigned in both pedigrees (non-missing). For dummy parents, it is considered a match if the inferred sibship which contains the most offspring of a non-genotyped parent, consists for more than half of this individual's offspring. |
Mismatch |
Different parents assigned in the two pedigrees. When a sibship according to Pedigree 1 is split over two sibships in Pedigree 2, the smaller fraction is included in the count here. |
P1only |
Parent in Pedigree 1 but not 2; includes non-assignable parents (e.g. not genotyped and no genotyped offspring). |
P2only |
Parent in Pedigree 2 but not 1. |
The third dimension Counts
separates between maternal and paternal
assignments, where e.g. paternal 'DT' is the assignment of fathers to both
maternal and paternal sibships (i.e., to dummies of both sexes).
In 'ConsensusPed', the priority used is parent.r (if not "nomatch") > parent.2 > parent.1. The columns 'id.cat', dam.cat' and 'sire.cat' have two additional levels compared to 'MergedPed':
G |
Genotyped |
D |
Dummy individual (in Pedigree 2) |
R |
Dummy individual in pedigree 2 replaced by best matching non-genotyped individual in pedigree 1 |
U |
Ungenotyped, Unconfirmed (parent in Pedigree 1, with no dummy match in Pedigree 2) |
X |
No parent in either pedigree |
Note that 'assignable' may be overly optimistic. Some parents from
Ped1
indicated as assignable may never be assigned by sequoia, for
example parent-offspring pairs where it cannot be determined which is the
older of the two, or grandparents that are indistinguishable from full
avuncular (i.e. genetics inconclusive because the candidate has no parent
assigned, and ageprior inconclusive).
Considered as potential dummy individuals are all non-genotyped individuals in Pedigree 1 who have, according to either pedigree, at least 2 genotyped offspring, or at least one genotyped offspring and a genotyped parent.
Perhaps unexpectedly, cases where all siblings are correct but a dummy
parent rather than the genotyped Ped1-parent are assigned, are classified
as a mismatch (for each of the siblings). These are typically due to a too
low assumed genotyping error rate, a wrong parental birth year, or some
other issue that requires user inspection. To identify these cases,
ComparePairs
may be of help.
If Pedigree 2 includes samples for which the ID is unknown, the behaviour of
PedCompare
depends on whether the temporary IDs for these samples are
included in SNPd
. If they are included, matching (actual) IDs in
Pedigree 1 will be flagged as mismatches (because the IDs differ). If they
are not included in SNPd
, or SNPd
is not explicitly provided,
matches are accepted, as the situation is indistinguishable from comparing
dummy parents across pedigrees.
This is of course all conditional on relatives of the mystery sample being assigned in Pedigree 2.
Jisca Huisman, [email protected]
ComparePairs
for comparison of all pairwise
relationships in 2 pedigrees; EstConf
for repeated
simulate-reconstruct-compare; getAssignCat
for all parents in
the reference pedigree that could have been assigned;
CalcOHLLR
to check how well an 'old' pedigree fits with the
SNP data.
compare <- PedCompare(Ped_griffin, SeqOUT_griffin$Pedigree) compare$Counts["TT",,] # totals only; 45 dams & 47 sires non-assigned compare$Counts[,,"dam"] # dams only # inspect non-assigned in Ped2, id genotyped, dam might-be-dummy PedM <- compare$MergedPed # for brevity PedM[PedM$id.dam.cat=='GD' & PedM$dam.class=='P1only',] # zoom in on specific dam PedM[which(PedM$dam.1=="i011_2001_F"), ] # no sire for 'i034_2002_F' -> impossible to tell if half-sibs or avuncular # overview of all non-genotyped -- dummy matches head(compare$DummyMatch) # success of paternity assignment, if genotyped mother correctly assigned dimnames(compare$Counts.detail) compare$Counts.detail["G","G",,"Match",] # default before version 3.5: minSibSize = '2sib' compare_2s <- PedCompare(Ped_griffin, SeqOUT_griffin$Pedigree, minSibSize = '2sib') compare_2s$Counts[,,"dam"] # note decrease in Total 'dummies with(compare_2s$MergedPed, table(id.dam.cat, dam.class)) # some with id.cat = 'X' or dam.cat='X' are nonetheless dam.class='Match'
compare <- PedCompare(Ped_griffin, SeqOUT_griffin$Pedigree) compare$Counts["TT",,] # totals only; 45 dams & 47 sires non-assigned compare$Counts[,,"dam"] # dams only # inspect non-assigned in Ped2, id genotyped, dam might-be-dummy PedM <- compare$MergedPed # for brevity PedM[PedM$id.dam.cat=='GD' & PedM$dam.class=='P1only',] # zoom in on specific dam PedM[which(PedM$dam.1=="i011_2001_F"), ] # no sire for 'i034_2002_F' -> impossible to tell if half-sibs or avuncular # overview of all non-genotyped -- dummy matches head(compare$DummyMatch) # success of paternity assignment, if genotyped mother correctly assigned dimnames(compare$Counts.detail) compare$Counts.detail["G","G",,"Match",] # default before version 3.5: minSibSize = '2sib' compare_2s <- PedCompare(Ped_griffin, SeqOUT_griffin$Pedigree, minSibSize = '2sib') compare_2s$Counts[,,"dam"] # note decrease in Total 'dummies with(compare_2s$MergedPed, table(id.dam.cat, dam.class)) # some with id.cat = 'X' or dam.cat='X' are nonetheless dam.class='Match'
Ensure all parents & all genotyped individuals are included, remove duplicates, rename columns, and replace 0 by NA or v.v..
PedPolish( Pedigree, gID = NULL, ZeroToNA = TRUE, NAToZero = FALSE, DropNonSNPd = TRUE, FillParents = FALSE, KeepAllColumns = TRUE, KeepAllRows = FALSE, NullOK = FALSE, LoopCheck = TRUE, StopIfInvalid = TRUE )
PedPolish( Pedigree, gID = NULL, ZeroToNA = TRUE, NAToZero = FALSE, DropNonSNPd = TRUE, FillParents = FALSE, KeepAllColumns = TRUE, KeepAllRows = FALSE, NullOK = FALSE, LoopCheck = TRUE, StopIfInvalid = TRUE )
Pedigree |
dataframe where the first 3 columns are id, dam, sire. |
gID |
character vector with ids of genotyped individuals (rownames of genotype matrix). |
ZeroToNA |
logical, replace 0's for missing values by NA's (defaults to
|
NAToZero |
logical, replace NA's for missing values by 0's. If
|
DropNonSNPd |
logical, remove any non-genotyped individuals (but keep
non-genotyped parents), & sort pedigree in order of |
FillParents |
logical, for individuals with only 1 parent assigned, set
the other parent to a dummy (without assigning siblings or grandparents).
Makes the pedigree compatible with R packages and software that requires
individuals to have either 2 or 0 parents, such as
|
KeepAllColumns |
Keep all columns in |
KeepAllRows |
Keep all rows in |
NullOK |
logical, is it OK for Ped to be NULL? Then NULL will be returned. |
LoopCheck |
logical, check for invalid pedigree loops by calling
|
StopIfInvalid |
if a pedigree loop is detected, stop with an error (TRUE, default). |
Recognized column names are an exact or partial match with (case is ignored):
"id", "iid", "off"
"dam", "mother", "mot", "mom", "mum", "mat"
"sire", "father", "fat", "dad", "pat"
sequoia
requires the column order id - dam - sire; columns 2 and 3 are
swapped by this function if necessary.
## Not run: # To get the output pedigree into kinship2 compatible format: PedP <- sequoia::PedPolish(SeqOUT$Pedigree, DropNonSNPd=FALSE, FillParents = TRUE) PedP$Sex <- with(PedP, ifelse(id %in% dam, "female", "male")) # default to 'male' to avoid warning: "More than 25% of the gender values are # 'unknown'" Ped.fix <- with(PedP, kinship2::fixParents(id=id, dadid=sire, momid=dam, sex=Sex)) Ped.k <- with(Ped.fix, kinship2::pedigree(id, dadid, momid, sex, missid=0)) ## End(Not run)
## Not run: # To get the output pedigree into kinship2 compatible format: PedP <- sequoia::PedPolish(SeqOUT$Pedigree, DropNonSNPd=FALSE, FillParents = TRUE) PedP$Sex <- with(PedP, ifelse(id %in% dam, "female", "male")) # default to 'male' to avoid warning: "More than 25% of the gender values are # 'unknown'" Ped.fix <- with(PedP, kinship2::fixParents(id=id, dadid=sire, momid=dam, sex=Sex)) Ped.k <- with(Ped.fix, kinship2::pedigree(id, dadid, momid, sex, missid=0)) ## End(Not run)
Reverse the joining of FID and IID in
GenoConvert
and LHConvert
PedStripFID(Ped, FIDsep = "__")
PedStripFID(Ped, FIDsep = "__")
Ped |
pedigree as returned by sequoia (e.g. |
FIDsep |
characters inbetween FID and IID in composite-ID. |
Note that the family IDs are the ones provided, and not
automatically updated. New, numeric ones can be obtained with
FindFamilies
.
A pedigree with 6 columns
FID |
family ID of focal individual (offspring). |
id |
within-family of focal individual |
dam.FID |
original family ID of assigned dam |
dam |
within-family of dam |
sire.FID |
original family ID of assigned sire |
sire |
within-family of sire |
Visualise the age-difference based prior probability ratios as a heatmap.
PlotAgePrior(AP = NULL, legend = TRUE)
PlotAgePrior(AP = NULL, legend = TRUE)
AP |
matrix with age priors ( |
legend |
if |
A heatmap.
PlotAgePrior(SeqOUT_griffin$AgePriors) PlotAgePrior(SeqOUT_griffin$AgePriorExtra)
PlotAgePrior(SeqOUT_griffin$AgePriors) PlotAgePrior(SeqOUT_griffin$AgePriorExtra)
Colour-coded scatter plots of e.g. LLR(PO/U) against LLR(FS/U), for various relationship combinations.
PlotPairLL( PairLL, combo = list(c("FS", "PO"), c("HS", "FS"), c("GP", "HS"), c("FA", "HS")), nrows = NULL, ncols = NULL, bgcol = TRUE, Tassign = 0.5, Tfilter = -2 )
PlotPairLL( PairLL, combo = list(c("FS", "PO"), c("HS", "FS"), c("GP", "HS"), c("FA", "HS")), nrows = NULL, ncols = NULL, bgcol = TRUE, Tassign = 0.5, Tfilter = -2 )
PairLL |
dataframe, output from |
combo |
list with length-2 character vectors, specifying which likelihoods to plot against each other. Choose from 'PO', 'FS', 'HS', 'GP', 'FA', and 'HA'. The first one gets plotted on the x-axis, the second on the y-axis. Subsequent figures will be drawn row-wise. |
nrows |
number of rows in the figure layout. If |
ncols |
number of columns in the figure layout. If both |
bgcol |
logical, colour the upper and lower triangle background of each figure to match the specified relationship combo. |
Tassign |
assignment threshold, shown as grey square in bottom-left corner and a band along the diagonal. |
Tfilter |
filter threshold, shown as dark grey square in bottom-left. |
The colour of each point is determined by columns focal
(outer circle) and TopRel
(inner filling) of PairLL
.
Impossible relationships (LL > 0 in PairLL
) are shown as -Inf
on the axes, if any are present.
Pairs <- data.frame(ID1 = "a01005", ID2 = c("a00013", "a00008", "a00011", "b00001", "b01006", "b01007", "b01013", "b01014"), focal = rep(c("PO", "HS"), each=4)) PLL <- CalcPairLL(Pairs, GenoM=SimGeno_example, Plot=FALSE) PlotPairLL(PLL, combo = list(c("FS", "PO"), c("HS", "FS"), c("GP", "HS"), c("FA", "HS"), c("HA", "FA"), c("FA", "GP")), nrows = 3)
Pairs <- data.frame(ID1 = "a01005", ID2 = c("a00013", "a00008", "a00011", "b00001", "b01006", "b01007", "b01013", "b01014"), focal = rep(c("PO", "HS"), each=4)) PLL <- CalcPairLL(Pairs, GenoM=SimGeno_example, Plot=FALSE) PlotPairLL(PLL, combo = list(c("FS", "PO"), c("HS", "FS"), c("GP", "HS"), c("FA", "HS"), c("HA", "FA"), c("FA", "GP")), nrows = 3)
square Venn diagrams with PedCompare
Counts
.
PlotPedComp(Counts, sameSize = FALSE)
PlotPedComp(Counts, sameSize = FALSE)
Counts |
a 7x5x2 array with counts of matches and mismatches per
category (genotyped vs dummy), as returned by |
sameSize |
logical, make all per-category Venn diagrams the same size
|
PC.g <- PedCompare(Ped1 = cbind(FieldMums_griffin, sire=NA), Ped2 = SeqOUT_griffin$Pedigree) PlotPedComp(PC.g$Counts)
PC.g <- PedCompare(Ped1 = cbind(FieldMums_griffin, sire=NA), Ped2 = SeqOUT_griffin$Pedigree) PlotPedComp(PC.g$Counts)
Plot pairwise 1st and 2nd degree relationships between individuals, similar to Colony's dyad plot.
PlotRelPairs( RelM = NULL, subset.x = NULL, subset.y = NULL, drop.U = TRUE, pch.symbols = FALSE, cex.axis = 0.7, mar = c(5, 5, 1, 8) )
PlotRelPairs( RelM = NULL, subset.x = NULL, subset.y = NULL, drop.U = TRUE, pch.symbols = FALSE, cex.axis = 0.7, mar = c(5, 5, 1, 8) )
RelM |
square matrix with relationships between all pairs of
individuals, as generated by |
subset.x |
vector with IDs to show on the x-axis; the y-axis will include all siblings, parents and grandparents of these individuals. |
subset.y |
vector with IDs to show on the y-axis; the x-axis will
include all siblings, offspring and grandoffspring of these individuals.
Specify either |
drop.U |
logical: omit individuals without relatives from the plot, and
omit individuals without parents from the x-axis. Ignored if
|
pch.symbols |
logical: use different symbols for the different relationships (TRUE) or only colours in a heatmap-like fashion (FALSE). Question marks in the plot indicate that one or more of the symbols are not supported on your machine. |
cex.axis |
the magnification to be used for axis annotation. Decrease this value if R is dropping axis labels to prevent them from overlapping. |
mar |
A numerical vector of the form c(bottom, left, top, right) which gives the number of lines of margin to be specified on the four sides of the plot. |
Parents are shown above the diagonal (y-axis is parent of x-axis),
siblings below the diagonal. If present, grandparents and full aunts/uncles
are also shown above the diagonal. Individuals are sorted by dam ID and
sire ID so that siblings are grouped together, and then by generation
(getGenerations
) so that later generations are closer to the
origin.
If RelM
is based on a dataframe with pairs rather than a pedigree,
parents and grandparents are similarly only displayed above the diagonal,
but the order of individuals is arbitrary and the ID on the x-axis is as
likely to be the grandparent of the one on the y-axis as vice versa. Second
degree relatives of unknown classification ('2nd', may be HS, GP or FA) are
only shown below the diagonal. The switch between pedigree-based versus
pairs-based is made on whether parent-offspring pairs are coded as 'M','P',
'MP', 'O' (unidirectional, from pedigree) or as 'PO' (bidirectional, from
pairs).
Note that half-avuncular and (double) full cousin pairs are ignored.
The subsetted, rearranged RelM
is returned
invisible
.
The numbers of unique pairs of each relationship type are given in the
figure legend. The number of 'self' pairs refers to the number of
individuals on the x-axis, not all of whom may occur on the y-axis when
drop.U=TRUE
or a subset
is specified.
GetRelM
; SummarySeq
for individual-wise
graphical pedigree summaries.
Rel.griffin <- GetRelM(Ped_griffin, patmat=TRUE, GenBack=2) PlotRelPairs(Rel.griffin) ## Not run: PlotRelPairs(Rel.griffin, pch.symbols = TRUE) # plot with unicode symbols not supported on all platforms ## End(Not run) # parents & grandparents of 2008 cohort: PlotRelPairs(Rel.griffin, subset.x = Ped_griffin$id[Ped_griffin$birthyear ==2008]) # offspring & grand-offspring of 2002 cohort: PlotRelPairs(Rel.griffin, subset.y = Ped_griffin$id[Ped_griffin$birthyear ==2002])
Rel.griffin <- GetRelM(Ped_griffin, patmat=TRUE, GenBack=2) PlotRelPairs(Rel.griffin) ## Not run: PlotRelPairs(Rel.griffin, pch.symbols = TRUE) # plot with unicode symbols not supported on all platforms ## End(Not run) # parents & grandparents of 2008 cohort: PlotRelPairs(Rel.griffin, subset.x = Ped_griffin$id[Ped_griffin$birthyear ==2008]) # offspring & grand-offspring of 2002 cohort: PlotRelPairs(Rel.griffin, subset.y = Ped_griffin$id[Ped_griffin$birthyear ==2002])
visualise the numbers of assigned parents, sibship sizes, and parental LLRs
PlotSeqSum(SeqSum, Pedigree = NULL, Panels = "all", ask = TRUE)
PlotSeqSum(SeqSum, Pedigree = NULL, Panels = "all", ask = TRUE)
SeqSum |
list output from |
Pedigree |
dataframe with at least id, dam and sire in columns 1-3, respectively. If columns with parental LLRs and/or Mendelian errors are present, these will be plotted as well. |
Panels |
character vector with panel(s) to plot. Choose from 'all', 'G.parents' (parents of genotyped individuals), 'D.parents' (parents of dummies), 'O.parents' (parents of non-genotyped non-dummies), sibships', 'LLR', 'OH'. |
ask |
ask for user key stroke before proceeding to next plot. |
sumry <- SummarySeq(SeqOUT_griffin, Plot=FALSE) PlotSeqSum(sumry, SeqOUT_griffin$Pedigree, Panels='all', ask=FALSE)
sumry <- SummarySeq(SeqOUT_griffin, Plot=FALSE) PlotSeqSum(sumry, SeqOUT_griffin$Pedigree, Panels='all', ask=FALSE)
Example output of a sequoia run including sibship clustering,
with Geno_griffin
as input (simulated from
Ped_griffin
).
data(SeqOUT_griffin)
data(SeqOUT_griffin)
a list, see sequoia
Jisca Huisman, [email protected]
## Not run: SeqOUT_griffin <- sequoia(GenoM = Geno_griffin, LifeHistData = LH_griffin, Module = 'ped') ## End(Not run)
## Not run: SeqOUT_griffin <- sequoia(GenoM = Geno_griffin, LifeHistData = LH_griffin, Module = 'ped') ## End(Not run)
Example output of a sequoia
run including sibship
clustering, based on Pedigree Geno_HSg5
.
data(SeqOUT_HSg5)
data(SeqOUT_HSg5)
a list, see sequoia
Jisca Huisman, [email protected]
## Not run: # this output was created as follows: Geno <- SimGeno(Ped = Ped_HSg5, nSnp = 200) SeqOUT_HSg5 <- sequoia(GenoM = Geno, LifeHistData = LH_HSg5, Module = "ped", Err = 0.005) ## End(Not run) # some ways to inspect the output; see vignette for more info: names(SeqOUT_HSg5) SeqOUT_HSg5$Specs SummarySeq(SeqOUT_HSg5)
## Not run: # this output was created as follows: Geno <- SimGeno(Ped = Ped_HSg5, nSnp = 200) SeqOUT_HSg5 <- sequoia(GenoM = Geno, LifeHistData = LH_HSg5, Module = "ped", Err = 0.005) ## End(Not run) # some ways to inspect the output; see vignette for more info: names(SeqOUT_HSg5) SeqOUT_HSg5$Specs SummarySeq(SeqOUT_HSg5)
Perform pedigree reconstruction based on SNP data, including parentage assignment and sibship clustering.
sequoia( GenoM = NULL, LifeHistData = NULL, SeqList = NULL, Module = "ped", Err = 1e-04, Tfilter = -2, Tassign = 0.5, MaxSibshipSize = 100, DummyPrefix = c("F", "M"), Complex = "full", Herm = "no", UseAge = "yes", args.AP = list(Flatten = NULL, Smooth = TRUE), mtSame = NULL, CalcLLR = TRUE, quiet = FALSE, Plot = NULL, StrictGenoCheck = TRUE, ErrFlavour = "version2.9", MaxSibIter = 42, MaxMismatch = NA, FindMaybeRel = FALSE )
sequoia( GenoM = NULL, LifeHistData = NULL, SeqList = NULL, Module = "ped", Err = 1e-04, Tfilter = -2, Tassign = 0.5, MaxSibshipSize = 100, DummyPrefix = c("F", "M"), Complex = "full", Herm = "no", UseAge = "yes", args.AP = list(Flatten = NULL, Smooth = TRUE), mtSame = NULL, CalcLLR = TRUE, quiet = FALSE, Plot = NULL, StrictGenoCheck = TRUE, ErrFlavour = "version2.9", MaxSibIter = 42, MaxMismatch = NA, FindMaybeRel = FALSE )
GenoM |
numeric matrix with genotype data: One row per individual,
one column per SNP, coded as 0, 1, 2, missing values as a negative number
or NA. You can reformat data with |
LifeHistData |
data.frame with up to 6 columns:
"Birth year" may be in any arbitrary discrete time unit relevant to the species (day, month, decade), as long as parents are never born in the same time unit as their offspring, and only integers are used. Individuals do not need to be in the same order as in ‘GenoM’, nor do all genotyped individuals need to be included. |
SeqList |
list with output from a previous run, to be re-used in the
current run. Used are elements ‘PedigreePar’, ‘LifeHist’, ‘AgePriors’,
‘Specs’, and ‘ErrM’, and these override the corresponding input parameters.
Not all of these elements need to be present, and all other elements are
ignored. If |
Module |
one of
NOTE: Until 'MaxSibIter' is fully deprecated: if 'MaxSibIter' differs
from the default ( |
Err |
estimated genotyping error rate, as a single number, or a length 3
vector with P(hom|hom), P(het|hom), P(hom|het), or a 3x3 matrix. See
details below. The error rate is presumed constant across SNPs, and
missingness is presumed random with respect to actual genotype. Using
|
Tfilter |
threshold log10-likelihood ratio (LLR) between a proposed relationship versus unrelated, to select candidate relatives. Typically a negative value, related to the fact that unconditional likelihoods are calculated during the filtering steps. More negative values may decrease non-assignment, but will increase computational time. |
Tassign |
minimum LLR required for acceptance of proposed relationship, relative to next most likely relationship. Higher values result in more conservative assignments. Must be zero or positive. |
MaxSibshipSize |
maximum number of offspring for a single individual (a generous safety margin is advised). |
DummyPrefix |
character vector of length 2 with prefixes for dummy dams (mothers) and sires (fathers); maximum 20 characters each. Length 3 vector in case of hermaphrodites (or default prefix 'H'). |
Complex |
Breeding system complexity. Either "full" (default), "simp" (simplified, no explicit consideration of inbred relationships), "mono" (monogamous). |
Herm |
Hermaphrodites, either "no", "A" (distinguish between dam and sire role, default if at least 1 individual with sex=4), or "B" (no distinction between dam and sire role). Both of the latter deal with selfing. |
UseAge |
either "yes" (default), "no" (only use age differences for filtering), or "extra" (additional rounds with extra reliance on ageprior, may boost assignments but increased risk of erroneous assignments). Used during full reconstruction only. |
args.AP |
list with arguments to be passed on to
|
mtSame |
NEW matrix indicating whether individuals (might) have the same mitochondrial haplotype (1), and may thus be matrilineal relatives, or not (0). Row names and column names should match IDs in 'GenoM'. Not all individuals need to be included and order is not important. Please report any issues. For details see the mtDNA vignette. |
CalcLLR |
TRUE/FALSE; calculate log-likelihood ratios for all assigned
parents (genotyped + dummy; parent vs. otherwise related). Time-consuming
in large datasets. Can be done separately with |
quiet |
suppress messages: TRUE/FALSE/"verbose". |
Plot |
display plots from |
StrictGenoCheck |
Automatically exclude any individuals genotyped for <5 the unavoidable default up to version 2.4.1. Otherwise only excluded are (very nearly) monomorphic SNPs, SNPs scored for fewer than 2 individuals, and individuals scored for fewer than 2 SNPs. |
ErrFlavour |
function that takes |
MaxSibIter |
DEPRECATED, use |
MaxMismatch |
DEPRECATED AND IGNORED. Now calculated
automatically using |
FindMaybeRel |
DEPRECATED AND IGNORED, advised to run
|
For each pair of candidate relatives, the likelihoods are calculated of them being parent-offspring (PO), full siblings (FS), half siblings (HS), grandparent-grandoffspring (GG), full avuncular (niece/nephew - aunt/uncle; FA), half avuncular/great-grandparental/cousins (HA), or unrelated (U). Assignments are made if the likelihood ratio (LLR) between the focal relationship and the most likely alternative exceed the threshold Tassign.
Dummy parents of sibships are denoted by F0001, F0002, ... (mothers) and M0001, M0002, ... (fathers), are appended to the bottom of the pedigree, and may have been assigned real or dummy parents themselves (i.e. sibship-grandparents). A dummy parent is not assigned to singletons.
Full explanation of the various options and interpretation of the output is provided in the vignettes and on the package website, https://jiscah.github.io/index.html .
A list with some or all of the following components, depending on
Module
. All input except GenoM
is included in the output.
AgePriors |
Matrix with age-difference based probability ratios for
each relationship, used for full pedigree reconstruction; see
|
args.AP |
(input) arguments used to specify age prior matrix. If a
custom ageprior was provided via |
DummyIDs |
Dataframe with pedigree for dummy individuals, as well as
their sex, estimated birth year (point estimate, upper and lower bound of
95% confidence interval; see also |
DupGenotype |
Dataframe, duplicated genotypes (with different IDs, duplicate IDs are not allowed). The specified number of maximum mismatches is used here too. Note that this dataframe may include pairs of closely related individuals, and monozygotic twins. |
DupLifeHistID |
Dataframe, row numbers of duplicated IDs in life history dataframe. For convenience only, but may signal a problem. The first entry is used. |
ErrM |
(input) Error matrix; probability of observed genotype (columns) conditional on actual genotype (rows) |
ExcludedInd |
Individuals in GenoM which were excluded because of a too low genotyping success rate (<50%). |
ExcludedSNPs |
Column numbers of SNPs in GenoM which were excluded because of a too low genotyping success rate (<10%). |
LifeHist |
(input) Dataframe with sex and birth year data. All missing birth years are coded as '-999', all missing sex as '3'. |
LifeHistPar |
LifeHist with additional columns 'Sexx' (inferred Sex when assigned as part of parent-pair), 'BY.est' (mode of birth year probability distribution), 'BY.lo' (lower limit of 95% highest density region), 'BY.hi' (higher limit), inferred after parentage assignment. 'BY.est' is NA when the probability distribution is flat between 'BY.lo' and 'BY.hi'. |
LifeHistSib |
as LifeHistPar, but estimated after full pedigree reconstruction |
NoLH |
Vector, IDs in genotype data for which no life history data is provided. |
Pedigree |
Dataframe with assigned genotyped and dummy parents from Sibship step; entries for dummy individuals are added at the bottom. |
PedigreePar |
Dataframe with assigned parents from Parentage step. |
Specs |
Named vector with parameter values. |
TotLikParents |
Numeric vector, Total likelihood of the genotype data at initiation and after each iteration during Parentage. |
TotLikSib |
Numeric vector, Total likelihood of the genotype data at initiation and after each iteration during Sibship clustering. |
AgePriorExtra |
As AgePriors, but including columns for grandparents and avuncular pairs. NOT updated after parentage assignment, but returned as used during the run. |
DummyClones |
Hermaphrodites only: female-male dummy ID pairs that refer to the same non-genotyped individual |
List elements PedigreePar and Pedigree both have the following columns:
id |
Individual ID |
dam |
Assigned mother, or NA |
sire |
Assigned father, or NA |
LLRdam |
Log10-Likelihood Ratio (LLR) of this female being the mother,
versus the next most likely relationship between the focal individual and
this female. See Details below for relationships considered, and see
|
LLRsire |
idem, for male parent |
LLRpair |
LLR for the parental pair, versus the next most likely configuration between the three individuals (with one or neither parent assigned) |
OHdam |
Number of loci at which the offspring and mother are opposite homozygotes |
OHsire |
idem, for father |
MEpair |
Number of Mendelian errors between the offspring and the parent pair, includes OH as well as e.g. parents being opposing homozygotes, but the offspring not being a heterozygote. The offspring being OH with both parents is counted as 2 errors. |
The genotyping error rate Err
can be specified three different ways:
A single number, which is combined with ErrFlavour
by
ErrToM
to create a length 3 vector (next item). By
default (ErrFlavour
= 'version2.9'), P(hom|hom)=$(E/2)^2$,
P(het|hom)=$E-(E/2)^2$, P(hom|het)=$E/2$.
a length 3 vector (NEW from version 2.6), with the probabilities to observe a actual homozygote as the other homozygote (hom|hom), to observe a homozygote as heterozygote (het|hom), and to observe an actual heterozygote as homozygote (hom|het). This assumes that the two alleles are equivalent with respect to genotyping errors, i.e. $P(AA|aa) = P(aa|AA)$, $P(aa|Aa)=P(AA|Aa)$, and $P(aA|aa)=P(aA|AA)$.
a 3x3 matrix, with the probabilities of observed genotype (columns)
conditional on actual genotype (rows). Only needed when the assumption
in the previous item does not hold. See ErrToM
for details.
Possibly Err
is much lower than the actual genotyping error rate.
Alternatively, a true parent will not be assigned when it is:
unclear who is the parent and who the offspring, due to unknown birth year for one or both individuals
unclear whether the parent is the father or mother
unclear if it is a parent or e.g. full sibling or grandparent, due to insufficient genetic data
And true half-siblings will not be clustered when it is:
unclear if they are maternal or paternal half-siblings
unclear if they are half-siblings, full avuncular, or grand-parental
unclear what type of relatives they are due to insufficient genetic data
All pairs of non-assigned but likely/definitely relatives can be found with
GetMaybeRel
. For a method to do pairwise 'assignments', see
https://jiscah.github.io/articles/pairLL_classification.html ; for further
information, see the vignette.
While every effort has been made to ensure that sequoia provides what it claims to do, there is absolutely no guarantee that the results provided are correct. Use of sequoia is entirely at your own risk.
https://jiscah.github.io/
Jisca Huisman, [email protected]
Huisman, J. (2017) Pedigree reconstruction from SNP data: Parentage assignment, sibship clustering, and beyond. Molecular Ecology Resources 17:1009–1024.
GenoConvert
to read in various data formats,
CheckGeno
, SnpStats
to calculate
missingness and allele frequencies,
SimGeno
to simulate SNP data from a pedigree,
MakeAgePrior
to estimate effect of age on relationships,
GetMaybeRel
to find pairs of potential relatives,
SummarySeq
and PlotAgePrior
to visualise
results,
GetRelM
to turn a pedigree into pairwise relationships,
CalcOHLLR
to calculate Mendelian errors and LLR for any
pedigree,
CalcPairLL
for likelihoods of various relationships
between specific pairs,
CalcBYprobs
to estimate birth years,
PedCompare
and ComparePairs
to compare to
two pedigrees,
EstConf
to estimate assignment errors,
writeSeq
to save results,
vignette("sequoia")
for detailed manual & FAQ.
# === EXAMPLE 1: simulated data === head(SimGeno_example[,1:10]) head(LH_HSg5) # parentage assignment: SeqOUT <- sequoia(GenoM = SimGeno_example, Err = 0.005, LifeHistData = LH_HSg5, Module="par", Plot=TRUE) names(SeqOUT) SeqOUT$PedigreePar[34:42, ] # compare to true (or old) pedigree: PC <- PedCompare(Ped_HSg5, SeqOUT$PedigreePar) PC$Counts["GG",,] # parentage assignment + full pedigree reconstruction: # (note: this can be rather time consuming) SeqOUT2 <- sequoia(GenoM = SimGeno_example, Err = 0.005, LifeHistData = LH_HSg5, Module="ped", quiet="verbose") SeqOUT2$Pedigree[34:42, ] PC2 <- PedCompare(Ped_HSg5, SeqOUT2$Pedigree) PC2$Counts["GT",,] PC2$Counts[,,"dam"] # different kind of pedigree comparison: ComparePairs(Ped1=Ped_HSg5, Ped2=SeqOUT$PedigreePar, patmat=TRUE) # results overview: SummarySeq(SeqOUT2) # important to run with approx. correct genotyping error rate: SeqOUT2.b <- sequoia(GenoM = SimGeno_example, # Err = 1e-4 by default LifeHistData = LH_HSg5, Module="ped", Plot=FALSE) PC2.b <- PedCompare(Ped_HSg5, SeqOUT2.b$Pedigree) PC2.b$Counts["GT",,] ## Not run: # === EXAMPLE 2: real data === # ideally, select 400-700 SNPs: high MAF & low LD # save in 0/1/2/NA format (PLINK's --recodeA) GenoM <- GenoConvert(InFile = "inputfile_for_sequoia.raw", InFormat = "raw") # can also do Colony format SNPSTATS <- SnpStats(GenoM) # perhaps after some data-cleaning: write.table(GenoM, file="MyGenoData.txt", row.names=T, col.names=F) # later: GenoM <- as.matrix(read.table("MyGenoData.txt", row.names=1, header=F)) LHdata <- read.table("LifeHistoryData.txt", header=T) # ID-Sex-birthyear SeqOUT <- sequoia(GenoM, LHdata, Err=0.005) SummarySeq(SeqOUT) SeqOUT$notes <- "Trial run on cleaned data" # add notes for future reference saveRDS(SeqOUT, file="sequoia_output_42.RDS") # save to R-specific file writeSeq(SeqOUT, folder="sequoia_output") # save to several plain text files # runtime: SeqOUT$Specs$TimeEnd - SeqOUT$Specs$TimeStart ## End(Not run)
# === EXAMPLE 1: simulated data === head(SimGeno_example[,1:10]) head(LH_HSg5) # parentage assignment: SeqOUT <- sequoia(GenoM = SimGeno_example, Err = 0.005, LifeHistData = LH_HSg5, Module="par", Plot=TRUE) names(SeqOUT) SeqOUT$PedigreePar[34:42, ] # compare to true (or old) pedigree: PC <- PedCompare(Ped_HSg5, SeqOUT$PedigreePar) PC$Counts["GG",,] # parentage assignment + full pedigree reconstruction: # (note: this can be rather time consuming) SeqOUT2 <- sequoia(GenoM = SimGeno_example, Err = 0.005, LifeHistData = LH_HSg5, Module="ped", quiet="verbose") SeqOUT2$Pedigree[34:42, ] PC2 <- PedCompare(Ped_HSg5, SeqOUT2$Pedigree) PC2$Counts["GT",,] PC2$Counts[,,"dam"] # different kind of pedigree comparison: ComparePairs(Ped1=Ped_HSg5, Ped2=SeqOUT$PedigreePar, patmat=TRUE) # results overview: SummarySeq(SeqOUT2) # important to run with approx. correct genotyping error rate: SeqOUT2.b <- sequoia(GenoM = SimGeno_example, # Err = 1e-4 by default LifeHistData = LH_HSg5, Module="ped", Plot=FALSE) PC2.b <- PedCompare(Ped_HSg5, SeqOUT2.b$Pedigree) PC2.b$Counts["GT",,] ## Not run: # === EXAMPLE 2: real data === # ideally, select 400-700 SNPs: high MAF & low LD # save in 0/1/2/NA format (PLINK's --recodeA) GenoM <- GenoConvert(InFile = "inputfile_for_sequoia.raw", InFormat = "raw") # can also do Colony format SNPSTATS <- SnpStats(GenoM) # perhaps after some data-cleaning: write.table(GenoM, file="MyGenoData.txt", row.names=T, col.names=F) # later: GenoM <- as.matrix(read.table("MyGenoData.txt", row.names=1, header=F)) LHdata <- read.table("LifeHistoryData.txt", header=T) # ID-Sex-birthyear SeqOUT <- sequoia(GenoM, LHdata, Err=0.005) SummarySeq(SeqOUT) SeqOUT$notes <- "Trial run on cleaned data" # add notes for future reference saveRDS(SeqOUT, file="sequoia_output_42.RDS") # save to R-specific file writeSeq(SeqOUT, folder="sequoia_output") # save to several plain text files # runtime: SeqOUT$Specs$TimeEnd - SeqOUT$Specs$TimeStart ## End(Not run)
Simulate SNP genotype data from a pedigree, with optional missingness, genotyping errors, and non-genotyped parents.
SimGeno( Pedigree, nSnp = 400, ParMis = c(0, 0), MAF = 0.3, CallRate = 0.99, SnpError = 5e-04, ErrorFV = function(E) c((E/2)^2, E - (E/2)^2, E/2), ErrorFM = NULL, ReturnStats = FALSE, quiet = FALSE )
SimGeno( Pedigree, nSnp = 400, ParMis = c(0, 0), MAF = 0.3, CallRate = 0.99, SnpError = 5e-04, ErrorFV = function(E) c((E/2)^2, E - (E/2)^2, E/2), ErrorFM = NULL, ReturnStats = FALSE, quiet = FALSE )
Pedigree |
dataframe, pedigree with the first three columns being id - dam - sire, additional columns are ignored. |
nSnp |
number of SNPs to simulate. |
ParMis |
single number or vector length two with proportion of parents with fully missing genotype. Ignored if CallRate is a named vector. NOTE: default changed from 0.4 (up to version 2.8.5) to 0 (from version 2.9). |
MAF |
either a single number with minimum minor allele frequency, and allele frequencies will be sampled uniformly between this minimum and 0.5, OR a vector with minor allele frequency at each locus. In both cases, this is the MAF among pedigree founders, the MAF in the sample will deviate due to drift. |
CallRate |
either a single number for the mean call rate (genotyping success), OR a vector with the call rate at each SNP, OR a named vector with the call rate for each individual. In the third case, ParMis is ignored, and individuals in the pedigree (as id or as parent) not included in this vector are presumed non-genotyped. |
SnpError |
either a single value which will be combined with
|
ErrorFV |
function taking the error rate (scalar) as argument and returning a length 3 vector with hom->other hom, hom->het, het->hom. May be an 'ErrFlavour', e.g. 'version2.9'. |
ErrorFM |
function taking the error rate (scalar) as argument and
returning a 3x3 matrix with probabilities that actual genotype i (rows) is
observed as genotype j (columns). See below for details. To use, set
|
ReturnStats |
in addition to the genotype matrix, return the input parameters and mean & quantiles of MAF, error rate and call rates. |
quiet |
suppress messages. |
For founders, i.e. individuals with no known parents, genotypes are drawn according to the provided MAF and assuming Hardy-Weinberg equilibrium. Offspring genotypes are generated following Mendelian inheritance, assuming all loci are completely independent. Individuals with one known parent are allowed: at each locus, one allele is inherited from the known parent, and the other drawn from the genepool according to the provided MAF.
If ReturnStats=FALSE
(the default), a matrix with genotype
data in sequoia's input format, encoded as 0/1/2/-9.
If ReturnStats=TRUE
, a named list with three elements: list
'ParamsIN', matrix 'SGeno', and list 'StatsOUT':
AF |
Frequency in 'observed' genotypes of '1' allele |
AF.act |
Allele frequency in 'actual' (without genotyping errors & missingness) |
SnpError |
Error rate per SNP (actual /= observed AND observed /= missing) |
SnpCallRate |
Non-missing per SNP |
IndivError |
Error rate per individual |
IndivCallRate |
Non-missing per individual |
If SnpError
is a length 3 vector, genotyping errors are generated
following a length 3 vector with probabilities that 1) an actual homozygote
is observed as the other homozygote, 2) an actual homozygote is observed as
a heterozygote, and 3) an heterozygote is observed as an homozygote. The
only assumption made is that the two alleles can be treated equally, i.e.
observing actual allele $A$ as $a$ is as likely as observing actual $a$ as
$A$.
If SnpError
is a single value, by default this is interpreted as a
locus-level error rate (rather than allele-level), and equals the
probability that a homozygote is observed as heterozygote, and the
probability that a heterozygote is observed as either homozygote (i.e., the
probability that it is observed as AA = probability that observed as aa =
SnpError
/2). The probability that one homozygote is observed as the
other is (SnpError
/2. How this single value is rendered
into a 3x3 error matrix is fully flexible and specified via
ErrorFM
;
see link{ErrToM}
for details.
The default values of SnpError=5e-4
and ErrorFM='version2.9'
correspond to the length 3 vector c((5e-4/2)^2, 5e-4*(1-5e-4/2),
5e-4/2)
.
A beta-distribution is used to simulate variation in the error rate between
SNPs, the shape parameter of this distribution can be specified via
MkGenoErrors
. It is also possible to specify the error rate
per SNP.
Variation in call rates across SNPs is assumed to follow a highly skewed
(beta) distribution, with many SNPs having call rates close to 1, and a
narrowing tail of lower call rates. The first shape parameter defaults to 1
(but see MkGenoErrors
), and the second shape parameter is
defined via the mean as CallRate
. For 99.9% of SNPs to have a call
rate of 0.8 (0.9; 0.95) or higher, use a mean call rate of 0.969 (0.985;
0.993).
Variation in call rate between samples can be specified by providing a
named vector to CallRate
. Otherwise, variation in call rate and
error rate between samples occurs only as side-effect of the random nature
of which individuals are hit by per-SNP errors and drop-outs. Finer control
is possible by first generating an error-free genotype matrix, and then
calling MkGenoErrors
directly on (subsets of) the matrix.
This simulation is highly simplistic and assumes that all SNPs segregate completely independently, that the SNPs are in Hardy-Weinberg equilibrium in the pedigree founders. It assumes that genotyping errors are not due to heritable mutations of the SNPs, and that missingness is random and not e.g. due to heritable mutations of SNP flanking regions. Results based on this simulated data will provide an minimum estimate of the number of SNPs required, and an optimistic estimate of pedigree reconstruction performance.
Jisca Huisman, [email protected]
The wrapper EstConf
for repeated simulation and
pedigree reconstruction; MkGenoErrors
for fine control over
the distribution of genotyping errors in simulated data;
ErrToM
for more information about genotyping error patterns.
Geno_A <- SimGeno(Pedigree = Ped_griffin, nSnp=200, ParMis=c(0.1, 0.6), MAF = 0.25, SnpError = 0.001) Geno_B <- SimGeno(Pedigree = Ped_HSg5, nSnp = 100, ParMis = 0.2, SnpError = c(0.01, 0.04, 0.1)) Geno_C <- SimGeno(Pedigree = Ped_griffin, nSnp=200, ParMis=0, CallRate=0.6, SnpError = 0.05, ErrorFV=function(E) c(E/10, E/10, E)) # genotype matrix with duplicated samples: Dups_grif <- data.frame(ID1 = c('i006_2001_M', 'i021_2002_M', 'i064_2004_F')) Dups_grif$ID2 <- paste0(Dups_grif$ID1, '_2') Err <- c(0.01, 0.04, 0.1) Geno_act <- SimGeno(Ped_griffin, nSnp=500, ParMis=0, CallRate=1, SnpError=0) Geno_sim <- MkGenoErrors(Geno_act, SnpError=Err, CallRate=0.99) Geno_dups <- MkGenoErrors(Geno_act[Dups_grif$ID1, ], SnpError=Err, CallRate=0.99) rownames(Geno_dups) <- Dups_grif$ID2 Geno_sim <- rbind(Geno_sim, Geno_dups)
Geno_A <- SimGeno(Pedigree = Ped_griffin, nSnp=200, ParMis=c(0.1, 0.6), MAF = 0.25, SnpError = 0.001) Geno_B <- SimGeno(Pedigree = Ped_HSg5, nSnp = 100, ParMis = 0.2, SnpError = c(0.01, 0.04, 0.1)) Geno_C <- SimGeno(Pedigree = Ped_griffin, nSnp=200, ParMis=0, CallRate=0.6, SnpError = 0.05, ErrorFV=function(E) c(E/10, E/10, E)) # genotype matrix with duplicated samples: Dups_grif <- data.frame(ID1 = c('i006_2001_M', 'i021_2002_M', 'i064_2004_F')) Dups_grif$ID2 <- paste0(Dups_grif$ID1, '_2') Err <- c(0.01, 0.04, 0.1) Geno_act <- SimGeno(Ped_griffin, nSnp=500, ParMis=0, CallRate=1, SnpError=0) Geno_sim <- MkGenoErrors(Geno_act, SnpError=Err, CallRate=0.99) Geno_dups <- MkGenoErrors(Geno_act[Dups_grif$ID1, ], SnpError=Err, CallRate=0.99) rownames(Geno_dups) <- Dups_grif$ID2 Geno_sim <- rbind(Geno_sim, Geno_dups)
Simulated genotype data for cohorts 1+2 in Pedigree
Ped_HSg5
data(SimGeno_example)
data(SimGeno_example)
A genotype matrix with 214 rows (ids) and 200 columns (SNPs). Each SNP is coded as 0/1/2 copies of the reference allele, with -9 for missing values. Ids are stored as rownames.
Jisca Huisman, [email protected]
Estimate allele frequency (AF), missingness and Mendelian errors per SNP.
SnpStats( GenoM, Pedigree = NULL, Duplicates = NULL, Plot = TRUE, quiet = TRUE, ErrFlavour )
SnpStats( GenoM, Pedigree = NULL, Duplicates = NULL, Plot = TRUE, quiet = TRUE, ErrFlavour )
GenoM |
genotype matrix, in sequoia's format: 1 column per SNP, 1 row per individual, genotypes coded as 0/1/2/-9, and row names giving individual IDs. |
Pedigree |
dataframe with 3 columns: ID - parent1 - parent2. Additional columns and non-genotyped individuals are ignored. Used to count Mendelian errors per SNP and (poorly) estimate the error rate. |
Duplicates |
dataframe with pairs of duplicated samples |
Plot |
show histograms of the results? |
quiet |
suppress messages |
ErrFlavour |
DEPRECATED AND IGNORED. Was used to estimate |
Calculation of these summary statistics can be done in PLINK, and SNPs with low minor allele frequency or high missingness should be filtered out prior to pedigree reconstruction. This function is provided as an aid to inspect the relationship between AF, missingness and genotyping error to find a suitable combination of SNP filtering thresholds to use.
For pedigree reconstruction, SNPs with zero or one copies of the alternate
allele in the dataset (MAF ) are considered fixed, and
excluded.
A matrix with a number of rows equal to the number of SNPs (=number of columns of GenoM), and when no Pedigree is provided 2 columns:
AF |
Allele frequency of the 'second allele' (the one for which the homozygote is coded 2) |
Mis |
Proportion of missing calls |
HWE.p |
p-value from chi-square test for Hardy-Weinberg equilibrium |
When a Pedigree is provided, there are 8 additional columns:
n.dam , n.sire , n.pair
|
Number of dams, sires, parent-pairs successfully genotyped for the SNP |
OHdam , OHsire
|
Count of number of opposing homozygous cases |
MEpair |
Count of Mendelian errors, includes opposing homozygous cases when only one parent is genotyped |
n.dups , n.diff
|
Number of duplicate pairs successfully genotyped for the SNP; number of differences. The latter does not count cases where one duplicate is not successfully genotyped at the SNP |
GenoConvert
to convert from various data formats;
CheckGeno
to check the data is in valid format for sequoia
and exclude monomorphic SNPs etc., CalcOHLLR
to calculate OH
& ME per individual.
Genotypes <- SimGeno(Ped_HSg5, nSnp=100, CallRate = runif(100, 0.5, 0.8), SnpError = 0.05) SnpStats(Genotypes) # only plots; data is returned invisibly SNPstats <- SnpStats(Genotypes, Pedigree=Ped_HSg5)
Genotypes <- SimGeno(Ped_HSg5, nSnp=100, CallRate = runif(100, 0.5, 0.8), SnpError = 0.05) SnpStats(Genotypes) # only plots; data is returned invisibly SNPstats <- SnpStats(Genotypes, Pedigree=Ped_HSg5)
Number of assigned parents and grandparents and sibship sizes, split by genotyped, dummy, and 'observed'.
SummarySeq( SeqList = NULL, Pedigree = NULL, DumPrefix = c("F0", "M0"), SNPd = NULL, Plot = TRUE, Panels = "all" )
SummarySeq( SeqList = NULL, Pedigree = NULL, DumPrefix = c("F0", "M0"), SNPd = NULL, Plot = TRUE, Panels = "all" )
SeqList |
the list returned by |
Pedigree |
dataframe, pedigree with the first three columns being id - dam - sire. Column names are ignored, as are additional columns, except for columns OHdam, OHsire, MEpair, LLRdam, LLRsire, LLRpair (plotting only). |
DumPrefix |
character vector of length 2 with prefixes for dummy dams
(mothers) and sires (fathers). Will be read from |
SNPd |
character vector with ids of SNP genotyped individuals. Only used
when |
Plot |
show barplots and histograms of the results, as well as of the parental LLRs, Mendelian errors, and agepriors, if present. |
Panels |
character vector with panel(s) to plot. Choose from 'all', 'G.parents' (parents of genotyped individuals), 'D.parents' (parents of dummy individuals), 'sibships' (distribution of sibship sizes), 'LLR' (log10-likelihood ratio parent/otherwise related), 'OH' (count of opposite homozygote SNPs). |
A list with the following elements:
PedSummary |
a 2-column matrix with basic summary statistics, similar
to what used to be returned by Pedantics' |
ParentCount |
an array with the number of assigned parents, split by:
|
GPCount |
an array with the number of assigned grandparents, split by:
|
SibSize |
a list with elements 'mat' (maternal half + full siblings), 'pat' (paternal half + full siblings), and 'full' (full siblings). Each is a matrix with a number of rows equal to the maximum sibship size, and 3 columns, splitting by the type of parent: Genotyped, Dummy, or Observed. |
PlotSeqSum
to plot the output of this function;
sequoia
for pedigree reconstruction and links to other
functions.
SummarySeq(Ped_griffin) sumry_grif <- SummarySeq(SeqOUT_griffin, Panels=c("G.parents", "OH")) sumry_grif$PedSummary
SummarySeq(Ped_griffin) sumry_grif <- SummarySeq(SeqOUT_griffin, Panels=c("G.parents", "OH")) sumry_grif$PedSummary
Write data.frame or matrix to a text file, using white
space padding to keep columns aligned as in print
.
writeColumns(x, file = "", row.names = TRUE, col.names = TRUE)
writeColumns(x, file = "", row.names = TRUE, col.names = TRUE)
x |
the object to be written, preferably a matrix or data frame. If not, it is attempted to coerce x to a matrix. |
file |
a character string naming a file. |
row.names |
a logical value indicating whether the row names of x are to be written along with x. |
col.names |
a logical value indicating whether the column names of x are to be written along with x. |
The various list elements returned by sequoia
are each
written to text files in the specified folder, or to separate sheets in a
single excel file (requires library openxlsx).
writeSeq( SeqList, GenoM = NULL, MaybeRel = NULL, PedComp = NULL, OutFormat = "txt", folder = "Sequoia-OUT", file = "Sequoia-OUT.xlsx", ForVersion = 2, quiet = FALSE )
writeSeq( SeqList, GenoM = NULL, MaybeRel = NULL, PedComp = NULL, OutFormat = "txt", folder = "Sequoia-OUT", file = "Sequoia-OUT.xlsx", ForVersion = 2, quiet = FALSE )
SeqList |
list returned by |
GenoM |
matrix with genetic data (optional). Ignored if OutFormat='xls', as the resulting file could become too large for excel. |
MaybeRel |
list with results from |
PedComp |
list with results from |
OutFormat |
'xls' or 'txt'. |
folder |
the directory where the text files will be written; will be
created if it does not already exists. Relative to the current working
directory, or NULL for current working directory. Ignored if
|
file |
the name of the excel file to write to, ignored if
|
ForVersion |
choose '1' for back-compatibility with stand-alone sequoia versions 1.x |
quiet |
suppress messages. |
The text files can be used as input for the stand-alone Fortran
version of sequoia, e.g. when the genotype data is too large for R. See
vignette('sequoia')
for further details.
writeColumns
to write to a text file, using white
space padding to keep columns aligned.
## Not run: writeSeq(SeqList, OutFormat="xls", file="MyFile.xlsx") # add additional sheet to the excel file: library(openxlsx) wb <- loadWorkbook("MyFile.xlsx") addWorksheet(wb, sheetName = "ExtraData") writeData(wb, sheet = "ExtraData", MyData, rowNames=FALSE) saveWorkbook(wb, "MyFile.xlsx", overwrite=TRUE, returnValue=TRUE) # or: (package requires java & is trickier to install) xlsx::write.xlsx(MyData, file = "MyFile.xlsx", sheetName="ExtraData", col.names=TRUE, row.names=FALSE, append=TRUE, showNA=FALSE) ## End(Not run)
## Not run: writeSeq(SeqList, OutFormat="xls", file="MyFile.xlsx") # add additional sheet to the excel file: library(openxlsx) wb <- loadWorkbook("MyFile.xlsx") addWorksheet(wb, sheetName = "ExtraData") writeData(wb, sheet = "ExtraData", MyData, rowNames=FALSE) saveWorkbook(wb, "MyFile.xlsx", overwrite=TRUE, returnValue=TRUE) # or: (package requires java & is trickier to install) xlsx::write.xlsx(MyData, file = "MyFile.xlsx", sheetName="ExtraData", col.names=TRUE, row.names=FALSE, append=TRUE, showNA=FALSE) ## End(Not run)