It is often worthwhile to compare a newly inferred genetic pedigree to an older (field-)pedigree, as no dataset is flawless, nor is genetic pedigree reconstruction infallible.
Discrepancies between the pedigrees may bring to light mislabeled samples, incorrect birth or death years, or incorrectly inferred pedigree links. For example, if an individual’s genetically inferred mother differs from its field-observed mother, this may be due to a mislabeled DNA sample, a pedigree inference error, an error in the field records, or egg dumping/adoption. Eliminating possibilities and tracking down the most likely scenario can be rather time consuming, but will increase the overall quality of the dataset and pedigree.
In addition, the pedigree comparison may be able to match dummy parents in the genetic pedigree to real, non-genotyped parents in the field pedigree. This allows pedigree records (offspring number, mates, etc.) to be combined with phenotypic records (age, size, etc.) of those non-genotyped individuals.
Here, the process is illustrated by comparing genetically assigned and field observed mothers in a fictional population of griffins.
Nests have been monitored in a small, closed population of griffins,
where each year exactly 20 baby griffins hatch. The mother of most
individuals is known from field observations (example data
FieldMums_griffin
included in the package). From 2001 to
2010, most hatchlings are tagged, and sampled for SNP genotyping. From
this, a genetic pedigree has been reconstructed (example data
SeqOUT_griffin
). Breeding females without tags, are given a
two-colour code 1. This may including females who have lost
their tags.
library(sequoia)
data(SeqOUT_griffin, FieldMums_griffin, package="sequoia")
PCG <- PedCompare(Ped1 = cbind(FieldMums_griffin,
sire = NA),
Ped2 = SeqOUT_griffin$Pedigree,
SNPd = SeqOUT_griffin$PedigreePar$id,
Symmetrical = TRUE, Plot=FALSE)
In essence, PedCompare()
lies the two pedigrees
side-by-side and classifies each individual’s parent as being the same
in the two pedigrees (Match
) or not
(Mismatch
), or only having a parent in only one of the
pedigrees (P1only
or P2only
).
While Match
versus Mismatch
is
straightforward when all individuals are genotyped, and all IDs are
(thus) consistent between the two pedigrees, it becomes more complicated
when non-genotyped individuals and dummy IDs are involved.
PedCompare()
tries to match each sibship-dummy-parent in
Pedigree2 (F0001, F0002, etc.) to a non-genotyped parent in Pedigree1
(here these are the IDs consisting of 2 colours: PinkBlue, BlueRed,
etc.).
$MergedPed
The side-by-side pedigrees are given in the list element
$MergedPed
of the output:
## id dam.1 dam.2 dam.class
## 127 i148_2008_F GreenYellow F0005 Mismatch
## 128 i149_2008_F i126_2007_F i126_2007_F Match
## 129 i151_2008_F i108_2006_F i108_2006_F Match
## 130 i153_2008_F <NA> <NA> _
## 131 i154_2008_F i137_2007_F i137_2007_F Match
## 132 i157_2008_F <NA> <NA> _
## 133 i158_2008_M BlueRed F0001 Match
This part-pedigree for example shows that individuals i165_2009_F and
i166_2009_F have the non-genotyped female ‘PinkBlue’ as field observed
mother (dam.1
= dam in Pedigree1). The pair are also
genetically assigned as maternal siblings, with ‘F0002’ as dummy ID for
the mother (dam.2
). And, while the ID of the mother differs
between the two pedigrees, it is nonetheless considered a
Match
(dam.class
, short for
classification).
A point of clarification on the id
columns:
id
are the ids common to both pedigrees, plus
those only occurring in Pedigree2 (typically the dummy IDs F0001, F0002
etc.)id.r
are the ‘consensus’ ids, plus those only
occurring in Pedigree1For example, ‘PinkBlue’ does not have a mother in Pedigree1 (first
row in MergedPed
below: id.r=PinkBlue
,
dam.1=NA
). Pedigree2 tells us that ‘F0002’ has genetic
mother ‘F0007’, which is not hugely informative of itself
(MergedPed
: id=F0002
,
dam.2=F0007
). But PedCompare()
then tells us
that this means that ‘PinkBlue’ (id.r
) has ‘YellowPink’ as
mother (dam.r
), which has a lot more meaning (to someone
working in the field).
# subset some individuals:
these <- c("i177_2009_M", "i179_2009_M", "i165_2009_F", "i166_2009_F", "F0002",
"F0007", "YellowPink", "PinkBlue")
knitr::kable(list(Ped1 = FieldMums_griffin[FieldMums_griffin$id %in% these, ],
Ped2 = SeqOUT_griffin$Pedigree[SeqOUT_griffin$Pedigree$id %in% these, 1:3]),
caption = "Subsets of Pedigree1 (left) and Pedigree2 (right)")
|
|
## id id.r dam.1 dam.2 dam.r
## 2 F0002 PinkBlue <NA> F0007 YellowPink
## 7 F0007 YellowPink <NA> <NA> <NA>
## 137 i165_2009_F i165_2009_F PinkBlue F0002 PinkBlue
## 138 i166_2009_F i166_2009_F PinkBlue F0002 PinkBlue
## 146 i177_2009_M i177_2009_M YellowPink F0007 YellowPink
## 147 i179_2009_M i179_2009_M YellowPink F0007 YellowPink
$DummyMatch
The output list element $DummyMatch
summarises the
matches:
## id.2 id.1 off.Match off.Mismatch off.P1only off.P2only
## 1 F0001 BlueRed 4 4 0 3
## 2 F0002 PinkBlue 3 0 0 0
## 3 F0003 GreenYellow 3 2 0 0
## 4 F0004 nomatch 0 0 0 2
## 5 F0005 nomatch 0 2 0 0
## 6 F0006 RedOrange 3 0 0 0
For the match between ‘F0002’ in Pedigree2 and ‘PinkBlue’ in
Pedigree1 (bottom row), there are 3 matching offspring, and no offspring
by ‘PinkBlue’ that are genetically assigned to a different sibship
cluster or mother (off.Mismatch
) or have no genetic mother
assigned (off.P1only
), nor are there any members of the
genetic sibship F0002 with no mother in Pedigree1
(off.P2only
). Thus, a perfect one-to-one match.
For genetic sibship ‘F0001’, however, the situation is more complicated, and this will be worked through in detail below.
‘nomatch’ in the id.1
column indicates either that none
of the individuals in this genetic sibship had a field-observed mother,
or that the field-observed mother is matched to a different genetic
sibship, with which there was a larger overlap.
$Counts
The total number of matches and mismatches is summarised in the 3D
array $Counts
, which can be visualised with
PlotPedComp()
:
Since there are no sires in our field pedigree, we will have a look
at the dam
slice only:
## class
## cat Total Match Mismatch P1only P2only
## GG 65 61 4 0 0
## GD 41 16 7 0 18
## GT 107 77 11 1 18
## DG 12 0 0 0 12
## DD 4 0 0 0 4
## DT 16 0 0 0 16
## TT 124 77 11 2 34
The counts for each classification are subdivided into various categories (rows), based on whether the focal individual (first letter) and parent (second letter) are Genotyped or a Dummy individual, as well as the Totals. The totals counts includes individuals in Pedigree1 who are neither genotyped nor ‘dummifiable’, such as a non-genotyped parent with a single offspring, and therefore exceeds the sum of G and D.
We will go through each of the three classes of discrepancies in turn
– Mismatch
, P1only
(only field mum), and
P2only
(only genetic mum).
To get more detail on the 11 mismatches, we head back to
$MergedPed
. For brevity, we display only the columns in
which we’re currently interested:
PCG$MergedPed[which(PCG$MergedPed$dam.class == "Mismatch"), c("id", "dam.1", "dam.2", "id.dam.cat")]
## id dam.1 dam.2 id.dam.cat
## 97 i108_2006_F GreenBlue i081_2005_F GG
## 108 i122_2007_M GreenBlue i081_2005_F GG
## 109 i123_2007_F OrangeGreen F0001 GD
## 110 i124_2007_M GreenBlue i081_2005_F GG
## 114 i130_2007_F OrangeGreen F0001 GD
## 115 i131_2007_F OrangeGreen F0001 GD
## 117 i133_2007_F OrangeGreen F0001 GD
## 122 i142_2008_M GreenBlue i081_2005_F GG
## 126 i147_2008_F YellowBlue F0003 GD
## 127 i148_2008_F GreenYellow F0005 GD
## 134 i159_2008_F YellowBlue F0005 GD
Thus while there are 11 individuals with mismatching mothers, there are only 4 unique mothers involved, which we will look at in turn.
PedM <- PCG$MergedPed[, c("id", "dam.1", "dam.2")] # short-hand to minimise typing
# does the mismatch affect all of GreenBlue's offspring?
PedM[which(PedM$dam.1 == "GreenBlue"), ]
## id dam.1 dam.2
## 97 i108_2006_F GreenBlue i081_2005_F
## 108 i122_2007_M GreenBlue i081_2005_F
## 110 i124_2007_M GreenBlue i081_2005_F
## 122 i142_2008_M GreenBlue i081_2005_F
# > yes, these 4 are all of her known offspring
# does genetic mother i081_2005_F have any field-observed offspring?
PedM[which(PedM$dam.1 == "i081_2005_F"), ]
## [1] id dam.1 dam.2
## <0 rows> (or 0-length row.names)
# no.
# does i081_2005_F have any other genetic offspring?
PedM[which(PedM$dam.2 == "i081_2005_F"), ]
## id dam.1 dam.2
## 97 i108_2006_F GreenBlue i081_2005_F
## 108 i122_2007_M GreenBlue i081_2005_F
## 110 i124_2007_M GreenBlue i081_2005_F
## 122 i142_2008_M GreenBlue i081_2005_F
It seems ‘GreenBlue’ is a perfect match with ‘i081_2005_F’: all four of GreenBlue’s observed offspring are genetically assigned i081_2005_F as mother, and all four of i081_2005_F’s genetic offspring have GreenBlue as observed mother. Perhaps this female lost her tag, was therefore not recognised, and received a new field ID. Field records may be able to back this theory up, or disprove it: Did GreenBlue look about 1 year old when first recorded in 2007? Was i081_2005_F ever seen in or after the 2006 breeding season? Does i081_2005_F have a known death date? If so, was there a post-mortem, or was she presumed dead because she had not seen for several months/years?
An alternative explanation might be a pedigree inference error, but
this is highly unlikely when offspring and parent are both genotyped
(id.dam.cat = 'GG'
), and only plausible with a limited
number of SNPs with low call rate and high genotyping error rate. The
four assignments are independent if they were done during parentage
assignment, but not necessarily so during full pedigree reconstruction:
they may first have been clustered as maternal siblings, and
subsequently the dummy mother may have been replaced by i081_2005_F.
Dummy individuals in Pedigree2 are never matched to genotyped individuals in Pedigree1, even if there is a perfect match such as here – there is almost always something odd going on that requires user inspection.
Resolution: Merge IDs ‘i081_2005_F’ and ‘GreenBlue’
## id dam.1 dam.2
## 109 i123_2007_F OrangeGreen F0001
## 114 i130_2007_F OrangeGreen F0001
## 115 i131_2007_F OrangeGreen F0001
## 117 i133_2007_F OrangeGreen F0001
# all of OrangeGreen's offspring are in sibship F0001
PedM[which(PedM$dam.2 == "F0001"), c("id", "dam.1", "dam.2")]
## id dam.1 dam.2
## 5 F0005 <NA> F0001
## 13 F0013 <NA> F0001
## 23 M0008 <NA> F0001
## 109 i123_2007_F OrangeGreen F0001
## 114 i130_2007_F OrangeGreen F0001
## 115 i131_2007_F OrangeGreen F0001
## 117 i133_2007_F OrangeGreen F0001
## 133 i158_2008_M BlueRed F0001
## 135 i160_2008_F BlueRed F0001
## 140 i170_2009_F BlueRed F0001
## 142 i173_2009_F BlueRed F0001
So, the genetic sibship with ‘F0001’ as dummy mother includes individuals with two different field-observed mothers, ‘OrangeGreen’ and ‘BlueRed’. OrangeGreen’s offspring hatched in 2007, and BlueRed’s in 2008 and 2009. Perhaps this female was not regularly monitored, was not recognised at the start of the 2008 breeding season, and got a new ID?
As for the previous case, field records that could disprove this theory are census records of BlueRed in or before 2007, or of OrangeGreen in or after 2008 (or if they were seen together!). Records backing up this theory are notes like ‘this could be…’ or ‘looks similar to …’ when BlueRed was first described. This cautionary approach to use a new ID is advisable, as it is often much easier to combine the records of two IDs later on, than to try to tease apart records under a single ID into their respective IDs.
An alternative explanation is again a pedigree inference error,
namely that sequoia
has erroneously merged two actual
sibships. This may sporadically happen when the two mothers are closely
related. When the mothers are related by > 0.5 (closer than regular full siblings,
due to inbreeding), such erroneous merging can happen even with large
powerful SNP sets.
Resolution: Probably merge IDs OrangeGreen’ and ‘BlueRed’
## id dam.1 dam.2
## 126 i147_2008_F YellowBlue F0003
## 134 i159_2008_F YellowBlue F0005
# something odd going on involving sibships F0003 & F0005
PedM[which(PedM$dam.2 %in% c("F0003", "F0005")), ]
## id dam.1 dam.2
## 107 i121_2007_M GreenYellow F0003
## 118 i135_2007_F GreenYellow F0003
## 124 i144_2008_M GreenYellow F0003
## 126 i147_2008_F YellowBlue F0003
## 127 i148_2008_F GreenYellow F0005
## 134 i159_2008_F YellowBlue F0005
so:
A likely explanation is that the samples of these individuals were accidentally swapped around in the lab during DNA extraction, i.e. that genotype ‘i147_2008_F’ belongs to field id ‘i148_2008_F’, and v.v.. Lab notes might shed light on this theory, e.g. was the DNA extracted in the same batch? Where the samples adjacent on the same 96-well plate?
An alternative explanation is sloppy handwriting or typing error. Whether in this case genotype ‘i147_2008_F’ belongs to field id ‘i147_2008_FM’ or ‘i148_2008_F’ depends on when this mixup happened, i.e. whether the field data are swapped between the individuals too.
A pedigree inference error seems unlikely in this particular case – individuals may occasionally get assigned to the wrong sibship, due to genotyping errors, a low call rate sample, or an uncommon draw in the Mendelian inheritance lottery, but it seems highly unlikely that this would result in two individuals swapping place.
Resolution: Swap IDs ‘i147_2008_F’ and ‘i148_2008_F’ in genetic data (and/or in field data)
This turned out to be the same as Issue3!
The second class of discrepancies are individuals who have a parent assigned in Pedigree1 (the field-observation-based pedigree), but not in Pedigree2 (the genetically inferred pedigree), i.e. there is no genetic confirmation of the field parent. It is important here to distinguish between:
A range of tools is available to help distinguish between these three alternatives:
CalcOHLLR()
: Calculate Mendelian errors and parental
LLR for the parents assigned in Pedigree1; any genetic incompatibilities
will stand out as outliers (if the SNP panel is powerful enough)getAssignCat()
: get an (anti-conservative) indication
which of the non-genotyped individuals in Pedigree1 could get dummy ID
stand-ins during genetic pedigree reconstruction; non-‘dummyfiable’
parents suggest the third alternative and are no reason for worryGetMaybeRel()
: find all pairs of individuals that are
likely relatives, but not assigned (2nd alternative)CalcPairLL()
: For a batch of pairs, get the likelihoods
of them being PO, FS, HS, …; i.e. get the data underlying the parental
LLR or the LLR in the GetMaybeRel()
output.PCG$MergedPed[which(PCG$MergedPed$dam.class == "P1only"),
c("id", "id.r", "dam.1", "dam.2", "id.dam.cat")]
## id id.r dam.1 dam.2 id.dam.cat
## 62 i057_2003_M i057_2003_M GreenRed <NA> GX
## 170 <NA> i053_2003_M GreenRed <NA> XX
The id.dam.cat = 'XX'
indicates that neither the focal
individual nor their dam in Pedigree1 is genotyped or ‘dummifiable’ (has
at the very minimum one genotyped offspring, see
?getAssignCat
). Thus there is in this case simply no way of
genetically testing whether or not these two are indeed maternal
siblings, as the field data indicates.
Genetic data could become available in the future, due to DNA sampling of i053_2003_M’s offspring, or of himself (e.g. post-mortem). This is one of the reasons why it is recommended to rerun pedigree analysis with all individuals when additional individuals have been genotyped.
Analogous to the Pedigree1-only classification, it is here useful to differentiate between
When the field data indicate that a newly assigned parent is highly implausible, there are various possible explanations, such as:
As example, let’s take the assignment of ‘YellowPink’ (aka F0007) as mother of ‘PinkBlue’ (aka F0002):
## id dam BY.est NumOff O1 O2 O3 O4
## 6 F0006 <NA> 2002 3 i041_2003_F i054_2003_F i076_2004_F <NA>
## 7 F0007 <NA> 2007 4 i141_2008_M i177_2009_M i179_2009_M F0002
Dummy - dummy pedigree links as these tend to have a higher error
rate than genotyped - genotyped pedigree links (see also
EstConf()
), especially when the sibships are small, as all
genetic information is ‘second hand’. They are therefore worth checking
if independent data is available.
The question here is whether it is plausible that YellowPink had a non-genotyped daughter in 2008 that survived to breeding age. She had one genotyped offspring in 2008 (‘i141_2008_M’), so the nest was monitored that year. Field and lab records will indicate whether there were any additional hatchlings that were not sampled, or not successfully genotyped (+ tag loss of PinkBlue). If YellowPink had only one female, non-genotyped, potentially-surviving offspring that year, then that hatchling’s data can be combined with PinkBlue’s breeding records. If she definitely had no offspring that are unaccounted for, the pedigree is most likely incorrect and the parent-offspring link would best be removed.
The dummy number corresponding to each non-genotyped individual is not consistent between runs. It simply reflects the order in which sibships were found, which may change when individuals are added or removed from the genotype data, or even when the order of individuals or the presumed genotyping error rate is changed 2. Therefore, a script like this is NOT robust:
PedX <- SeqOUT_griffin$Pedigree
# NOT LIKE THIS:
PedX$id[ PedX$id == 'F0001'] <- 'OrangeGreen'
PedX$dam[ PedX$dam == 'F0001'] <- 'OrangeGreen'
PedX$id[ PedX$id == 'F0002'] <- 'PinkBlue'
PedX$dam[ PedX$dam == 'F0002'] <- 'PinkBlue'
Instead, match by an individual within the sibship (the dummy’s offspring), which should be consistent between runs:
PedX <- SeqOUT_griffin$Pedigree
# INSTEAD, SOMETHING LIKE THIS:
Name_match <- matrix(c(PedX$dam[PedX$id=='i123_2007_F'], 'OrangeGreen',
PedX$dam[PedX$id=='i165_2009_F'], 'PinkBlue',
PedX$dam[PedX$id=='i121_2007_M'], 'GreenYellow'),
ncol = 2, byrow=TRUE)
Name_match
## [,1] [,2]
## [1,] "F0001" "OrangeGreen"
## [2,] "F0002" "PinkBlue"
## [3,] "F0003" "GreenYellow"
for (i in 1:nrow(Name_match)) {
PedX$id[ PedX$id == Name_match[i,1] ] <- Name_match[i,2]
PedX$dam[ PedX$dam == Name_match[i,1] ] <- Name_match[i,2]
}
The offspring IDs can for example be found in
SeqOUT_griffin$DummyIDs
.
## id dam sire NumOff O1 O2 O3
## 1 F0001 i095_2005_F i046_2003_M 11 i123_2007_F i130_2007_F i131_2007_F
## 2 F0002 F0007 i109_2006_M 3 i165_2009_F i166_2009_F i175_2009_M
## 3 F0003 <NA> <NA> 4 i121_2007_M i135_2007_F i144_2008_M
## 4 F0004 i157_2008_F i128_2007_M 2 i182_2010_F i188_2010_M <NA>
## 5 F0005 F0001 <NA> 2 i148_2008_F i159_2008_F <NA>
## 6 F0006 <NA> <NA> 3 i041_2003_F i054_2003_F i076_2004_F