---
title: "PedCompare Example (Griffin Pedigree)"
author: "Jisca Huisman ( jisca.huisman @ gmail.com )"
date: "12 Dec 2022"
output:
  bookdown::html_document2:
    base_format: rmarkdown::html_vignette
    toc: true
    toc_depth: 2
linkcolor: blue
vignette: >
  %\VignetteIndexEntry{PedCompare Example (Griffin Pedigree)}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
---


# Example: PedCompare

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, eval=TRUE,
                      fig.height=4, fig.width=6, fig.pos="ht!") 
library(sequoia)
```


## Why compare pedigrees

It is often worthwhile to compare a newly inferred genetic pedigree to an older (field-)pedigree, as no dataset is flawless, nor is genetic pedigree reconstruction infallible. 

Discrepancies between the pedigrees may bring to light mislabeled samples, incorrect birth or death years, or incorrectly inferred pedigree links. For example, if an individual's genetically inferred mother differs from its field-observed mother, this may be due to a mislabeled DNA sample, a pedigree inference error, an error in the field records, or egg dumping/adoption. Eliminating possibilities and tracking down the most likely scenario can be rather time consuming, but will increase the overall quality of the dataset and pedigree. 

In addition, the pedigree comparison may be able to match dummy parents in the genetic pedigree to real, non-genotyped parents in the field pedigree. This allows pedigree records (offspring number, mates, etc.) to be combined with phenotypic records (age, size, etc.) of those non-genotyped individuals. 

Here, the process is illustrated by comparing genetically assigned and field observed mothers in a fictional population of griffins. 



## Study population
Nests have been monitored in a small, closed population of griffins, where each year exactly 20 baby griffins hatch. The mother of most individuals is known from field observations (example data `FieldMums_griffin` included in the package). From 2001 to 2010, most hatchlings are tagged, and sampled for SNP genotyping. From this, a genetic pedigree has been reconstructed (example data `SeqOUT_griffin`). Breeding females without tags, are given a two-colour code [^x]. This may including females who have lost their tags.

[^x]: with animal-friendly spray paint



## Comparing pedigrees

```{r RunPedComp}
library(sequoia)
data(SeqOUT_griffin, FieldMums_griffin, package="sequoia")

PCG  <- PedCompare(Ped1 = cbind(FieldMums_griffin,
                                sire = NA),
                   Ped2 = SeqOUT_griffin$Pedigree,
                   SNPd = SeqOUT_griffin$PedigreePar$id,
                   Symmetrical = TRUE, Plot=FALSE)
```

In essence, `PedCompare()` lies the two pedigrees side-by-side and classifies each individual's parent as being the same in the two pedigrees (`Match`) or not (`Mismatch`), or only having a parent in only one of the pedigrees (`P1only` or `P2only`). 

While `Match` versus `Mismatch` is straightforward when all individuals are genotyped, and all IDs are (thus) consistent between the two pedigrees, it becomes more complicated when non-genotyped individuals and dummy IDs are involved. `PedCompare()` tries to match each sibship-dummy-parent in Pedigree2 (F0001, F0002, etc.) to a non-genotyped parent in Pedigree1 (here these are the IDs consisting of 2 colours: PinkBlue, BlueRed, etc.). 

### `$MergedPed`
The side-by-side pedigrees are given in the list element `$MergedPed` of the output:

```{r MergedPed}
PCG$MergedPed[c(127:133), c("id", "dam.1", "dam.2", "dam.class")]
```

This part-pedigree for example shows that individuals i165_2009_F and i166_2009_F have the non-genotyped female 'PinkBlue' as field observed mother (`dam.1` = dam in Pedigree1). The pair are also genetically assigned as maternal siblings, with 'F0002' as dummy ID for the mother (`dam.2`). And, while the ID of the mother differs between the two pedigrees, it is nonetheless considered a `Match` (`dam.class`, short for classification). 


#### id columns
A point of clarification on the `id` columns:  

* column `id` are the ids common to both pedigrees, plus those only occurring in Pedigree2 (typically the dummy IDs F0001, F0002 etc.)  
* column `id.r` are the 'consensus' ids, plus those only occurring in Pedigree1  

For example, 'PinkBlue' does not have a mother in Pedigree1 (first row in `MergedPed` below: `id.r=PinkBlue`, `dam.1=NA`). Pedigree2 tells us that 'F0002' has genetic mother 'F0007', which is not hugely informative of itself (`MergedPed`: `id=F0002`, `dam.2=F0007`). But `PedCompare()` then tells us that this means that 'PinkBlue' (`id.r`) has 'YellowPink' as mother (`dam.r`), which has a lot more meaning (to someone working in the field). 


```{r id-idr-sep}
# subset some individuals:
these <- c("i177_2009_M", "i179_2009_M", "i165_2009_F", "i166_2009_F", "F0002",
           "F0007", "YellowPink", "PinkBlue")
knitr::kable(list(Ped1 = FieldMums_griffin[FieldMums_griffin$id %in% these, ], 
                  Ped2 = SeqOUT_griffin$Pedigree[SeqOUT_griffin$Pedigree$id %in% these, 1:3]),
             caption = "Subsets of Pedigree1 (left) and Pedigree2 (right)")
```

```{r id-idr}
PCG$MergedPed[PCG$MergedPed$id %in% these, 
              c("id", "id.r", "dam.1", "dam.2", "dam.r")]
```



### `$DummyMatch`
The output list element `$DummyMatch` summarises the matches:

```{r DumMatch}
head(PCG$DummyMatch[, -c(3:5)], n=6)
```

For the match between 'F0002' in Pedigree2 and 'PinkBlue' in Pedigree1 (bottom row), there are 3 matching offspring, and no offspring by 'PinkBlue' that are genetically assigned to a different sibship cluster or mother (`off.Mismatch`) or have no genetic mother assigned (`off.P1only`), nor are there any members of the genetic sibship F0002 with no mother in Pedigree1 (`off.P2only`). Thus, a perfect one-to-one match. 

For genetic sibship 'F0001', however, the situation is more complicated, and this will be worked through in detail below. 

'nomatch' in the `id.1` column indicates either that none of the individuals in this genetic sibship had a field-observed mother, or that the field-observed mother is matched to a different genetic sibship, with which there was a larger overlap. 



### `$Counts`
The total number of matches and mismatches is summarised in the 3D array `$Counts`, which can be visualised with `PlotPedComp()`:

```{r Counts-plot}
PlotPedComp(PCG$Counts)
```

Since there are no sires in our field pedigree, we will have a look at the `dam` slice only:

```{r Counts}
PCG$Counts[,,"dam"]
```

The counts for each classification are subdivided into various categories (rows), based on whether the focal individual (first letter) and parent (second letter) are **G**enotyped or a **D**ummy individual, as well as the **T**otals. The totals counts includes individuals in Pedigree1 who are neither genotyped nor 'dummifiable', such as a non-genotyped parent with a single offspring, and therefore exceeds the sum of **G** and **D**.  

We will go through each of the three classes of discrepancies in turn -- `Mismatch`, `P1only` (only field mum), and `P2only` (only genetic mum). 


## Mismatch
To get more detail on the 11 mismatches, we head back to `$MergedPed`. For brevity, we display only the columns in which we're currently interested:
```{r Mismatch1}
PCG$MergedPed[which(PCG$MergedPed$dam.class == "Mismatch"), c("id", "dam.1", "dam.2", "id.dam.cat")]
```

Thus while there are 11 individuals with mismatching mothers, there are only 4 unique mothers involved, which we will look at in turn.  


### Mismatch Issue1: GreenBlue

```{r MisMatch1}
PedM <- PCG$MergedPed[, c("id", "dam.1", "dam.2")]   # short-hand to minimise typing

# does the mismatch affect all of GreenBlue's offspring?
PedM[which(PedM$dam.1 == "GreenBlue"), ]
# > yes, these 4 are all of her known offspring

# does genetic mother i081_2005_F have any field-observed offspring?
PedM[which(PedM$dam.1 == "i081_2005_F"), ]
# no.

# does i081_2005_F have any other genetic offspring?
PedM[which(PedM$dam.2 == "i081_2005_F"), ]
# no.
```

It seems 'GreenBlue' is a perfect match with 'i081_2005_F': all four of GreenBlue's observed offspring are genetically assigned i081_2005_F as mother, and all four of i081_2005_F's genetic offspring have GreenBlue as observed mother. Perhaps this female lost her tag, was therefore not recognised, and received a new field ID. Field records may be able to back this theory up, or disprove it: Did GreenBlue look about 1 year old when first recorded in 2007? Was i081_2005_F ever seen in or after the 2006 breeding season? Does i081_2005_F have a known death date? If so, was there a post-mortem, or was she presumed dead because she had not seen for several months/years? 

An alternative explanation might be a pedigree inference error, but this is highly unlikely when offspring and parent are both genotyped (`id.dam.cat = 'GG'`), and only plausible with a limited number of SNPs with low call rate and high genotyping error rate. The four assignments are independent if they were done during parentage assignment, but not necessarily so during full pedigree reconstruction: they may first have been clustered as maternal siblings, and subsequently the dummy mother may have been replaced by i081_2005_F. 


Dummy individuals in Pedigree2 are never matched to genotyped individuals in Pedigree1, even if there is a perfect match such as here -- there is almost always something odd going on that requires user inspection.  

**Resolution: Merge IDs 'i081_2005_F' and 'GreenBlue' **

### Mismatch Issue2: OrangeGreen
```{r MisMatch2}
# why is this flagged as a mismatch?
PedM[which(PedM$dam.1 == "OrangeGreen"), ]
# all of OrangeGreen's offspring are in sibship F0001

PedM[which(PedM$dam.2 == "F0001"), c("id", "dam.1", "dam.2")]
# but sibship F0001 is split across two field mothers
```

So, the genetic sibship with 'F0001' as dummy mother includes individuals with two different field-observed mothers, 'OrangeGreen' and 'BlueRed'. OrangeGreen's offspring hatched in 2007, and BlueRed's in 2008 and 2009. Perhaps this female was not regularly monitored, was not recognised at the start of the 2008 breeding season, and got a new ID? 

As for the previous case, field records that could disprove this theory are census records of BlueRed in or before 2007, or of OrangeGreen in or after 2008 (or if they were seen together!). Records backing up this theory are notes like 'this could be...' or 'looks similar to ...' when BlueRed was first described. This cautionary approach to use a new ID is advisable, as it is often much easier to combine the records of two IDs later on, than to try to tease apart records under a single ID into their respective IDs. 

An alternative explanation is again a pedigree inference error, namely that `sequoia` has erroneously merged two actual sibships. This may sporadically happen when the two mothers are closely related. When the mothers are related by $>0.5$ (closer than regular full siblings, due to inbreeding), such erroneous merging can happen even with large powerful SNP sets. 

**Resolution: *Probably* merge IDs OrangeGreen' and 'BlueRed' **


### Mismatch Issue3: YellowBlue 

```{r MisMatch3}
# as before
PedM[which(PedM$dam.1 == "YellowBlue"), ]
# something odd going on involving sibships F0003 & F0005

PedM[which(PedM$dam.2 %in% c("F0003", "F0005")), ]
```

so:  

* i147_2008_F has field mum YellowBlue, but its three genetic sibs suggest GreenYellow as mum 
* i148_2008_F has field mum GreenYellow, but its genetic sib (i159_2008_F) suggest YellowBlue as mum  
 

A likely explanation is that the samples of these individuals were accidentally swapped around in the lab during DNA extraction, i.e. that genotype 'i147_2008_F' belongs to  field id 'i148_2008_F', and *v.v.*. Lab notes might shed light on this theory, e.g. was the DNA extracted in the same batch? Where the samples adjacent on the same 96-well plate? 


An alternative explanation is sloppy handwriting or typing error. Whether in this case genotype 'i147_2008_F' belongs to  field id 'i147_2008_FM' or 'i148_2008_F' depends on when this mixup happened, i.e. whether the field data are swapped between the individuals too. 


A pedigree inference error seems unlikely in this particular case -- individuals may occasionally get assigned to the wrong sibship, due to genotyping errors, a low call rate sample, or an uncommon draw in the Mendelian inheritance lottery, but it seems highly unlikely that this would result in two individuals swapping place. 

**Resolution: Swap IDs 'i147_2008_F' and 'i148_2008_F' in genetic data *(and/or in field data)* **


### Mismatch Issue4: GreenYellow 
This turned out to be the same as Issue3!



## Pedigree1-only

### General 
The second class of discrepancies are individuals who have a parent assigned in Pedigree1 (the field-observation-based pedigree), but not in Pedigree2 (the genetically inferred pedigree), i.e. there is no genetic confirmation of the field parent. It is important here to distinguish between:

1. Parent and/or siblings in Pedigree1 are definitely not closely genetically related to the focal individual  
2. Parent and/or siblings in Pedigree1 *are* a genetic match, but not assigned due to 
    * lacking birth year information (who is the parent and who is the offspring?),  
    * ambiguity about what kind of second degree relative the sibling in Pedigree 1 is, or  
    * the likelihood ratio falling short of the assignment threshold.  
3. Parent and siblings are not genotyped, or have such a low call rate that they are automatically excluded; i.e. there is no evidence one way or another. 

A range of tools is available to help distinguish between these three alternatives:

* `CalcOHLLR()`: Calculate Mendelian errors and parental LLR for the parents assigned in Pedigree1; any genetic incompatibilities will stand out as outliers (if the SNP panel is powerful enough)  
* `getAssignCat()`: get an (anti-conservative) indication which of the non-genotyped individuals in Pedigree1 could get dummy ID stand-ins during genetic pedigree reconstruction; non-'dummyfiable' parents suggest the third alternative and are no reason for worry
* `GetMaybeRel()`: find all pairs of individuals that are likely relatives, but not assigned (2nd alternative)
* `CalcPairLL()`: For a batch of pairs, get the likelihoods of them being PO, FS, HS, ...; i.e. get the data underlying the parental LLR or the LLR in the `GetMaybeRel()` output.   
* Genomic relatedness: especially if there are many more SNPs scored than those used for pedigree reconstruction, the genomic relatedness can provide an additional clue on whether or not two individuals are closely related.   


### Griffins

```{r Ped1only}
PCG$MergedPed[which(PCG$MergedPed$dam.class == "P1only"), 
              c("id", "id.r", "dam.1", "dam.2", "id.dam.cat")]
```

The `id.dam.cat = 'XX'` indicates that neither the focal individual nor their dam in Pedigree1 is genotyped or 'dummifiable' (has at the very minimum one genotyped offspring, see `?getAssignCat`). Thus there is in this case simply no way of genetically testing whether or not these two are indeed maternal siblings, as the field data indicates. 

Genetic data could become available in the future, due to DNA sampling of  i053_2003_M's offspring, or of himself (e.g. post-mortem). This is one of the reasons why it is recommended to rerun pedigree analysis with all individuals when additional individuals have been genotyped. 


## Pedigree2-only (newly assigned mum)


### General
Analogous to the Pedigree1-only classification, it is here useful to differentiate between  

1. The new genetic assignment is impossible or extremely unlikely based on field data
    * The assigned parent died before the birth year of the putative offspring
    * The putative offspring's parents are known with great certainty
    * The spatial locations of the assigned parent-pair were too far apart around the time of conception for them to have mated
    * The putative offspring is a dummy individual, but the assigned mother's offspring are all accounted for:
        * SNP genotyped and different ID, or
        * different sex, or
        * definite death date before breeding age
        * birth year incompatible with dummy individual's offspring

2. The assignment is quite likely based on field data
    * The assigned mother was not seen in the breeding season her new offspring was born (e.g. bred outside the study area)
    * The offspring and newly assigned mother are part of the same social group
3. There is no field data to suggest one way or the other



When the field data indicate that a newly assigned parent is highly implausible, there are various possible explanations, such as:   

* Sample mislabeling (offspring or parent sample)  
* Parent-offspring pair assigned the wrong way around (i.e. actual offspring assigned as parent), due to missing/incorrect birth years, or as a low-confidence dummy--dummy pedigree link
* A presumed-dead individual emigrated and re-immigrated, or mistaken identity during post-mortem (i.e. incorrect death date). 


### Griffins
As example, let's take the assignment of 'YellowPink' (aka F0007) as mother of 'PinkBlue' (aka F0002):

```{r }
SeqOUT_griffin$DummyIDs[c(6,7), c("id", "dam", "BY.est", "NumOff", "O1", "O2", "O3", "O4")]
```



Dummy - dummy pedigree links as these tend to have a higher error rate than genotyped - genotyped pedigree links (see also `EstConf()`), especially when the sibships are small, as all genetic information is 'second hand'. They are therefore worth checking if independent data is available.

The question here is whether it is plausible that YellowPink had a non-genotyped daughter in 2008 that survived to breeding age. She had one genotyped offspring in 2008 ('i141_2008_M'), so the nest was monitored that year. Field and lab records will indicate whether there were any additional hatchlings that were not sampled, or not successfully genotyped (+ tag loss of PinkBlue). If YellowPink had only one female, non-genotyped, potentially-surviving offspring that year, then that hatchling's data can be combined with PinkBlue's breeding records. If she definitely had no offspring that are unaccounted for, the pedigree is most likely incorrect and the parent-offspring link would best be removed. 


## Warning re dummy numbering
The dummy number corresponding to each non-genotyped individual is not consistent between runs. 
It simply reflects the order in which sibships were found, which may change when individuals are added or removed from the genotype data, or even when the order of individuals or the presumed genotyping error rate is changed [^2]. 
Therefore, a script like this is NOT robust:

[^2]: Note that the dummy numbering in this griffin example has also changed between version 2.3.5 and 2.4

```{r eval=FALSE, purl=FALSE}
PedX <- SeqOUT_griffin$Pedigree
# NOT LIKE THIS:
PedX$id[ PedX$id == 'F0001'] <- 'OrangeGreen'
PedX$dam[ PedX$dam == 'F0001'] <- 'OrangeGreen'
PedX$id[ PedX$id == 'F0002'] <- 'PinkBlue'
PedX$dam[ PedX$dam == 'F0002'] <- 'PinkBlue'
```

Instead, match by an individual within the sibship (the dummy's offspring), which should be consistent between runs:

```{r }
PedX <- SeqOUT_griffin$Pedigree
# INSTEAD, SOMETHING LIKE THIS:
Name_match <- matrix(c(PedX$dam[PedX$id=='i123_2007_F'], 'OrangeGreen',  
                       PedX$dam[PedX$id=='i165_2009_F'], 'PinkBlue',  
                       PedX$dam[PedX$id=='i121_2007_M'], 'GreenYellow'),
                     ncol = 2, byrow=TRUE)
Name_match

for (i in 1:nrow(Name_match)) {
  PedX$id[ PedX$id == Name_match[i,1] ] <- Name_match[i,2] 
  PedX$dam[ PedX$dam == Name_match[i,1] ] <- Name_match[i,2] 
}
  
```
The offspring IDs can for example be found in `SeqOUT_griffin$DummyIDs`. 

```{r }
head(SeqOUT_griffin$DummyIDs[, c('id','dam','sire','NumOff','O1','O2','O3')])
```