---
title: "Introduction to the fossil record"
author: "Matheus Januario, Andressa Viol, and Daniel Rabosky"
date: "Jan 2024"
output: rmarkdown::html_vignette
vignette: >
 %\VignetteIndexEntry{deeptime_rocks}
 %\VignetteEngine{knitr::rmarkdown}
 %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
options(rmarkdown.html_vignette.check_title = FALSE)
```

# Learning objectives

1. Exploring fossil occurrences
2. Diversity patterns in deep time
3. The spatial distribution of the record
4. Drawing conclusions from the fossil record
5. Technical note: Dating fossils in absolute time

# Introduction

Thus far, we've been typically dealing with evolutionary processes that occur over short timescales -- mostly not more than a few hundred generations. But this generational scale of evolutionary change pales in comparison to the vastness of geological time. In this lab, we will explore the evolutionary history of clades through tens (sometimes hundreds) of millions of years, via the major source of direct evidence on deep time: the fossil record.

For this tutorial, you must install the `palebioDB` R package.

```{r eval=FALSE}
install.packages("paleobioDB")
```

Then, load all packages required into your workspace:

```{r}
library(evolved)
library(paleobioDB)
```

***

# Exploring fossil occurrences

We've downloaded a fossil dataset for you to work with today (`dinos_fossil`). All occurrence data were originally obtained from the Paleobiology database, a free-to-use resource that is the standard repository for such data and which has been used by hundreds or thousands of researchers to study the history and dynamics of biodiversity in deep time. 

We kept as many occurrences as possible from the original datasets, but removed some extraneous data (e.g. notes on preservation, notes on the paleoenvironment, the original reference for the data, etc.) to make them easier to use. We also removed occurrences with poor _time resolution_ (i.e., the timespan between the maximum and minimum ages of the strata associated with that fossil), and we removed all fossil occurrences not identified to the species level.

To illustrate what these data look like, let's read and view part of the dinosaur occurrence data. First, we'll read the file:

```{r}
data("dinos_fossil")
```

Then, we'll check the basic properties of the `dinos` dataframe:

```{r}
dim(dinos_fossil)
```

You can see that there are 15527 rows (each row is an occurrence) and 13 columns (each column is an attribute of each occurrence).

So what are these columns?

```{r}
colnames(dinos_fossil) 
```

It is always good to check the rows and columns (as we just did) to verify that data were correctly loaded.

In our data, we see columns for:

**`phylum`**, **`class`**, **`order`**, **`family`**, **`genus`**, **`species`** = The taxon associated with that occurrence for that hierarchical level.

**`early_interval`**, **`late_interval`** = The name of the early (furthest from present) and the name of the late (closest to present) intervals associated with that occurrence. Some of those intervals appear in the GSA geologic timescale [(link **here**)](https://www.geosociety.org/GSA/GSA/timescale/home.aspx). 

**`min_ma`**, **`max_ma`**  = The maximum (oldest; farthest in time from the present) and minimum (youngest; closest to present) ages associated with that occurrence. Note that these ages are in units of millions of years ago (abbreviated to "mya"), in absolute time. 

**`midpoint`**  = The date (in millions of years ago) of the exact halfway point of the interval the fossil was found in. We will explain this variable a bit further in this tutorial.

**`lng`**, **`lat`** = The longitude and latitude of the location the fossil was found in. These coordinates are for the fossil's location _in the present_. The "paleo" latitude and longitude (which we removed prior to loading the data in the package) of each fossil occurrence would be an estimate of the location where that organism actually lived, given what we know about plate tectonics and the age of the fossil.

More on how fossil dating works (i.e., how we can determine **`min_ma`** and **`max_ma`**) and its limitations are in the technical note "Dating fossils in absolute time" at the end of this tutorial.

Now, using the function `head()`, we can view the first `n` rows of the dataset:

```{r}
head(dinos_fossil, n = 5)
```

You will notice that the `head()` output wraps around your console output screen because it is too wide to be displayed in one viewing frame. The functions `dim()`, `colnames()`, and `head()` are useful for inspecting the datasets that you've loaded.

The exercises below use the dinosaur data for illustrative purposes, but the functions we've provided (and associated exercises) will work for any of the other datasets (trilobites, ammonites, mammals).

As a warm-up, let's count how many occurrences there are for a famous dinosaur genus, like _Majungasaurus_. An interesting quick note here is that there is a _Majungasaurus_ full skeleton reconstruction in the University of Michigan Museum of Natural History that anyone can visit. 

```{r}
sum(dinos_fossil$genus == "Majungasaurus")
```

Now let's try its more famous relative, the T-rex. How many occurrences of T-rex are there in our dataset? 

```{r}
sum(dinos_fossil$species == "Tyrannosaurus rex")
```

Important to consider: most of what we know about the T-rex comes from these 70 fossil specimens!

## A brief introduction to temporal ranges

When we are able to date both time boundaries of a rock section, we can determine the _midpoint_ age of that rock section, which marks the exact halfway point of the interval. 

To illustrate this, we will take a closer look at the fossil occurrences of _Tyranossaurus rex_. First, let's make a copy of these occurrences and assign them to a new R object.

```{r}
Trex <- dinos_fossil[dinos_fossil$species == "Tyrannosaurus rex",]

#this still has the same columns as the dinosaur dataset:
colnames(Trex)
```

Now, let's look at these occurrences.

```{r}
plotRawFossilOccs(Trex, use.midpoint =  F, knitr = T)
```

In the above plot, each horizontal black line segment represents a fossil occurrence. The leftmost end of the line segment represents the maximum (oldest) estimated age of that occurrence, and the rightmost end of the line segment represents the minimum (youngest) estimated age of that occurrence. Thus, each segment captures the full temporal range of possible ages for that fossil occurrence. Because there are 70 total occurrences for T-rex, there are 70 line segments.

Looking at these occurrences, do you think that T-rex actually appeared at the maximum age of the oldest occurrence (the bottommost line segment that spans from 83.5 Ma to 70.6 Ma)?

Instead of plotting the early and late boundaries of the stages in which a species occurs, we can get a more reasonable duration for the T-rex by considering just the time spanning the earliest and latest *midpoints* of all of the occurrences. 

Let's re-draw the fossil record of the T-rex, but this time let's only use the midpoints of the fossil occurrences to see if we can get a better idea of the actual duration of this species.

```{r}
plotRawFossilOccs(Trex, use.midpoint = F, knitr = T)
segments(x0 = min(Trex$midpoint),
         x1 = max(Trex$midpoint),
         y0 = 35, y1 = 35, lwd=3, col="red"
          )
```

What do you think? Is this a better approximation of the length of time in which T-rex lived (its "temporal range")? This is not a graded question, but discuss this with your fellow students or your GSI for a moment.

There are many ways to try to reconstruct ages or temporal ranges of species from the fossil record. All of those reconstructions will present some amount of error, and this error will be smaller for fossil records that have better time resolution. The main caveats of these techniques are:

(I) We still do not know the exact date of the appearance or extinction of a taxon (!)
(II) Taxa with fossils that are all dated to the same stage do not have temporal ranges (see taxa B and F in panels I and II in fig 1 below), and it is therefore difficult to know when exactly within that stage they lived, and consequently with whom they co-occurred.
(III) Imprecisely-dated occurrences can still influence our midpoint technique (for instance, look at how much the temporal range of the T-rex is affected by just one occurrence).

Problems associated with the midpoint technique can be gleaned from comparing panels I and II in figure 1.

\begin{figure}[H]
  \includegraphics{~/evolved/vignettes/example_fossil_occs_V2.png}
  \caption{Fossil occurrences and the many ways of reading them. Panels show (I) the empirically unknown true taxon durations (lines) and sampling times (black dots), (II) how we can reconstruct (with error) stratigraphic ranges by connecting midpoint of fossil stages (note that as species C was not observed in stage C, we are unaware of a big part of its true stratigraphic range), and (III) how diversity curves are calculated from fossil occurrences (more on this below).}
\end{figure}

\begin{center}

-- 

\end{center}

For the purposes of this lab, we will assume that reconstructing ages using the interval between midpoints, as we did above for the T-rex, is good enough.

Now we will plot ALL of the dinosaurs in our dataset, estimating their temporal range (with error) with their interval midpoints. Each species will be a horizontal line in this graph.

```{r}
plotRawFossilOccs(dinos_fossil, tax.lvl = "species", knitr = T)
```

How will the record look if, instead of looking at the species level, we plot the temporal ranges at the taxonomic level of family (even having all the uncertainties mentioned above)?

```{r}
plotRawFossilOccs(dinos_fossil, tax.lvl  = "family", knitr = T)
```

What about the taxonomic order level?

```{r}
plotRawFossilOccs(dinos_fossil, tax.lvl  = "order", knitr = T)
```

\fbox{\begin{minipage}{48em}
Question 1:

Why do higher taxonomic levels have longer timespans?
\end{minipage}}

And, maybe more interesting: why do so many species (and, an even higher proportion of higher taxonomic levels) disappear abruptly around 66 Mya?

## The K-Pg mass extinction

The K-Pg mass extinction was caused by the collision of a bolide with Earth. We can date this event relatively precisely because the magnitude of the impact left a very conspicuous layer of iridium (a mineral that is rare on Earth, but very common in meteors) throughout our planet. Because of these characteristics, we can securely conclude that the K-Pg mass extinction occurred _66 million years ago_. We can annotate the K-Pg mass extinction event in the plot we made by using the function:

`abline(v = 66)`

### Something to think about: 

Did the K-Pg mass extinction impact diversity at the species, genus, and family levels equally, or do we see different patterns of diversity loss among these taxonomic levels? If the pattern is different, might this have some biological meaning?

## The pull of the recent

One question we might make is: what is the temporal pattern of fossil preservation? Does it change across millions of years? We can quickly (and very superficially) test this by plotting a histogram of fossil occurrence midpoints.

```{r}
hist(dinos_fossil$midpoint,
     breaks=100, #increasing breaks to see data in a better precision
     xlab="Absolute time (Mya)", main = "Fossil Occurrence midpoints",
     
     #reversing axis to represent past -> present
     xlim=c(rev(range(dinos_fossil$midpoint))) 
     )
```

Paleontologists call this increase in fossil preservation towards the present the "pull of the recent".

\fbox{\begin{minipage}{48em}
Question 2:

Why would the preservation rate increase towards the present? Give at least two reasons to justify your answer.
\end{minipage}}

\begin{center}

-- 

\end{center}

It is reasonable to imagine that not every species that has ever existed left a fossil. How can we quantify the fraction of species that become fossils? To estimate this, we will look at the number of current dinosaur species (a.k.a. birds; species list from Jetz _et al_, 2012) that have left at least one fossil occurrence.

```{r, include=T}
# first we load the species list:
data("birds_spp")

# then we count the number of extant dinosaurs within our fossil dataset:
n_fossilized_dinos <- sum(birds_spp %in% dinos_fossil$species)

#then we calculate the proportion:
n_fossilized_dinos/length(birds_spp)
```

\fbox{\begin{minipage}{48em}
Challenge (extra question) 1:

Given what you observed about how fossil preservation varies with time, and the above estimation of how many birds have left a fossil, give your critical evaluation of the overall completion of the dinosaur fossil record.
\end{minipage}}

# Diversity patterns in deep time

We can use two simple methods to calculate diversity curves in the fossil record (see fig 1-III):

**The standard method**, which simply sums up taxa that occur in the same interval, and thus calculates the diversity within each interval.

**The range-through method**, which only sums up species that are observed before and after a given stratigraphic interval, and thus calculates the diversity at interval boundaries. This method guarantees we are measuring diversity that co-occurs in time. 

Both methods interpolate (“fill in”) gaps in a species’ fossil record between occurrences, by counting a species as being present in any interval that is between the first and last fossil occurrences (FAD and LAD, respectively). Meaning: an interval that does not have an occurrence, but that is between the first and last intervals containing occurrences, is counted as containing that species; see taxon D in Figure 1 for a visual example.

The function `calcFossilDivTT` do these calculations for you, and you can use the argument `method` to decide which method will be employed.

```{r}
# This line calculates diversity through the standard method:
SM <- calcFossilDivTT(dinos_fossil, method = "stdmethod")

# and this one calculates diversity using the range-through method:
RG <- calcFossilDivTT(dinos_fossil, method = "rangethrough")


# Now we can create an empty plot to store our estimates:
plot(NA, 
     xlim=rev(range(c(RG$age, SM$age))),
     ylim=range(c(RG$div, SM$div)),
     xlab="Absolute time (Mya)",
     ylab="Diversity"
     )  

# and then add the standard method diversity:
lines(x=SM$age, y = SM$div, type = "l", lwd=2)  

# and finally add the range-through diversity:
lines(x=RG$age, y = RG$div, type = "l", col="blue", lwd=2)  
```


\fbox{\begin{minipage}{48em}
Question 3:\\

Why is the standard method curve always equal to or above the range-through curve?

Hint: Look at figure 1.
\end{minipage}}

As you can see, even when we calculate diversity through corrected methods, the pull of the recent still distorts our perception of diversity trends: the number of bird fossils increases our diversity line so much that it limits our ability to visualize changes deeper in the past. We can account for this in two ways.

The first and simpler way is to manually restrict our y-axis. We will also add a mark representing the time of the K-Pg mass extinction with a red vertical line in the following plot.

```{r}
#Creating a plot:
plot(NA, 
     xlim=rev(range(c(RG$age, SM$age))),
     
     ylim=c(0,700), # here is the part of the code I changed, in
                    # comparison with the former chunk of code
     
     xlab="Absolute time (Mya)",
     ylab="Diversity"
     )  

# adding the standard method diversity:
lines(x=SM$age, y = SM$div, type = "l", lwd=2)  

# adding the range-through diversity:
lines(x=RG$age, y = RG$div, type = "l", col="blue", lwd=2) 

# Here, red line marks the K-Pg mass extinction:
abline(v=66, col="red")
```

Instead of restricting our y-axis, we could use the log scale to highlight _proportional changes_ in diversity. In the below plot, we will plot species richness in log base 2. This means every increment of 1 unit on the log scale equals _doubling_ the number of species, and each unit decrease on this scale equals halving the number of species.

```{r}
# Plotting again our diversity curves (note
# this is exactly what we did in the previous section), but in log scale
plot(NA, 
     xlim=rev(range(c(RG$age, SM$age))),
     ylim=range(c(0, max(log(c(RG$div, SM$div), base = 2)))), # note the base 2
     xlab="Absolute time (Mya)",
     ylab="Diversity")  

# adding the standard method diversity:
lines(x=SM$age, 
      y = log(SM$div, base=2),
      type = "l", lwd=2)  
# adding the range-through diversity:
lines(x=RG$age, 
      y = log(RG$div, base = 2),
      type = "l", col="blue", lwd=2)

# Here, red line marks the K-Pg mass extinction:
abline(v=66, col="red")
```

\fbox{\begin{minipage}{48em}
Challenge (extra question) 2:

If the collision of a bolide has an almost instantaneous impact on Earth's ecosystems, why would the range-through diversity decrease before the asteroid hit Earth?
\end{minipage}}

# The spatial distribution of the record

The density of fossil occurrences also varies in space. We can plot this unevenness using the function `pbdb_map_occur` from the `paleobioDB` R package.

```{r}
# Plotting number of occurrences in space:

# you can change the size of the cells using the  argument "res"
pbdb_map_occur(dinos_fossil, res = 1, cex = 0.3)
```

Just as a curiosity, we can also plot the spatial distribution of the T-rex using the dataset we created earlier in this tutorial.

```{r, fig.height=7, fig.width=10}
pbdb_map_occur(Trex, res = 1, cex=0.7)
```

\fbox{\begin{minipage}{48em}
Question 4:

Give at least three hypotheses that would explain why some places have more fossil occurrences in the online database than others.
\end{minipage}}

# Drawing conclusions from the fossil record

In this tutorial, we briefly explored the fossil record of dinosaurs only, but we gave you 3 other datasets: Trilobites, Mammals, and Ammonites.

Explore them using the same set of functions and techniques that we used above, and then answer:

\fbox{\begin{minipage}{48em}
Question 5:\\

Which of the four fossil records has the least uncertainty? Which has the most uncertainty?

Do not try to guess what is the answer we want (there is no straightforward answer), but rather collect evidence to sustain your claims.\\

You should use at least two and at maximum six plots to support your claims. Don't forget to explain the information inside each plot, and how it supports your claim.
\end{minipage}}

\fbox{\begin{minipage}{48em}
Challenge (extra question) 3:\\

Give at least two deep-time questions that could still be answered with the most uncertain fossil record you listed above. We want your critical evaluation here, using some of the tools we presented to you in this tutorial.
\end{minipage}}

\fbox{\begin{minipage}{48em}
Challenge (extra question) 4:\\

Can you see any evidence of the same event impacting the diversity of two (or more) of the four datasets we provided? Show why you think so.
\end{minipage}}

\fbox{\begin{minipage}{48em}
Challenge (extra question) 5:

Using the analyses we employed in this class give at least two questions that you think are likely to remain unanswered because of the inherent limitations of the fossil record. Provide strong reasons, and up to two graphs from this lab's data, to justify your answers.
\end{minipage}}

Apart from the incompleteness of the record, you may have realized that another layer of complexity in the fossil record is the lack of temporal resolution most fossil occurrences have. Sometimes the imprecision is larger than 20 million years, which is a large amount of time -- but why do we have so much imprecision in those estimates? We will explore this in the last section of this tutorial.

***

# Technical note: Dating fossils in absolute time

As you may know, we can use the $U^{238}-Pb^{206}$ decay series to estimate the absolute ages of igneous rocks (e.g., granite, basalt, volcanic ash, "frozen lava", etc). Zircon is a silicate mineral that is widely-used for radiometric dating, because it forms crystals (e.g., when magma cools into rock) that selectively include $U^{238}$ and exclude all the decay products. This happens because $U^{238}$ is trapped in the zircon crystal after it forms. Then, the crystal begins to accumulate all of the decay products, but in this case, these products are kept trapped within the crystal structure. Physicists have determined that after 4.5 billion years (4,500,000,000 years; $4 \times 10^9$ years), roughly 50% of a sample of pure $U^{238}$ will have decayed into $Pb^{206}$. Thus, by looking at the ratio of $U^{238}$ and $Pb^{206}$ in a sample, we can estimate the ages of an igneous rock and thus gain insights into the age of fossil-bearing sedimentary layers.

To apply this method, we first have to estimate the decay rate of $U^{238}$. An exponential decay process can be modeled as

\begin{equation}
U_t = U_0 e^{-kt}
\end{equation}

where $U_t$ is the amount of $U^{238}$ at time $t$, $U_0$ is the starting amount, $t$ is the elapsed time (since the formation of the zircon crystal), and $k$ is the radioactive decay constant for $U^{238}$. Consequently, we need to estimate $k$  for any dating we may wish to do. We can estimate $k$ because we know the half-life ( = $t_{\frac{1}{2}}$) of $U^{238}$: this is the value of $t$ where an initial quantity of $U_0$ = 1 will have decayed by half, or $U_t = 0.5$. Rearranging the equation gives:

$k = \frac{log 2}{t_{\frac{1}{2}}}$

Note we have to use the natural log (i.e. base = $e$) in the equation above. More on the method of half-lives can be found [here ](https://chem.libretexts.org/Bookshelves/Physical_and_Theoretical_Chemistry_Textbook_Maps/Physical_Chemistry_(Fleming)/11%3A_Chemical_Kinetics_I/11.08%3A_The_Method_of_Half-Lives).

Now, using the decay constant $k$ obtained above, we can apply equation (1) to date the rock layers given an estimate of the ratio of $U^{238}$ to $Pb^{206}$. These estimates are typically made using ionizing mass spectrometry; note also that we need the ratio of the number of atoms of U and Pb, not the ratio of masses. Suppose, for a zircon crystal, that we have estimated the ratios of uranium to lead as $W_U$ : $W_{Pb}$. This lets us directly estimate $U_t$ and $U_0$ in the decay equation above. If you think of $W_U$ as the amount of uranium currently in the sample, then the total amount that you started with had to have been $W_U$ plus $W_{Pb}$, because every $Pb^{206}$ atom is the result of decay from exactly one $U^{238}$ atom. Thus, $W_U$ = $U_t$ and $U_0$ = $W_U + W_{Pb}$. 

Know, we can come back to equation (1), solve for $t$, and get to the equation that allow you to estimate the age of a rock layer, given you measured certain amounts of  $W_U$ and $W_{Pb}$.

```{r, echo=F, fig.cap="Example on how fossil layers can be dated using zircon crystals. Note that percentages represent atomic fractions, not percentages by mass."}
knitr::include_graphics('zircon_dating.png')
```


\fbox{\begin{minipage}{48em}
Challenge (extra question) 6:

Can you date fossil-bearing sections A, B, and C in the figure 2 (below)? What is the absolute dating of each rock layer?
\end{minipage}}

As an extra reading, you can find an interesting outreach article about zircon dating [ here](https://knowablemagazine.org/content/article/physical-world/2021/keeping-time-zircons).

A last important point is that not all fossiliferous sites have the conditions that allow us to do zircon dating, which means that paleontologists have to use other techniques to date those fossils. This involves a lot of work and usually adds uncertainty on the dating process. If you have 14 minutes of your free time, and want to see how important and yet difficult it is to date a fossil, you could watch [this short video from the youtube channel PBS Eons](https://www.youtube.com/watch?v=LZiHLKAymdM&t=1s).

# References:

Jetz, W., Thomas, G. H., Joy, J. B., Hartmann, K., & Mooers, A. O. (2012). The global diversity of birds in space and time. Nature, 491(7424), 444-448.

Upham, N. S., Esselstyn, J. A., & Jetz, W. (2019). Inferring the mammal tree: species-level sets of phylogenies for questions in ecology, evolution, and conservation. PLoS biology, 17(12), e3000494.