Simulating datasets is a powerful and varied tool when conducting
unmarked
analyses. Writing our own code to simulate a
dataset based on a given model is an excellent learning tool, and can
help us test if a given model is generating the expected results. If we
simulate a series of datasets based on a fitted model, and calculate a
statistic from each of those fits, we can generate a distribution of the
statistic - this is what the parboot
function does. This
can be helpful, for example, when testing goodness of fit. Finally,
simulation can be a useful component of power analysis when a
closed-form equation for power is not available.
unmarked
provides two different ways of generating
simulated datasets, depending on the stage we are at in the modeling
process.
For (1), we simply call the simulate
method on our
fitted model object and new dataset(s) are generated. This is the
approach used by parboot
. In this vignette we will focus on
(2), a more flexible approach to simulation, also using the
simulate
method, that allows us to generate a dataset
corresponding to any unmarked
model from scratch.
We will need to provide several pieces of information to
simulate
in order to simulate a dataset from scratch in
unmarked
.
unmarkedFrame
of the appropriate type defining the
desired experimental designmodel
), if the
unmarkedFrame
is used for multiple model typescoefs
)
controlling the simulation, which correspond to the formula(s) specified
earlier.The easiest way to demonstrate how to use simulate
is to
look at an example: we’ll start with a simple one for occupancy.
unmarkedFrame
Suppose we want to simulate a single-season occupancy dataset in
which site occupancy is affected by elevation. The first step is to
create an unmarkedFrame
object of the appropriate type,
which defines the experimental design and includes any covariates we
want to use in the simulation. Since we want to simulate an occupancy
dataset, we’ll create an unmarkedFrameOccu
.
The unmarkedFrameOccu
function takes three arguments:
the observation matrix y
, the site covariates
siteCovs
, and the observation-level covariates
obsCovs
. The dimensions of y
define how many
sites and replicate samples the study includes. We’ll create a blank
y
matrix (i.e., filled with NA
s) of dimension
300 x 8, indicating we want our study to have 300 sites and 8 sampling
occasions. The values you put in this y
matrix don’t
matter, you can put anything in there you want as they’ll be overwritten
with the simulated values later. It’s only used to define the number of
sites and occasions.
Earlier we said we want to include an elevation covariate, so we’ll simulate the covariate now and add it to a data frame. We could create several covariates here, including factors, etc.
We’re not using any observation covariates, so we can now make the
complete unmarkedFrameOccu
:
## Data frame representation of unmarkedFrame object.
## y.1 y.2 y.3 y.4 y.5 y.6 y.7 y.8 elev
## 1 NA NA NA NA NA NA NA NA -0.56047565
## 2 NA NA NA NA NA NA NA NA -0.23017749
## 3 NA NA NA NA NA NA NA NA 1.55870831
## 4 NA NA NA NA NA NA NA NA 0.07050839
## 5 NA NA NA NA NA NA NA NA 0.12928774
## 6 NA NA NA NA NA NA NA NA 1.71506499
## 7 NA NA NA NA NA NA NA NA 0.46091621
## 8 NA NA NA NA NA NA NA NA -1.26506123
## 9 NA NA NA NA NA NA NA NA -0.68685285
## 10 NA NA NA NA NA NA NA NA -0.44566197
Since unmarkedFrameOccu
is used by both the
single-season occupancy model (occu
) and the Royle-Nichols
occupancy model (occuRN
), we need to tell
unmarked
which one to use.
Most unmarkedFrame
types in unmarked
are
used by only one model fitting function, so this step is often
unnecessary.
Take a look at the help file for occu
. When fitting a
single-season occupancy model we need to provide, in addition to the
data, the formula
argument defining the model structure.
We’ll need to provide these same argument(s) to simulate
.
Many fitting functions will have multiple required arguments, such as
the mixture distribution to use, key functions, etc.
Here we specify a double right-hand-side formula as required by
occu
, specifying an effect of elevation on occupancy.
The model structure, as defined by the formula above, implies a
certain set of parameter/coefficient values (intercepts, slopes) we need
to supply to simulate
. These need to be supplied as a named
list, where each list element corresponds to one submodel (such as
state
for occupancy and det
for detection).
Each list element is a numeric vector of the required parameter values.
It can be tricky to figure out the structure of this list, so
simulate
allows you to not include it at first, in which
case the function will return a template for you to fill in.
## coefs should be a named list of vectors, with the following structure
## (replace 0s with your values):
##
## $state
## (Intercept) elev
## 0 0
##
## $det
## (Intercept)
## 0
## Error: Specify coefs argument as shown above
We need to supply a list with two elements state
and
det
. The state
element contains two values,
the intercept and the slope corresponding to elevation. The
det
element contains only the intercept since we have no
covariates on detection. Note that all values supplied in this list
must be on the inverse link scale, which will depend on the
specific submodel used. So for example, a value of 0 for
det
implies a detection probability of 0.5, because we’re
using the logit link function.
## [1] 0.5
Now let’s make our own coefs
list:
Here we’re setting a negative effect of elevation on occupancy.
We now have all the pieces to simulate a dataset.
## Assumed parameter order for state:
## (Intercept), elev
## Assumed parameter order for det:
## (Intercept)
The result is always a list of unmarkedFrame
s. By
default, we just get one, but we can get more with the nsim
argument.
## Data frame representation of unmarkedFrame object.
## y.1 y.2 y.3 y.4 y.5 y.6 y.7 y.8 elev
## 1 1 1 1 1 1 1 1 1 -0.56047565
## 2 0 0 0 0 0 0 0 0 -0.23017749
## 3 0 0 0 0 0 0 0 0 1.55870831
## 4 0 0 0 0 0 0 0 0 0.07050839
## 5 0 0 0 0 0 0 0 0 0.12928774
## 6 1 0 0 0 0 0 0 0 1.71506499
## 7 0 0 0 0 0 0 0 0 0.46091621
## 8 0 0 0 0 0 0 0 0 -1.26506123
## 9 1 1 1 0 1 0 1 0 -0.68685285
## 10 1 0 0 1 1 0 0 0 -0.44566197
The simulated unmarkedFrame
now contains y
values and is ready to use.
As a quick check, let’s fit a model to our simulated dataset.
##
## Call:
## occu(formula = ~1 ~ elev, data = out[[1]])
##
## Occupancy:
## Estimate SE z P(>|z|)
## (Intercept) -0.146 0.120 -1.22 0.223241
## elev -0.572 0.136 -4.20 0.000027
##
## Detection:
## Estimate SE z P(>|z|)
## -0.038 0.0612 -0.621 0.535
##
## AIC: 1929.538
We get out roughly the same parameters that we put in, as expected.
The gdistremoval
function fits the model of Amundson et al. (2014), which estimates
abundance using a combination of distance sampling and removal sampling
data. When simulating a dataset based on this model, we have to provide
several additional pieces of information related to the structure of the
distance and removal sampling analyses.
unmarkedFrame
First create the appropriate type of unmarkedFrame
,
which is unmarkedFrameGDR
. There’s two y-matrices: one for
distance sampling and one for removal sampling. We’ll create a dataset
with 4 distance bins and 5 removal periods.
set.seed(123)
M <- 100
Jdist <- 4
Jrem <- 5
y_dist <- matrix(NA, M, Jdist)
y_rem <- matrix(NA, M, Jrem)
We’ll create an elevation site covariate and a wind observation
covariate. Observation-level covariates are only used by the removal
part of the model, so they should have the same number of values as
y_rem
.
Finally we can create the unmarkedFrameGDR
. We’ll also
need to specify the distance bins and the units for the distance part of
the model here. See ?unmarkedFrameGDR
for more
information.
umf <- unmarkedFrameGDR(yRem = y_rem, yDist = y_dist, siteCovs = site_covs, obsCovs = obs_covs,
dist.breaks = c(0,25,50,75,100), unitsIn = 'm')
## Data frame representation of unmarkedFrame object.
## yDist.1 yDist.2 yDist.3 yDist.4 yRem.1 yRem.2 yRem.3 yRem.4 yRem.5
## 1 NA NA NA NA NA NA NA NA NA
## 2 NA NA NA NA NA NA NA NA NA
## 3 NA NA NA NA NA NA NA NA NA
## 4 NA NA NA NA NA NA NA NA NA
## 5 NA NA NA NA NA NA NA NA NA
## 6 NA NA NA NA NA NA NA NA NA
## 7 NA NA NA NA NA NA NA NA NA
## 8 NA NA NA NA NA NA NA NA NA
## 9 NA NA NA NA NA NA NA NA NA
## 10 NA NA NA NA NA NA NA NA NA
## elev wind.1 wind.2 wind.3 wind.4 wind.5
## 1 -0.56047565 -0.71040656 0.2568837 -0.24669188 -0.34754260 -0.95161857
## 2 -0.23017749 -0.04502772 -0.7849045 -1.66794194 -0.38022652 0.91899661
## 3 1.55870831 -0.57534696 0.6079643 -1.61788271 -0.05556197 0.51940720
## 4 0.07050839 0.30115336 0.1056762 -0.64070601 -0.84970435 -1.02412879
## 5 0.12928774 0.11764660 -0.9474746 -0.49055744 -0.25609219 1.84386201
## 6 1.71506499 -0.65194990 0.2353866 0.07796085 -0.96185663 -0.07130809
## 7 0.46091621 1.44455086 0.4515041 0.04123292 -0.42249683 -2.05324722
## 8 -1.26506123 1.13133721 -1.4606401 0.73994751 1.90910357 -1.44389316
## 9 -0.68685285 0.70178434 -0.2621975 -1.57214416 -1.51466765 -1.60153617
## 10 -0.44566197 -0.53090652 -1.4617556 0.68791677 2.10010894 -1.28703048
gdistremoval
Looking at ?gdistremoval
, required arguments include
lambdaformula
, removalformula
, and
distanceformula
. We need to set these formula values to
control the simulation. We’ll also use the negative binomial
distribution for abundance.
As in the previous section, we’ll leave the coefs
argument blank at first and get the correct output structure.
## coefs should be a named list of vectors, with the following structure
## (replace 0s with your values):
##
## $lambda
## (Intercept) elev
## 0 0
##
## $alpha
## alpha
## 0
##
## $dist
## (Intercept)
## 0
##
## $rem
## (Intercept) wind
## 0 0
## Error: Specify coefs argument as shown above
We need to set two values for the abundance (lambda
)
model on the log scale, one for dist
which represents the
distance function sigma parameter (log scale), one for the negative
binomial dispersion parameter alpha
(log scale), and two
for the removal detection probability model (logit scale).
We’ll pick the (relatively arbitrary) values below:
Now provide everything to simulate
. Note we don’t need
to provide the model
argument because
unmarkedFrameGDR
is used for only one fitting function
(gdistremoval
).
We’ll simulate 2 datasets.
out <- simulate(umf, lambdaformula=~elev, removalformula=~wind, distanceformula=~1,
coefs=cf, mixture="NB", nsim=2)
## Assumed parameter order for lambda:
## (Intercept), elev
## Assumed parameter order for alpha:
## alpha
## Assumed parameter order for dist:
## (Intercept)
## Assumed parameter order for rem:
## (Intercept), wind
## [[1]]
## Data frame representation of unmarkedFrame object.
## yDist.1 yDist.2 yDist.3 yDist.4 yRem.1 yRem.2 yRem.3 yRem.4 yRem.5
## 1 0 1 0 0 1 0 0 0 0
## 2 1 0 1 2 1 1 1 0 1
## 3 4 6 7 3 7 3 5 2 3
## 4 1 0 1 0 0 0 0 1 1
## 5 0 0 1 0 1 0 0 0 0
## 6 0 0 0 0 0 0 0 0 0
## 7 3 3 0 1 1 0 1 3 2
## 8 0 0 0 0 0 0 0 0 0
## 9 0 0 0 0 0 0 0 0 0
## 10 0 0 0 0 0 0 0 0 0
## elev wind.1 wind.2 wind.3 wind.4 wind.5
## 1 -0.56047565 -0.71040656 0.2568837 -0.24669188 -0.34754260 -0.95161857
## 2 -0.23017749 -0.04502772 -0.7849045 -1.66794194 -0.38022652 0.91899661
## 3 1.55870831 -0.57534696 0.6079643 -1.61788271 -0.05556197 0.51940720
## 4 0.07050839 0.30115336 0.1056762 -0.64070601 -0.84970435 -1.02412879
## 5 0.12928774 0.11764660 -0.9474746 -0.49055744 -0.25609219 1.84386201
## 6 1.71506499 -0.65194990 0.2353866 0.07796085 -0.96185663 -0.07130809
## 7 0.46091621 1.44455086 0.4515041 0.04123292 -0.42249683 -2.05324722
## 8 -1.26506123 1.13133721 -1.4606401 0.73994751 1.90910357 -1.44389316
## 9 -0.68685285 0.70178434 -0.2621975 -1.57214416 -1.51466765 -1.60153617
## 10 -0.44566197 -0.53090652 -1.4617556 0.68791677 2.10010894 -1.28703048
##
## [[2]]
## Data frame representation of unmarkedFrame object.
## yDist.1 yDist.2 yDist.3 yDist.4 yRem.1 yRem.2 yRem.3 yRem.4 yRem.5
## 1 0 1 0 0 1 0 0 0 0
## 2 0 1 0 0 0 1 0 0 0
## 3 2 1 2 1 4 2 0 0 0
## 4 0 0 0 0 0 0 0 0 0
## 5 1 1 0 1 1 1 0 1 0
## 6 0 0 0 0 0 0 0 0 0
## 7 1 1 1 0 1 0 2 0 0
## 8 0 0 0 0 0 0 0 0 0
## 9 0 0 0 1 1 0 0 0 0
## 10 0 0 0 0 0 0 0 0 0
## elev wind.1 wind.2 wind.3 wind.4 wind.5
## 1 -0.56047565 -0.71040656 0.2568837 -0.24669188 -0.34754260 -0.95161857
## 2 -0.23017749 -0.04502772 -0.7849045 -1.66794194 -0.38022652 0.91899661
## 3 1.55870831 -0.57534696 0.6079643 -1.61788271 -0.05556197 0.51940720
## 4 0.07050839 0.30115336 0.1056762 -0.64070601 -0.84970435 -1.02412879
## 5 0.12928774 0.11764660 -0.9474746 -0.49055744 -0.25609219 1.84386201
## 6 1.71506499 -0.65194990 0.2353866 0.07796085 -0.96185663 -0.07130809
## 7 0.46091621 1.44455086 0.4515041 0.04123292 -0.42249683 -2.05324722
## 8 -1.26506123 1.13133721 -1.4606401 0.73994751 1.90910357 -1.44389316
## 9 -0.68685285 0.70178434 -0.2621975 -1.57214416 -1.51466765 -1.60153617
## 10 -0.44566197 -0.53090652 -1.4617556 0.68791677 2.10010894 -1.28703048
As a check, we’ll fit the same model used for simulation to one of the datasets.
gdistremoval(lambdaformula=~elev, removalformula=~wind, distanceformula=~1, data=out[[1]],
mixture="NB")
##
## Call:
## gdistremoval(lambdaformula = ~elev, removalformula = ~wind, distanceformula = ~1,
## data = out[[1]], mixture = "NB")
##
## Abundance:
## Estimate SE z P(>|z|)
## (Intercept) 1.651 0.152 10.89 1.28e-27
## elev 0.887 0.142 6.25 4.16e-10
##
## Dispersion:
## Estimate SE z P(>|z|)
## 0.118 0.242 0.486 0.627
##
## Distance:
## Estimate SE z P(>|z|)
## 3.89 0.053 73.4 0
##
## Removal:
## Estimate SE z P(>|z|)
## (Intercept) -0.901 0.141 -6.38 1.74e-10
## wind -0.373 0.088 -4.24 2.20e-05
##
## AIC: 1162.882
Looks good.
The simulate
function provides a flexible tool for
simulating data from any model in unmarked
. These datasets
can be used for a variety of purposes, such as for teaching examples,
testing models, or developing new tools that work with
unmarked
. Additionally, simulating datasets is a key
component of the power analysis workflow in unmarked
- see
the power analysis vignette for more examples.