ubms
The ubms
package fits models of wildlife occurrence and
abundance in Stan (Carpenter et al.
2017), in a similar fashion to the unmarked
package (Fiske,
Chandler, and others 2011). One of the advantages of
ubms
is that it is possible to include random effects in
your models, using the same syntax as lme4 (Bates et al.
2015). For example, if you have a group
site
covariate, you can fit a model with random intercepts by
group
by including + (1|group)
in your
parameter formula. Random slopes, or a combination of random slopes and
intercepts, are also possible. To illustrate the use of random effects
of ubms
, this vignette fits a model to multi-season
occupancy data using a “stacked” approach.
Suppose you have a dataset of repeated detections/non-detections or counts that are collected over several primary periods (i.e., seasons). The logical model choice for such data is a multi-season model, such as the dynamic occupancy model (MacKenzie et al. 2003) or some form of Dail-Madsen model for count data (Dail and Madsen 2011). These models estimate transition probabilities such as colonization and extinction rates between seasons.
However, in some cases you might not want to fit a dynamic model. There are several potential reasons for this: (1) You don’t have enough data (Dail-Madsen type models are particularly data hungry); (2) You aren’t interested in the transition probabilities; or (3) The dynamic model type you need isn’t available in theory or in your software package of choice.
An alternative approach is to fit multiple years of data into a
single-season model using the “stacked” approach. Essentially, you treat
unique site-year combinations as sites. For a helpful discussion on the
topic, see this
thread on the unmarked
forums.
Ideally you want to control for the pseudoreplication this creates in
some form. In unmarked
you are limited to approaches such
as including a dummy variable for site and/or year. In ubms
you can instead include, for example, random site intercepts to account
for this pseudoreplication.
ubms
We will use the crossbill
dataset to illustrate a
stacked occupancy model with a site-level random effect. The
crossbill
dataset comes packaged with ubms
via
unmarked
:
The crossbill
dataset is a data.frame
with
many columns. It contains 9 years of detection/non-detection data for
the European crossbill (Loxia curvirostra) in Switzerland (Schmid, Zbinden, and Keller
2004).
## [1] 267 58
## [1] "id" "ele" "forest" "surveys" "det991" "det992" "det993"
## [8] "det001" "det002" "det003" "det011" "det012" "det013" "det021"
## [15] "det022" "det023" "det031" "det032" "det033" "det041" "det042"
## [22] "det043" "det051" "det052" "det053" "det061" "det062" "det063"
## [29] "det071" "det072" "det073" "date991" "date992" "date993" "date001"
## [36] "date002" "date003" "date011" "date012" "date013" "date021" "date022"
## [43] "date023" "date031" "date032" "date033" "date041" "date042" "date043"
## [50] "date051" "date052" "date053" "date061" "date062" "date063" "date071"
## [57] "date072" "date073"
Check ?crossbill
for details about each column. The
first three columns id
, ele
, and
forest
are site covariates.
The following 27 columns beginning with det
are the
binary detection/non-detection data; 9 years with 3 observations per
year. The final 27 columns beginning with date
are the
Julian dates for each observation.
We will use the first 3 years of crossbill
data (instead
of all 9), simply to keep the analysis run time down. Converting the
crossbill
data to stacked format is a bit complex. The
dataset contains 267 unique sites; thus after stacking we should end up
with a response variable and covariates that contain
267 * 3 = 801
“sites” (actually site-years). We will order
this new dataset so that the first 267 rows are the sites in year 1, the
2nd 267 rows are the sites in year 2, and so on.
Handling the site-level covariates (which do not change between
years) is the easiest task. We simply replicate the set of site
covariates (which contains one row for each of the original 267 sites)
one time per season, and stack each replicate on top of each other
vertically with rbind
.
site_covs <- crossbill[,c("id", "ele", "forest")]
sc_stack <- rbind(site_covs, site_covs, site_covs)
We also want to add a factor column called site
to the
stacked site covariates that identifies the original site number of each
row. We will use this later as our grouping factor for the random
effect
## id ele forest site
## 1 1 450 3 1
## 2 2 450 21 2
## 3 3 1050 32 3
## 4 4 950 9 4
## 5 5 1150 35 5
## 6 6 550 2 6
Stacking the response variable and the observation covariates is harder. Our dataset is in a “wide” format where each row is a site and each observation is a column, with columns 1-3 corresponding to year 1, 4-6 to year 2, and so on. Here is a function that splits a “wide” dataset like this into pieces and stacks them on top of each other.
wide_to_stacked <- function(input_df, nyears, surveys_per_year){
inds <- split(1:(nyears*surveys_per_year), rep(1:nyears, each=surveys_per_year))
split_df <- lapply(1:nyears, function(i){
out <- input_df[,inds[[i]]]
out$site <- 1:nrow(input_df)
out$year <- i
names(out)[1:3] <- paste0("obs",1:3)
out
})
stack_df <- do.call("rbind", split_df)
stack_df$site <- as.factor(stack_df$site)
stack_df$year <- as.factor(stack_df$year)
stack_df
}
This function can be used to convert both the detection/non-detection
data and observation covariates to the stacked format. First, we isolate
the detection/non-detection data in crossbill
:
Next we convert it to stacked format, specifying that we want only the first 3 years, and that each year has 3 observations/surveys:
## [1] 801 5
## obs1 obs2 obs3 site year
## 1 0 0 0 1 1
## 2 0 0 0 2 1
## 3 NA NA NA 3 1
## 4 0 0 0 4 1
## 5 0 0 0 5 1
## 6 NA NA NA 6 1
Finally, we do the same with the date
observation
covariate.
date_wide <- crossbill[,grep("date", names(crossbill), value=TRUE)]
date_stack <- wide_to_stacked(date_wide, 3, 3)
dim(date_stack)
## [1] 801 5
With our stacked datasets constructed, we build our
unmarkedFrame
:
umf_stack <- unmarkedFrameOccu(y=y_stack[,1:3], siteCovs=sc_stack,
obsCovs=list(date=date_stack[,1:3]))
head(umf_stack)
## Data frame representation of unmarkedFrame object.
## y.1 y.2 y.3 id ele forest site date.1 date.2 date.3
## 1 0 0 0 1 450 3 1 34 59 65
## 2 0 0 0 2 450 21 2 17 33 65
## 3 NA NA NA 3 1050 32 3 NA NA NA
## 4 0 0 0 4 950 9 4 29 59 65
## 5 0 0 0 5 1150 35 5 24 45 65
## 6 NA NA NA 6 550 2 6 NA NA NA
## 7 0 0 0 7 750 6 7 26 54 74
## 8 0 0 0 8 650 60 8 23 43 71
## 9 0 0 0 9 550 5 9 21 36 56
## 10 0 0 0 10 550 13 10 37 62 75
We’ll now fit a model with fixed effects of elevation and forest
cover (ele
and forest
) on occupancy and a
date
effect on detection. In addition, we will include
random intercepts by site
, since in stacking the data we
have pseudoreplication by site. To review, random effects are specified
using the approach used in with the lme4
package. For
example, a random intercept for each level of the covariate
site
is specified with the formula component
(1|site)
. Including random effects in a model in
ubms
usually significantly increases the run time.
fit_stack <- stan_occu(~scale(date) ~scale(ele) + scale(forest) + (1|site),
data=umf_stack, chains=3, iter=500)
fit_stack
##
## Call:
## stan_occu(formula = ~scale(date) ~ scale(ele) + scale(forest) +
## (1 | site), data = umf_stack, chains = 3, iter = 500, refresh = 0)
##
## Occupancy (logit-scale):
## Estimate SD 2.5% 97.5% n_eff Rhat
## (Intercept) -1.60 0.218 -2.061 -1.21 146 0.998
## scale(ele) 1.13 0.217 0.729 1.55 312 1.004
## scale(forest) 1.49 0.223 1.066 1.94 109 1.002
## sigma [1|site] 1.95 0.319 1.387 2.66 44 1.011
##
## Detection (logit-scale):
## Estimate SD 2.5% 97.5% n_eff Rhat
## (Intercept) 0.182 0.0980 -0.00664 0.372 1082 0.998
## scale(date) 0.337 0.0886 0.15452 0.500 1767 0.997
##
## LOOIC: 1473.761
We get warnings; these should be fixed by increasing the iterations.
In addition to fixed effect estimates, we now have an estimate for the
site-level variance (sigma [1|site]
) in our summary
table.
In order to get the actual random intercept values, we use the
ranef
function. Note that this function behaves like the
lme4
version, not like the unmarked
version. A
further caution is that when using an effects parameterization,
ranef
always returns the complete random intercept/slope
term for a group (i.e., the mean + random effect, not just the random
part).
## 1 2 3 4 5 6
## -1.89508002 -1.99838584 -0.30322435 0.06205105 0.39474369 -1.72122316
You can also generate summary statistics for each random intercept:
## Estimate SD 2.5% 97.5%
## 1 -1.89508002 1.785040 -5.680523 1.561208
## 2 -1.99838584 1.753822 -5.530354 1.201788
## 3 -0.30322435 1.373765 -3.039763 2.464200
## 4 0.06205105 1.336437 -2.665990 2.442716
## 5 0.39474369 1.214734 -1.977876 3.013954
## 6 -1.72122316 1.753082 -5.287677 1.521753