Random effects in ubms

Introduction

Random effects in ubms

The ubms package fits models of wildlife occurrence and abundance in Stan (Carpenter et al. 2017), in a similar fashion to the unmarked package (Fiske, Chandler, and others 2011). One of the advantages of ubms is that it is possible to include random effects in your models, using the same syntax as lme4 (Bates et al. 2015). For example, if you have a group site covariate, you can fit a model with random intercepts by group by including + (1|group) in your parameter formula. Random slopes, or a combination of random slopes and intercepts, are also possible. To illustrate the use of random effects of ubms, this vignette fits a model to multi-season occupancy data using a “stacked” approach.

“Stacked” Models

Suppose you have a dataset of repeated detections/non-detections or counts that are collected over several primary periods (i.e., seasons). The logical model choice for such data is a multi-season model, such as the dynamic occupancy model (MacKenzie et al. 2003) or some form of Dail-Madsen model for count data (Dail and Madsen 2011). These models estimate transition probabilities such as colonization and extinction rates between seasons.

However, in some cases you might not want to fit a dynamic model. There are several potential reasons for this: (1) You don’t have enough data (Dail-Madsen type models are particularly data hungry); (2) You aren’t interested in the transition probabilities; or (3) The dynamic model type you need isn’t available in theory or in your software package of choice.

An alternative approach is to fit multiple years of data into a single-season model using the “stacked” approach. Essentially, you treat unique site-year combinations as sites. For a helpful discussion on the topic, see this thread on the unmarked forums.

Ideally you want to control for the pseudoreplication this creates in some form. In unmarked you are limited to approaches such as including a dummy variable for site and/or year. In ubms you can instead include, for example, random site intercepts to account for this pseudoreplication.

Fitting a stacked model with ubms

Read in the input data

We will use the crossbill dataset to illustrate a stacked occupancy model with a site-level random effect. The crossbill dataset comes packaged with ubms via unmarked:

library(ubms)
data(crossbill)

The crossbill dataset is a data.frame with many columns. It contains 9 years of detection/non-detection data for the European crossbill (Loxia curvirostra) in Switzerland (Schmid, Zbinden, and Keller 2004).

dim(crossbill)
## [1] 267  58
names(crossbill)
##  [1] "id"      "ele"     "forest"  "surveys" "det991"  "det992"  "det993" 
##  [8] "det001"  "det002"  "det003"  "det011"  "det012"  "det013"  "det021" 
## [15] "det022"  "det023"  "det031"  "det032"  "det033"  "det041"  "det042" 
## [22] "det043"  "det051"  "det052"  "det053"  "det061"  "det062"  "det063" 
## [29] "det071"  "det072"  "det073"  "date991" "date992" "date993" "date001"
## [36] "date002" "date003" "date011" "date012" "date013" "date021" "date022"
## [43] "date023" "date031" "date032" "date033" "date041" "date042" "date043"
## [50] "date051" "date052" "date053" "date061" "date062" "date063" "date071"
## [57] "date072" "date073"

Check ?crossbill for details about each column. The first three columns id, ele, and forest are site covariates.

The following 27 columns beginning with det are the binary detection/non-detection data; 9 years with 3 observations per year. The final 27 columns beginning with date are the Julian dates for each observation.

Convert the input data to stacked format

We will use the first 3 years of crossbill data (instead of all 9), simply to keep the analysis run time down. Converting the crossbill data to stacked format is a bit complex. The dataset contains 267 unique sites; thus after stacking we should end up with a response variable and covariates that contain 267 * 3 = 801 “sites” (actually site-years). We will order this new dataset so that the first 267 rows are the sites in year 1, the 2nd 267 rows are the sites in year 2, and so on.

Handling the site-level covariates (which do not change between years) is the easiest task. We simply replicate the set of site covariates (which contains one row for each of the original 267 sites) one time per season, and stack each replicate on top of each other vertically with rbind.

site_covs <- crossbill[,c("id", "ele", "forest")]
sc_stack <- rbind(site_covs, site_covs, site_covs)

We also want to add a factor column called site to the stacked site covariates that identifies the original site number of each row. We will use this later as our grouping factor for the random effect

sc_stack$site <- factor(rep(1:nrow(site_covs), 3))
head(sc_stack)
##   id  ele forest site
## 1  1  450      3    1
## 2  2  450     21    2
## 3  3 1050     32    3
## 4  4  950      9    4
## 5  5 1150     35    5
## 6  6  550      2    6

Stacking the response variable and the observation covariates is harder. Our dataset is in a “wide” format where each row is a site and each observation is a column, with columns 1-3 corresponding to year 1, 4-6 to year 2, and so on. Here is a function that splits a “wide” dataset like this into pieces and stacks them on top of each other.

wide_to_stacked <- function(input_df, nyears, surveys_per_year){
  inds <- split(1:(nyears*surveys_per_year), rep(1:nyears, each=surveys_per_year))
  split_df <- lapply(1:nyears, function(i){
                      out <- input_df[,inds[[i]]]
                      out$site <- 1:nrow(input_df)
                      out$year <- i
                      names(out)[1:3] <- paste0("obs",1:3)
                      out
              })
  stack_df <- do.call("rbind", split_df)
  stack_df$site <- as.factor(stack_df$site)
  stack_df$year <- as.factor(stack_df$year)
  stack_df
}

This function can be used to convert both the detection/non-detection data and observation covariates to the stacked format. First, we isolate the detection/non-detection data in crossbill:

y_wide <- crossbill[, grep("det", names(crossbill), value=TRUE)]

Next we convert it to stacked format, specifying that we want only the first 3 years, and that each year has 3 observations/surveys:

y_stack <- wide_to_stacked(y_wide, nyears=3, surveys_per_year=3)
dim(y_stack)
## [1] 801   5
head(y_stack)
##   obs1 obs2 obs3 site year
## 1    0    0    0    1    1
## 2    0    0    0    2    1
## 3   NA   NA   NA    3    1
## 4    0    0    0    4    1
## 5    0    0    0    5    1
## 6   NA   NA   NA    6    1

Finally, we do the same with the date observation covariate.

date_wide <- crossbill[,grep("date", names(crossbill), value=TRUE)]
date_stack <- wide_to_stacked(date_wide, 3, 3)
dim(date_stack)
## [1] 801   5

With our stacked datasets constructed, we build our unmarkedFrame:

umf_stack <- unmarkedFrameOccu(y=y_stack[,1:3], siteCovs=sc_stack,
                         obsCovs=list(date=date_stack[,1:3]))
head(umf_stack)
## Data frame representation of unmarkedFrame object.
##    y.1 y.2 y.3 id  ele forest site date.1 date.2 date.3
## 1    0   0   0  1  450      3    1     34     59     65
## 2    0   0   0  2  450     21    2     17     33     65
## 3   NA  NA  NA  3 1050     32    3     NA     NA     NA
## 4    0   0   0  4  950      9    4     29     59     65
## 5    0   0   0  5 1150     35    5     24     45     65
## 6   NA  NA  NA  6  550      2    6     NA     NA     NA
## 7    0   0   0  7  750      6    7     26     54     74
## 8    0   0   0  8  650     60    8     23     43     71
## 9    0   0   0  9  550      5    9     21     36     56
## 10   0   0   0 10  550     13   10     37     62     75

Fit the Stacked Model

We’ll now fit a model with fixed effects of elevation and forest cover (ele and forest) on occupancy and a date effect on detection. In addition, we will include random intercepts by site, since in stacking the data we have pseudoreplication by site. To review, random effects are specified using the approach used in with the lme4 package. For example, a random intercept for each level of the covariate site is specified with the formula component (1|site). Including random effects in a model in ubms usually significantly increases the run time.

fit_stack <- stan_occu(~scale(date) ~scale(ele) + scale(forest) + (1|site), 
                       data=umf_stack, chains=3, iter=500)
fit_stack
## 
## Call:
## stan_occu(formula = ~scale(date) ~ scale(ele) + scale(forest) + 
##     (1 | site), data = umf_stack, chains = 3, iter = 500, refresh = 0)
## 
## Occupancy (logit-scale):
##                Estimate    SD   2.5% 97.5% n_eff  Rhat
## (Intercept)       -1.60 0.218 -2.061 -1.21   146 0.998
## scale(ele)         1.13 0.217  0.729  1.55   312 1.004
## scale(forest)      1.49 0.223  1.066  1.94   109 1.002
## sigma [1|site]     1.95 0.319  1.387  2.66    44 1.011
## 
## Detection (logit-scale):
##             Estimate     SD     2.5% 97.5% n_eff  Rhat
## (Intercept)    0.182 0.0980 -0.00664 0.372  1082 0.998
## scale(date)    0.337 0.0886  0.15452 0.500  1767 0.997
## 
## LOOIC: 1473.761

We get warnings; these should be fixed by increasing the iterations. In addition to fixed effect estimates, we now have an estimate for the site-level variance (sigma [1|site]) in our summary table.

Accessing the random intercepts

In order to get the actual random intercept values, we use the ranef function. Note that this function behaves like the lme4 version, not like the unmarked version. A further caution is that when using an effects parameterization, ranef always returns the complete random intercept/slope term for a group (i.e., the mean + random effect, not just the random part).

ran <- ranef(fit_stack, submodel="state")
head(ran$site[[1]])
##           1           2           3           4           5           6 
## -1.89508002 -1.99838584 -0.30322435  0.06205105  0.39474369 -1.72122316

You can also generate summary statistics for each random intercept:

ran <- ranef(fit_stack, submodel="state", summary=TRUE)
head(ran$site[[1]])
##      Estimate       SD      2.5%    97.5%
## 1 -1.89508002 1.785040 -5.680523 1.561208
## 2 -1.99838584 1.753822 -5.530354 1.201788
## 3 -0.30322435 1.373765 -3.039763 2.464200
## 4  0.06205105 1.336437 -2.665990 2.442716
## 5  0.39474369 1.214734 -1.977876 3.013954
## 6 -1.72122316 1.753082 -5.287677 1.521753

References

Bates, Douglas, Martin Mächler, Ben Bolker, and Steve Walker. 2015. “Fitting Linear Mixed-Effects Models Using Lme4.” Journal of Statistical Software 67 (1). https://doi.org/10.18637/jss.v067.i01.
Carpenter, Bob, Andrew Gelman, Matthew D. Hoffman, Daniel Lee, Ben Goodrich, Michael Betancourt, Marcus Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell. 2017. “Stan: A Probabilistic Programming Language.” Journal of Statistical Software 76 (1). https://doi.org/10.18637/jss.v076.i01.
Dail, D., and L. Madsen. 2011. “Models for Estimating Abundance from Repeated Counts of an Open Metapopulation.” Biometrics 67: 577–87. https://doi.org/10.1111/j.1541-0420.2010.01465.x.
Fiske, Ian, Richard Chandler, and others. 2011. “Unmarked: An R Package for Fitting Hierarchical Models of Wildlife Occurrence and Abundance.” Journal of Statistical Software 43 (10): 1–23. https://doi.org/10.18637/jss.v043.i10.
MacKenzie, Darryl I., James D. Nichols, James E. Hines, Melinda G. Knutson, and Alan B. Franklin. 2003. “Estimating Site Occupancy, Colonization, and Local Extinction When a Species Is Detected Imperfectly.” Ecology 84: 2200–2207. https://doi.org/10.1890/02-3090.
Schmid, Hans, Niklaus Zbinden, and Verena Keller. 2004. “Überwachung Der Bestandsentwicklung Häufiger Brutvögel in Der Schweiz.” Swiss Ornithological Institute Sempach Switzerland.