The smerc package is focused on Statistical Methods for Regional Counts. It particularly focuses on the spatial scan method (Kulldorff, 1997) and many of its extensions.
In what follows, we demonstrate some of the basic functionality of the smerc package using 281 observations related to leukeumia cases in an 8 county area of the state of New York. The data were made available in Waller and Gotway (2005) and details are provided there.
The leukemia data are available in the nydf
dataframe.
Each row of the dataframe contains information regarding a different
region. Each row of the dataframe includes longitude
and
latitude
information for the centroid of each region,
transformed x
and y
coordinates available in
the original dataset provided by Waller and Gotway (2005), as well as
the population
of each region and the number of leukemia
cases
.
## # This research was partially supported under NSF grants 1463642 and 1915277
## 'data.frame': 281 obs. of 6 variables:
## $ longitude : num -75.9 -75.9 -75.9 -75.9 -75.9 ...
## $ latitude : num 42.1 42.1 42.1 42.1 42.1 ...
## $ population: num 3540 3560 3739 2784 2571 ...
## $ cases : num 3.08 4.08 1.09 1.07 3.06 ...
## $ x : num 4.07 4.64 5.71 7.61 7.32 ...
## $ y : num -67.4 -66.9 -67 -66 -67.3 ...
We plot the study area below using information in the data object.
## Linking to GEOS 3.12.1, GDAL 3.8.4, PROJ 9.4.0; sf_use_s2() is TRUE
The spatial scan test can be perfomed using the
scan.test
function. The user must supply the coordinates of
the region centroids, the number of cases in each region, and the
population (i.e., the number persons at risk). There are many additional
optional arguments described in the documentation. In the following
example, we reduce the number of Monte Carlo simulations used to
estimate the p-value.
coords = nydf[,c("x", "y")] # extract coordinates
cases = nydf$cases # extract cases
pop = nydf$population # extract population
scan_out = scan.test(coords, cases, pop, nsim = 99) # perform scan test
## computing statistics for simulated data:
The scan.test
function (and most of the other
*.test
functions in the smerc package)
will return an object of class smerc_cluster
.
## [1] "smerc_cluster"
A smerc_cluster
object has a default print option that
summarizes relevant details about the object produced by the
function.
In this case, we receive the following:
scan_out # print scan.test results
## method: circular scan
## statistic: poisson
## simulation: multinomial
## realizations: 99
## population upperbound: 50%
## minimum cases: 2
In this particular case, we see that the scan_out
object
summarizes the results of the circular scan
method using a
poisson
statistic (the other choice is
binomial
). The Monte Carlo p-value was estimated using
99
realizations from a multinomial
distribution (poisson
and binomial
are the
other possibilities). The candidate zones were limited to having no more
than 50%
of the total population and at least
2
cases had to be in a candidate zone to be considered a
cluster. Other methods will provide some of the same information, but
some of the information will be different based on the relevant
parameters of the models.
A smerc_cluster
also has a summary
function
that summarizes the significant (or most likely) clusters returned by
the method.
## nregions max_dist cases ex rr stat p
## 1 24 12.3 95.33108 55.8 1.8 13.1 0.01
## 2 11 26.5 49.71990 27.1 1.9 8.0 0.08
The summary
of scan_out
reveals that there
were two significant clusters. The first cluster was comprised of 24
regions with a maximum distance of 12.3 between centroids. The number of
cases observed in the cluster is 95.33, while 55.8 cases were expected.
The relative risk of the cluster is 1.8, with an associated test
statistic of 13.1, and a Monte Carlo p-value of 0.01. A second cluster
is also summarized with similar information.
A very basic plot
mechanism is provided for
smerc_cluster
objects. The plot will show the locations of
every centroid, with significant clusters shown with different colors
and symbols. An attempt is also made to show the connectivity of the
regions. In the plot below, we see the two significant clusters shown in
yellow (middle) and purple (bottom).
If you want to extract the detected clusters from a
smerc_cluster
object, then the clusters
function can be used. The function returns a list: each element of the
list is a vector with the indices of the regions included in the
cluster. The first element of the list contains the regions comprising
the most likely cluster, the second element contains the regions for the
second most likely cluster, etc.
## [[1]]
## [1] 52 50 49 48 15 16 1 51 37 38 47 2 14 39 53 13 34 40 3 44 17 43 12 46
##
## [[2]]
## [1] 88 86 87 89 92 91 85 93 84 90 259
If a polygon-like structure is associated with each region, then the
color.clusters
can be used to easily produce nicer plots.
E.g., we can use the nysf
object to create a nicer map of
our results.
Other scan methods available include:
elliptic.test
}.flex.test
}.rflex.test
}.uls.test
}.dmst.test
}.edmst.test
}.dc.test
}.mlink.test
}.fast.test
}.mlf.test
}.Other non-scan method for the detection of clusters and/or clustering of cases for regional data are provided.
bn.test
performs the Besag-Newell test (Besag and
Newell, 1991) and returns a smerc_cluster
. The
print
, summary
, and plot
methods
are available, as before.
bn_out = bn.test(coords = coords, cases = cases, pop = pop, cstar = 20,
alpha = 0.01) # perform besag-newell test
bn_out # print results
## method: Besag-Newell
## case radius: 20
## modified p-value: FALSE
summary(bn_out) # summarize results
## nregions max_dist cases ex rr stat p
## 1 5 3.0 20.42516 10.2 2.0 5 0.004132413
## 2 5 13.4 20.15705 10.5 1.9 5 0.006017272
## 3 3 2.0 20.39443 10.8 1.9 3 0.008039295
## 4 5 2.7 25.45904 11.0 2.4 5 0.009113555
## 5 5 2.4 20.29863 11.1 1.9 5 0.009962904
plot(bn_out) # plot results
The Cluster Evaluation Permutation Procedure (CEPP) proposed by
Turnbull et al. (1990) can be performed using cepp.test
.
The function returns a smerc_cluster
object with
print
, summary
, and plot
functions.
# perform CEPP test
cepp_out = cepp.test(coords = coords, cases = cases, pop = pop,
nstar = 5000, nsim = 99, alpha = 0.1)
cepp_out # print results
## method: CEPP
## simulation: multinomial
## realizations: 99
## population radius: 5000
summary(cepp_out) # summarize results
## nregions max_dist cases ex rr stat p
## 1 2 3.6 9.589796 2.8 3.5 9.6 0.04
plot(cepp_out) # plot results
Tango’s clustering detection test (Tango, 1995) can be performed via
the tango.test
function. The dweights
function
can be used to construct the weights for the test. The
tango.test
function produces and object of class
tango
, which has a native print
function and a
plot
function that compares the goodness-of-fit and spatial
autocorrelation components of Tango’s statistic for the observed data
and the simulated data sets.
w = dweights(coords, kappa = 1) # construct weights matrix
tango_out = tango.test(cases, pop, w, nsim = 49) # perform tango's test
tango_out # print results
## method: Tango's index
## index: 0.0029
## goodness-of-fit component: 0.0026
## spatial autocorrelation component: 0.00033
## chi-square statistic: 210
## chi-square df: 110
## chi-square p-value: 1.4e-08
## Monte Carlo p-value: 0.02
plot(tango_out) # plot results
Nearly all cluster detection methods have both a *.zones
function that returns all the candidate zones for the method and a
*.sim
function that is used to produce results for data
simulated under the null hypothesis. These are unlikely to interest most
users, but may be useful for individuals wanting to better understand
how these methods work or are interested in developing new methods based
off existing ones. The *.sim
functions are not meant for
general use, as they are written to take very specific arguments that
the user could easily misuse.
As an example, the following code produces all elliptical-shaped zones considered by the elliptic scan method (Kulldorff et al., 2006) using a population upperbound of no more than 10%. The function returns a list with the 248,213 candidate zones, as well as the associated shape and angle.
# obtain zones for elliptical scan method
ezones = elliptic.zones(coords, pop, ubpop = 0.1)
# view structure of ezones
str(ezones)
## List of 3
## $ zones:List of 239228
## ..$ : int 1
## ..$ : int [1:2] 1 2
## ..$ : int [1:3] 1 2 15
## ..$ : int [1:4] 1 2 15 49
## ..$ : int [1:5] 1 2 15 49 50
## ..$ : int [1:6] 1 2 15 49 50 13
## ..$ : int [1:7] 1 2 15 49 50 13 3
## ..$ : int [1:8] 1 2 15 49 50 13 3 14
## ..$ : int [1:9] 1 2 15 49 50 13 3 14 48
## ..$ : int [1:10] 1 2 15 49 50 13 3 14 48 47
## ..$ : int [1:11] 1 2 15 49 50 13 3 14 48 47 ...
## ..$ : int [1:12] 1 2 15 49 50 13 3 14 48 47 ...
## ..$ : int [1:13] 1 2 15 49 50 13 3 14 48 47 ...
## ..$ : int [1:14] 1 2 15 49 50 13 3 14 48 47 ...
## ..$ : int [1:15] 1 2 15 49 50 13 3 14 48 47 ...
## ..$ : int [1:16] 1 2 15 49 50 13 3 14 48 47 ...
## ..$ : int [1:17] 1 2 15 49 50 13 3 14 48 47 ...
## ..$ : int [1:18] 1 2 15 49 50 13 3 14 48 47 ...
## ..$ : int [1:19] 1 2 15 49 50 13 3 14 48 47 ...
## ..$ : int [1:20] 1 2 15 49 50 13 3 14 48 47 ...
## ..$ : int [1:21] 1 2 15 49 50 13 3 14 48 47 ...
## ..$ : int [1:22] 1 2 15 49 50 13 3 14 48 47 ...
## ..$ : int [1:23] 1 2 15 49 50 13 3 14 48 47 ...
## ..$ : int [1:24] 1 2 15 49 50 13 3 14 48 47 ...
## ..$ : int [1:25] 1 2 15 49 50 13 3 14 48 47 ...
## ..$ : int [1:26] 1 2 15 49 50 13 3 14 48 47 ...
## ..$ : int [1:27] 1 2 15 49 50 13 3 14 48 47 ...
## ..$ : int [1:28] 1 2 15 49 50 13 3 14 48 47 ...
## ..$ : int [1:29] 1 2 15 49 50 13 3 14 48 47 ...
## ..$ : int 2
## ..$ : int [1:3] 2 1 3
## ..$ : int [1:4] 2 1 3 13
## ..$ : int [1:5] 2 1 3 13 15
## ..$ : int [1:6] 2 1 3 13 15 14
## ..$ : int [1:7] 2 1 3 13 15 14 49
## ..$ : int [1:8] 2 1 3 13 15 14 49 47
## ..$ : int [1:9] 2 1 3 13 15 14 49 47 12
## ..$ : int [1:10] 2 1 3 13 15 14 49 47 12 50
## ..$ : int [1:11] 2 1 3 13 15 14 49 47 12 50 ...
## ..$ : int [1:12] 2 1 3 13 15 14 49 47 12 50 ...
## ..$ : int [1:13] 2 1 3 13 15 14 49 47 12 50 ...
## ..$ : int [1:14] 2 1 3 13 15 14 49 47 12 50 ...
## ..$ : int [1:15] 2 1 3 13 15 14 49 47 12 50 ...
## ..$ : int [1:16] 2 1 3 13 15 14 49 47 12 50 ...
## ..$ : int [1:17] 2 1 3 13 15 14 49 47 12 50 ...
## ..$ : int [1:18] 2 1 3 13 15 14 49 47 12 50 ...
## ..$ : int [1:20] 2 1 3 13 15 14 49 47 12 50 ...
## ..$ : int [1:21] 2 1 3 13 15 14 49 47 12 50 ...
## ..$ : int [1:22] 2 1 3 13 15 14 49 47 12 50 ...
## ..$ : int [1:23] 2 1 3 13 15 14 49 47 12 50 ...
## ..$ : int [1:24] 2 1 3 13 15 14 49 47 12 50 ...
## ..$ : int [1:25] 2 1 3 13 15 14 49 47 12 50 ...
## ..$ : int [1:27] 2 1 3 13 15 14 49 47 12 50 ...
## ..$ : int [1:28] 2 1 3 13 15 14 49 47 12 50 ...
## ..$ : int 3
## ..$ : int [1:2] 3 13
## ..$ : int [1:3] 3 13 2
## ..$ : int [1:4] 3 13 2 12
## ..$ : int [1:5] 3 13 2 12 14
## ..$ : int [1:6] 3 13 2 12 14 5
## ..$ : int [1:7] 3 13 2 12 14 5 1
## ..$ : int [1:8] 3 13 2 12 14 5 1 10
## ..$ : int [1:9] 3 13 2 12 14 5 1 10 11
## ..$ : int [1:10] 3 13 2 12 14 5 1 10 11 4
## ..$ : int [1:11] 3 13 2 12 14 5 1 10 11 4 ...
## ..$ : int [1:12] 3 13 2 12 14 5 1 10 11 4 ...
## ..$ : int [1:13] 3 13 2 12 14 5 1 10 11 4 ...
## ..$ : int [1:14] 3 13 2 12 14 5 1 10 11 4 ...
## ..$ : int [1:15] 3 13 2 12 14 5 1 10 11 4 ...
## ..$ : int [1:16] 3 13 2 12 14 5 1 10 11 4 ...
## ..$ : int [1:17] 3 13 2 12 14 5 1 10 11 4 ...
## ..$ : int [1:18] 3 13 2 12 14 5 1 10 11 4 ...
## ..$ : int [1:19] 3 13 2 12 14 5 1 10 11 4 ...
## ..$ : int [1:20] 3 13 2 12 14 5 1 10 11 4 ...
## ..$ : int [1:21] 3 13 2 12 14 5 1 10 11 4 ...
## ..$ : int [1:22] 3 13 2 12 14 5 1 10 11 4 ...
## ..$ : int [1:23] 3 13 2 12 14 5 1 10 11 4 ...
## ..$ : int [1:24] 3 13 2 12 14 5 1 10 11 4 ...
## ..$ : int [1:25] 3 13 2 12 14 5 1 10 11 4 ...
## ..$ : int [1:26] 3 13 2 12 14 5 1 10 11 4 ...
## ..$ : int [1:27] 3 13 2 12 14 5 1 10 11 4 ...
## ..$ : int [1:29] 3 13 2 12 14 5 1 10 11 4 ...
## ..$ : int 4
## ..$ : int [1:2] 4 6
## ..$ : int [1:3] 4 6 5
## ..$ : int [1:4] 4 6 5 12
## ..$ : int [1:5] 4 6 5 12 35
## ..$ : int [1:6] 4 6 5 12 35 7
## ..$ : int [1:7] 4 6 5 12 35 7 10
## ..$ : int [1:8] 4 6 5 12 35 7 10 3
## ..$ : int [1:9] 4 6 5 12 35 7 10 3 11
## ..$ : int [1:10] 4 6 5 12 35 7 10 3 11 9
## ..$ : int [1:11] 4 6 5 12 35 7 10 3 11 9 ...
## ..$ : int [1:12] 4 6 5 12 35 7 10 3 11 9 ...
## ..$ : int [1:13] 4 6 5 12 35 7 10 3 11 9 ...
## ..$ : int [1:14] 4 6 5 12 35 7 10 3 11 9 ...
## ..$ : int [1:15] 4 6 5 12 35 7 10 3 11 9 ...
## ..$ : int [1:16] 4 6 5 12 35 7 10 3 11 9 ...
## ..$ : int [1:17] 4 6 5 12 35 7 10 3 11 9 ...
## .. [list output truncated]
## $ shape: num [1:239228] 1 1 1 1 1 1 1 1 1 1 ...
## $ angle: num [1:239228] 90 90 90 90 90 90 90 90 90 90 ...
Assuncao, R.M., Costa, M.A., Tavares, A. and Neto, S.J.F. (2006). Fast detection of arbitrarily shaped disease clusters, Statistics in Medicine, 25, 723-742.
Besag, J. and Newell, J. (1991). The detection of clusters in rare diseases, Journal of the Royal Statistical Society, Series A, 154, 327-333.
Costa, M.A. and Assuncao, R.M. and Kulldorff, M. (2012) Constrained spanning tree algorithms for irregularly-shaped spatial clustering, Computational Statistics & Data Analysis, 56(6), 1771-1783.
Kulldorff, M. (1997) A spatial scan statistic. Communications in Statistics - Theory and Methods, 26(6): 1481-1496,
Kulldorff, M., Huang, L., Pickle, L. and Duczmal, L. (2006) An elliptic spatial scan statistic. Statististics in Medicine, 25:3929-3943.
Neill, D. B. (2012), Fast subset scan for spatial pattern detection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 74: 337-360.
Patil, G.P. & Taillie, C. Upper level set scan statistic for detecting arbitrarily shaped hotspots. Environmental and Ecological Statistics (2004) 11(2):183-197.
Tango, T. (1995) A class of tests for detecting “general” and “focused” clustering of rare diseases. Statistics in Medicine. 14, 2323-2334.
Tango, T., & Takahashi, K. (2005). A flexibly shaped spatial scan statistic for detecting clusters. International journal of health geographics, 4(1), 11. Kulldorff, M. (1997) A spatial scan statistic. Communications in Statistics – Theory and Methods 26, 1481-1496.
Tango, T. and Takahashi, K. (2012), A flexible spatial scan statistic with a restricted likelihood ratio for detecting disease clusters. Statist. Med., 31: 4207-4218.
Turnbull, Bruce W., Iwano, Eric J., Burnett, William S., Howe, Holly L., Clark, Larry C. (1990). Monitoring for Clusters of Disease: Application to Leukemia Incidence in Upstate New York, American Journal of Epidemiology, 132(supp1):136-143.
Waller, L.A. and Gotway, C.A. (2005). Applied Spatial Statistics for Public Health Data. Hoboken, NJ: Wiley.
Yao, Z., Tang, J., & Zhan, F. B. (2011). Detection of arbitrarily-shaped clusters using a neighbor-expanding approach: A case study on murine typhus in South Texas. International journal of health geographics, 10(1), 1.