Title: | Multiscale Inference for Nonparametric Time Trend(s) |
---|---|
Description: | Performs a multiscale analysis of a nonparametric regression or nonparametric regressions with time series errors. In case of one regression, with the help of this package it is possible to detect the regions where the trend function is increasing or decreasing. In case of multiple regressions, the test identifies regions where the trend functions are different from each other. See Khismatullina and Vogt (2020) <doi:10.1111/rssb.12347>, Khismatullina and Vogt (2022) <doi:10.48550/arXiv.2209.10841> and Khismatullina and Vogt (2023) <doi:10.1016/j.jeconom.2021.04.010> for more details on theory and applications. |
Authors: | Marina Khismatullina [aut, cre], Michael Vogt [aut] |
Maintainer: | Marina Khismatullina <[email protected]> |
License: | GPL (>= 2) |
Version: | 0.2.1 |
Built: | 2024-11-20 06:24:11 UTC |
Source: | CRAN |
This package performs a multiscale analysis of a single nonparametric time trends (Khismatullina and Vogt (2020)) or multiple nonparametric time trends (Khismatullina and Vogt (2022), Khismatullina and Vogt (2023)).
In case of a single nonparametric regression, the multiscale method to
test qualitative hypotheses about the nonparametric time trend
in the model
with time series errors
is provided. The method was first proposed in
Khismatullina and Vogt (2020). It allows to test for shape properties
(areas of monotonic decrease or increase) of the trend
.
This method require an estimator of the long-run error variance
.
Hence, the package also provides the difference-based
estimator for the case that the errors belong to the class of
processes. The estimator was also proposed in
Khismatullina and Vogt (2020).
In case of multiple nonparametric regressions, we provide
the multiscale method to test qualitative hypotheses about
the nonparametric time trends in the context of epidemic modelling.
Specifically, we assume that the we observe a sample of the count data
, where
are quasi-Poisson distributed with time-varying intensity parameter
. The multiscale method allows to test whether
intenisty parameters are different or not, and if they are, it detects
with a prespicified significance level the regions where these differences
most probably occur. The method was introduced in
Khismatullina and Vogt (2023) and can be used for comparing the rates of
infection of COVID-19 across countries.
Khismatullina M, Vogt M (2020). “Multiscale inference and long-run variance estimation in nonparametric regression with time series errors.” Journal of the Royal Statistical Society: Series B (Statistical Methodology).
Khismatullina M, Vogt M (2023). “Nonparametric comparison of epidemic time trends: The case of COVID-19.” Journal of Econometrics, 232(1), 87-108. ISSN 0304-4076, doi:10.1016/j.jeconom.2021.04.010.
Given a set of intervals, this function computes
the corresponding subset of minimal intervals which are defined
as follows. For a given set of intervals ,
all intervals
such that
does not contain a proper subset of
are called minimal.
This function is needed for illustrative purposes. The set of all the intervals where our test rejects the null hypothesis may be quite large, hence, we would like to focus our attention on the smaller subset, for which we are still able to make simultaneous confidence intervals. This subset is the subset of minimal intervals, and it helps us to to precisely locate the intervals of further interest.
More details can be found in Duembgen (2002) and Khismatullina and Vogt (2019, 2020)
compute_minimal_intervals(dataset)
compute_minimal_intervals(dataset)
dataset |
Set of the intervals. It needs to contain the following columns: "startpoint" - left end of the interval; "endpoint" - right end of the interval. |
Subset of minimal intervals
startpoint <- c(0, 0.5, 1) endpoint <- c(2, 2, 2) dataset <- data.frame(startpoint, endpoint) minimal_ints <- compute_minimal_intervals(dataset)
startpoint <- c(0, 0.5, 1) endpoint <- c(2, 2, 2) dataset <- data.frame(startpoint, endpoint) minimal_ints <- compute_minimal_intervals(dataset)
Quantiles from the gaussian version of the test statistics which are used to approximate the critical values for the multiscale test.
compute_quantiles( t_len, n_ts = 1, grid = NULL, ijset = NULL, sigma = 1, deriv_order = 0, sim_runs = 1000, probs = seq(0.5, 0.995, by = 0.005), correction = TRUE, epidem = FALSE )
compute_quantiles( t_len, n_ts = 1, grid = NULL, ijset = NULL, sigma = 1, deriv_order = 0, sim_runs = 1000, probs = seq(0.5, 0.995, by = 0.005), correction = TRUE, epidem = FALSE )
t_len |
Sample size. |
n_ts |
Number of time series analyzed. Default is 1. |
grid |
Grid of location-bandwidth points as produced by
the function |
ijset |
A matrix of integers. In case of multiple time series,
we need to know which pairwise comparisons to perform.
This matrix consists of all pairs of indices |
sigma |
Value of |
deriv_order |
In case of a single time series analysed, this parameter denotes the order of the derivative of the trend function that is being estimated. Default is 0. |
sim_runs |
Number of simulation runs to produce quantiles. Default is 1000. |
probs |
A numeric vector of probability levels |
correction |
Logical variable, TRUE (by default) if we are using
|
epidem |
Logical variable, TRUE if we are using dealing with epidemic time trends. Default is FALSE. |
Matrix with 2 rows where the first row contains the vector of probabilities (probs) and the second contains corresponding quantiles of the gaussian statistics distribution.
compute_quantiles(100)
compute_quantiles(100)
Quantiles from the gaussian version of the test statistics which are used to approximate the critical values for the multiscale test.
compute_quantiles_2( t_len, n_ts = 1, grid = NULL, ijset = NULL, sigma = 1, deriv_order = 0, sim_runs = 1000, probs = seq(0.5, 0.995, by = 0.005), correction = TRUE, epidem = FALSE, numCores = NULL )
compute_quantiles_2( t_len, n_ts = 1, grid = NULL, ijset = NULL, sigma = 1, deriv_order = 0, sim_runs = 1000, probs = seq(0.5, 0.995, by = 0.005), correction = TRUE, epidem = FALSE, numCores = NULL )
t_len |
Sample size. |
n_ts |
Number of time series analyzed. Default is 1. |
grid |
Grid of location-bandwidth points as produced by
the function |
ijset |
A matrix of integers. In case of multiple time series,
we need to know which pairwise comparisons to perform.
This matrix consists of all pairs of indices |
sigma |
Value of |
deriv_order |
In case of a single time series analysed, this parameter denotes the order of the derivative of the trend function that is being estimated. Default is 0. |
sim_runs |
Number of simulation runs to produce quantiles. Default is 1000. |
probs |
A numeric vector of probability levels |
correction |
Logical variable, TRUE (by default) if we are using
|
epidem |
Logical variable, TRUE if we are using dealing with epidemic time trends. Default is FALSE. |
numCores |
Integer value used to indicate how many cores are used
while calculating the critical value. Default is NULL,
then the formula used is
|
Matrix with 2 rows where the first row contains the vector of probabilities (probs) and the second contains corresponding quantiles of the gaussian statistics distribution.
compute_quantiles_2(100, numCores = 2)
compute_quantiles_2(100, numCores = 2)
Calculates the value of the test statistics both for single time series analysis and multiple time series analysis.
compute_statistics( data, sigma = 1, sigma_vec = 1, n_ts = 1, grid = NULL, ijset = NULL, deriv_order = 0, epidem = FALSE )
compute_statistics( data, sigma = 1, sigma_vec = 1, n_ts = 1, grid = NULL, ijset = NULL, deriv_order = 0, epidem = FALSE )
data |
Vector (in case of n_ts = 1) or matrix (in case of n_ts > 1) that contains (a number of) time series that needs to be analyzed. In the latter case, each column of the matrix must contain one time series. |
sigma |
The estimator of the square root of the long-run
variance |
sigma_vec |
Vector that consists of estimators of the square root
of the long-run variances |
n_ts |
Number of time series analysed. Default is 1. |
grid |
Grid of location-bandwidth points as produced by
the functions |
ijset |
In case of multiple time series (n_ts > 1),
we need to know which pairs of time series to compare.
This matrix consists of all pairs of indices |
deriv_order |
In case of a single time series, this denotes the order of the derivative of the trend that we estimate. Default is 0. |
epidem |
Logical variable, TRUE if we are using dealing with epidemic time trends. Default is FALSE. |
In case of n_ts = 1, the function returns a list with the following elements:
stat |
Value of the multiscale statistics. |
gset_with_vals |
A matrix that contains the values of the normalised kernel averages for each pair of location-bandwidth with the corresponding location and bandwidth. |
In case of n_ts > 1, the function returns a list with the following elements:
stat |
Value of the multiscale statistics. |
stat_pairwise |
Matrix of the values of the pairwise statistics. |
ijset |
The matrix that consists of all pairs of indices
|
gset_with_vals |
A list of matrices, each matrix corresponding to a specific pairwise comparison. The order of the list is determined by ijset. Each matrix contains the values of the normalisedkernel averages for each pair of location-bandwidth with the corresponding location and bandwidth. |
Computes the location-bandwidth grid for the multiscale test.
construct_grid(t, u_grid = NULL, h_grid = NULL, deletions = NULL)
construct_grid(t, u_grid = NULL, h_grid = NULL, deletions = NULL)
t |
Sample size. |
u_grid |
Vector of location points in the unit interval
|
h_grid |
Vector of bandwidths, each bandwidth is supposed to lie
in |
deletions |
Logical vector of the length len(u.grid) * len(h.grid).
Each element is either TRUE, which means that
the corresponding location-bandwidth point |
A list with the following elements:
gset |
Matrix of location-bandwidth points |
bws |
Vector of bandwidths (after deletions). |
lens |
Vector of length = length(bws), lens[i] gives the number of locations in the grid for the i-th bandwidth level. |
gtype |
Type of grid that is used, either 'default' or 'non-default'. |
gset_full |
Matrix of all location-bandwidth pairs |
pos_full |
Logical vector indicating which points |
construct_grid(100) construct_grid(100, u_grid = seq(from = 0.05, to = 1, by = 0.05), h_grid = c(0.1, 0.2, 0.3, 0.4))
construct_grid(100) construct_grid(100, u_grid = seq(from = 0.05, to = 1, by = 0.05), h_grid = c(0.1, 0.2, 0.3, 0.4))
Computes the location-bandwidth weekly grid for the multiscale test.
construct_weekly_grid(t, min_len = 7, nmbr_of_wks = 4)
construct_weekly_grid(t, min_len = 7, nmbr_of_wks = 4)
t |
Sample size. |
min_len |
Minimal length of the interval considered. The grid then consists of intervals with lengths min_len, 2 * min_len, 3 * min_len, ... Default is 7, i.e. a week. |
nmbr_of_wks |
Number that determines the longest intervals in the grid: the length of this interval is calculated then as min_len * nmbr_of_wks. Default is 4. |
A list with the following elements:
gset |
Matrix of location-bandwidth points |
bws |
Vector of bandwidths. |
lens |
Vector of length = length(bws), lens[i] gives the number of locations in the grid for the i-th bandwidth level. |
gtype |
Type of grid that is used, always 'default'. |
gset_full |
Matrix of all location-bandwidth pairs |
construct_weekly_grid(100) construct_weekly_grid(100, min_len = 7, nmbr_of_wks = 2)
construct_weekly_grid(100) construct_weekly_grid(100, min_len = 7, nmbr_of_wks = 2)
Data on the geographic distribution of COVID-19 cases worldwide (© ECDC [2005-2019])
data("covid")
data("covid")
A matrix with 99 rows and 41 columns. Each column corresponds to one coutnry, with the name of the country (denoted by three letter) being the name of the column.
Each entry in the dataset denotes the number of new cases of infection per day and per country. In order to make the data comparable across countries, we take the day of the 100th confirmed case in each country as the starting date t = 1. This way of “normalizing” the data is common practice (Cohen and Kupferschmidt (2020)).
A difference based estimator for the coefficients and long-run variance in case of a nonparametric regression model are AR(p).
Specifically, we assume that we observe that satisfy
the following equation:
Here, is an unknown function, and the errors
are AR(p) with p known. Specifically, we ler
be a process of the form
where are unknown coefficients and
are i.i.d.\ with
and
.
This function produces an estimator
of the long-run variance
of the error terms, as well as estimators
of the coefficients
and an estimator
of
the innovation variance
.
The exact estimation procedure as well as description of the tuning parameters needed for this estimation can be found in Khismatullina and Vogt (2020).
estimate_lrv(data, q, r_bar, p)
estimate_lrv(data, q, r_bar, p)
data |
A vector of |
q , r_bar
|
Tuning parameters. |
p |
AR order of the error terms. |
A list with the following elements:
lrv |
Estimator of the long run variance of the error terms
|
ahat |
Vector of length p of estimated AR coefficients
|
vareta |
Estimator of the variance of the innovation term |
Khismatullina M., Vogt M. Multiscale inference and long-run variance estimation in non-parametric regression with time series errors //Journal of the Royal Statistical Society: Series B (Statistical Methodology). - 2020.
Carries out the multiscale test given that the values the estimatates of long-run variance have already been computed.
multiscale_test( data, sigma = 1, sigma_vec = 1, n_ts = 1, grid = NULL, ijset = NULL, alpha = 0.05, sim_runs = 1000, deriv_order = 0, correction = TRUE, epidem = FALSE )
multiscale_test( data, sigma = 1, sigma_vec = 1, n_ts = 1, grid = NULL, ijset = NULL, alpha = 0.05, sim_runs = 1000, deriv_order = 0, correction = TRUE, epidem = FALSE )
data |
Vector (in case of n_ts = 1) or matrix (in case of n_ts > 1) that contains (a number of) time series that needs to be analyzed. In the latter case, each column of the matrix must contain one time series. |
sigma |
The estimator of the square root of the long-run
variance |
sigma_vec |
Vector that consists of estimators of the square root
of the long-run variances |
n_ts |
Number of time series analysed. Default is 1. |
grid |
Grid of location-bandwidth points as produced by
the functions |
ijset |
In case of multiple time series (n_ts > 1),
we need to know which pairs of time series to compare.
This matrix consists of all pairs of indices |
alpha |
Significance level. Default is |
sim_runs |
Number of simulation runs to produce quantiles. Default is 1000. |
deriv_order |
In case of a single time series, this denotes the order of the derivative of the trend that we estimate. Default is 0. |
correction |
Logical variable, TRUE (by default) is we are using
|
epidem |
Logical variable, TRUE if we are using dealing with epidemic time trends. Default is FALSE. |
In case of n_ts = 1, the function returns a list with the following elements:
testing_result |
A string that contains the result of the testing: either the null hypothesis is rejected or not, what is the confidence level and what is value of the test statistic. |
quant |
Quantile that was used for testing calculated from the Gaussian distribution. |
statistics |
Value of the multiscale statistics. |
test_matrix |
Matrix of the test results for the multiscale test defined in Khismatullina and Vogt (2019). The matrix is coded as follows:
. |
gset_with_vals |
A matrix that contains the values of the normalised kernel averages and test results for each pair of location-bandwidth with the corresponding location and bandwidth. |
In case of n_ts > 1, the function returns a list with the following elements:
quant |
Quantile that was used for testing calculated from the gaussian distribution. |
statistics |
Value of the multiscale statistics. |
stat_pairwise |
Matrix of the values of the pairwise statistics. |
ijset |
The matrix that consists of all pairs of indices
|
gset_with_vals |
A list of matrices, each matrix corresponding to a specific pairwise comparison. The order of the list is determined by ijset. Each matrix contains the values of the normalisedkernel averages for each pair of location-bandwidth with the corresponding location and bandwidth. |
Plots SiZer map from the test results of the multiscale testing procedure.
plot_sizer_map( u_grid, h_grid, test_results, plot_title = NA, greyscale = FALSE, ... )
plot_sizer_map( u_grid, h_grid, test_results, plot_title = NA, greyscale = FALSE, ... )
u_grid |
Vector of location points in the unit interval |
h_grid |
Vector of bandwidths from |
test_results |
Matrix of test results created by
|
plot_title |
Title of the plot. Default is NA and no title is written. |
greyscale |
Whether SiZer map is plotted in grey scale. Default is FALSE. |
... |
Any further options to be passed to the image function. |
No return value, called for plotting a SiZer map.
) errors
based on the long-run variance estimator(s) for a range of tuning
parameters and different orders
.This function fits AR(1), ... AR(9) models for all given time series and calculates different information criterions (FPE, AIC, AICC, SIC, HQ) for each of these fits. The result is the best fit in terms of minimizing the infromation criteria.
select_order(data, q = NULL, r = 5:15)
select_order(data, q = NULL, r = 5:15)
data |
One or a number of time series in a matrix. Column names of the matrix should be reasonable |
q |
A vector of integers that consisits of different tuning
parameters to analyse. If not supplied, q is taken to be
|
r |
A vector of integers that consisits of different tuning
parameters r_bar for |
A list with a number of elements:
orders |
A vector of chosen orders of length equal to the number
of time series.
For each time series the order is calculated as
|
... |
Matrices with the orders that were selected
(among |
The CET dataset is the longest instrumental record of temperature in the world. It contains the mean monthly surface air temperatures (in degrees Celsius) from the year 1659 to the present. These monthly temperatures are representative of a roughly triangular area of the United Kingdom enclosed by Lancashire, London and Bristol. Manley (1953, 1974) compiled most of the monthly series, covering 1659 to 1973. These data were updated to 1991 by Parker et al (1992). It is now kept up to date by the Climate Data Monitoring section of the Hadley Centre, Met Office.
data("temperature")
data("temperature")
A numeric vector of length 359.
Since 1974 the data have been adjusted to allow for urban warming: currently a correction of -0.2 C is applied to mean temperatures. CET datasets are freely available for use under Open Government License.
https://www.metoffice.gov.uk/hadobs/hadcet/