--- title: "Introduction to SNSeg and Examples" output: rmarkdown::html_vignette description: > Start here if this is your first time using SNSeg. You'll learn the basic philosophy, the type of parameters used for change-points detection, and examples for functions within this R package. vignette: > %\VignetteIndexEntry{SNSeg} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r, echo = FALSE, message = FALSE} knitr::opts_chunk$set(collapse = T, comment = "#>") options(tibble.print_min = 4L, tibble.print_max = 4L) library(SNSeg) ``` SNSeg supports change-points estimation for both univariate and multivariate time series (including high-dimensional time series with dimension greater than 10.) using Self-Normalization (SN) based framework. Please read Zhao, Jiang and Shao (2022) for details of the SN-based algorithms. The package contain three functions for change-points estimation: * The function `SNSeg_Uni` works for a univariate time series. * The function `SNSeg_Multi` works for multivariate time series with dimension up to ten. * The function `SNSeg_HD` works for high-dimensional time series with dimension above ten. All functions contain two important input arguments: `grid_size_scale` and `grid_size`. * `grid_size_scale`: the trimming parameter $\epsilon$, a parameter to control the local window size `grid_size`. The range of `grid_size_scale` should be between 0.05 and 0.5. Any input `grid_size_scale` smaller than 0.05 will be automatically changed to 0.05, and any input `grid_size_scale` greater than 0.5 will be automatically changed to 0.5. * `grid_size`: the local window size $h$ for local scanning for each time point. According to Zhao et al. (2022), `grid_size` is a product of `grid_size_scale` and the length of time. Users can set their own `grid_size` or leave it as `NULL` in the input arguments. If `grid_size` is `NULL`, these functions will compute the SN test statistic using the value of `grid_size_scale`. If `grid_size` is set by users, the functions will first calculate `grid_size_scale` by diving the length of time, and then compute the SN test statistic. For the other input arguments: * `ts`: Users should enter a time series for this argument. For `SNSeg_Uni`, the dimension of `ts` should be exactly one in most cases (or two when `paras_to_test = 'bivcor'`); for `SNSeg_Multi`, the dimension must be at least two but no more than ten; for `SNSeg_HD`, the dimension must be greater than ten. * `paras_to_test`: This argument in functions `SNSeg_Uni` and `SNSeg_Multi` allows users to enter the parameter(s) they would like to test the change in. * `confidence`: Users should choose a confidence level among 0.9, 0.95, 0.99, 0.995 and 0.995. A smaller confidence level is easier to reject the null hypothesis and detect a change-point. The package also offers an option to plot the time series (this only works for the univariate time series cases!) Users need to set `plot_SN = TRUE` to visualize the time series, and `est_cp_loc = TRUE` to add the estimated change-point locations in the plot. ## SN Test Statistic Plot: `max_SNsweep` To visualize the computed SN-based test statistic at each time point, users can apply the function `max_SNsweep` by plugging in the output object from one of the functions `SNSeg_Uni`, `SNSeg_Multi` and `SNSeg_HD`. The options `est_cp_loc = TRUE` and `critical_loc = TRUE` are provided to draw the estimated change-point locations and the critical value threshold inside the test statistic plot. `max_SNsweep` also returns the SN-based test statistic for each time point. A large number of test statistics in the output can be messy for users who only seek to generate a plot. To hide the test statistics output, users can create an arbitrary variable name and set it to the `max_SNsweep` function, e.g., `SN_stat <- max_SNsweep(...)`. ## Parameter Estimates of Each Segment Separated by the Detected Change-Points: `SNSeg_estimate` The function `SNSeg_estimate` allows users to compute the parameter estimates (e.g., mean, variance, acf, quantile, etc.) of each of the segments separated by the estimated change-points. To use this function, users should use the output of the functions `SNSeg_Uni`, `SNSeg_Multi` and `SNSeg_HD` as the input of `SNSeg_estimate`. ## S3 methods: summary, print and plot The typical S3 methods `summary`, `print` and `plot` are available to `SNSeg_Uni`, `SNSeg_Multi` and `SNSeg_HD` objects. The `summary` method displays the parameter to be tested, the estimated change-point amount and locations, the `grid_size`, confidence level as well as the critical value of the SN-based test. The `print` method shows the change-point locations. The `plot` method plots the time series, and similar to the argument `plot_SN = TRUE`, the `plot` method allows users to generate time series segmentation plot(s) with the estimated change-point locations. It also provides `ts_index` option to allow users to plot any individual time series they want. Users can apply their preferred color of the change-point(s) within the plot(s) by setting `cpts.col` to any color. We then provide the examples of the functions `SNSeg_Uni`, `SNSeg_Multi` and `SNSeg_HD`. ## Examples of `SNSeg_Uni`: The function `SNSeg_Uni` detect change-points for a univariate time series based on the change in a single or parameters. * For a univariate time series, the input argument `paras_to_test` allows one or a combination of multiple parameters from "mean", "variance", "acf" and a numeric quantile value between 0 and 1. For instance, to test the change in autocorrelation and 80th quantile, users should set `paras_to_test` to c('acf', 0.8). * To test the change in correlation between two univariate time series, the input time series must be two-dimensional and `paras_to_test` should be set to `bivcor`. * Users can also set `paras_to_test` using their own parameters. In this case, the input of `paras_to_test` needs to be a function which returns a numeric value. We provide examples for different cases: ```{r echo=TRUE, eval=FALSE} # Please run the following function before running examples: mix_GauGPD <- function(u,p,trunc_r,gpd_scale,gpd_shape){ indicator <- u

0] <- qtruncnorm(u[indicator>0]/p,a=-Inf,b=trunc_r) rv[indicator<=0] <- qgpd((u[indicator<=0]-p)/(1-p), loc=trunc_r, scale=gpd_scale,shape=gpd_shape) return(rv) } ``` ## Test in a single parameter ### Segmentation for Mean ```{r} set.seed(7) n <- 2000 reptime <- 2 cp_sets <- round(n*c(0,cumsum(c(0.5,0.25)),1)) mean_shift <- c(0.4,0,0.4) rho <- -0.7 ts <- MAR(n, reptime, rho) no_seg <- length(cp_sets)-1 for(index in 1:no_seg){ # Mean shift tau1 <- cp_sets[index]+1 tau2 <- cp_sets[index+1] ts[tau1:tau2,] <- ts[tau1:tau2,] + mean_shift[index] } ts <- ts[,2] # grid_size undefined result <- SNSeg_Uni(ts, paras_to_test = "mean", confidence = 0.9, grid_size_scale = 0.05, grid_size = NULL, plot_SN = FALSE, est_cp_loc = FALSE) # grid_size defined & generate time series segmentation plot result <- SNSeg_Uni(ts, paras_to_test = "mean", confidence = 0.9, grid_size_scale = 0.05, grid_size = 116, plot_SN = TRUE, est_cp_loc = TRUE) # Estimated change-point locations result$est_cp # Parameter estimates (mean) of each segment SNSeg_estimate(result) # plot the SN-based test statistic SN_stat <- max_SNsweep(result, plot_SN = TRUE, est_cp_loc = TRUE, critical_loc = TRUE) ``` We then show how to use the S3 methods `summary`, `print` and `plot`. ```{r} summary(result) ``` ```{r} print(result) ``` ```{r} plot(result, cpts.col = 'red') ``` We can see both the `plot_SN = TRUE` option and the `plot.SN` function generates the same plot. Users can select any choice based on their preferences. The class of the argument `result` is `SNSeg_Uni`. ### Segmentation for Variance ```{r echo=TRUE, eval=FALSE} set.seed(7) ts <- MAR_Variance(2, "V1") ts <- ts[,2] # grid_size defined result <- SNSeg_Uni(ts, paras_to_test = "variance", confidence = 0.9, grid_size_scale = 0.05, grid_size = NULL, plot_SN = FALSE, est_cp_loc = TRUE) # Estimated change-point locations result$est_cp # Parameter estimates (variance) of each segment SNSeg_estimate(result) # plot the SN-based test statistic SN_stat <- max_SNsweep(result, plot_SN = TRUE, est_cp_loc = TRUE, critical_loc = TRUE) ``` ```{r echo=TRUE, eval=FALSE} # built-in functions # summary statistic summary(result) # print print(result) # plot plot(result, cpts.col = 'red') ``` ### Segmentation for Autocorrelation ```{r echo=TRUE, eval=FALSE} set.seed(7) ts <- MAR_Variance(2, "A3") ts <- ts[,2] # grid_size defined result <- SNSeg_Uni(ts, paras_to_test = "acf", confidence = 0.9, grid_size_scale = 0.05, grid_size = 92, plot_SN = FALSE, est_cp_loc = TRUE) # Estimated change-point locations result$est_cp # Parameter estimates (acf) of each segment SNSeg_estimate(result) # plot the SN-based test statistic SN_stat <- max_SNsweep(result, plot_SN = TRUE, est_cp_loc = TRUE, critical_loc = TRUE) ``` ```{r echo=TRUE, eval=FALSE} # built-in functions # summary statistic summary(result) # print print(result) # plot plot(result, cpts.col = 'red') ``` ### Segmentation for bivariate correlation ```{r echo=TRUE, eval=FALSE} library(mvtnorm) set.seed(7) n <- 1000 sigma_cross <- list(4*matrix(c(1,0.8,0.8,1), nrow=2), matrix(c(1,0.2,0.2,1), nrow=2), matrix(c(1,0.8,0.8,1), nrow=2)) cp_sets <- round(c(0,n/3,2*n/3,n)) noCP <- length(cp_sets)-2 rho_sets <- rep(0.5, noCP+1) ts <- MAR_MTS_Covariance(n, 2, rho_sets, cp_sets, sigma_cross) ts <- ts[1][[1]] # grid_size defined result <- SNSeg_Uni(ts, paras_to_test = "bivcor", confidence = 0.9, grid_size_scale = 0.05, grid_size = 77, plot_SN = FALSE, est_cp_loc = TRUE) # Estimated change-point locations result$est_cp # Parameter estimates (bivariate correlation) of each segment SNSeg_estimate(result) # plot the SN-based test statistic SN_stat <- max_SNsweep(result, plot_SN = TRUE, est_cp_loc = TRUE, critical_loc = TRUE) ``` ```{r echo=TRUE, eval=FALSE} # built-in functions # summary statistic summary(result) # print print(result) # plot plot(result, cpts.col = 'red') ``` ### Segmentation for quantile Please download the packages `truncnorm` and `evd` before running the following code. ```{r echo=TRUE, eval=FALSE} library(truncnorm) library(evd) set.seed(7) n <- 1000 cp_sets <- c(0,n/2,n) noCP <- length(cp_sets)-2 reptime <- 2 rho <- 0.2 # AR time series with no change-point (mean, var)=(0,1) ts <- MAR(n, reptime, rho)*sqrt(1-rho^2) trunc_r <- 0 p <- pnorm(trunc_r) gpd_scale <- 2 gpd_shape <- 0.125 for(ts_index in 1:reptime){ ts[(cp_sets[2]+1):n, ts_index] <- mix_GauGPD(pnorm(ts[(cp_sets[2]+1):n, ts_index]), p,trunc_r,gpd_scale,gpd_shape) } ts <- ts[,2] # grid_size undefined # test in 90% quantile result <- SNSeg_Uni(ts, paras_to_test = 0.9, confidence = 0.9, grid_size_scale = 0.066, grid_size = NULL, plot_SN = FALSE, est_cp_loc = TRUE) # Estimated change-point locations result$est_cp # Parameter estimates (90th quantile) of each segment SNSeg_estimate(result) # plot the SN-based test statistic SN_stat <- max_SNsweep(result, plot_SN = TRUE, est_cp_loc = TRUE, critical_loc = TRUE) ``` ```{r echo=TRUE, eval=FALSE} # built-in functions # summary statistic summary(result) # print print(result) # plot plot(result, cpts.col = 'red') ``` ### Test in a general functional To perform the SN-based test using any general functional, users should define their own function as the input of `paras_to_test`. The input of this function should consist of the followings: * `ts`: A univariate time series provided by the user. The user-defined function should return a numeric value as the output. ```{r echo=TRUE, eval=FALSE} set.seed(7) n <- 500 reptime <- 2 cp_sets <- round(n*c(0,cumsum(c(0.5,0.25)),1)) mean_shift <- c(0.4,0,0.4) rho <- -0.7 ts <- MAR(n, reptime, rho) no_seg <- length(cp_sets)-1 for(index in 1:no_seg){ # Mean shift tau1 <- cp_sets[index]+1 tau2 <- cp_sets[index+1] ts[tau1:tau2,] <- ts[tau1:tau2,] + mean_shift[index] } ts <- ts[,2] # Define a general functional for the input 'paras_to_test' paras_to_test = function(ts){ mean(ts) } result.SNCP.general <- SNSeg_Uni(ts, paras_to_test = paras_to_test, confidence = 0.9, grid_size_scale = 0.05, grid_size = NULL, plot_SN = FALSE, est_cp_loc = TRUE) # Estimated change-point locations result.SNCP.general$est_cp # Parameter estimates (general functional) of each segment SNSeg_estimate(result.SNCP.general) # plot the SN-based test statistic SN_stat <- max_SNsweep(result.SNCP.general, plot_SN = TRUE, est_cp_loc = TRUE, critical_loc = TRUE) ``` ```{r echo=TRUE, eval=FALSE} # built-in functions # summary statistic summary(result) # print print(result) # plot plot(result, cpts.col = 'red') ``` ### Test in multiple parameters ```{r echo=TRUE, eval=FALSE} set.seed(7) n <- 1000 cp_sets <- c(0,333,667,1000) no_seg <- length(cp_sets)-1 rho <- 0 # AR time series with no change-point (mean, var)=(0,1) ts <- MAR(n, 2, rho)*sqrt(1-rho^2) no_seg <- length(cp_sets)-1 sd_shift <- c(1,1.6,1) for(index in 1:no_seg){ # Mean shift tau1 <- cp_sets[index]+1 tau2 <- cp_sets[index+1] ts[tau1:tau2,] <- ts[tau1:tau2,]*sd_shift[index] } d <- 2 ts <- ts[,2] # Test in 90th and 95th quantile with grid_size undefined result <- SNSeg_Uni(ts, paras_to_test = c(0.9, 0.95), confidence = 0.9, grid_size_scale = 0.05, grid_size = NULL, plot_SN = FALSE, est_cp_loc = FALSE) # Test in 90th quantile and the variance with grid_size undefined result <- SNSeg_Uni(ts, paras_to_test = c(0.9, 'variance'), confidence = 0.95, grid_size_scale = 0.078, grid_size = NULL, plot_SN = FALSE, est_cp_loc = FALSE) # Test in 90th quantile, variance and acf with grid_size undefined result <- SNSeg_Uni(ts, paras_to_test = c(0.9,'variance', "acf"), confidence = 0.9, grid_size_scale = 0.064, grid_size = NULL, plot_SN = TRUE, est_cp_loc = TRUE) # Estimated change-point locations result$est_cp ``` ```{r echo=TRUE, eval=FALSE} # Test in 60th quantile, mean, variance and acf with grid_size defined result.last <- SNSeg_Uni(ts, paras_to_test = c(0.6, 'mean', "variance", "acf"), confidence = 0.9, grid_size_scale = 0.05, grid_size = 83, plot_SN = FALSE, est_cp_loc = TRUE) # Estimated change-point locations result.last$est_cp # Parameter estimates (of the last example) of each segment SNSeg_estimate(result.last) # SN-based test statistic segmentation plot SN_stat <- max_SNsweep(result.last, plot_SN = TRUE, est_cp_loc = TRUE, critical_loc = TRUE) ``` ```{r echo=TRUE, eval=FALSE} # built-in functions # summary statistic summary(result) # print print(result) # plot plot(result, cpts.col = 'red') ``` ## Examples: `SNSeg_Multi` The function `SNSeg_Multi` detects change-points for multivariate time series based on the change in multivariate means or covariances. Users can set `paras_to_test` to `mean` or `covariance` for change-points estimation. * The dimension of parameters to be tested must be at most ten. If the parameter is `mean`, the dimension is equivalent to the number of time series. If the parameter is `covariance`, the dimension is equivalent to the number of unknown parameters within the covariance matrix. For examples of multivariate means and covariances, please check the following commands: ```{r echo=TRUE, eval=FALSE} # Please run this function before running examples: exchange_cor_matrix <- function(d, rho){ tmp <- matrix(rho, d, d) diag(tmp) <- 1 return(tmp) } ``` ### Segmentation for Multivariate Mean ```{r echo=TRUE, eval=FALSE} library(mvtnorm) set.seed(10) d <- 5 n <- 1000 cp_sets <- round(n*c(0,cumsum(c(0.075,0.3,0.05,0.1,0.05)),1)) mean_shift <- c(-3,0,3,0,-3,0)/sqrt(d) mean_shift <- sign(mean_shift)*ceiling(abs(mean_shift)*10)/10 rho_sets <- 0.5 sigma_cross <- list(exchange_cor_matrix(d,0)) ts <- MAR_MTS_Covariance(n, 2, rho_sets, cp_sets=c(0,n), sigma_cross) noCP <- length(cp_sets)-2 no_seg <- length(cp_sets)-1 for(rep_index in 1:2){ for(index in 1:no_seg){ # Mean shift tau1 <- cp_sets[index]+1 tau2 <- cp_sets[index+1] ts[[rep_index]][,tau1:tau2] <- ts[[rep_index]][,tau1:tau2] + mean_shift[index] } } ts <- ts[1][[1]] # grid_size undefined result <- SNSeg_Multi(ts, paras_to_test = "mean", confidence = 0.95, grid_size_scale = 0.079, grid_size = NULL, plot_SN = FALSE, est_cp_loc = TRUE) # grid_size defined result <- SNSeg_Multi(ts, paras_to_test = "mean", confidence = 0.99, grid_size_scale = 0.05, grid_size = 65, plot_SN = FALSE, est_cp_loc = TRUE) # Estimated change-point locations result$est_cp # Parameter estimates (multivariate mean) of each segment SNSeg_estimate(result) # SN-based test statistic segmentation plot SN_stat <- max_SNsweep(result, plot_SN = TRUE, est_cp_loc = TRUE, critical_loc = TRUE) ``` ```{r echo=TRUE, eval=FALSE} # built-in functions # summary statistic summary(result) # print print(result) # plot par(mfrow=c(2,3)) plot(result, cpts.col = 'red') ``` ### Segmentation for Multivariate Covariance ```{r echo=TRUE, eval=FALSE} library(mvtnorm) set.seed(10) reptime <- 2 d <- 4 n <- 1000 sigma_cross <- list(exchange_cor_matrix(d,0.2), 2*exchange_cor_matrix(d,0.5), 4*exchange_cor_matrix(d,0.5)) rho_sets <- c(0.3,0.3,0.3) mean_shift <- c(0,0,0) # with mean change cp_sets <- round(c(0,n/3,2*n/3,n)) ts <- MAR_MTS_Covariance(n, reptime, rho_sets, cp_sets, sigma_cross) noCP <- length(cp_sets)-2 no_seg <- length(cp_sets)-1 for(rep_index in 1:reptime){ for(index in 1:no_seg){ # Mean shift tau1 <- cp_sets[index]+1 tau2 <- cp_sets[index+1] ts[[rep_index]][,tau1:tau2] <- ts[[rep_index]][,tau1:tau2] + mean_shift[index] } } ts <- ts[[1]] # grid_size undefined result <- SNSeg_Multi(ts, paras_to_test = "covariance", confidence = 0.9, grid_size_scale = 0.05, grid_size = NULL, plot_SN = FALSE, est_cp_loc = FALSE) # grid_size defined result <- SNSeg_Multi(ts, paras_to_test = "covariance", confidence = 0.9, grid_size_scale = 0.05, grid_size = 81, plot_SN = FALSE, est_cp_loc = TRUE) # Estimated change-point locations result$est_cp # Parameter estimates (covariance estimate) of each segment SNSeg_estimate(result) # SN-based test statistic segmentation plot SN_stat <- max_SNsweep(result, plot_SN = TRUE, est_cp_loc = TRUE, critical_loc = TRUE) ``` ```{r echo=TRUE, eval=FALSE} # built-in functions # summary statistic summary(result) # print print(result) # plot par(mfrow=c(1,1)) plot(result, cpts.col = 'red') ``` ## Examples: `SNSeg_HD` The function `SNSeg_HD` performs change-points estimation for high-dimensional time series (dimension greater than 10) using the change in high-dimensional means. For usage examples of `SNSeg_HD`, we provide the followig code: ```{r echo=TRUE, eval=FALSE} n <- 600 p <- 100 nocp <- 5 cp_sets <- round(seq(0,nocp+1,1)/(nocp+1)*n) num_entry <- 5 kappa <- sqrt(4/5) # Wang et al(2020) mean_shift <- rep(c(0,kappa),100)[1:(length(cp_sets)-1)] set.seed(1) ts <- matrix(rnorm(n*p,0,1),n,p) no_seg <- length(cp_sets)-1 for(index in 1:no_seg){ # Mean shift tau1 <- cp_sets[index]+1 tau2 <- cp_sets[index+1] ts[tau1:tau2,1:num_entry] <- ts[tau1:tau2,1:num_entry] + mean_shift[index] # sparse change } # SN segmentation plot # grid_size undefined (plot the first 3 time series) result <- SNSeg_HD(ts, confidence = 0.9, grid_size_scale = 0.05, grid_size = NULL, plot_SN = FALSE, est_cp_loc = TRUE, ts_index = c(1:3)) # grid_size defined (plot the 1st, 3rd and 5th time series) result <- SNSeg_HD(ts, confidence = 0.9, grid_size_scale = 0.05, grid_size = 52, plot_SN = FALSE, est_cp_loc = TRUE, ts_index = c(1,3,5)) # Estimated change-point locations result$est_cp # Parameter estimates (high-dimensional means) of each segment summary.stat <- SNSeg_estimate(result) # SN-based test statistic segmentation plot SN_stat <- max_SNsweep(result, plot_SN = TRUE, est_cp_loc = TRUE, critical_loc = TRUE) ``` ```{r echo=TRUE, eval=FALSE} # built-in functions # summary statistic summary(result) # print print(result) # plot plot(result, cpts.col = 'red', ts_index = c(1:3)) ``` We note that only the selected time series were plotted by setting values for `ts_index`.