ktweedie-vignette

Introduction

ktweedie is a package that fits nonparametric Tweedie compound Poisson gamma models in the reproducing kernel Hilbert space. The package is based on two algorithms, the ktweedie for kernel-based Tweedie model and the sktweedie for sparse kernel-based Tweedie model. The ktweedie supports a wide range of kernel functions implemented in the R package kernlab and the optimization algorithm is a Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm with bisection line search. The package includes cross-validation functions for one-dimensional tuning of the kernel regularization parameter λ and for two-dimensional joint tuning of one kernel parameter and λ. The sktweedie uses variable weights to achieve variable selection. It is a meta-algorithm that alternatively updates the kernel parameters and a set of variable weights.

The ktweedie solves the problem $$ \min_{\boldsymbol{\alpha}}\left\{ -\sum_{i=1}^{n}\frac{1}{\phi}\left(\frac{y_{i}e^{(1-\rho)\mathbf{K}_{i}^{\top}\boldsymbol{\alpha}}}{1-\rho}-\frac{e^{(2-\rho)\mathbf{K}_{i}^{\top}\boldsymbol{\alpha}}}{2-\rho}\right)+\lambda\boldsymbol{\alpha}^{\top}\mathbf{K}\boldsymbol{\alpha}\right\} , $$where ρ ∈ (1, 2) is the index parameter, ϕ > 0 is the dispersion parameter, K is an n × n kernel matrix computed according to the user-specified kernel function K(⋅, ⋅), whose entries are Kij = K(xi, xj) are calculated based on the p-dimensional predictors from subjects i, j = 1, …, n. In the kernel-based Tweedie model, the mean parameter μi for the i-th observation is modelled by $$\mu_i=\log(\mathbb{E}(Y_i|\mathbf{x_i}))=\sum_{j=1}^n \alpha_j K(\mathbf x_{i}, \mathbf x_j).$$

The sktweedie solves$$ \begin{aligned} &\min_{\boldsymbol{\alpha}, \mathbf{w}}\left\{ -\sum_{i=1}^{n}\frac{1}{\phi}\left(\frac{y_{i}e^{(1-\rho)\mathbf{K(w)}_{i}^{\top}\boldsymbol{\alpha}}}{1-\rho}-\frac{e^{(2-\rho)\mathbf{K(w)}_{i}^{\top}\boldsymbol{\alpha}}}{2-\rho}\right)+\lambda_1\boldsymbol{\alpha}^{\top}\mathbf{K(w)}\boldsymbol{\alpha} +\lambda_2 \mathbf{1}^\top \mathbf{w} \right \}\\ & \qquad \qquad \mathrm{s.t.\ \ \ } w_j\in [0,1],\ j=1,\ldots,p, \end{aligned}$$

where K(w)ij = K(wxi, wxj) involves variable weights w.

Installation

From the CRAN.

install.packages("ktweedie")

From the Github.

devtools::install_github("ly129/ktweedie")

Quick Start

First we load the ktweedie package:

library(ktweedie)

The package includes a toy data for demonstration purpose. The 30 × 5 predictor matrix x is generated from standard normal distribution and y is generated according toy ∼ Tweedie(μ = sin (xβ), ρ = 1.5, ϕ = 0.5),where β = (6, −4, 0, 0, 0). That said, only the first two predictors are associated with the response.

data(dat)
x <- dat$x
y <- dat$y

An input matrix x and an output vector y are now loaded. The ktd_estimate() function can be used to fit a basic ktweedie model. The regularization parameter lam1 can be a vector, which will be solved in a decreasing order with warm start.

fit.ktd <- ktd_estimate(x = x,
                        y = y,
                        kern = rbfdot(sigma = 0.1),
                        lam1 = c(0.01, 0.1, 1))
str(fit.ktd$estimates)
#> List of 3
#>  $ lambda 1   :List of 3
#>   ..$ fn         : num 110
#>   ..$ coefficient: num [1:30, 1] 0.5558 -0.062 -0.0381 0.0523 -0.0251 ...
#>   ..$ convergence: int 0
#>  $ lambda 0.1 :List of 3
#>   ..$ fn         : num 51
#>   ..$ coefficient: num [1:30, 1] 1.662 -0.235 -0.177 0.867 -0.143 ...
#>   ..$ convergence: int 0
#>  $ lambda 0.01:List of 3
#>   ..$ fn         : num 39.2
#>   ..$ coefficient: num [1:30, 1] 7.692 -0.49 -0.841 4.624 -0.696 ...
#>   ..$ convergence: int 0

fit.ktd$estimates stores the estimated coefficients and the final objective function value. The estimated kernel-based model coefficients for the l-th lam1 can be accessed by the index l: fit.ktd$estimates[[l]]$coefficient.

The function can also be used to fit the sktweedie model by setting the argument sparsity to TRUE, and specifying the regularization coefficient λ2 in the argument lam2.

fit.sktd <- ktd_estimate(x = x,
                         y = y,
                         kern = rbfdot(sigma = 0.1),
                         lam1 = 5,
                         sparsity = TRUE,
                         lam2 = 1)

And we can access the fitted coefficients in a similar manner to the fit.ktd. Additionally, the fitted variable weights w can be accessed by

fit.sktd$estimates[[1]]$weight
#>           [,1]
#> [1,] 1.0000000
#> [2,] 0.4462078
#> [3,] 0.0000000
#> [4,] 0.0000000
#> [5,] 0.0000000

Variables with weights close to 0 can be viewed as noise variables.