Package 'GUEST'

Title: Graphical Models in Ultrahigh-Dimensional and Error-Prone Data via Boosting Algorithm
Description: We consider the ultrahigh-dimensional and error-prone data. Our goal aims to estimate the precision matrix and identify the graphical structure of the random variables with measurement error corrected. We further adopt the estimated precision matrix to the linear discriminant function to do classification for multi-label classes.
Authors: Hui-Shan Tsao [aut, cre], Li-Pang Chen [aut]
Maintainer: Hui-Shan Tsao <[email protected]>
License: GPL-2
Version: 0.2.0
Built: 2024-11-29 08:43:52 UTC
Source: CRAN

Help Index


Estimation of precision matrix and detection of graphical structure

Description

This function first applies the regression calibration to deal with measurement error effects. After that, the feature screening technique is employed to screen out independent pairs of random variables and reduce the dimension of random variables. Finally, we adopt the boosting method to detect informative pairs of random variables and estimate the precision matrix. This function can handle various distributions, such as normal, binomial, and Poisson distributions, as well as nonlinear effects among random variables.

Usage

boost.graph(data,ite1,ite2,ite3,thre,select = 0.9,inc = 10^(-3),
sigma_e = 0.6,q = 0.8,lambda = 1,pi = 0.5,rep = 100,cor = TRUE)

Arguments

data

An n (observations) times p (variables) matrix of random variables, whose distributions can be continuous, discrete, or mixed.

ite1

The number of iterations for continuous variables.

ite2

The number of iterations for binary variables.

ite3

The number of iterations for count variables.

thre

The treshold value for feature screening, whose value should be between 0 and 1.

select

The treshold constant in the boosting algorithm, whose value should be between 0 and 1. The default value is 0.9.

inc

The learning rate of the increment in the boosting algorithm, which shoud be a small value. The default value is 0.001.

sigma_e

The common value in the diagonal covariance matrix of the error for the classical measurement error model when data are continuous. The default value is 0.6.

q

The common value used to characterize misclassification for binary random variables. The default value is 0.8.

lambda

The parameter of the Poisson distribution, which is used to characterize error-prone count random variables. The default value is 1.

pi

The probability in the Binomial distribution, which is used to characterize error-prone count random variables. The default value is 0.5.

rep

The number of bootstrapping iterations. The default value is 100.

cor

Measurement error correction when estimating the precision matrix. The default value is TRUE.

Value

w

The estimator of the precision matrix.

p

The chosen pairs obtained by the feature screening.

xi

The weights sorted with pairs in p.

g

The visualization of the estimated network structure determined by w.

Author(s)

Hui-Shan Tsao and Li-Pang Chen
Maintainer: Hui-Shan Tsao [email protected]

References

Hui-Shan Tsao (2024). Estimation of Ultrahigh-Dimensional Graphical Models and Its Application to Dsicriminant Analysis. Master Thesis supervised by Li-Pang Chen, National Chengchi University.

Examples

data(MedulloblastomaData)

X <- t(MedulloblastomaData[2:656,]) #covariates
Y <- MedulloblastomaData[1,] #response

X <- matrix(as.numeric(X),nrow=23)

p <- ncol(X)
n <- nrow(X)

#standarization
X_new=data.frame()
for (i in 1:p){
 X_new[1:n,i]=(X[,i]-rep(mean(X[,i]),n))/sd(X[,i])
}
X_new=matrix(unlist(X_new),nrow = n)


#estimate graphical model
result <- boost.graph(data = X_new, thre = 0.2, ite1 = 3, ite2 = 0, ite3 = 0, rep = 1)
theta.hat <- result$w

Graphical Models in Ultrahigh-Dimensional and Error-Prone Data via Boosting Algorithm

Description

The package GUEST, referred to Graphical models in Ultrahigh-dimensional and Error-prone data via booSTing algorithm, is used to estimate the precision matrix and detect graphical structure for ultrahigh-dimensional, error-prone, and possibly nonlinear random variables. Given the estimated precision matrix, we further apply it to the linear discriminant function to deal with multi-classification. The precision matrix can be estimated by the function boost.graph, and the classification can be implemented by the function LDA.boost. Finally, we consider the medulloblastoma dataset to demonstrate the implementation of two functions.

Details

To estimate the precision matrix and detect the graphical structure under our scenario, the function boost.graph first applies the regression calibration method to deal with measurement error in continuous, binary, or count random variables. After that, the feature screening technique is employed to reduce the dimension of random variable, and we then adopt the boosting algorithm to estimate the precision matrix. The estimated precision matrix also reflects the desired graphical structure. The function LDA.boost implements the linear discriminant function to do classification for multi-label classes, where the precision matrix, also known as the inverse of the covariance matrix, in the linear discriminant function can be estimated by the function boost.graph.

Value

GUEST_package


Implementation of the linear discriminant function for multi-label classification.

Description

This function applies the linear discriminant function to do classification for multi-label responses. The precision matrix, or the inverse of the covariance matrix, in the linear discriminant function can be estimated by w in the function boost.graph. In addition, error-prone covariates in the linear discriminant function are addressed by the regression calibration.

Usage

LDA.boost(data, resp, theta, sigma_e = 0.6,q = 0.8,lambda = 1, pi = 0.5)

Arguments

data

An n (observations) times p (variables) matrix of random variables, whose distributions can be continuous, discrete, or mixed.

resp

An n-dimensional vector of categorical random variables, which is the response in the data.

theta

The estimator of the precision matrix.

sigma_e

The common value in the diagonal covariance matrix of the error for the classical measurement error model when data are continuous. The default value is 0.6.

q

The common value used to characterize misclassification for binary random variables. The default value is 0.8.

lambda

The parameter of the Poisson distribution, which is used to characterize error-prone count random variables. The default value is 1.

pi

The probability in the Binomial distribution, which is used to characterize error-prone count random variables. The default value is 0.5.

Details

The linear discriminant function used is as follow:

scorei,j=log(πi)0.5 μi theta μi+dataj theta μi,\code{score}_{i,j} = \log (\pi _i) - 0.5\ \mu_{i}^\top\ \code{theta}\ \mu _{i} + \code{data}_{j}^\top\ \code{theta}\ \mu_{i},


for the class i=1,,Ii = 1, \cdots, I with II being the number of classes in the dataset and subject j=1,,nj = 1, \cdots, n, where πi\pi _i is the proportion of subjects in the class ii, dataj\code{data}_{j} is the vector of covariates for the subject jj, theta\code{theta} is the precision matrix of the covariates, and μi\mu_{i} is the empirical mean vector of the random variables in the class ii.

Value

score

The value of the linear discriminant function (see details) with the estimator of the precision matrix accommodated.

class

The result of predicted class for subjects.

Author(s)

Hui-Shan Tsao and Li-Pang Chen
Maintainer: Hui-Shan Tsao [email protected]

References

Hui-Shan Tsao (2024). Estimation of Ultrahigh-Dimensional Graphical Models and Its Application to Dsicriminant Analysis. Master Thesis supervised by Li-Pang Chen, National Chengchi University.

Examples

data(MedulloblastomaData)

X <- t(MedulloblastomaData[2:655,]) #covariates
Y <- MedulloblastomaData[1,] #response

X <- matrix(as.numeric(X),nrow=23)

p <- ncol(X)
n <- nrow(X)

#standarization
X_new=data.frame()
for (i in 1:p){
 X_new[1:n,i]=(X[,i]-rep(mean(X[,i]),n))/sd(X[,i])
}
X_new=matrix(unlist(X_new),nrow = n)


#estimate graphical model
result <- boost.graph(data = X_new, thre = 0.2, ite1 = 3, ite2 = 0, ite3 = 0, rep = 1)
theta.hat <- result$w

theta.hat[which(theta.hat<0.8)]=0 #keep the highly dependent pairs

#predict
pre <- LDA.boost(data = X_new, resp = Y, theta = theta.hat)
estimated_Y <- pre$class

The medulloblastoma dataset

Description

The dataset, which is available on https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE468, contains 23 patients with medulloblastoma, and each patient has 2059 gene expression values. The response contains 2 classes: metastatic (M+) or non-metastatic (M0). After removing the missing and duplicate values, the dimension of remaining gene expressions is 655. The dataset is used to illustrate the usage of the boost.graph and LDA.boost functions.

Usage

data(MedulloblastomaData)

Format

The dataset has 23 observations and 655 gene expression values.

References

MacDonald, T., Brown, K., LaFleur, B., Peterson K., Lawlor C., Chen Y., Packer RJ., Cogen P., Stephan DA.(2001). Expression profiling of medulloblastoma: PDGFRA and the RAS/MAPK pathway as therapeutic targets for metastatic disease. Nat Genet, 29, 143–152.

Examples

X <- t(MedulloblastomaData[2:655,]) #covariates
Y <- MedulloblastomaData[1,] #response