Title: | Empirical Bayesian Tobit Matrix Estimation |
---|---|
Description: | Estimation tools for multidimensional Gaussian means using empirical Bayesian g-modeling. Methods are able to handle fully observed data as well as left-, right-, and interval-censored observations (Tobit likelihood); descriptions of these methods can be found in Barbehenn and Zhao (2023) <doi:10.48550/arXiv.2306.07239>. Additional, lower-level functionality based on Kiefer and Wolfowitz (1956) <doi:10.1214/aoms/1177728066> and Jiang and Zhang (2009) <doi:10.1214/08-AOS638> is provided that can be used to accelerate many empirical Bayes and nonparametric maximum likelihood problems. |
Authors: | Alton Barbehenn [aut, cre] , Sihai Dave Zhao [aut] |
Maintainer: | Alton Barbehenn <[email protected]> |
License: | GPL-3 |
Version: | 1.0.2 |
Built: | 2024-11-27 06:28:32 UTC |
Source: | CRAN |
A bile acid data set taken from Lei et al. (2018)
doi:10.1096/fj.201700055R via Wei et al. (2018)
doi:10.1371/journal.pcbi.1005973
(corresponding GitHub repository: https://github.com/WandeRum/GSimp). The
values in BileAcid
can be assumed to be independent log-normal
measurements.
BileAcid
BileAcid
A data frame with 198 rows and 34 variables. Each row is a patient id and each column is an bile acid measurement.
This method only works if there is a working installation of REBayes
available. See the REBayes
package and corresponding papers for more
implementation details.
ConvexDual(A, ...)
ConvexDual(A, ...)
A |
numeric matrix likelihoods |
... |
further arguments passed to |
The matrix A
is structured as follows: A_ij = P(X_i | theta = t_j),
where X_i is the i'th observation and t_j is the j'th set of
parameters/grid-point.
a vector containing the fitted prior
This method only works if there is a working installation of REBayes
available. See the REBayes
package and corresponding papers for more
implementation details.
ConvexPrimal(A, ...)
ConvexPrimal(A, ...)
A |
numeric matrix likelihoods |
... |
further arguments passed to |
The matrix A
is structured as follows: A_ij = P(X_i | theta = t_j),
where X_i is the i'th observation and t_j is the j'th set of
parameters/grid-point.
a vector containing the fitted prior
Fit and estimate the nonparametric maximum likelihood estimator in R^p
(p >= 1) when the likelihood is Gaussian and possibly interval censored. If
p = 1, then L
, R
, and gr
may be vectors (they are
immediately converted into matrices internally).
ebTobit( L, R = L, gr = (R + L)/2, s1 = 1, algorithm = "EM", pos_lik = TRUE, ... )
ebTobit( L, R = L, gr = (R + L)/2, s1 = 1, algorithm = "EM", pos_lik = TRUE, ... )
L |
n x p matrix of lower bounds on observations |
R |
n x p matrix of upper bounds on observations |
gr |
m x p matrix of grid points |
s1 |
a single numeric standard deviation or an n x p matrix of standard deviations |
algorithm |
method to fit prior, either a function or function name |
pos_lik |
boolean indicating whether to lower-bound the likelihood
matrix with |
... |
further arguments passed into fitting method such as |
Each observation is stored in a pair of matrices, L
and R
. If
L_ij = R_ij then a direct measurement X_ij ~ N(theta, s1^2) is made;
if L_ij < R_ij then the measurement is censored so that L_ij < X_ij < R_ij.
To use a custom fitting algorithm, define a function MyAlg
that
takes in an n x m likelihood matrix: P_ij = P(L_i, R_i | theta = t_j) and
returns a vector of estimated prior weights for t_j. Once MyAlg
is
defined, fit the prior by using algorithm = "MyAlg"
or use the
function itself algorithm = MyAlg
.
Alternative fitting algorithms "ConvexPrimal"and "ConvexDual" have been
(wrappers of REBayes::KWPrimal
and REBayes::KWDual
,
respectively) included and can be used if MOSEK and REBayes
are
properly installed.
a fitted ebTobit
object containing at least the prior weights,
corresponding grid/support points, and likelihood matrix relating the grid to
the observations
set.seed(1) n <- 100 p <- 5 r <- 2 U.true <- matrix(stats::rexp(n*r), n, r) V.true <- matrix(sample(x = c(1,4,7), size = p*r, replace = TRUE, prob = c(0.7, 0.2, 0.1)), p, r) TH <- tcrossprod(U.true, V.true) X <- TH + matrix(stats::rnorm(n*p), n, p) # fit uncensored method fit1 <- ebTobit(X) # fit left-censored method ldl <- 1 # lower and upper detection limits udl <- Inf L <- ifelse(X < ldl, 0, ifelse(X <= udl, X, udl)) R <- ifelse(X < ldl, ldl, ifelse(X <= udl, X, Inf)) fit2 <- ebTobit(L, R)
set.seed(1) n <- 100 p <- 5 r <- 2 U.true <- matrix(stats::rexp(n*r), n, r) V.true <- matrix(sample(x = c(1,4,7), size = p*r, replace = TRUE, prob = c(0.7, 0.2, 0.1)), p, r) TH <- tcrossprod(U.true, V.true) X <- TH + matrix(stats::rnorm(n*p), n, p) # fit uncensored method fit1 <- ebTobit(X) # fit left-censored method ldl <- 1 # lower and upper detection limits udl <- Inf L <- ifelse(X < ldl, 0, ifelse(X <= udl, X, udl)) R <- ifelse(X < ldl, ldl, ifelse(X <= udl, X, Inf)) fit2 <- ebTobit(L, R)
Compute the nonparametric maximum likelihood estimate given a likelihood matrix. The matrix A is structured so that A_{ij} = f(X_i | theta_j) for some grid of potential parameter values theta_1, ..., theta_p and observations X_1, ..., X_n. The parameters, theta_j, can be multidimensional because all that is required is the likelihood. Convergence is achieved when the relative improvements of the log-likelihood is below the provided tolerance level.
EM(A, maxiter = 10000L, rtol = 1e-06)
EM(A, maxiter = 10000L, rtol = 1e-06)
A |
numeric matrix likelihoods |
maxiter |
early stopping condition |
rtol |
convergence tolerance: abs(loss_new - loss_old)/abs(loss_old) |
the estimated prior distribution (a vector of masses corresponding to the columns of A)
set.seed(1) t = sample(c(0,5), size = 100, replace = TRUE) x = t + stats::rnorm(100) gr = seq(from = min(x), to = max(x), length.out = 50) A = stats::dnorm(outer(x, gr, "-")) EM(A) ## Not run: # compare to solution from rmosek (requires additional library installation): all.equal( REBayes::KWPrimal(A = A, d = rep(1, 50), w = rep(1/100, 100))$f, EM(A, maxiter = 1e+6, rtol = 1e-16), # EM alg converges slowly tolerance = 0.01 ) ## End(Not run)
set.seed(1) t = sample(c(0,5), size = 100, replace = TRUE) x = t + stats::rnorm(100) gr = seq(from = min(x), to = max(x), length.out = 50) A = stats::dnorm(outer(x, gr, "-")) EM(A) ## Not run: # compare to solution from rmosek (requires additional library installation): all.equal( REBayes::KWPrimal(A = A, d = rep(1, 50), w = rep(1/100, 100))$f, EM(A, maxiter = 1e+6, rtol = 1e-16), # EM alg converges slowly tolerance = 0.01 ) ## End(Not run)
Compute either the posterior mean (default) or posterior L1 mediod which corresponds to the posterior median in one-dimension.
## S3 method for class 'ebTobit' fitted(object, method = "mean", ...)
## S3 method for class 'ebTobit' fitted(object, method = "mean", ...)
object |
an object inheriting from class |
method |
either "mean", "L1mediod", or "mode" corresponding to the
methods: |
... |
not used |
matrix containing the posterior estimates for measurements in the
fit empirical Bayes model object
Validate ebTobit Object
is.ebTobit(object)
is.ebTobit(object)
object |
any R object |
boolean: TRUE if the object is a valid ebTobit
object
Compute P(L_i, R_i | theta = t_k) for observations (L_i, R_i) and grid of mean t_k.
lik_GaussianPIC(L, R, gr, s1)
lik_GaussianPIC(L, R, gr, s1)
L |
numeric vector of lower bounds |
R |
numeric vector of upper bounds |
gr |
numeric vector of means |
s1 |
numeric vector of standard deviations |
the likelihood under partial interval censoring
# set-up p = 15 gr = stats::rnorm(p) L = R = stats::rnorm(p) missing.idx = sample.int(n = p, size = p/5) L[missing.idx] = L[missing.idx] - stats::runif(length(missing.idx), 0, 1) R[missing.idx] = R[missing.idx] + stats::runif(length(missing.idx), 0, 1) # R solution lik = prod(ifelse( L == R, stats::dnorm(L-gr), stats::pnorm(R-gr) - stats::pnorm(L-gr))) # Compare R to RcppParallel method all.equal(lik, lik_GaussianPIC(L, R, gr, rep(1,p)))
# set-up p = 15 gr = stats::rnorm(p) L = R = stats::rnorm(p) missing.idx = sample.int(n = p, size = p/5) L[missing.idx] = L[missing.idx] - stats::runif(length(missing.idx), 0, 1) R[missing.idx] = R[missing.idx] + stats::runif(length(missing.idx), 0, 1) # R solution lik = prod(ifelse( L == R, stats::dnorm(L-gr), stats::pnorm(R-gr) - stats::pnorm(L-gr))) # Compare R to RcppParallel method all.equal(lik, lik_GaussianPIC(L, R, gr, rep(1,p)))
Compute a matrix L whose entries are L[i,k] = P(L_i, R_i | theta = t_k) for observations (L_i, R_i) and grid of means t_k.
likMat(L, R, gr, s1)
likMat(L, R, gr, s1)
L |
n x p matrix of lower bounds |
R |
n x p matrix of upper bounds |
gr |
m x p matrix of candidate means |
s1 |
n x p matrix of standard deviations |
the n x m likelihood matrix under partial interval censoring
# set-up n = 100; m = 50; p = 5 gr = matrix(stats::rnorm(m*p), m, p) L = R = matrix(stats::rnorm(n*p), n, p) s1 = matrix(1, n, p) missing.idx = sample.int(n = n*p, size = p*p) L[missing.idx] = L[missing.idx] - stats::runif(p, 0, 1) # R solution lik = matrix(nrow = n, ncol = m) for (i in 1:n) { for(k in 1:m) { lik[i,k] = prod(ifelse( L[i,] == R[i,], stats::dnorm(L[i,]-gr[k,], sd = s1[i,]), stats::pnorm(R[i,]-gr[k,], sd = s1[i,]) - stats::pnorm(L[i,]-gr[k,], sd = s1[i,]) )) } } # Compare R to RcppParallel method all.equal(lik, likMat(L, R, gr, s1))
# set-up n = 100; m = 50; p = 5 gr = matrix(stats::rnorm(m*p), m, p) L = R = matrix(stats::rnorm(n*p), n, p) s1 = matrix(1, n, p) missing.idx = sample.int(n = n*p, size = p*p) L[missing.idx] = L[missing.idx] - stats::runif(p, 0, 1) # R solution lik = matrix(nrow = n, ncol = m) for (i in 1:n) { for(k in 1:m) { lik[i,k] = prod(ifelse( L[i,] == R[i,], stats::dnorm(L[i,]-gr[k,], sd = s1[i,]), stats::pnorm(R[i,]-gr[k,], sd = s1[i,]) - stats::pnorm(L[i,]-gr[k,], sd = s1[i,]) )) } } # Compare R to RcppParallel method all.equal(lik, likMat(L, R, gr, s1))
Marginal Log-likelihood of an ebTobit object
## S3 method for class 'ebTobit' logLik(object, ...)
## S3 method for class 'ebTobit' logLik(object, ...)
object |
an object inheriting from class |
... |
not used |
log likelihood for the fitted empirical Bayes model in object
Validate the provided elements and populate the object. Current methods
require that gr
is numeric for that calculation of posterior
statistics (mean and mediod).
new_ebTobit(prior, gr, lik)
new_ebTobit(prior, gr, lik)
prior |
numeric vector of non-negative weights (sums to one) |
gr |
numeric matrix of support points |
lik |
numeric matrix of likelihoods |
an EBayesMat
object containing at least the prior weights,
corresponding grid/support points, and likelihood matrix relating the grid to
the observations
The posterior L1 mediod is defined as \arg\min_y E |y - t|_1 where the
expectation is taken over the posterior t|X=x. Here the posterior L1 mediod
is evaluated for each of the observations used to fit object
.
posterior_L1mediod.ebTobit(object)
posterior_L1mediod.ebTobit(object)
object |
an object inheriting from class |
numeric matrix of posterior L1 mediods for the fitted empirical
Bayes model in object
Compute Posterior Mean of an ebTobit object
posterior_mean.ebTobit(object)
posterior_mean.ebTobit(object)
object |
an object inheriting from class |
numeric matrix of posterior means for the fitted empirical Bayes
model in object
Compute Posterior Mode of an ebTobit object
posterior_mode.ebTobit(object)
posterior_mode.ebTobit(object)
object |
an object inheriting from class |
numeric matrix of posterior modes for the fitted empirical
Bayes model in object
Compute either the posterior mean (default) or posterior L1 mediod which corresponds to the posterior median in one-dimension.
## S3 method for class 'ebTobit' predict(object, L, R = L, s1 = 1, method = "mean", ...)
## S3 method for class 'ebTobit' predict(object, L, R = L, s1 = 1, method = "mean", ...)
object |
an object inheriting from class |
L |
n x p matrix of lower bounds on observations |
R |
n x p matrix of upper bounds on observations |
s1 |
a single numeric standard deviation or an n x p matrix of standard deviations |
method |
either "mean", "L1mediod", or "mode" corresponding to the
methods: |
... |
not used |
matrix of posterior estimates for new observations under the
provided, pre-fit empirical Bayes model object
Fit the matrix of standard deviations given censored observations current mean estimates. Currently there are four models for S implemented: global, column-specific, row-specific, and rank-1.
tobit_sd( L, R, mu = matrix(colMeans(L + R)/2, nrow(L), ncol(L), byrow = TRUE), sd.structure = "global", interval = c(1e-04, 100), tol = .Machine$double.eps^0.25, maxiter = 1000 )
tobit_sd( L, R, mu = matrix(colMeans(L + R)/2, nrow(L), ncol(L), byrow = TRUE), sd.structure = "global", interval = c(1e-04, 100), tol = .Machine$double.eps^0.25, maxiter = 1000 )
L |
matrix of lower bounds on observations (n x p) |
R |
matrix of upper bounds on observations (n x p) |
mu |
matrix of known means (n x p) |
sd.structure |
structure imposed on noise level estimates, must be one of: "global", "column", "row", or "rank1" |
interval |
a vector containing the end-points of the interval defining
the convex search space (default: |
tol |
the desired accuracy |
maxiter |
early stopping condition |
matrix of maximum likelihood estimates for each observation's standard deviation (n x p)
set.seed(1) n = 100; p = 5; r = 2 U.true = matrix(stats::rexp(n*r), n, r) V.true = matrix(sample(x = c(1,4,7), size = p*r, replace = TRUE, prob = c(0.7, 0.2, 0.1)), p, r) TH = tcrossprod(U.true, V.true) X = TH + matrix(stats::rnorm(n*p, sd = 1), n, p) ldl <- 0.1 # lower detection limit, known to be non-negative L <- ifelse(X < ldl, 0, X) R <- ifelse(X < ldl, ldl, X) tobit_sd(L, R, mu = TH) tobit_sd(L, R, mu = TH, sd.structure = "column")
set.seed(1) n = 100; p = 5; r = 2 U.true = matrix(stats::rexp(n*r), n, r) V.true = matrix(sample(x = c(1,4,7), size = p*r, replace = TRUE, prob = c(0.7, 0.2, 0.1)), p, r) TH = tcrossprod(U.true, V.true) X = TH + matrix(stats::rnorm(n*p, sd = 1), n, p) ldl <- 0.1 # lower detection limit, known to be non-negative L <- ifelse(X < ldl, 0, X) R <- ifelse(X < ldl, ldl, X) tobit_sd(L, R, mu = TH) tobit_sd(L, R, mu = TH, sd.structure = "column")
Use standard numerical optimization methods to maximize the log-likelihood of the given problem. If all of the data is passed in, this method computes the global estimate of standard deviation. By passing in a subset of the data, more specific estimates can be made (ex column-specific standard deviations).
tobit_sd_mle( L, R, mu = matrix(mean(L + R)/2, nrow(L), ncol(L)), interval = c(1e-04, 100), tol = .Machine$double.eps^0.25 )
tobit_sd_mle( L, R, mu = matrix(mean(L + R)/2, nrow(L), ncol(L)), interval = c(1e-04, 100), tol = .Machine$double.eps^0.25 )
L |
matrix of lower bounds on observations (n x p) |
R |
matrix of upper bounds on observations (n x p) |
mu |
matrix of known means (n x p) |
interval |
a vector containing the end-points of the interval defining
the convex search space (default: |
tol |
the desired accuracy |
a list containing estimate (maximum) and log-likelihood (objective)
set.seed(1) n = 100; p = 5; r = 2 U.true = matrix(stats::rexp(n*r), n, r) V.true = matrix(sample(x = c(1,4,7), size = p*r, replace = TRUE, prob = c(0.7, 0.2, 0.1)), p, r) TH = tcrossprod(U.true, V.true) X = TH + matrix(stats::rnorm(n*p, sd = 1), n, p) ldl <- 0.1 # lower detection limit, known to be non-negative L <- ifelse(X < ldl, 0, X) R <- ifelse(X < ldl, ldl, X) tobit_sd_mle(L, R, mu = TH)
set.seed(1) n = 100; p = 5; r = 2 U.true = matrix(stats::rexp(n*r), n, r) V.true = matrix(sample(x = c(1,4,7), size = p*r, replace = TRUE, prob = c(0.7, 0.2, 0.1)), p, r) TH = tcrossprod(U.true, V.true) X = TH + matrix(stats::rnorm(n*p, sd = 1), n, p) ldl <- 0.1 # lower detection limit, known to be non-negative L <- ifelse(X < ldl, 0, X) R <- ifelse(X < ldl, ldl, X) tobit_sd_mle(L, R, mu = TH)