Title: | Quantile Least Mahalanobis Distance Estimator for Tukey g-&-h Mixture |
---|---|
Description: | Functions for simulation, estimation, and model selection of finite mixtures of Tukey g-and-h distributions. |
Authors: | Tingting Zhan [aut, cre, cph] , Inna Chervoneva [ctb, cph] |
Maintainer: | Tingting Zhan <[email protected]> |
License: | GPL-2 |
Version: | 0.1.7 |
Built: | 2024-11-28 06:32:40 UTC |
Source: | CRAN |
-&-
MixtureTools for simulating and fitting finite mixtures of the 4-parameter Tukey -&-
distributions.
Tukey
-&-
mixture is highly flexible to model multimodal distributions with variable degree of skewness and kurtosis in the components.
The Quantile Least Mahalanobis Distance estimator QLMDe is used for estimating parameters of the finite Tukey
-&-
mixtures.
QLMDe is an indirect estimator that minimizes the Mahalanobis distance between the sample and model-based quantiles.
A backward-forward stepwise model selection algorithm is provided to find
a parsimonious Tukey -&-
mixture model, conditional on a given number-of-components; and
the optimal number of components within the user-specified range.
Maintainer: Tingting Zhan [email protected] (ORCID) [copyright holder]
Other contributors:
Inna Chervoneva [email protected] (ORCID) [contributor, copyright holder]
# see ?QLMDe
# see ?QLMDe
Naive estimates for finite mixture distribution fmx via clustering.
fmx_cluster( x, K, distname = c("GH", "norm", "sn"), constraint = character(), ... )
fmx_cluster( x, K, distname = c("GH", "norm", "sn"), constraint = character(), ... )
x |
|
K |
integer scalar, number of mixture components |
distname |
character scalar, name of parametric distribution of the mixture components |
constraint |
character vector,
parameters ( |
... |
additional parameters, currently not in use |
First of all, if the specified number of components ,
trimmed
-means clustering with re-assignment will be performed;
otherwise, all observations will be considered as one single cluster.
The standard
-means clustering is not used since the heavy tails of
Tukey
-&-
distribution could be mistakenly classified as individual cluster(s).
In each of the one or more clusters,
letterValue-based estimates of Tukey -&-
distribution (Hoaglin, 2006)
are calculated, for any
, serving as the starting values for
QLMD algorithm.
These estimates are provided by function fmx_cluster.
the median and mad will serve as
the starting values for and
(or
and
for Tukey
-&-
distribution, with
),
for QLMD algorithm
when
.
Function fmx_cluster returns an fmx object.
Best estimates for finite mixture distribution fmx.
fmx_hybrid(x, test = c("logLik", "CvM", "KS"), ...)
fmx_hybrid(x, test = c("logLik", "CvM", "KS"), ...)
x |
|
test |
character scalar, criteria for selecting the optimal estimates. See Details. |
... |
additional parameters of functions fmx_normix and fmx_cluster |
Function fmx_hybrid compares
Tukey -&-
mixture estimate provided by function fmx_cluster
and the normal mixture estimate by function fmx_normix,
and select the one either with maximum likelihood (
test = 'logLik'
, default),
with minimum Cramer-von Mises distance (test = 'CvM'
) or
with minimum Kolmogorov distance (Kolmogorov_fmx).
Function fmx_hybrid returns an fmx object.
library(fmx) d1 = fmx('norm', mean = c(1, 2), sd = .5, w = c(.4, .6)) set.seed(100); hist(x1 <- rfmx(n = 1e3L, dist = d1)) fmx_normix(x1, distname = 'norm', K = 2L) fmx_normix(x1, distname = 'GH', K = 2L) (d2 = fmx('GH', A = c(1,6), B = 2, g = c(0,.3), h = c(.2,0), w = c(1,2))) set.seed(100); hist(x2 <- rfmx(n = 1e3L, dist = d2)) fmx_cluster(x2, K = 2L) fmx_cluster(x2, K = 2L, constraint = c('g1', 'h2')) fmx_normix(x2, K = 2L, distname = 'GH') fmx_hybrid(x2, distname = 'GH', K = 2L)
library(fmx) d1 = fmx('norm', mean = c(1, 2), sd = .5, w = c(.4, .6)) set.seed(100); hist(x1 <- rfmx(n = 1e3L, dist = d1)) fmx_normix(x1, distname = 'norm', K = 2L) fmx_normix(x1, distname = 'GH', K = 2L) (d2 = fmx('GH', A = c(1,6), B = 2, g = c(0,.3), h = c(.2,0), w = c(1,2))) set.seed(100); hist(x2 <- rfmx(n = 1e3L, dist = d2)) fmx_cluster(x2, K = 2L) fmx_cluster(x2, K = 2L, constraint = c('g1', 'h2')) fmx_normix(x2, K = 2L, distname = 'GH') fmx_hybrid(x2, distname = 'GH', K = 2L)
Naive parameter estimates for finite mixture distribution fmx using mixture of normal distributions.
fmx_normix(x, K, distname = c("norm", "GH", "sn"), alpha = 0.05, R = 10L, ...)
fmx_normix(x, K, distname = c("norm", "GH", "sn"), alpha = 0.05, R = 10L, ...)
x |
|
K |
integer scalar, number of mixture components |
distname |
character scalar, name of parametric distribution of the mixture components |
alpha |
numeric scalar, proportion of observations to be trimmed in
trimmed |
R |
integer scalar, number of normalmixEM replicates |
... |
additional parameters, currently not in use |
fmx_normix ... the cluster centers are provided as the starting values of 's for
the univariate normal mixture by EM algorithm.
R
replicates of normal mixture estimates are obtained, and
the one with maximum likelihood will be selected
Function fmx_normix returns an fmx object.
The quantile least Mahalanobis distance algorithm estimates the parameters of single-component or finite mixture distributions by minimizing the Mahalanobis distance between the vectors of sample and theoretical quantiles. See QLMDp for the default selection of probabilities at which the sample and theoretical quantiles are compared.
The default initial values are estimated based on trimmed -means
clustering with re-assignment.
QLMDe( x, distname = c("GH", "norm", "sn"), K, data.name = deparse1(substitute(x)), constraint = character(), probs = QLMDp(x = x), init = c("logLik", "letterValue", "normix"), tol = .Machine$double.eps^0.25, maxiter = 1000, ... )
QLMDe( x, distname = c("GH", "norm", "sn"), K, data.name = deparse1(substitute(x)), constraint = character(), probs = QLMDp(x = x), init = c("logLik", "letterValue", "normix"), tol = .Machine$double.eps^0.25, maxiter = 1000, ... )
x |
|
distname |
character scalar, name of mixture distribution to be fitted. Currently supports |
K |
integer scalar, number of components (e.g., must use |
data.name |
character scalar, name for the observations for user-friendly print out. |
constraint |
character vector, parameters ( |
probs |
numeric vector, percentiles at where the sample and theoretical quantiles are to be matched. See function QLMDp for details. |
init |
character scalar for the method of initial values selection, or an fmx object of the initial values. See function fmx_hybrid for more details. |
tol , maxiter
|
see function vuniroot2 |
... |
additional parameters of optim |
Quantile Least Mahalanobis Distance estimator fits a single-component or finite mixture distribution by minimizing the Mahalanobis distance between the theoretical and observed quantiles, using the empirical quantile variance-covariance matrix quantile_vcov.
Function QLMDe returns an fmx object.
data(bmi, package = 'mixsmsn') hist(x <- bmi[[1L]]) QLMDe(x, distname = 'GH', K = 2L)
data(bmi, package = 'mixsmsn') hist(x <- bmi[[1L]]) QLMDe(x, distname = 'GH', K = 2L)
To compare -parsimonious models of Tukey
-&-
mixtures with different number of components
(up to a user-specified
)
and select the optimal number of components.
QLMDe_stepK( x, distname = c("GH", "norm"), data.name = deparse1(substitute(x)), Kmax = 3L, test = c("BIC", "AIC"), direction = c("forward", "backward"), ... )
QLMDe_stepK( x, distname = c("GH", "norm"), data.name = deparse1(substitute(x)), Kmax = 3L, test = c("BIC", "AIC"), direction = c("forward", "backward"), ... )
x |
|
distname , data.name
|
character scalars, see parameters of the same names in function QLMDe |
Kmax |
integer scalar |
test |
character scalar, criterion to be used, either Akaike's information criterion AIC, or Bayesian information criterion BIC (default). |
direction |
character scalar, direct of selection in function step_fmx,
either |
... |
additional parameters |
Function QLMDe_stepK compares the -parsimonious models with different number of components
,
and selects the optimal number of components using BIC (default) or AIC.
The forward selection starts with finding the -parsimonious model (via function step_fmx)
at
.
Let the current number of component be
.
We compare the
-parsimonious models of
and
component, respectively,
using BIC or AIC.
If
is preferred, then the forward selection is stopped, and
is considered the
optimal number of components.
If
is preferred, then
the forward selection is stopped if
,
otherwise update
with
and repeat the previous steps.
Function QLMDe_stepK returns an object of S3 class 'stepK'
,
which is a list of selected models (in reversed order) with attribute(s)
'direction'
and
'test'
.
data(bmi, package = 'mixsmsn') hist(x <- bmi[[1L]]) QLMDe_stepK(x, distname = 'GH', Kmax = 2L)
data(bmi, package = 'mixsmsn') hist(x <- bmi[[1L]]) QLMDe_stepK(x, distname = 'GH', Kmax = 2L)
A vector of probabilities to be used in Quantile Least Mahalanobis Distance estimation (QLMDe).
QLMDp( from = 0.05, to = 0.95, length.out = 15L, equidistant = c("prob", "quantile"), extra = c(0.005, 0.01, 0.02, 0.03, 0.97, 0.98, 0.99, 0.995), x )
QLMDp( from = 0.05, to = 0.95, length.out = 15L, equidistant = c("prob", "quantile"), extra = c(0.005, 0.01, 0.02, 0.03, 0.97, 0.98, 0.99, 0.995), x )
from , to
|
numeric scalar,
minimum and maximum of the equidistant (in probability or quantile) probabilities.
Default |
length.out |
non-negative integer scalar, the number of the equidistant (in probability or quantile) probabilities. |
equidistant |
character scalar.
If |
extra |
numeric vector of additional probabilities,
default |
x |
numeric vector of observations, only used when |
The default arguments of function QLMDp returns the probabilities of
c(.005, .01, .02, .03, seq.int(.05, .95, length.out = 15L), .97, .98, .99, .995)
.
A numeric vector of probabilities to be supplied to parameter p
of
Quantile Least Mahalanobis Distance QLMDe estimation).
In practice, the length of this probability vector p
must be equal or larger than the number of parameters in the distribution model to be estimated.
library(fmx) (d2 = fmx('GH', A = c(1,6), B = 2, g = c(0,.3), h = c(.2,0), w = c(1,2))) set.seed(100); hist(x2 <- rfmx(n = 1e3L, dist = d2)) # equidistant in probabilities (p1 = QLMDp()) # equidistant in quantiles (p2 = QLMDp(equidistant = 'quantile', x = x2))
library(fmx) (d2 = fmx('GH', A = c(1,6), B = 2, g = c(0,.3), h = c(.2,0), w = c(1,2))) set.seed(100); hist(x2 <- rfmx(n = 1e3L, dist = d2)) # equidistant in probabilities (p1 = QLMDp()) # equidistant in quantiles (p2 = QLMDp(equidistant = 'quantile', x = x2))
-Means ClusteringRe-assign the observations,
which are trimmed in the trimmed -means algorithm,
back to the closest cluster as determined by the smallest
Mahalanobis distance.
reAssign(x, ...) ## S3 method for class 'tkmeans' reAssign(x, ...)
reAssign(x, ...) ## S3 method for class 'tkmeans' reAssign(x, ...)
x |
a tkmeans object |
... |
potential parameters, currently not in use. |
Given the tkmeans input, the mahalanobis distance is computed between each trimmed observation and each cluster. Each trimmed observation is assigned to the closest cluster (i.e., with the smallest Mahalanobis distance).
Function reAssign.tkmeans returns an 'reAssign_tkmeans'
object,
which inherits from tkmeans class.
Either kmeans or tkmeans is slow for big x
.
library(tclust) data(geyser2) clus = tkmeans(geyser2, k = 3L, alpha = .03) plot(clus, main = 'Before Re-Assigning') plot(reAssign(clus), main = 'After Re-Assigning')
library(tclust) data(geyser2) clus = tkmeans(geyser2, k = 3L, alpha = .03) plot(clus, main = 'Before Re-Assigning') plot(reAssign(clus), main = 'After Re-Assigning')
-parsimonious Model with Fixed Number of Components
To select the -parsimonious mixture model,
i.e., with some
and/or
parameters equal to zero,
conditionally on a fixed number of components
.
step_fmx( object, test = c("BIC", "AIC"), direction = c("forward", "backward"), ... )
step_fmx( object, test = c("BIC", "AIC"), direction = c("forward", "backward"), ... )
object |
fmx object |
test |
character scalar, criterion to be used, either Akaike's information criterion AIC-like, or Bayesian information criterion BIC-like (default). |
direction |
character scalar, |
... |
additional parameters, currently not in use |
The algorithm starts with quantile least Mahalanobis distance estimates
of either the full mixture of Tukey -&-
distributions model, or
a constrained model (i.e., some
and/or
parameters equal to zero according to the user input).
Next, each of the non-zero
and/or
parameters is tested using the likelihood ratio test.
If all tested
and/or
parameters are significantly different from zero at the level 0.05
the algorithm is stopped and the initial model is considered
-parsimonious.
Otherwise, the
or
parameter with the largest p-value is constrained to zero
for the next iteration of the algorithm.
The algorithm iterates until only significantly-different-from-zero and
parameters
are retained, which corresponds to
-parsimonious Tukey
-&-
mixture model.
Function step_fmx returns an object of S3 class 'step_fmx'
,
which is a list of selected models (in reversed order) with attribute(s)
'direction'
and
'test'
.