Title: | Likelihood-Based Intrinsic Dimension Estimators |
---|---|
Description: | Provides functions to estimate the intrinsic dimension of a dataset via likelihood-based approaches. Specifically, the package implements the 'TWO-NN' and 'Gride' estimators and the 'Hidalgo' Bayesian mixture model. In addition, the first reference contains an extended vignette on the usage of the 'TWO-NN' and 'Hidalgo' models. References: Denti (2023, <doi:10.18637/jss.v106.i09>); Allegra et al. (2020, <doi:10.1038/s41598-020-72222-0>); Denti et al. (2022, <doi:10.1038/s41598-022-20991-1>); Facco et al. (2017, <doi:10.1038/s41598-017-11873-y>); Santos-Fernandez et al. (2021, <doi:10.1038/s41598-022-20991-1>). |
Authors: | Francesco Denti [aut, cre, cph] , Andrea Gilardi [aut] |
Maintainer: | Francesco Denti <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.1.0 |
Built: | 2024-11-12 06:54:48 UTC |
Source: | CRAN |
Gride
Use this method without the .gride_bayes
suffix.
It displays the traceplot of the chain generated
with Metropolis-Hasting updates to visually assess mixing and convergence.
Alternatively, it is possible to plot the posterior density.
## S3 method for class 'gride_bayes' autoplot( object, traceplot = FALSE, title = "Bayesian Gride - Posterior distribution", ... )
## S3 method for class 'gride_bayes' autoplot( object, traceplot = FALSE, title = "Bayesian Gride - Posterior distribution", ... )
object |
object of class |
traceplot |
logical. If |
title |
optional string to display as title. |
... |
other arguments passed to specific methods. |
object of class ggplot
.
It could represent the traceplot of the posterior simulations for the
Bayesian Gride
model (traceplot = TRUE
) or a density plot
of the simulated posterior distribution (traceplot = FALSE
).
Other autoplot methods:
autoplot.Hidalgo()
,
autoplot.twonn_bayes()
,
autoplot.twonn_linfit()
,
autoplot.twonn_mle()
Gride
estimatesUse this method without the .gride_evolution
suffix.
It plots the evolution of the id
estimates as a function of the average distance from the furthest NN of
each point.
## S3 method for class 'gride_evolution' autoplot(object, title = "Gride Evolution", ...)
## S3 method for class 'gride_evolution' autoplot(object, title = "Gride Evolution", ...)
object |
an object of class |
title |
an optional string to customize the title of the plot. |
... |
other arguments passed to specific methods. |
object of class ggplot
. It displays the
the evolution of the Gride maximum likelihood estimates as a function
of the average distance from n2
.
Gride
Use this method without the .gride_mle
suffix.
It displays the density plot of sample obtained via
parametric bootstrap for the Gride
model.
## S3 method for class 'gride_mle' autoplot(object, title = "MLE Gride - Bootstrap sample", ...)
## S3 method for class 'gride_mle' autoplot(object, title = "MLE Gride - Bootstrap sample", ...)
object |
object of class |
title |
title for the plot. |
... |
other arguments passed to specific methods. |
object of class ggplot
. It displays the
density plot of the sample generated via parametric bootstrap to help the
visual assessment of the uncertainty of the id
estimates.
Hidalgo
functionUse this method without the .Hidalgo
suffix.
It produces several plots to explore the output of
the Hidalgo
model.
## S3 method for class 'Hidalgo' autoplot( object, type = c("raw_chains", "point_estimates", "class_plot", "clustering"), class_plot_type = c("histogram", "density", "boxplot", "violin"), class = NULL, psm = NULL, clust = NULL, title = NULL, ... )
## S3 method for class 'Hidalgo' autoplot( object, type = c("raw_chains", "point_estimates", "class_plot", "clustering"), class_plot_type = c("histogram", "density", "boxplot", "violin"), class = NULL, psm = NULL, clust = NULL, title = NULL, ... )
object |
object of class |
type |
character that indicates the requested type of plot. It can be:
|
class_plot_type |
if |
class |
factor variable used to stratify observations according to
their the |
psm |
posterior similarity matrix containing the posterior probability of coclustering. |
clust |
vector containing the cluster membership labels. |
title |
character string used as title of the plot. |
... |
other arguments passed to specific methods. |
a ggplot2
object produced by the function
according to the type
chosen.
More precisely, if
method = "raw_chains"
The functions produces the traceplots
of the parameters d_k
, for k=1...K
.
The ergodic means for all the chains are superimposed. The K
chains
that are plotted are not post-processed.
Ergo, they are subjected to label switching;
method = "point_estimates"
The function returns two
scatterplots displaying
the posterior mean and median id
for each observation, after that the
MCMC has been postprocessed to handle label switching;
method = "class_plot"
The function returns a plot that can be
used to visually assess the relationship between the posterior id
estimates and an external, categorical variable. The type of plot varies
according to the specification of class_plot_type
, and it can be
either a set of boxplots or violin plots or a collection of overlapping
densities or histograms;
method = "clustering"
The function displays the posterior similarity matrix, to allow the study of the clustering structure present in the data estimated via the mixture model. Rows and columns can be stratified by an exogenous class and/or a clustering structure.
Other autoplot methods:
autoplot.gride_bayes()
,
autoplot.twonn_bayes()
,
autoplot.twonn_linfit()
,
autoplot.twonn_mle()
TWO-NN
model estimated via the Bayesian
approachUse this method without the .twonn_bayes
suffix.
The function returns the density plot of the
posterior distribution computed with the bayes
method.
## S3 method for class 'twonn_bayes' autoplot( object, plot_low = 0, plot_upp = NULL, by = 0.05, title = "Bayesian TWO-NN", ... )
## S3 method for class 'twonn_bayes' autoplot( object, plot_low = 0, plot_upp = NULL, by = 0.05, title = "Bayesian TWO-NN", ... )
object |
object of class |
plot_low |
lower bound of the interval on which the posterior density is plotted. |
plot_upp |
upper bound of the interval on which the posterior density is plotted. |
by |
step-size at which the sequence spanning the interval is incremented. |
title |
character string used as title of the plot. |
... |
other arguments passed to specific methods. |
ggplot2
object displaying the posterior
distribution of the intrinsic dimension parameter.
Other autoplot methods:
autoplot.Hidalgo()
,
autoplot.gride_bayes()
,
autoplot.twonn_linfit()
,
autoplot.twonn_mle()
TWO-NN
model estimated via least squaresUse this method without the .twonn_linfit
suffix.
The function returns the representation of the linear
regression that is fitted with the linfit
method.
## S3 method for class 'twonn_linfit' autoplot(object, title = "TWO-NN Linear Fit", ...)
## S3 method for class 'twonn_linfit' autoplot(object, title = "TWO-NN Linear Fit", ...)
object |
object of class |
title |
string used as title of the plot. |
... |
other arguments passed to specific methods. |
a ggplot2
object displaying the goodness of
the linear fit of the TWO-NN model.
Other autoplot methods:
autoplot.Hidalgo()
,
autoplot.gride_bayes()
,
autoplot.twonn_bayes()
,
autoplot.twonn_mle()
TWO-NN
model estimated via the Maximum
Likelihood approachUse this method without the .twonn_mle
suffix.
The function returns the point estimate along with the confidence bands
computed via the mle
method.
## S3 method for class 'twonn_mle' autoplot(object, title = "MLE TWO-NN", ...)
## S3 method for class 'twonn_mle' autoplot(object, title = "MLE TWO-NN", ...)
object |
object of class |
title |
character string used as title of the plot. |
... |
other arguments passed to specific methods. |
ggplot2
object displaying the point estimate
and confidence interval obtained via the maximum likelihood approach of the
id
parameter.
Other autoplot methods:
autoplot.Hidalgo()
,
autoplot.gride_bayes()
,
autoplot.twonn_bayes()
,
autoplot.twonn_linfit()
Hidalgo
modelCollection of functions used to extract meaningful information from the object returned
by the function Hidalgo
posterior_means(x) initial_values(x) posterior_medians(x) credible_intervals(x, alpha = 0.95)
posterior_means(x) initial_values(x) posterior_medians(x) credible_intervals(x, alpha = 0.95)
x |
object of class |
alpha |
posterior probability contained in the computed credible interval. |
posterior_mean
returns the observation-specific id
posterior means estimated with Hidalgo
.
initial_values
returns a list with the parameter specification
passed to the model.
posterior_median
returns the observation-specific id
posterior medians estimated with Hidalgo
.
credible_interval
returns the observation-specific credible intervals for a specific
probability alpha
.
The function computes the posterior similarity (coclustering) matrix (psm)
and estimates a representative partition of the observations from the MCMC
output. The user can provide the desired number of clusters or estimate a
optimal clustering solution by minimizing a loss function on the space
of the partitions.
In the latter case, the function uses the package salso
(Dahl et al., 2021),
that the user needs to load.
clustering( object, clustering_method = c("dendrogram", "salso"), K = 2, nCores = 1, ... ) ## S3 method for class 'hidalgo_psm' print(x, ...) ## S3 method for class 'hidalgo_psm' plot(x, ...)
clustering( object, clustering_method = c("dendrogram", "salso"), K = 2, nCores = 1, ... ) ## S3 method for class 'hidalgo_psm' print(x, ...) ## S3 method for class 'hidalgo_psm' plot(x, ...)
object |
object of class |
clustering_method |
character indicating the method to use to perform clustering. It can be
|
K |
number of clusters to recover by thresholding the dendrogram obtained from the psm. |
nCores |
parameter for the |
... |
ignored. |
x |
object of class |
list containing the posterior similarity matrix (psm
) and
the estimated partition clust
.
D. B. Dahl, D. J. Johnson, and P. Müller (2022), "Search Algorithms and Loss Functions for Bayesian Clustering", Journal of Computational and Graphical Statistics, doi:10.1080/10618600.2022.2069779.
David B. Dahl, Devin J. Johnson and Peter Müller (2022). "salso: Search Algorithms and Loss Functions for Bayesian Clustering". R package version 0.3.0. https://CRAN.R-project.org/package=salso
library(salso) X <- replicate(5,rnorm(500)) X[1:250,1:2] <- 0 h_out <- Hidalgo(X) clustering(h_out)
library(salso) X <- replicate(5,rnorm(500)) X[1:250,1:2] <- 0 h_out <- Hidalgo(X) clustering(h_out)
The function compute_mus
computes the ratios of distances between
nearest neighbors (NNs) of generic order, denoted as
mu(n_1,n_2)
.
This quantity is at the core of all the likelihood-based methods contained
in the package.
compute_mus(X = NULL, dist_mat = NULL, n1 = 1, n2 = 2, Nq = FALSE, q = 3) ## S3 method for class 'mus' print(x, ...) ## S3 method for class 'mus_Nq' print(x, ...) ## S3 method for class 'mus' plot(x, range_d = NULL, ...)
compute_mus(X = NULL, dist_mat = NULL, n1 = 1, n2 = 2, Nq = FALSE, q = 3) ## S3 method for class 'mus' print(x, ...) ## S3 method for class 'mus_Nq' print(x, ...) ## S3 method for class 'mus' plot(x, range_d = NULL, ...)
X |
a dataset with |
dist_mat |
a distance matrix computed between |
n1 |
order of the first NN considered. Default is 1. |
n2 |
order of the second NN considered. Default is 2. |
Nq |
logical indicator. If |
q |
integer, number of NN considered to build |
x |
object of class |
... |
ignored. |
range_d |
a sequence of values for which the generalized ratios density
is superimposed to the histogram of |
the principal output of this function is a vector containing the
ratio statistics, an object of class mus
. The length of the vector is
equal to the number of observations considered, unless ties are present in
the dataset. In that case, the duplicates are removed. Optionally, if
Nq
is TRUE
, the function returns an object of class
mus_Nq
, a list containing both the ratio statistics mus
and the
adjacency matrix NQ
.
Facco E, D'Errico M, Rodriguez A, Laio A (2017). "Estimating the intrinsic dimension of datasets by a minimal neighborhood information." Scientific Reports, 7(1). ISSN 20452322, doi:10.1038/s41598-017-11873-y.
Denti F, Doimo D, Laio A, Mira A (2022). "The generalized ratios intrinsic dimension estimator." Scientific Reports, 12(20005). ISSN 20452322, doi:10.1038/s41598-022-20991-1.
X <- replicate(2,rnorm(1000)) mu <- compute_mus(X, n1 = 1, n2 = 2) mudots <- compute_mus(X, n1 = 4, n2 = 8) pre_hidalgo <- compute_mus(X, n1 = 4, n2 = 8, Nq = TRUE, q = 3)
X <- replicate(2,rnorm(1000)) mu <- compute_mus(X, n1 = 1, n2 = 2) mudots <- compute_mus(X, n1 = 4, n2 = 8) pre_hidalgo <- compute_mus(X, n1 = 4, n2 = 8, Nq = TRUE, q = 3)
Density function and random number generator for the Generalized Ratio
distribution with NN orders equal to n1
and n2
.
See Denti et al., 2022
for more details.
rgera(nsim, n1 = 1, n2 = 2, d) dgera(x, n1 = 1, n2 = 2, d, log = FALSE)
rgera(nsim, n1 = 1, n2 = 2, d) dgera(x, n1 = 1, n2 = 2, d, log = FALSE)
nsim |
integer, the number of observations to generate. |
n1 |
order of the first NN considered. Default is 1. |
n2 |
order of the second NN considered. Default is 2. |
d |
value of the intrinsic dimension. |
x |
vector of quantiles. |
log |
logical, if |
dgera
gives the density. rgera
returns a vector of
random observations sampled from the generalized ratio distribution.
Denti F, Doimo D, Laio A, Mira A (2022). "The generalized ratios intrinsic dimension estimator." Scientific Reports, 12(20005). ISSN 20452322, doi:10.1038/s41598-022-20991-1.
draws <- rgera(100,3,5,2) density <- dgera(3,3,5,2)
draws <- rgera(100,3,5,2) density <- dgera(3,3,5,2)
Gride
: the Generalized Ratios ID EstimatorThe function can fit the Generalized ratios ID estimator under both the
frequentist and the Bayesian frameworks, depending on the specification of
the argument method
. The model is the direct extension of the
TWO-NN
method presented in
Facco et al., 2017
. See also Denti et al., 2022 \
for more details.
gride( X = NULL, dist_mat = NULL, mus_n1_n2 = NULL, method = c("mle", "bayes"), n1 = 1, n2 = 2, alpha = 0.95, nsim = 5000, upper_D = 50, burn_in = 2000, sigma = 0.5, start_d = NULL, a_d = 1, b_d = 1, ... ) ## S3 method for class 'gride_bayes' print(x, ...) ## S3 method for class 'gride_bayes' summary(object, ...) ## S3 method for class 'summary.gride_bayes' print(x, ...) ## S3 method for class 'gride_bayes' plot(x, ...) ## S3 method for class 'gride_mle' print(x, ...) ## S3 method for class 'gride_mle' summary(object, ...) ## S3 method for class 'summary.gride_mle' print(x, ...) ## S3 method for class 'gride_mle' plot(x, ...)
gride( X = NULL, dist_mat = NULL, mus_n1_n2 = NULL, method = c("mle", "bayes"), n1 = 1, n2 = 2, alpha = 0.95, nsim = 5000, upper_D = 50, burn_in = 2000, sigma = 0.5, start_d = NULL, a_d = 1, b_d = 1, ... ) ## S3 method for class 'gride_bayes' print(x, ...) ## S3 method for class 'gride_bayes' summary(object, ...) ## S3 method for class 'summary.gride_bayes' print(x, ...) ## S3 method for class 'gride_bayes' plot(x, ...) ## S3 method for class 'gride_mle' print(x, ...) ## S3 method for class 'gride_mle' summary(object, ...) ## S3 method for class 'summary.gride_mle' print(x, ...) ## S3 method for class 'gride_mle' plot(x, ...)
X |
data matrix with |
dist_mat |
distance matrix computed between the |
mus_n1_n2 |
vector of generalized order NN distance ratios. |
method |
the chosen estimation method. It can be
|
n1 |
order of the first NN considered. Default is 1. |
n2 |
order of the second NN considered. Default is 2. |
alpha |
confidence level (for |
nsim |
number of bootstrap samples or posterior simulation to consider. |
upper_D |
nominal dimension of the dataset (upper bound for the maximization routine). |
burn_in |
number of iterations to discard from the MCMC sample.
Applicable if |
sigma |
standard deviation of the Gaussian proposal used in the MH step.
Applicable if |
start_d |
initial value for the MCMC chain. If |
a_d |
shape parameter of the Gamma prior distribution for |
b_d |
rate parameter of the Gamma prior distribution for |
... |
other arguments passed to specific methods. |
x |
object of class |
object |
object of class |
a list containing the id
estimate obtained with the Gride
method, along with the relative confidence or credible interval
(object est
). The class of the output object changes according to the
chosen method
. Similarly,
the remaining elements stored in the list reports a summary of the key
quantities involved in the estimation process, e.g.,
the NN orders n1
and n2
.
Facco E, D'Errico M, Rodriguez A, Laio A (2017). "Estimating the intrinsic dimension of datasets by a minimal neighborhood information." Scientific Reports, 7(1). ISSN 20452322, doi:10.1038/s41598-017-11873-y.
Denti F, Doimo D, Laio A, Mira A (2022). "The generalized ratios intrinsic dimension estimator." Scientific Reports, 12(20005). ISSN 20452322, doi:10.1038/s41598-022-20991-1.
X <- replicate(2,rnorm(500)) dm <- as.matrix(dist(X,method = "manhattan")) res <- gride(X, nsim = 500) res plot(res) gride(dist_mat = dm, method = "bayes", upper_D =10, nsim = 500, burn_in = 100)
X <- replicate(2,rnorm(500)) dm <- as.matrix(dist(X,method = "manhattan")) res <- gride(X, nsim = 500) res plot(res) gride(dist_mat = dm, method = "bayes", upper_D =10, nsim = 500, burn_in = 100)
Gride
evolution based on Maximum Likelihood EstimationThe function allows the study of the evolution of the id
estimates
as a function of the scale of a dataset. A scale-dependent analysis
is essential to identify the correct number of relevant directions in noisy
data. To increase the average distance from the second NN (and thus the
average neighborhood size) involved in the estimation, the function computes
a sequence of Gride
models with increasing NN orders, n1
and
n2
.
See also Denti et al., 2022
for more details.
gride_evolution(X, vec_n1, vec_n2, upp_bound = 50) ## S3 method for class 'gride_evolution' print(x, ...) ## S3 method for class 'gride_evolution' plot(x, ...)
gride_evolution(X, vec_n1, vec_n2, upp_bound = 50) ## S3 method for class 'gride_evolution' print(x, ...) ## S3 method for class 'gride_evolution' plot(x, ...)
X |
data matrix with |
vec_n1 |
vector of integers, containing the smaller NN orders considered in the evolution. |
vec_n2 |
vector of integers, containing the larger NN orders considered in the evolution. |
upp_bound |
upper bound for the interval used in the numerical
optimization (via |
x |
an object of class |
... |
other arguments passed to specific methods. |
list containing the Gride evolution, the corresponding NN distance ratios, the average n2-th NN order distances, and the NN orders considered.
the function prints a summary of the Gride evolution to console.
Denti F, Doimo D, Laio A, Mira A (2022). "The generalized ratios intrinsic dimension estimator." Scientific Reports, 12(20005). ISSN 20452322, doi:10.1038/s41598-022-20991-1.
X <- replicate(5,rnorm(10000,0,.1)) gride_evolution(X = X,vec_n1 = 2^(0:5),vec_n2 = 2^(1:6))
X <- replicate(5,rnorm(10000,0,.1)) gride_evolution(X = X,vec_n1 = 2^(0:5),vec_n2 = 2^(1:6))
Hidalgo
modelThe function fits the Heterogeneous intrinsic dimension algorithm, developed in Allegra et al., 2020. The model is a Bayesian mixture of Pareto distribution with modified likelihood to induce homogeneity across neighboring observations. The model can segment the observations into multiple clusters characterized by different intrinsic dimensions. This permits to capture hidden patterns in the data. For more details on the algorithm, refer to Allegra et al., 2020. For an example of application to basketball data, see Santos-Fernandez et al., 2021.
Hidalgo( X = NULL, dist_mat = NULL, K = 10, nsim = 5000, burn_in = 5000, thinning = 1, verbose = TRUE, q = 3, xi = 0.75, alpha_Dirichlet = 0.05, a0_d = 1, b0_d = 1, prior_type = c("Conjugate", "Truncated", "Truncated_PointMass"), D = NULL, pi_mass = 0.5 ) ## S3 method for class 'Hidalgo' print(x, ...) ## S3 method for class 'Hidalgo' plot(x, type = c("A", "B", "C"), class = NULL, ...) ## S3 method for class 'Hidalgo' summary(object, ...) ## S3 method for class 'summary.Hidalgo' print(x, ...)
Hidalgo( X = NULL, dist_mat = NULL, K = 10, nsim = 5000, burn_in = 5000, thinning = 1, verbose = TRUE, q = 3, xi = 0.75, alpha_Dirichlet = 0.05, a0_d = 1, b0_d = 1, prior_type = c("Conjugate", "Truncated", "Truncated_PointMass"), D = NULL, pi_mass = 0.5 ) ## S3 method for class 'Hidalgo' print(x, ...) ## S3 method for class 'Hidalgo' plot(x, type = c("A", "B", "C"), class = NULL, ...) ## S3 method for class 'Hidalgo' summary(object, ...) ## S3 method for class 'summary.Hidalgo' print(x, ...)
X |
data matrix with |
dist_mat |
distance matrix computed between the |
K |
integer, number of mixture components. |
nsim |
number of MCMC iterations to run. |
burn_in |
number of MCMC iterations to discard as burn-in period. |
thinning |
integer indicating the thinning interval. |
verbose |
logical, should the progress of the sampler be printed? |
q |
integer, first local homogeneity parameter. Default is 3. |
xi |
real number between 0 and 1, second local homogeneity parameter. Default is 0.75. |
alpha_Dirichlet |
parameter of the symmetric Dirichlet prior on the mixture weights. Default is 0.05, inducing a sparse mixture. Values that are too small (i.e., lower than 0.005) may cause underflow. |
a0_d |
shape parameter of the Gamma prior on |
b0_d |
rate parameter of the Gamma prior on |
prior_type |
character, type of Gamma prior on
|
D |
integer, the maximal dimension of the dataset. |
pi_mass |
probability placed a priori on |
x |
object of class |
... |
other arguments passed to specific methods. |
type |
character that indicates the type of plot that is requested. It can be:
|
class |
factor variable used to stratify observations according to
their the |
object |
object of class |
object of class Hidalgo
, which is a list containing
cluster_prob
chains of the posterior mixture weights;
membership_labels
chains of the membership labels for all the observations;
id_raw
chains of the K
intrinsic dimensions
parameters, one per mixture component;
id_postpr
a chain for each observation, corrected for label switching;
id_summary
a matrix containing, for each observation, the value of posterior mean and the 5%, 25%, 50%, 75%, 95% quantiles;
recap
a list with the objects and specifications passed to the function used in the estimation.
Allegra M, Facco E, Denti F, Laio A, Mira A (2020). “Data segmentation based on the local intrinsic dimension.” Scientific Reports, 10(1), 1–27. ISSN 20452322, doi:10.1038/s41598-020-72222-0,
Santos-Fernandez E, Denti F, Mengersen K, Mira A (2021). “The role of intrinsic dimension in high-resolution player tracking data – Insights in basketball.” Annals of Applied Statistics - Forthcoming, – ISSN 2331-8422, 2002.04148, doi:10.1038/s41598-022-20991-1
id_by_class
and clustering
to understand how to further postprocess the results.
set.seed(1234) X <- replicate(5,rnorm(500)) X[1:250,1:2] <- 0 X[1:250,] <- X[1:250,] + 4 oracle <- rep(1:2,rep(250,2)) # this is just a short example # increase the number of iterations to improve mixing and convergence h_out <- Hidalgo(X, nsim = 500, burn_in = 500) plot(h_out, type = "B") id_by_class(h_out, oracle)
set.seed(1234) X <- replicate(5,rnorm(500)) X[1:250,1:2] <- 0 X[1:250,] <- X[1:250,] + 4 oracle <- rep(1:2,rep(250,2)) # this is just a short example # increase the number of iterations to improve mixing and convergence h_out <- Hidalgo(X, nsim = 500, burn_in = 500) plot(h_out, type = "B") id_by_class(h_out, oracle)
id
by an external categorical variableThe function computes summary statistics (mean, median, and standard deviation) of the post-processed chains of the intrinsic dimension stratified by an external categorical variable.
id_by_class(object, class) ## S3 method for class 'hidalgo_class' print(x, ...)
id_by_class(object, class) ## S3 method for class 'hidalgo_class' print(x, ...)
object |
object of class |
class |
factor according to the observations should be stratified by. |
x |
object of class |
... |
other arguments passed to specific methods. |
a data.frame
containing the posterior id
means,
medians, and standard deviations stratified by the levels of the variable
class
.
X <- replicate(5,rnorm(500)) X[1:250,1:2] <- 0 oracle <- rep(1:2,rep(250,2)) h_out <- Hidalgo(X) id_by_class(h_out,oracle)
X <- replicate(5,rnorm(500)) X[1:250,1:2] <- 0 oracle <- rep(1:2,rep(250,2)) h_out <- Hidalgo(X) id_by_class(h_out,oracle)
The function creates a three-dimensional dataset with coordinates
following the Swiss roll mapping, transforming random uniform data points
sampled on the interval (0,10)
.
Swissroll(n)
Swissroll(n)
n |
number of observations contained in the output dataset. |
a three-dimensional data.frame
containing the coordinates of
the points generated via the Swiss roll mapping.
Data <- Swissroll(1000)
Data <- Swissroll(1000)
TWO-NN
estimatorThe function can fit the two-nearest neighbor estimator within the maximum
likelihood and the Bayesian frameworks. Also, one can obtain the estimates
using least squares estimation, depending on the specification of the
argument method
. This model has been originally presented in
Facco et al., 2017
. See also Denti et al., 2022
for more details.
twonn( X = NULL, dist_mat = NULL, mus = NULL, method = c("mle", "linfit", "bayes"), alpha = 0.95, c_trimmed = 0.01, unbiased = TRUE, a_d = 0.001, b_d = 0.001, ... ) ## S3 method for class 'twonn_bayes' print(x, ...) ## S3 method for class 'twonn_bayes' summary(object, ...) ## S3 method for class 'summary.twonn_bayes' print(x, ...) ## S3 method for class 'twonn_bayes' plot(x, plot_low = 0.001, plot_upp = NULL, by = 0.05, ...) ## S3 method for class 'twonn_linfit' print(x, ...) ## S3 method for class 'twonn_linfit' summary(object, ...) ## S3 method for class 'summary.twonn_linfit' print(x, ...) ## S3 method for class 'twonn_linfit' plot(x, ...) ## S3 method for class 'twonn_mle' print(x, ...) ## S3 method for class 'twonn_mle' summary(object, ...) ## S3 method for class 'summary.twonn_mle' print(x, ...) ## S3 method for class 'twonn_mle' plot(x, ...)
twonn( X = NULL, dist_mat = NULL, mus = NULL, method = c("mle", "linfit", "bayes"), alpha = 0.95, c_trimmed = 0.01, unbiased = TRUE, a_d = 0.001, b_d = 0.001, ... ) ## S3 method for class 'twonn_bayes' print(x, ...) ## S3 method for class 'twonn_bayes' summary(object, ...) ## S3 method for class 'summary.twonn_bayes' print(x, ...) ## S3 method for class 'twonn_bayes' plot(x, plot_low = 0.001, plot_upp = NULL, by = 0.05, ...) ## S3 method for class 'twonn_linfit' print(x, ...) ## S3 method for class 'twonn_linfit' summary(object, ...) ## S3 method for class 'summary.twonn_linfit' print(x, ...) ## S3 method for class 'twonn_linfit' plot(x, ...) ## S3 method for class 'twonn_mle' print(x, ...) ## S3 method for class 'twonn_mle' summary(object, ...) ## S3 method for class 'summary.twonn_mle' print(x, ...) ## S3 method for class 'twonn_mle' plot(x, ...)
X |
data matrix with |
dist_mat |
distance matrix computed between the |
mus |
vector of second to first NN distance ratios. |
method |
chosen estimation method. It can be
|
alpha |
the confidence level (for |
c_trimmed |
the proportion of trimmed observations. |
unbiased |
logical, applicable when |
a_d |
shape parameter of the Gamma prior on the parameter |
b_d |
rate parameter of the Gamma prior on the parameter |
... |
ignored. |
x |
object of class |
object |
object of class |
plot_low |
lower bound of the interval on which the posterior density is plotted. |
plot_upp |
upper bound of the interval on which the posterior density is plotted. |
by |
step-size at which the sequence spanning the interval is incremented. |
list characterized by a class type that depends on the method
chosen. Regardless of the method
, the output list always contains the
object est
, which provides the estimated intrinsic dimension along
with uncertainty quantification. The remaining objects vary with the
estimation method. In particular, if
method = "mle"
the output reports the MLE and the relative confidence interval;
method = "linfit"
the output includes the lm()
object used for the computation;
method = "bayes"
the output contains the (1 + alpha
) / 2 and (1 - alpha
) / 2 quantiles, mean, mode, and median of the posterior distribution of d
.
Facco E, D'Errico M, Rodriguez A, Laio A (2017). "Estimating the intrinsic dimension of datasets by a minimal neighborhood information." Scientific Reports, 7(1). ISSN 20452322, doi:10.1038/s41598-017-11873-y.
Denti F, Doimo D, Laio A, Mira A (2022). "The generalized ratios intrinsic dimension estimator." Scientific Reports, 12(20005). ISSN 20452322, doi:10.1038/s41598-022-20991-1.
# dataset with 1000 observations and id = 2 X <- replicate(2,rnorm(1000)) twonn(X) # dataset with 1000 observations and id = 3 Y <- replicate(3,runif(1000)) # Bayesian and least squares estimate from distance matrix dm <- as.matrix(dist(Y,method = "manhattan")) twonn(dist_mat = dm,method = "bayes") twonn(dist_mat = dm,method = "linfit")
# dataset with 1000 observations and id = 2 X <- replicate(2,rnorm(1000)) twonn(X) # dataset with 1000 observations and id = 3 Y <- replicate(3,runif(1000)) # Bayesian and least squares estimate from distance matrix dm <- as.matrix(dist(Y,method = "manhattan")) twonn(dist_mat = dm,method = "bayes") twonn(dist_mat = dm,method = "linfit")
TWO-NN
evolution with halving steps or vector of
proportionsThe estimation of the id
is related to the scale of the
dataset. To escape the local reach of the TWO-NN
estimator,
Facco et al. (2017)
proposed to subsample the original dataset in order to induce greater
distances between the data points. By investigating the estimates' evolution
as a function of the size of the neighborhood, it is possible to obtain
information about the validity of the modeling assumptions and the robustness
of the model in the presence of noise.
twonn_decimated( X, method = c("steps", "proportions"), steps = 0, proportions = 1, seed = NULL )
twonn_decimated( X, method = c("steps", "proportions"), steps = 0, proportions = 1, seed = NULL )
X |
data matrix with |
method |
method to use for decimation:
|
steps |
number of times the dataset is halved. |
proportions |
vector containing the fractions of the dataset to be considered. |
seed |
random seed controlling the sequence of sub-sampled observations. |
list containing the TWO-NN
evolution
(maximum likelihood estimation and confidence intervals), the average
distance from the second NN, and the vector of proportions that were
considered. According to the chosen estimation method, it is accompanied with
the vector of proportions or halving steps considered.
Facco E, D'Errico M, Rodriguez A, Laio A (2017). "Estimating the intrinsic dimension of datasets by a minimal neighborhood information." Scientific Reports, 7(1). ISSN 20452322, doi:10.1038/s41598-017-11873-y.
Denti F, Doimo D, Laio A, Mira A (2022). "The generalized ratios intrinsic dimension estimator." Scientific Reports, 12(20005). ISSN 20452322, doi:10.1038/s41598-022-20991-1.
TWO-NN
evolution with halving steps or vector of
proportionsThe estimation of the id
is related to the scale of the
dataset. To escape the local reach of the TWO-NN
estimator,
Facco et al. (2017)
proposed to subsample the original dataset in order to induce greater
distances between the data points. By investigating the estimates' evolution
as a function of the size of the neighborhood, it is possible to obtain
information about the validity of the modeling assumptions and the robustness
of the model in the presence of noise.
twonn_decimation( X, method = c("steps", "proportions"), steps = 0, proportions = 1, seed = NULL ) ## S3 method for class 'twonn_dec_prop' print(x, ...) ## S3 method for class 'twonn_dec_prop' plot(x, CI = FALSE, proportions = FALSE, ...) ## S3 method for class 'twonn_dec_by' print(x, ...) ## S3 method for class 'twonn_dec_by' plot(x, CI = FALSE, steps = FALSE, ...)
twonn_decimation( X, method = c("steps", "proportions"), steps = 0, proportions = 1, seed = NULL ) ## S3 method for class 'twonn_dec_prop' print(x, ...) ## S3 method for class 'twonn_dec_prop' plot(x, CI = FALSE, proportions = FALSE, ...) ## S3 method for class 'twonn_dec_by' print(x, ...) ## S3 method for class 'twonn_dec_by' plot(x, CI = FALSE, steps = FALSE, ...)
X |
data matrix with |
method |
method to use for decimation:
|
steps |
logical, if |
proportions |
logical, if |
seed |
random seed controlling the sequence of sub-sampled observations. |
x |
object of class |
... |
ignored. |
CI |
logical, if |
list containing the TWO-NN
evolution
(maximum likelihood estimation and confidence intervals), the average
distance from the second NN, and the vector of proportions that were
considered. According to the chosen estimation method, it is accompanied with
the vector of proportions or halving steps considered.
Facco E, D'Errico M, Rodriguez A, Laio A (2017). "Estimating the intrinsic dimension of datasets by a minimal neighborhood information." Scientific Reports, 7(1). ISSN 20452322, doi:10.1038/s41598-017-11873-y.
Denti F, Doimo D, Laio A, Mira A (2022). "The generalized ratios intrinsic dimension estimator." Scientific Reports, 12(20005). ISSN 20452322, doi:10.1038/s41598-022-20991-1.
X <- replicate(4,rnorm(1000)) twonn_decimation(X,,method = "proportions", proportions = c(1,.5,.2,.1,.01))
X <- replicate(4,rnorm(1000)) twonn_decimation(X,,method = "proportions", proportions = c(1,.5,.2,.1,.01))