Package 'intRinsic' reference manual

Title:	Likelihood-Based Intrinsic Dimension Estimators
Description:	Provides functions to estimate the intrinsic dimension of a dataset via likelihood-based approaches. Specifically, the package implements the 'TWO-NN' and 'Gride' estimators and the 'Hidalgo' Bayesian mixture model. In addition, the first reference contains an extended vignette on the usage of the 'TWO-NN' and 'Hidalgo' models. References: Denti (2023, <doi:10.18637/jss.v106.i09>); Allegra et al. (2020, <doi:10.1038/s41598-020-72222-0>); Denti et al. (2022, <doi:10.1038/s41598-022-20991-1>); Facco et al. (2017, <doi:10.1038/s41598-017-11873-y>); Santos-Fernandez et al. (2021, <doi:10.1038/s41598-022-20991-1>).
Authors:	Francesco Denti [aut, cre, cph] , Andrea Gilardi [aut]
Maintainer:	Francesco Denti <[email protected]>
License:	MIT + file LICENSE
Version:	1.1.0
Built:	2025-03-12 06:59:13 UTC
Source:	CRAN

Plot the simulated MCMC chains for the Bayesian `Gride`

Description

Use this method without the .gride_bayes suffix. It displays the traceplot of the chain generated with Metropolis-Hasting updates to visually assess mixing and convergence. Alternatively, it is possible to plot the posterior density.

Usage

## S3 method for class 'gride_bayes'
autoplot(
  object,
  traceplot = FALSE,
  title = "Bayesian Gride - Posterior distribution",
  ...
)
## S3 method for class 'gride_bayes'
autoplot(
  object,
  traceplot = FALSE,
  title = "Bayesian Gride - Posterior distribution",
  ...
)

Arguments

`object`	object of class `gride_bayes`. It is obtained using the output of the `gride` function when `method = "bayes"`.
`traceplot`	logical. If `FALSE`, the function returns a plot of the posterior density. If `TRUE`, the function returns the traceplots of the MCMC used to simulate from the posterior distribution.
`title`	optional string to display as title.
`...`	other arguments passed to specific methods.

Value

object of class ggplot. It could represent the traceplot of the posterior simulations for the Bayesian Gride model (traceplot = TRUE) or a density plot of the simulated posterior distribution (traceplot = FALSE).

Plot the evolution of `Gride` estimates

Description

Use this method without the .gride_evolution suffix. It plots the evolution of the id estimates as a function of the average distance from the furthest NN of each point.

Usage

## S3 method for class 'gride_evolution'
autoplot(object, title = "Gride Evolution", ...)
## S3 method for class 'gride_evolution'
autoplot(object, title = "Gride Evolution", ...)

Arguments

`object`	an object of class `gride_evolution`.
`title`	an optional string to customize the title of the plot.
`...`	other arguments passed to specific methods.

Value

object of class ggplot. It displays the the evolution of the Gride maximum likelihood estimates as a function of the average distance from n2.

Plot the simulated bootstrap sample for the MLE `Gride`

Description

Use this method without the .gride_mle suffix. It displays the density plot of sample obtained via parametric bootstrap for the Gride model.

Usage

## S3 method for class 'gride_mle'
autoplot(object, title = "MLE Gride - Bootstrap sample", ...)
## S3 method for class 'gride_mle'
autoplot(object, title = "MLE Gride - Bootstrap sample", ...)

Arguments

`object`	object of class `gride_mle`. It is obtained using the output of the `gride` function when `method = "mle"`.
`title`	title for the plot.
`...`	other arguments passed to specific methods.

Value

object of class ggplot. It displays the density plot of the sample generated via parametric bootstrap to help the visual assessment of the uncertainty of the id estimates.

Plot the output of the `Hidalgo` function

Description

Use this method without the .Hidalgo suffix. It produces several plots to explore the output of the Hidalgo model.

Usage

## S3 method for class 'Hidalgo'
autoplot(
  object,
  type = c("raw_chains", "point_estimates", "class_plot", "clustering"),
  class_plot_type = c("histogram", "density", "boxplot", "violin"),
  class = NULL,
  psm = NULL,
  clust = NULL,
  title = NULL,
  ...
)
## S3 method for class 'Hidalgo'
autoplot(
  object,
  type = c("raw_chains", "point_estimates", "class_plot", "clustering"),
  class_plot_type = c("histogram", "density", "boxplot", "violin"),
  class = NULL,
  psm = NULL,
  clust = NULL,
  title = NULL,
  ...
)

Arguments

`object`	object of class `Hidalgo`, the output of the `Hidalgo()` function.
`type`	character that indicates the requested type of plot. It can be: `"raw_chains"` plot the MCMC and the ergodic means NOT corrected for label switching; `"point_estimates"` plot the posterior mean and median of the id for each observation, after the chains are processed for label switching; `"class_plot"` plot the estimated id distributions stratified by the groups specified in the class vector; `"clustering"` plot the posterior coclustering matrix. Rows and columns can be stratified by an exogenous class and/or a clustering solution.
`class_plot_type`	if `type` is chosen to be `"class_plot"`, one can plot the stratified id estimates with a `"density"` plot or a `"histogram"`, or using `"boxplots"` or `"violin"` plots.
`class`	factor variable used to stratify observations according to their the `id` estimates.
`psm`	posterior similarity matrix containing the posterior probability of coclustering.
`clust`	vector containing the cluster membership labels.
`title`	character string used as title of the plot.
`...`	other arguments passed to specific methods.

Value

a ggplot2 object produced by the function according to the type chosen. More precisely, if

method = "raw_chains": The functions produces the traceplots of the parameters d_k, for k=1...K. The ergodic means for all the chains are superimposed. The K chains that are plotted are not post-processed. Ergo, they are subjected to label switching;
method = "point_estimates": The function returns two scatterplots displaying the posterior mean and median id for each observation, after that the MCMC has been postprocessed to handle label switching;
method = "class_plot": The function returns a plot that can be used to visually assess the relationship between the posterior id estimates and an external, categorical variable. The type of plot varies according to the specification of class_plot_type, and it can be either a set of boxplots or violin plots or a collection of overlapping densities or histograms;
method = "clustering": The function displays the posterior similarity matrix, to allow the study of the clustering structure present in the data estimated via the mixture model. Rows and columns can be stratified by an exogenous class and/or a clustering structure.

Plot the output of the `TWO-NN` model estimated via the Bayesian approach

Description

Use this method without the .twonn_bayes suffix. The function returns the density plot of the posterior distribution computed with the bayes method.

Usage

## S3 method for class 'twonn_bayes'
autoplot(
  object,
  plot_low = 0,
  plot_upp = NULL,
  by = 0.05,
  title = "Bayesian TWO-NN",
  ...
)
## S3 method for class 'twonn_bayes'
autoplot(
  object,
  plot_low = 0,
  plot_upp = NULL,
  by = 0.05,
  title = "Bayesian TWO-NN",
  ...
)

Arguments

`object`	object of class `twonn_bayes`, the output of the `twonn` function when `method = "bayes"`.
`plot_low`	lower bound of the interval on which the posterior density is plotted.
`plot_upp`	upper bound of the interval on which the posterior density is plotted.
`by`	step-size at which the sequence spanning the interval is incremented.
`title`	character string used as title of the plot.
`...`	other arguments passed to specific methods.

Value

ggplot2 object displaying the posterior distribution of the intrinsic dimension parameter.

Plot the output of the `TWO-NN` model estimated via least squares

Description

Use this method without the .twonn_linfit suffix. The function returns the representation of the linear regression that is fitted with the linfit method.

Usage

## S3 method for class 'twonn_linfit'
autoplot(object, title = "TWO-NN Linear Fit", ...)
## S3 method for class 'twonn_linfit'
autoplot(object, title = "TWO-NN Linear Fit", ...)

Arguments

`object`	object of class `twonn_linfit`, the output of the `twonn` function when `method = "linfit"`.
`title`	string used as title of the plot.
`...`	other arguments passed to specific methods.

Value

a ggplot2 object displaying the goodness of the linear fit of the TWO-NN model.

Plot the output of the `TWO-NN` model estimated via the Maximum Likelihood approach

Description

Use this method without the .twonn_mle suffix. The function returns the point estimate along with the confidence bands computed via the mle method.

Usage

## S3 method for class 'twonn_mle'
autoplot(object, title = "MLE TWO-NN", ...)
## S3 method for class 'twonn_mle'
autoplot(object, title = "MLE TWO-NN", ...)

Arguments

`object`	object of class `twonn_mle`, the output of the `twonn` function when `method = "mle"`.
`title`	character string used as title of the plot.
`...`	other arguments passed to specific methods.

Value

ggplot2 object displaying the point estimate and confidence interval obtained via the maximum likelihood approach of the id parameter.

Auxiliary functions for the `Hidalgo` model

Description

Collection of functions used to extract meaningful information from the object returned by the function Hidalgo

Usage

posterior_means(x)

initial_values(x)

posterior_medians(x)

credible_intervals(x, alpha = 0.95)
posterior_means(x)

initial_values(x)

posterior_medians(x)

credible_intervals(x, alpha = 0.95)

Arguments

`x`	object of class `Hidalgo`, the output of the `Hidalgo()` function.
`alpha`	posterior probability contained in the computed credible interval.

Value

posterior_mean returns the observation-specific id posterior means estimated with Hidalgo.

initial_values returns a list with the parameter specification passed to the model.

posterior_median returns the observation-specific id posterior medians estimated with Hidalgo.

credible_interval returns the observation-specific credible intervals for a specific probability alpha.

Posterior similarity matrix and partition estimation

Description

The function computes the posterior similarity (coclustering) matrix (psm) and estimates a representative partition of the observations from the MCMC output. The user can provide the desired number of clusters or estimate a optimal clustering solution by minimizing a loss function on the space of the partitions. In the latter case, the function uses the package salso (Dahl et al., 2021), that the user needs to load.

Usage

clustering(
  object,
  clustering_method = c("dendrogram", "salso"),
  K = 2,
  nCores = 1,
  ...
)

## S3 method for class 'hidalgo_psm'
print(x, ...)

## S3 method for class 'hidalgo_psm'
plot(x, ...)
clustering(
  object,
  clustering_method = c("dendrogram", "salso"),
  K = 2,
  nCores = 1,
  ...
)

## S3 method for class 'hidalgo_psm'
print(x, ...)

## S3 method for class 'hidalgo_psm'
plot(x, ...)

Arguments

`object`	object of class `Hidalgo`, the output of the `Hidalgo` function.
`clustering_method`	character indicating the method to use to perform clustering. It can be "dendrogram" thresholding the adjacency dendrogram with a given number (`K`); "salso" estimation via minimization of several partition estimation criteria. The default loss function is the variation of information.
`K`	number of clusters to recover by thresholding the dendrogram obtained from the psm.
`nCores`	parameter for the `salso` function: the number of CPU cores to use. A value of zero indicates to use all cores on the system.
`...`	ignored.
`x`	object of class `hidalgo_psm`, obtained from the function `clustering()`.

Value

list containing the posterior similarity matrix (psm) and the estimated partition clust.

References

D. B. Dahl, D. J. Johnson, and P. Müller (2022), "Search Algorithms and Loss Functions for Bayesian Clustering", Journal of Computational and Graphical Statistics, doi:10.1080/10618600.2022.2069779.

David B. Dahl, Devin J. Johnson and Peter Müller (2022). "salso: Search Algorithms and Loss Functions for Bayesian Clustering". R package version 0.3.0. https://CRAN.R-project.org/package=salso

Examples


library(salso)
X            <- replicate(5,rnorm(500))
X[1:250,1:2] <- 0
h_out        <- Hidalgo(X)
clustering(h_out)

library(salso)
X            <- replicate(5,rnorm(500))
X[1:250,1:2] <- 0
h_out        <- Hidalgo(X)
clustering(h_out)

Compute the ratio statistics needed for the intrinsic dimension estimation

Description

The function compute_mus computes the ratios of distances between nearest neighbors (NNs) of generic order, denoted as mu(n_1,n_2). This quantity is at the core of all the likelihood-based methods contained in the package.

Usage

compute_mus(X = NULL, dist_mat = NULL, n1 = 1, n2 = 2, Nq = FALSE, q = 3)

## S3 method for class 'mus'
print(x, ...)

## S3 method for class 'mus_Nq'
print(x, ...)

## S3 method for class 'mus'
plot(x, range_d = NULL, ...)
compute_mus(X = NULL, dist_mat = NULL, n1 = 1, n2 = 2, Nq = FALSE, q = 3)

## S3 method for class 'mus'
print(x, ...)

## S3 method for class 'mus_Nq'
print(x, ...)

## S3 method for class 'mus'
plot(x, range_d = NULL, ...)

Arguments

`X`	a dataset with `n` observations and `D` variables.
`dist_mat`	a distance matrix computed between `n` observations.
`n1`	order of the first NN considered. Default is 1.
`n2`	order of the second NN considered. Default is 2.
`Nq`	logical indicator. If `TRUE`, it provides the `N^q` matrix needed for fitting the Hidalgo model.
`q`	integer, number of NN considered to build `N^q`.
`x`	object of class `mus`, obtained from the function `compute_mus()`.
`...`	ignored.
`range_d`	a sequence of values for which the generalized ratios density is superimposed to the histogram of `mus`.

Value

the principal output of this function is a vector containing the ratio statistics, an object of class mus. The length of the vector is equal to the number of observations considered, unless ties are present in the dataset. In that case, the duplicates are removed. Optionally, if Nq is TRUE, the function returns an object of class mus_Nq, a list containing both the ratio statistics mus and the adjacency matrix NQ.

References

Facco E, D'Errico M, Rodriguez A, Laio A (2017). "Estimating the intrinsic dimension of datasets by a minimal neighborhood information." Scientific Reports, 7(1). ISSN 20452322, doi:10.1038/s41598-017-11873-y.

Denti F, Doimo D, Laio A, Mira A (2022). "The generalized ratios intrinsic dimension estimator." Scientific Reports, 12(20005). ISSN 20452322, doi:10.1038/s41598-022-20991-1.

Examples

X           <- replicate(2,rnorm(1000))
mu          <- compute_mus(X, n1 = 1, n2 = 2)
mudots      <- compute_mus(X, n1 = 4, n2 = 8)
pre_hidalgo <- compute_mus(X, n1 = 4, n2 = 8, Nq = TRUE, q = 3)
X           <- replicate(2,rnorm(1000))
mu          <- compute_mus(X, n1 = 1, n2 = 2)
mudots      <- compute_mus(X, n1 = 4, n2 = 8)
pre_hidalgo <- compute_mus(X, n1 = 4, n2 = 8, Nq = TRUE, q = 3)

The Generalized Ratio distribution

Description

Density function and random number generator for the Generalized Ratio distribution with NN orders equal to n1 and n2. See Denti et al., 2022 for more details.

Usage

rgera(nsim, n1 = 1, n2 = 2, d)

dgera(x, n1 = 1, n2 = 2, d, log = FALSE)
rgera(nsim, n1 = 1, n2 = 2, d)

dgera(x, n1 = 1, n2 = 2, d, log = FALSE)

Arguments

`nsim`	integer, the number of observations to generate.
`n1`	order of the first NN considered. Default is 1.
`n2`	order of the second NN considered. Default is 2.
`d`	value of the intrinsic dimension.
`x`	vector of quantiles.
`log`	logical, if `TRUE`, it returns the log-density

Value

dgera gives the density. rgera returns a vector of random observations sampled from the generalized ratio distribution.

References

Denti F, Doimo D, Laio A, Mira A (2022). "The generalized ratios intrinsic dimension estimator." Scientific Reports, 12(20005). ISSN 20452322, doi:10.1038/s41598-022-20991-1.

Examples

draws   <- rgera(100,3,5,2)
density <- dgera(3,3,5,2)

draws   <- rgera(100,3,5,2)
density <- dgera(3,3,5,2)

`Gride`: the Generalized Ratios ID Estimator

Description

The function can fit the Generalized ratios ID estimator under both the frequentist and the Bayesian frameworks, depending on the specification of the argument method. The model is the direct extension of the TWO-NN method presented in Facco et al., 2017 . See also Denti et al., 2022 \ for more details.

Usage

gride(
  X = NULL,
  dist_mat = NULL,
  mus_n1_n2 = NULL,
  method = c("mle", "bayes"),
  n1 = 1,
  n2 = 2,
  alpha = 0.95,
  nsim = 5000,
  upper_D = 50,
  burn_in = 2000,
  sigma = 0.5,
  start_d = NULL,
  a_d = 1,
  b_d = 1,
  ...
)

## S3 method for class 'gride_bayes'
print(x, ...)

## S3 method for class 'gride_bayes'
summary(object, ...)

## S3 method for class 'summary.gride_bayes'
print(x, ...)

## S3 method for class 'gride_bayes'
plot(x, ...)

## S3 method for class 'gride_mle'
print(x, ...)

## S3 method for class 'gride_mle'
summary(object, ...)

## S3 method for class 'summary.gride_mle'
print(x, ...)

## S3 method for class 'gride_mle'
plot(x, ...)
gride(
  X = NULL,
  dist_mat = NULL,
  mus_n1_n2 = NULL,
  method = c("mle", "bayes"),
  n1 = 1,
  n2 = 2,
  alpha = 0.95,
  nsim = 5000,
  upper_D = 50,
  burn_in = 2000,
  sigma = 0.5,
  start_d = NULL,
  a_d = 1,
  b_d = 1,
  ...
)

## S3 method for class 'gride_bayes'
print(x, ...)

## S3 method for class 'gride_bayes'
summary(object, ...)

## S3 method for class 'summary.gride_bayes'
print(x, ...)

## S3 method for class 'gride_bayes'
plot(x, ...)

## S3 method for class 'gride_mle'
print(x, ...)

## S3 method for class 'gride_mle'
summary(object, ...)

## S3 method for class 'summary.gride_mle'
print(x, ...)

## S3 method for class 'gride_mle'
plot(x, ...)

Arguments

`X`	data matrix with `n` observations and `D` variables.
`dist_mat`	distance matrix computed between the `n` observations.
`mus_n1_n2`	vector of generalized order NN distance ratios.
`method`	the chosen estimation method. It can be `"mle"` maximum likelihood estimation; `"bayes"` estimation with the Bayesian approach.
`n1`	order of the first NN considered. Default is 1.
`n2`	order of the second NN considered. Default is 2.
`alpha`	confidence level (for `mle`) or posterior probability in the credible interval (`bayes`).
`nsim`	number of bootstrap samples or posterior simulation to consider.
`upper_D`	nominal dimension of the dataset (upper bound for the maximization routine).
`burn_in`	number of iterations to discard from the MCMC sample. Applicable if `method = "bayes"`.
`sigma`	standard deviation of the Gaussian proposal used in the MH step. Applicable if `method = "bayes"`.
`start_d`	initial value for the MCMC chain. If `NULL`, the MLE is used. Applicable if `method = "bayes"`.
`a_d`	shape parameter of the Gamma prior distribution for `d`. Applicable if `method = "bayes"`.
`b_d`	rate parameter of the Gamma prior distribution for `d`. Applicable if `method = "bayes"`.
`...`	other arguments passed to specific methods.
`x`	object of class `gride_mle`. It is obtained using the output of the `gride` function when `method = "mle"`.
`object`	object of class `gride_mle`, obtained from the function `gride_mle()`.

Value

a list containing the id estimate obtained with the Gride method, along with the relative confidence or credible interval (object est). The class of the output object changes according to the chosen method. Similarly, the remaining elements stored in the list reports a summary of the key quantities involved in the estimation process, e.g., the NN orders n1 and n2.

References

Denti F, Doimo D, Laio A, Mira A (2022). "The generalized ratios intrinsic dimension estimator." Scientific Reports, 12(20005). ISSN 20452322, doi:10.1038/s41598-022-20991-1.

Examples


 X  <- replicate(2,rnorm(500))
 dm <- as.matrix(dist(X,method = "manhattan"))
 res <- gride(X, nsim = 500)
 res
 plot(res)
 gride(dist_mat = dm, method = "bayes", upper_D =10,
 nsim = 500, burn_in = 100)

X  <- replicate(2,rnorm(500))
 dm <- as.matrix(dist(X,method = "manhattan"))
 res <- gride(X, nsim = 500)
 res
 plot(res)
 gride(dist_mat = dm, method = "bayes", upper_D =10,
 nsim = 500, burn_in = 100)

`Gride` evolution based on Maximum Likelihood Estimation

Description

The function allows the study of the evolution of the id estimates as a function of the scale of a dataset. A scale-dependent analysis is essential to identify the correct number of relevant directions in noisy data. To increase the average distance from the second NN (and thus the average neighborhood size) involved in the estimation, the function computes a sequence of Gride models with increasing NN orders, n1 and n2. See also Denti et al., 2022 for more details.

Usage

gride_evolution(X, vec_n1, vec_n2, upp_bound = 50)

## S3 method for class 'gride_evolution'
print(x, ...)

## S3 method for class 'gride_evolution'
plot(x, ...)
gride_evolution(X, vec_n1, vec_n2, upp_bound = 50)

## S3 method for class 'gride_evolution'
print(x, ...)

## S3 method for class 'gride_evolution'
plot(x, ...)

Arguments

`X`	data matrix with `n` observations and `D` variables.
`vec_n1`	vector of integers, containing the smaller NN orders considered in the evolution.
`vec_n2`	vector of integers, containing the larger NN orders considered in the evolution.
`upp_bound`	upper bound for the interval used in the numerical optimization (via `optimize`). Default is set to 50.
`x`	an object of class `gride_evolution`.
`...`	other arguments passed to specific methods.

Value

list containing the Gride evolution, the corresponding NN distance ratios, the average n2-th NN order distances, and the NN orders considered.

the function prints a summary of the Gride evolution to console.

References

Denti F, Doimo D, Laio A, Mira A (2022). "The generalized ratios intrinsic dimension estimator." Scientific Reports, 12(20005). ISSN 20452322, doi:10.1038/s41598-022-20991-1.

Examples


X       <-  replicate(5,rnorm(10000,0,.1))
gride_evolution(X = X,vec_n1 = 2^(0:5),vec_n2 = 2^(1:6))


X       <-  replicate(5,rnorm(10000,0,.1))
gride_evolution(X = X,vec_n1 = 2^(0:5),vec_n2 = 2^(1:6))

Fit the `Hidalgo` model

Description

The function fits the Heterogeneous intrinsic dimension algorithm, developed in Allegra et al., 2020. The model is a Bayesian mixture of Pareto distribution with modified likelihood to induce homogeneity across neighboring observations. The model can segment the observations into multiple clusters characterized by different intrinsic dimensions. This permits to capture hidden patterns in the data. For more details on the algorithm, refer to Allegra et al., 2020. For an example of application to basketball data, see Santos-Fernandez et al., 2021.

Usage

Hidalgo(
  X = NULL,
  dist_mat = NULL,
  K = 10,
  nsim = 5000,
  burn_in = 5000,
  thinning = 1,
  verbose = TRUE,
  q = 3,
  xi = 0.75,
  alpha_Dirichlet = 0.05,
  a0_d = 1,
  b0_d = 1,
  prior_type = c("Conjugate", "Truncated", "Truncated_PointMass"),
  D = NULL,
  pi_mass = 0.5
)

## S3 method for class 'Hidalgo'
print(x, ...)

## S3 method for class 'Hidalgo'
plot(x, type = c("A", "B", "C"), class = NULL, ...)

## S3 method for class 'Hidalgo'
summary(object, ...)

## S3 method for class 'summary.Hidalgo'
print(x, ...)
Hidalgo(
  X = NULL,
  dist_mat = NULL,
  K = 10,
  nsim = 5000,
  burn_in = 5000,
  thinning = 1,
  verbose = TRUE,
  q = 3,
  xi = 0.75,
  alpha_Dirichlet = 0.05,
  a0_d = 1,
  b0_d = 1,
  prior_type = c("Conjugate", "Truncated", "Truncated_PointMass"),
  D = NULL,
  pi_mass = 0.5
)

## S3 method for class 'Hidalgo'
print(x, ...)

## S3 method for class 'Hidalgo'
plot(x, type = c("A", "B", "C"), class = NULL, ...)

## S3 method for class 'Hidalgo'
summary(object, ...)

## S3 method for class 'summary.Hidalgo'
print(x, ...)

Arguments

`X`	data matrix with `n` observations and `D` variables.
`dist_mat`	distance matrix computed between the `n` observations.
`K`	integer, number of mixture components.
`nsim`	number of MCMC iterations to run.
`burn_in`	number of MCMC iterations to discard as burn-in period.
`thinning`	integer indicating the thinning interval.
`verbose`	logical, should the progress of the sampler be printed?
`q`	integer, first local homogeneity parameter. Default is 3.
`xi`	real number between 0 and 1, second local homogeneity parameter. Default is 0.75.
`alpha_Dirichlet`	parameter of the symmetric Dirichlet prior on the mixture weights. Default is 0.05, inducing a sparse mixture. Values that are too small (i.e., lower than 0.005) may cause underflow.
`a0_d`	shape parameter of the Gamma prior on `d`.
`b0_d`	rate parameter of the Gamma prior on `d`.
`prior_type`	character, type of Gamma prior on `d`, can be `"Conjugate"` a conjugate Gamma distribution is elicited; `"Truncated"` the conjugate Gamma prior is truncated over the interval `(0,D)`; `"Truncated_PointMass"` same as `"Truncated"`, but a point mass is placed on `D`, to allow the `id` to be identically equal to the nominal dimension.
`D`	integer, the maximal dimension of the dataset.
`pi_mass`	probability placed a priori on `D` when `Truncated_PointMass` is chosen.
`x`	object of class `Hidalgo`, the output of the `Hidalgo()` function.
`...`	other arguments passed to specific methods.
`type`	character that indicates the type of plot that is requested. It can be: `"A"` plot the MCMC and the ergodic means NOT corrected for label switching; `"B"` plot the posterior mean and median of the id for each observation, after the chains are processed for label switching; `"C"` plot the estimated id distributions stratified by the groups specified in the class vector;
`class`	factor variable used to stratify observations according to their the `id` estimates.
`object`	object of class `Hidalgo`, the output of the `Hidalgo()` function.

Value

object of class Hidalgo, which is a list containing

cluster_prob: chains of the posterior mixture weights;
membership_labels: chains of the membership labels for all the observations;
id_raw: chains of the K intrinsic dimensions parameters, one per mixture component;
id_postpr: a chain for each observation, corrected for label switching;
id_summary: a matrix containing, for each observation, the value of posterior mean and the 5%, 25%, 50%, 75%, 95% quantiles;
recap: a list with the objects and specifications passed to the function used in the estimation.

References

Allegra M, Facco E, Denti F, Laio A, Mira A (2020). “Data segmentation based on the local intrinsic dimension.” Scientific Reports, 10(1), 1–27. ISSN 20452322, doi:10.1038/s41598-020-72222-0,

Santos-Fernandez E, Denti F, Mengersen K, Mira A (2021). “The role of intrinsic dimension in high-resolution player tracking data – Insights in basketball.” Annals of Applied Statistics - Forthcoming, – ISSN 2331-8422, 2002.04148, doi:10.1038/s41598-022-20991-1

Examples


set.seed(1234)
X            <- replicate(5,rnorm(500))
X[1:250,1:2] <- 0
X[1:250,]    <- X[1:250,] + 4
oracle       <- rep(1:2,rep(250,2))
# this is just a short example
# increase the number of iterations to improve mixing and convergence
h_out        <- Hidalgo(X, nsim = 500, burn_in = 500)
plot(h_out, type =  "B")
id_by_class(h_out, oracle)



set.seed(1234)
X            <- replicate(5,rnorm(500))
X[1:250,1:2] <- 0
X[1:250,]    <- X[1:250,] + 4
oracle       <- rep(1:2,rep(250,2))
# this is just a short example
# increase the number of iterations to improve mixing and convergence
h_out        <- Hidalgo(X, nsim = 500, burn_in = 500)
plot(h_out, type =  "B")
id_by_class(h_out, oracle)

Stratification of the `id` by an external categorical variable

Description

The function computes summary statistics (mean, median, and standard deviation) of the post-processed chains of the intrinsic dimension stratified by an external categorical variable.

Usage

id_by_class(object, class)

## S3 method for class 'hidalgo_class'
print(x, ...)
id_by_class(object, class)

## S3 method for class 'hidalgo_class'
print(x, ...)

Arguments

`object`	object of class `Hidalgo`, the output of the `Hidalgo()` function.
`class`	factor according to the observations should be stratified by.
`x`	object of class `hidalgo_class`, the output of the `id_by_class()` function.
`...`	other arguments passed to specific methods.

Value

a data.frame containing the posterior id means, medians, and standard deviations stratified by the levels of the variable class.

Examples


X            <- replicate(5,rnorm(500))
X[1:250,1:2] <- 0
oracle       <- rep(1:2,rep(250,2))
h_out        <- Hidalgo(X)
id_by_class(h_out,oracle)


X            <- replicate(5,rnorm(500))
X[1:250,1:2] <- 0
oracle       <- rep(1:2,rep(250,2))
h_out        <- Hidalgo(X)
id_by_class(h_out,oracle)

Generates a noise-free Swiss roll dataset

Description

The function creates a three-dimensional dataset with coordinates following the Swiss roll mapping, transforming random uniform data points sampled on the interval (0,10).

Usage

Swissroll(n)
Swissroll(n)

Arguments

`n`	number of observations contained in the output dataset.

Value

a three-dimensional data.frame containing the coordinates of the points generated via the Swiss roll mapping.

Examples

Data <- Swissroll(1000)

Data <- Swissroll(1000)

`TWO-NN` estimator

Description

The function can fit the two-nearest neighbor estimator within the maximum likelihood and the Bayesian frameworks. Also, one can obtain the estimates using least squares estimation, depending on the specification of the argument method. This model has been originally presented in Facco et al., 2017 . See also Denti et al., 2022 for more details.

Usage

twonn(
  X = NULL,
  dist_mat = NULL,
  mus = NULL,
  method = c("mle", "linfit", "bayes"),
  alpha = 0.95,
  c_trimmed = 0.01,
  unbiased = TRUE,
  a_d = 0.001,
  b_d = 0.001,
  ...
)

## S3 method for class 'twonn_bayes'
print(x, ...)

## S3 method for class 'twonn_bayes'
summary(object, ...)

## S3 method for class 'summary.twonn_bayes'
print(x, ...)

## S3 method for class 'twonn_bayes'
plot(x, plot_low = 0.001, plot_upp = NULL, by = 0.05, ...)

## S3 method for class 'twonn_linfit'
print(x, ...)

## S3 method for class 'twonn_linfit'
summary(object, ...)

## S3 method for class 'summary.twonn_linfit'
print(x, ...)

## S3 method for class 'twonn_linfit'
plot(x, ...)

## S3 method for class 'twonn_mle'
print(x, ...)

## S3 method for class 'twonn_mle'
summary(object, ...)

## S3 method for class 'summary.twonn_mle'
print(x, ...)

## S3 method for class 'twonn_mle'
plot(x, ...)
twonn(
  X = NULL,
  dist_mat = NULL,
  mus = NULL,
  method = c("mle", "linfit", "bayes"),
  alpha = 0.95,
  c_trimmed = 0.01,
  unbiased = TRUE,
  a_d = 0.001,
  b_d = 0.001,
  ...
)

## S3 method for class 'twonn_bayes'
print(x, ...)

## S3 method for class 'twonn_bayes'
summary(object, ...)

## S3 method for class 'summary.twonn_bayes'
print(x, ...)

## S3 method for class 'twonn_bayes'
plot(x, plot_low = 0.001, plot_upp = NULL, by = 0.05, ...)

## S3 method for class 'twonn_linfit'
print(x, ...)

## S3 method for class 'twonn_linfit'
summary(object, ...)

## S3 method for class 'summary.twonn_linfit'
print(x, ...)

## S3 method for class 'twonn_linfit'
plot(x, ...)

## S3 method for class 'twonn_mle'
print(x, ...)

## S3 method for class 'twonn_mle'
summary(object, ...)

## S3 method for class 'summary.twonn_mle'
print(x, ...)

## S3 method for class 'twonn_mle'
plot(x, ...)

Arguments

`X`	data matrix with `n` observations and `D` variables.
`dist_mat`	distance matrix computed between the `n` observations.
`mus`	vector of second to first NN distance ratios.
`method`	chosen estimation method. It can be `"mle"` for maximum likelihood estimator; `"linfit"` for estimation via the least squares approach; `"bayes"` for estimation with the Bayesian approach.
`alpha`	the confidence level (for `mle` and least squares fit) or posterior probability in the credible interval (`bayes`).
`c_trimmed`	the proportion of trimmed observations.
`unbiased`	logical, applicable when `method = "mle"`. If `TRUE`, the MLE is corrected to ensure unbiasedness.
`a_d`	shape parameter of the Gamma prior on the parameter `d`, applicable when `method = "bayes"`.
`b_d`	rate parameter of the Gamma prior on the parameter `d`, applicable when `method = "bayes"`.
`...`	ignored.
`x`	object of class `twonn_mle`, the output of the `twonn` function when `method = "mle"`.
`object`	object of class `twonn_mle`, obtained from the function `twonn_mle()`.
`plot_low`	lower bound of the interval on which the posterior density is plotted.
`plot_upp`	upper bound of the interval on which the posterior density is plotted.
`by`	step-size at which the sequence spanning the interval is incremented.

Value

list characterized by a class type that depends on the method chosen. Regardless of the method, the output list always contains the object est, which provides the estimated intrinsic dimension along with uncertainty quantification. The remaining objects vary with the estimation method. In particular, if

method = "mle": the output reports the MLE and the relative confidence interval;
method = "linfit": the output includes the lm() object used for the computation;
method = "bayes": the output contains the (1 + alpha) / 2 and (1 - alpha) / 2 quantiles, mean, mode, and median of the posterior distribution of d.

References

Denti F, Doimo D, Laio A, Mira A (2022). "The generalized ratios intrinsic dimension estimator." Scientific Reports, 12(20005). ISSN 20452322, doi:10.1038/s41598-022-20991-1.

Examples

# dataset with 1000 observations and id = 2
X <- replicate(2,rnorm(1000))
twonn(X)
# dataset with 1000 observations and id = 3
Y <- replicate(3,runif(1000))
#  Bayesian and least squares estimate from distance matrix
dm <- as.matrix(dist(Y,method = "manhattan"))
twonn(dist_mat = dm,method = "bayes")
twonn(dist_mat = dm,method = "linfit")

# dataset with 1000 observations and id = 2
X <- replicate(2,rnorm(1000))
twonn(X)
# dataset with 1000 observations and id = 3
Y <- replicate(3,runif(1000))
#  Bayesian and least squares estimate from distance matrix
dm <- as.matrix(dist(Y,method = "manhattan"))
twonn(dist_mat = dm,method = "bayes")
twonn(dist_mat = dm,method = "linfit")

Estimate the decimated `TWO-NN` evolution with halving steps or vector of proportions

Description

The estimation of the id is related to the scale of the dataset. To escape the local reach of the TWO-NN estimator, Facco et al. (2017) proposed to subsample the original dataset in order to induce greater distances between the data points. By investigating the estimates' evolution as a function of the size of the neighborhood, it is possible to obtain information about the validity of the modeling assumptions and the robustness of the model in the presence of noise.

Usage

twonn_decimated(
  X,
  method = c("steps", "proportions"),
  steps = 0,
  proportions = 1,
  seed = NULL
)
twonn_decimated(
  X,
  method = c("steps", "proportions"),
  steps = 0,
  proportions = 1,
  seed = NULL
)

Arguments

`X`	data matrix with `n` observations and `D` variables.
`method`	method to use for decimation: `"steps"` the number of times the dataset is halved; `"proportion"` the dataset is subsampled according to a vector of proportions.
`steps`	number of times the dataset is halved.
`proportions`	vector containing the fractions of the dataset to be considered.
`seed`	random seed controlling the sequence of sub-sampled observations.

Value

list containing the TWO-NN evolution (maximum likelihood estimation and confidence intervals), the average distance from the second NN, and the vector of proportions that were considered. According to the chosen estimation method, it is accompanied with the vector of proportions or halving steps considered.

References

Denti F, Doimo D, Laio A, Mira A (2022). "The generalized ratios intrinsic dimension estimator." Scientific Reports, 12(20005). ISSN 20452322, doi:10.1038/s41598-022-20991-1.

Estimate the decimated `TWO-NN` evolution with halving steps or vector of proportions

Description

Usage

twonn_decimation(
  X,
  method = c("steps", "proportions"),
  steps = 0,
  proportions = 1,
  seed = NULL
)

## S3 method for class 'twonn_dec_prop'
print(x, ...)

## S3 method for class 'twonn_dec_prop'
plot(x, CI = FALSE, proportions = FALSE, ...)

## S3 method for class 'twonn_dec_by'
print(x, ...)

## S3 method for class 'twonn_dec_by'
plot(x, CI = FALSE, steps = FALSE, ...)
twonn_decimation(
  X,
  method = c("steps", "proportions"),
  steps = 0,
  proportions = 1,
  seed = NULL
)

## S3 method for class 'twonn_dec_prop'
print(x, ...)

## S3 method for class 'twonn_dec_prop'
plot(x, CI = FALSE, proportions = FALSE, ...)

## S3 method for class 'twonn_dec_by'
print(x, ...)

## S3 method for class 'twonn_dec_by'
plot(x, CI = FALSE, steps = FALSE, ...)

Arguments

`X`	data matrix with `n` observations and `D` variables.
`method`	method to use for decimation: `"steps"` the number of times the dataset is halved; `"proportion"` the dataset is subsampled according to a vector of proportions.
`steps`	logical, if `TRUE`, the x-axis reports the number of halving steps. If `FALSE`, the x-axis reports the log10 average distance.
`proportions`	logical, if `TRUE`, the x-axis reports the number of decimating proportions. If `FALSE`, the x-axis reports the log10 average distance.
`seed`	random seed controlling the sequence of sub-sampled observations.
`x`	object of class `twonn_dec_prop`, obtained from the function `twonn_dec_prop()`.
`...`	ignored.
`CI`	logical, if `TRUE`, the confidence intervals are plotted

Value

References

Denti F, Doimo D, Laio A, Mira A (2022). "The generalized ratios intrinsic dimension estimator." Scientific Reports, 12(20005). ISSN 20452322, doi:10.1038/s41598-022-20991-1.

Examples

X <- replicate(4,rnorm(1000))
twonn_decimation(X,,method = "proportions",
                proportions = c(1,.5,.2,.1,.01))

X <- replicate(4,rnorm(1000))
twonn_decimation(X,,method = "proportions",
                proportions = c(1,.5,.2,.1,.01))

Package 'intRinsic'

Help Index

Plot the simulated MCMC chains for the Bayesian Gride

Description

Usage

Arguments

Value

See Also

Plot the evolution of Gride estimates

Description

Usage

Arguments

Value

Plot the simulated bootstrap sample for the MLE Gride

Description

Usage

Arguments

Value

See Also

Plot the output of the Hidalgo function

Description

Usage

Arguments

Value

See Also

Plot the output of the TWO-NN model estimated via the Bayesian approach

Description

Usage

Arguments

Value

See Also

Plot the output of the TWO-NN model estimated via least squares

Description

Usage

Arguments

Value

See Also

Plot the output of the TWO-NN model estimated via the Maximum Likelihood approach

Description

Usage

Arguments

Value

See Also

Auxiliary functions for the Hidalgo model

Description

Usage

Arguments

Value

Posterior similarity matrix and partition estimation

Description

Usage

Arguments

Value

References

See Also

Examples

Compute the ratio statistics needed for the intrinsic dimension estimation

Description

Usage

Arguments

Value

References

Examples

The Generalized Ratio distribution

Description

Usage

Arguments

Value

References

Examples

Gride: the Generalized Ratios ID Estimator

Description

Usage

Arguments

Value

References

Examples

Gride evolution based on Maximum Likelihood Estimation

Description

Usage

Plot the simulated MCMC chains for the Bayesian `Gride`

Plot the evolution of `Gride` estimates

Plot the simulated bootstrap sample for the MLE `Gride`

Plot the output of the `Hidalgo` function

Plot the output of the `TWO-NN` model estimated via the Bayesian approach

Plot the output of the `TWO-NN` model estimated via least squares

Plot the output of the `TWO-NN` model estimated via the Maximum Likelihood approach

Auxiliary functions for the `Hidalgo` model

`Gride`: the Generalized Ratios ID Estimator

`Gride` evolution based on Maximum Likelihood Estimation

Fit the `Hidalgo` model

Stratification of the `id` by an external categorical variable

`TWO-NN` estimator

Estimate the decimated `TWO-NN` evolution with halving steps or vector of proportions

Estimate the decimated `TWO-NN` evolution with halving steps or vector of proportions