Title: | Collection of Methods Constructed using Kernel-Based Quadratic Distances |
---|---|
Description: | It includes test for multivariate normality, test for uniformity on the d-dimensional Sphere, non-parametric two- and k-sample tests, random generation of points from the Poisson kernel-based density and clustering algorithm for spherical data. For more information see Saraceno G., Markatou M., Mukhopadhyay R. and Golzy M. (2024) <doi:10.48550/arXiv.2402.02290> Markatou, M. and Saraceno, G. (2024) <doi:10.48550/arXiv.2407.16374>, Ding, Y., Markatou, M. and Saraceno, G. (2023) <doi:10.5705/ss.202022.0347>, and Golzy, M. and Markatou, M. (2020) <doi:10.1080/10618600.2020.1740713>. |
Authors: | Giovanni Saraceno [aut, cre] (ORCID 000-0002-1753-2367), Marianthi Markatou [aut], Raktim Mukhopadhyay [aut], Mojgan Golzy [aut] |
Maintainer: | Giovanni Saraceno <[email protected]> |
License: | GPL (>= 3) |
Version: | 1.1.2 |
Built: | 2024-10-30 09:24:19 UTC |
Source: | CRAN |
Collection of Methods Constructed using the Kernel-Based Quadratic Distances
QuadratiK
provides the first implementation, in R and Python, of a
comprehensive set of goodness-of-fit tests and a clustering technique for
-dimensional spherical data
using kernel-based quadratic
distances. It includes:
Goodness-of-Fit Tests: The software implements one, two, and
k-sample tests for goodness of fit, offering an efficient and
mathematically sound way to assess the fit of probability distributions.
Our tests are
particularly useful for large, high dimensional data sets where the
assessment of fit of probability models is of interest. Specifically, we
offer tests for normality, as well as two- and k-sample tests, where
testing equality of two or more distributions is of interest, that is
and
respectively.
The proposed tests perform well in terms of level and power for contiguous
alternatives, heavy tailed distributions and in higher dimensions.
Expanded capabilities include supporting tests for uniformity on the
d-dimensional Sphere based on the Poisson kernel, exhibiting excellent
results especially in the case of multimodal distributions.
Poisson kernel-based distribution (PKBD): the package offers
functions for computing the density value and for generating random samples
from a PKBD. The Poisson kernel-based densities are based on the normalized
Poisson kernel and are defined on the -dimensional unit sphere.
Given a vector
, and a parameter
such that
, the probability density function of a
-variate Poisson kernel-based density is defined by:
where is a vector orienting the center of the distribution,
is a parameter to control the concentration of the distribution
around the vector
and it is related to the variance of the
distribution. Furthermore,
is the surface area of the unit sphere in
(see Golzy and Markatou, 2020).
Clustering Algorithm for Spherical Data: the package incorporates a
unique clustering algorithm specifically tailored for -dimensional
spherical data and it is especially useful in the presence of noise in the
data and the presence of non-negligible overlap between clusters. This
algorithm leverages a mixture of Poisson kernel-based densities on the
Sphere, enabling effective clustering of spherical data or data that has
been spherically transformed.
Additional Features: Alongside these functionalities, the software includes additional graphical functions, aiding users in validating and representing the cluster results as well as enhancing the interpretability and usability of the analysis.
For an introduction to QuadratiK
see the vignette
Introduction to the QuadratiK Package.
The work has been supported by Kaleida Health Foundation and the National Science Foundation.
The QuadratiK
package is also available in Python on PyPI
https://pypi.org/project/QuadratiK/ and also as a Dashboard application.
Usage instruction for the Dashboard can be found at
https://quadratik.readthedocs.io/en/latest/user_guide/dashboard_application_usage.html.
Giovanni Saraceno, Marianthi Markatou, Raktim Mukhopadhyay, Mojgan Golzy
Maintainer: Giovanni Saraceno [email protected]
Saraceno, G., Markatou, M., Mukhopadhyay, R. and Golzy, M. (2024). Goodness-of-Fit and Clustering of Spherical Data: the QuadratiK package in R and Python. arXiv preprint arXiv:2402.02290.
Ding, Y., Markatou, M. and Saraceno, G. (2023). “Poisson Kernel-Based Tests for Uniformity on the d-Dimensional Sphere.” Statistica Sinica. doi: doi:10.5705/ss.202022.0347.
Golzy, M. and Markatou, M. (2020) Poisson Kernel-Based Clustering on the Sphere: Convergence Properties, Identifiability, and a Method of Sampling, Journal of Computational and Graphical Statistics, 29:4, 758-770, DOI: 10.1080/10618600.2020.1740713.
Markatou, M. and Saraceno, G. (2024). “A Unified Framework for
Multivariate Two- and k-Sample Kernel-based Quadratic Distance
Goodness-of-Fit Tests.”
https://doi.org/10.48550/arXiv.2407.16374
Useful links:
Report bugs at https://github.com/giovsaraceno/QuadratiK-package/issues
The breast_cancer
Wisconsin data has 569 rows and 31 columns. The
first 30 variables report the features that are computed from a digitized
image of a fine needle aspirate (FNA) of a breast mass. They describe
characteristics of the cell nuclei present in the image. The last column
indicates the class labels (Benign = 0 or Malignant = 1).
breast_cancer
breast_cancer
A data frame of 569 observations and 31 variables.
Wolberg, W., Mangasarian, O., Street, N., & Street, W. (1993). Breast Cancer Wisconsin (Diagnostic). UCI Machine Learning Repository. https://doi.org/10.24432/C5DW2B.
Street, W. N., Wolberg, W. H., & Mangasarian, O. L. (1993, July). Nuclear feature extraction for breast tumor diagnosis. In Biomedical image processing and biomedical visualization (Vol. 1905, pp. 861-870). SPIE.
data(breast_cancer) summary(breast_cancer)
data(breast_cancer) summary(breast_cancer)
The Poisson kernel-based densities are based on the normalized Poisson kernel
and are defined on the -dimensional unit sphere. Given a vector
, where
, and a parameter
such that
, the probability density function of a
-variate
Poisson kernel-based density is defined by:
where is a vector orienting the center of the distribution,
is a parameter to control the concentration of the distribution
around the vector
and it is related to the variance of the
distribution. Recall that, for
,
. Furthermore,
is the surface area of the unit sphere in
(see Golzy and Markatou, 2020). When
,
the Poisson kernel-based density tends to the uniform density on the sphere.
Connections of the PKBDs to other distributions are discussed in detail in
Golzy and Markatou (2020). Here we note that when
, PKBDs reduce to
the wrapped Cauchy distribution. Additionally, with precise choice of the
parameters
and
the two-dimensional PKBD becomes a
two-dimensional projected normal distribution. However, the connection with
the
-dimensional projected normal distributions does not carry beyond
.
Golzy and Markatou (2020) proposed an acceptance-rejection method for
simulating data from a PKBD using von Mises-Fisher envelops (
rejvmf
method). Furthermore Sablica, Hornik and Leydold (2023) proposed new ways for
simulating from the PKBD, using angular central Gaussian envelops
(rejacg
) or using the projected Saw distributions (rejpsaw
).
dpkb(x, mu, rho, logdens = FALSE) rpkb( n, mu, rho, method = "rejacg", tol.eps = .Machine$double.eps^0.25, max.iter = 1000 )
dpkb(x, mu, rho, logdens = FALSE) rpkb( n, mu, rho, method = "rejacg", tol.eps = .Machine$double.eps^0.25, max.iter = 1000 )
x |
Matrix (or data.frame) of data point on the sphere
|
mu |
location vector parameter with length indicating the dimension of generated points. |
rho |
Concentration parameter, with |
logdens |
Logical; if 'TRUE', densities are returned in logarithmic scale. |
n |
number of observations. |
method |
string that indicates the method used for sampling observations. The available methods are
|
tol.eps |
the desired accuracy of convergence tolerance (for 'rejacg' method). |
max.iter |
the maximum number of iterations (for 'rejacg' method). |
This function dpkb()
computes the density value for a given point
x
from the Poisson kernel-based distribution with mean direction
vector mu
and concentration parameter rho
.
The number of observations generated is determined by n
for
rpkb()
. This function returns a list with the matrix of generated
observations x
, the number of tries numTries
and the number of
acceptances numAccepted
.
A limitation of the rejvmf
is that the method does not ensure the
computational feasibility of the sampler for approaching 1.
If the chosen method is 'rejacg', the function uniroot
, from the
stat
package, is used to estimate the beta parameter. In this case,
the complete results are provided as output.
dpkb
gives the density value;
rpkb
generates random observations from the PKBD.
If the required packages (movMF
for rejvmf
method, and
Tinflex
for rejpsaw
) are not installed, the function will display a
message asking the user to install the missing package(s).
Golzy, M. and Markatou, M. (2020) Poisson Kernel-Based Clustering on the Sphere: Convergence Properties, Identifiability, and a Method of Sampling, Journal of Computational and Graphical Statistics, 29:4, 758-770, DOI: 10.1080/10618600.2020.1740713.
Sablica L., Hornik K. and Leydold J. (2023) "Efficient sampling from the PKBD distribution", Electronic Journal of Statistics, 17(2), 2180-2209.
# Generate some data from pkbd density pkbd_dat <- rpkb(10, c(0.5,0), 0.5) # Calculate the PKBD density values dens_val <- dpkb(pkbd_dat$x, c(0.5,0.5),0.5)
# Generate some data from pkbd density pkbd_dat <- rpkb(10, c(0.5,0), 0.5) # Calculate the PKBD density values dens_val <- dpkb(pkbd_dat$x, c(0.5,0.5),0.5)
This function performs the kernel-based quadratic distance goodness-of-fit
tests. It includes tests for multivariate normality, two-sample tests and
-sample tests.
kb.test( x, y = NULL, h = NULL, method = "subsampling", B = 150, b = NULL, Quantile = 0.95, mu_hat = NULL, Sigma_hat = NULL, centeringType = "Nonparam", K_threshold = 10, alternative = "skewness" ) ## S4 method for signature 'ANY' kb.test( x, y = NULL, h = NULL, method = "subsampling", B = 150, b = 0.9, Quantile = 0.95, mu_hat = NULL, Sigma_hat = NULL, centeringType = "Nonparam", K_threshold = 10, alternative = "skewness" ) ## S4 method for signature 'kb.test' show(object)
kb.test( x, y = NULL, h = NULL, method = "subsampling", B = 150, b = NULL, Quantile = 0.95, mu_hat = NULL, Sigma_hat = NULL, centeringType = "Nonparam", K_threshold = 10, alternative = "skewness" ) ## S4 method for signature 'ANY' kb.test( x, y = NULL, h = NULL, method = "subsampling", B = 150, b = 0.9, Quantile = 0.95, mu_hat = NULL, Sigma_hat = NULL, centeringType = "Nonparam", K_threshold = 10, alternative = "skewness" ) ## S4 method for signature 'kb.test' show(object)
x |
Numeric matrix or vector of data values. |
y |
Numeric matrix or vector of data values. Depending on the input
|
h |
Bandwidth for the kernel function. If a value is not provided, the
algorithm for the selection of an optimal h is performed
automatically. See the function |
method |
The method used for critical value estimation ("subsampling", "bootstrap", or "permutation")(default: "subsampling"). |
B |
The number of iterations to use for critical value estimation (default: 150). |
b |
The size of the subsamples used in the subsampling algorithm (default: 0.8). |
Quantile |
The quantile to use for critical value estimation, 0.95 is the default value. |
mu_hat |
Mean vector for the reference distribution. |
Sigma_hat |
Covariance matrix of the reference distribution. |
centeringType |
String indicating the method used for centering the normal kernel ('Param' or 'Nonparam'). |
K_threshold |
maximum number of groups allowed. Default is 10. It is a control parameter. Change in case of more than 10 samples. |
alternative |
Family of alternative chosen for selecting h, between
"location", "scale" and "skewness" (only if |
object |
Object of class |
The function kb.test
performs the kernel-based quadratic
distance tests using the Gaussian kernel with bandwidth parameter h
.
Depending on the shape of the input y
the function performs the tests
of multivariate normality, the non-parametric two-sample tests or the
k-sample tests.
The quadratic distance between two probability distributions and
is
defined as
where is a distribution whose goodness of fit we wish to assess and
denotes the Normal kernel defined as
for every , with covariance matrix
and
tuning parameter
.
Test for Normality:
Let be a random sample with empirical
distribution function
. We test the null hypothesis of
normality, i.e.
.
We consider the U-statistic estimate of the sample KBQD
then the first test statistics is
with computed exactly following Lindsay et al.(2014),
and the V-statistic estimate
where denotes the Normal kernel
with parametric
centering with respect to the considered normal distribution
.
The asymptotic distribution of the V-statistic is an infinite combination
of weighted independent chi-squared random variables with one degree of
freedom. The cutoff value is obtained using the Satterthwaite
approximation , where
and
are computed exactly following the formulas in Lindsay et al.(2014).
For the -statistic the cutoff is determined empirically:
Generate data from the considered normal distribution ;
Compute the test statistics for B
Monte Carlo(MC) replications;
Compute the 95th quantile of the empirical distribution of the test statistic.
k-sample test:
Consider random samples of i.i.d. observations
,
.
We test if the samples are generated from the same unknown distribution,
that is
versus
, for some
.
We construct a matrix distance , with
off-diagonal elements
and in the diagonal
where denotes the Normal kernel
centered non-parametrically with respect to
We compute the trace statistic
and , derived considering all the possible pairwise comparisons
in the k-sample null hypothesis, given as
We compute the empirical critical value by employing numerical techniques such as the bootstrap, permutation and subsampling algorithms:
Generate k-tuples, of total size , from the pooled sample
following one of the sampling methods;
Compute the k-sample test statistic;
Repeat B
times;
Select the quantile of the obtained values.
Two-sample test:
Let and
be
random samples from the distributions
and
, respectively.
We test the null hypothesis that the two samples are generated from
the same unknown distribution, that is
vs
. The test statistics coincide with the
-sample
test statistics when
.
The arguments mu_hat
and Sigma_hat
indicate the normal model
considered for the normality test, that is mu_hat
,
Sigma_hat
).
For the two-sample and -sample tests,
mu_hat
and
Sigma_hat
can
be used for the parametric centering of the kernel, in the case we want to
specify the reference distribution, with centeringType = "Param"
.
This is the default method when the test for normality is performed.
The normal kernel centered with respect to
can be computed as
We consider the non-parametric centering of the kernel with respect to
where
,
with
centeringType = "Nonparam"
, for the two- and -sample
tests.
Let
denote the pooled sample. For any
, it is given by
An S4 object of class kb.test
containing the results of the
kernel-based quadratic distance tests, based on the normal kernel. The object
contains the following slots:
method
: Description of the kernel-based quadratic
distance test performed.
x
Data list of samples X (and Y).
Un
The value of the U-statistic.
H0_Un
A logical value indicating whether or not the null
hypothesis is rejected according to Un.
CV_Un
The critical value computed for the test Un.
Vn
The value of the V-statistic (if available).
H0_Vn
A logical value indicating whether or not the null
hypothesis is rejected according to Vn (if available).
CV_Vn
The critical value computed for the test Vn
(if available).
h
List with the value of bandwidth parameter used for the
normal kernel function. If select_h
is used, the matrix of computed
power values and the corresponding power plot are also provided.
B
Number of bootstrap/permutation/subsampling replications.
var_Un
exact variance of the kernel-based U-statistic.
cv_method
The method used to estimate the critical value
(one of "subsampling", "permutation" or "bootstrap").
For the two- and -sample tests, the slots
Vn
, H0_Vn
and
CV_Vn
are empty, while the computed statistics are both reported in
slots Un
, H0_Un
and CV_Un
.
A U-statistic is a type of statistic that is used to estimate a population parameter. It is based on the idea of averaging over all possible distinct combinations of a fixed size from a sample. A V-statistic considers all possible tuples of a certain size, not just distinct combinations and can be used in contexts where unbiasedness is not required.
Markatou, M. and Saraceno, G. (2024). “A Unified Framework for
Multivariate Two- and k-Sample Kernel-based Quadratic Distance
Goodness-of-Fit Tests.”
https://doi.org/10.48550/arXiv.2407.16374
Lindsay, B.G., Markatou, M. and Ray, S. (2014) "Kernels, Degrees of Freedom, and Power Properties of Quadratic Distance Goodness-of-Fit Tests", Journal of the American Statistical Association, 109:505, 395-410, DOI: 10.1080/01621459.2013.836972
kb.test for the class definition.
# create a kb.test object x <- matrix(rnorm(100),ncol=2) y <- matrix(rnorm(100),ncol=2) # Normality test my_test <- kb.test(x, h=0.5) my_test # Two-sample test my_test <- kb.test(x,y,h=0.5, method="subsampling",b=0.9, centeringType = "Nonparam") my_test # k-sample test z <- matrix(rnorm(100,2),ncol=2) dat <- rbind(x,y,z) group <- rep(c(1,2,3),each=50) my_test <- kb.test(x=dat,y=group,h=0.5, method="subsampling",b=0.9) my_test
# create a kb.test object x <- matrix(rnorm(100),ncol=2) y <- matrix(rnorm(100),ncol=2) # Normality test my_test <- kb.test(x, h=0.5) my_test # Two-sample test my_test <- kb.test(x,y,h=0.5, method="subsampling",b=0.9, centeringType = "Nonparam") my_test # k-sample test z <- matrix(rnorm(100,2),ncol=2) dat <- rbind(x,y,z) group <- rep(c(1,2,3),each=50) my_test <- kb.test(x=dat,y=group,h=0.5, method="subsampling",b=0.9) my_test
A class to represent the results of Gaussian kernel-based quadratic distance tests. This includes the normality test, the two-sample test statistics and the k-sample tests.
method
String indicating the kernel-based quadratic distance test performed.
Un
The value of the test U-statistic.
Vn
The value of the test V-statistic.
H0_Un
A logical value indicating whether or not the null hypothesis is rejected according to U-statistic.
H0_Vn
A logical value indicating whether or not the null hypothesis is rejected according to Vn.
data
List of samples X (and Y).
CV_Un
The critical value computed for the test Un.
CV_Vn
The critical value computed for the test Vn.
cv_method
The method used to estimate the critical value (one of "subsampling", "permutation" or "bootstrap").
h
A list with the value of bandwidth parameter used for the Gaussian
kernel. If the function select_h
is used, then also the matrix
of computed power values and the resulting power plot are provided.
B
Number of bootstrap/permutation/subsampling replications.
var_Un
exact variance of the kernel-based U-statistic.
kb.test()
for the function that generates this class.
# create a kb.test object x <- matrix(rnorm(100),ncol=2) y <- matrix(rnorm(100),ncol=2) # Normality test kb.test(x, h=0.5) # Two-sample test kb.test(x,y,h=0.5, method="subsampling",b=0.9)
# create a kb.test object x <- matrix(rnorm(100),ncol=2) y <- matrix(rnorm(100),ncol=2) # Normality test kb.test(x, h=0.5) # Two-sample test kb.test(x,y,h=0.5, method="subsampling",b=0.9)
This function performs the kernel-based quadratic distance goodness-of-fit
tests for Uniformity for spherical data x
using the Poisson kernel
with concentration parameter rho
.
The Poisson kernel-based test for uniformity exhibits excellent results
especially in the case of multimodal distributions, as shown in the example
of the Uniformity test on the Sphere vignette.
pk.test(x, rho, B = 300, Quantile = 0.95) ## S4 method for signature 'ANY' pk.test(x, rho, B = 300, Quantile = 0.95) ## S4 method for signature 'pk.test' show(object)
pk.test(x, rho, B = 300, Quantile = 0.95) ## S4 method for signature 'ANY' pk.test(x, rho, B = 300, Quantile = 0.95) ## S4 method for signature 'pk.test' show(object)
x |
A numeric d-dim matrix of data points on the Sphere S^(d-1). |
rho |
Concentration parameter of the Poisson kernel function. |
B |
Number of Monte Carlo iterations for critical value estimation of Un (default: 300). |
Quantile |
The quantile to use for critical value estimation, 0.95 is the default value. |
object |
Object of class |
Let be a random sample with empirical distribution
function
.
We test the null hypothesis of uniformity on the
-dimensional sphere, i.e.
, where
is the uniform
distribution on the
-dimensional sphere
.
We compute the U-statistic estimate of the sample KBQD (Kernel-Based
Quadratic Distance)
then the first test statistic is given as
with
and the V-statistic estimate of the KBQD
where denotes the Poisson kernel
centered with
respect to the uniform distribution on the
-dimensional sphere, that
is
and
for every .
The asymptotic distribution of the V-statistic is an infinite combination
of weighted independent chi-squared random variables with one degree of
freedom. The cutoff value is obtained using the Satterthwaite approximation
, where
and
.
For the -statistic the cutoff is determined empirically:
Generate data from a Uniform distribution on the d-dimensional sphere;
Compute the test statistics for B
Monte Carlo(MC) replications;
Compute the 95th quantile of the empirical distribution of the test statistic.
An S4 object of class pk.test
containing the results of the
Poisson kernel-based tests. The object contains the following slots:
method
: Description of the test performed.
x
Data matrix.
Un
The value of the U-statistic.
CV_Un
The empirical critical value for Un.
H0_Vn
A logical value indicating whether or not the null
hypothesis is rejected according to Un.
Vn
The value of the V-statistic Vn.
CV_Vn
The critical value for Vn computed following the
asymptotic distribution.
H0_Vn
A logical value indicating whether or not the null
hypothesis is rejected according to Vn.
rho
The value of concentration parameter used for the Poisson
kernel function.
B
Number of replications for the critical value of the
U-statistic Un.
A U-statistic is a type of statistic that is used to estimate a population parameter. It is based on the idea of averaging over all possible distinct combinations of a fixed size from a sample. A V-statistic considers all possible tuples of a certain size, not just distinct combinations and can be used in contexts where unbiasedness is not required.
Ding, Y., Markatou, M. and Saraceno, G. (2023). “Poisson Kernel-Based Tests for Uniformity on the d-Dimensional Sphere.” Statistica Sinica. doi:10.5705/ss.202022.0347
# create a pk.test object x_sp <- sample_hypersphere(3, n_points=100) unif_test <- pk.test(x_sp,rho=0.8) unif_test
# create a pk.test object x_sp <- sample_hypersphere(3, n_points=100) unif_test <- pk.test(x_sp,rho=0.8) unif_test
A class to represent the results of Poisson kernel-based quadratic distance tests for Uniformity on the sphere.
method
Description of the test.
x
Matrix of data
Un
The value of the U-statistic.
CV_Un
The critical value for Un computed through replications.
H0_Un
A logical value indicating whether or not the null hypothesis is rejected according to Un.
Vn
The value of the V-statistic.
CV_Vn
The critical value for Vn computed following the asymptotic distribution.
H0_Vn
A logical value indicating whether or not the null hypothesis is rejected according to Vn.
rho
The concentration parameter of the Poisson kernel.
B
Number of replications.
var_Un
exact variance of the kernel-based U-statistic.
# create a pk.test object d=3 size=100 x_sp <- sample_hypersphere(d, n_points=size) pk.test(x_sp,rho=0.8)
# create a pk.test object d=3 size=100 x_sp <- sample_hypersphere(d, n_points=size) pk.test(x_sp,rho=0.8)
The function pkbc()
performs the Poisson kernel-based clustering
algorithm on the sphere proposed by Golzy and Markatou (2020).
The proposed algorithm is based on a mixture, with components, of
Poisson kernel-based densities on the hypersphere
given by
where 's are the mixing proportions and
's denote the probability density function of a
-variate
Poisson kernel-based density given as
The parameters are estimated through a
iterative reweighted EM algorithm.
The proposed clustering algorithm exhibits excellent results when
(1) the clusters are not well separated; (2) the data points
are fairly well concentrated around the vectors of each cluster;
(3) the percentage of noise in the data increases.
pkbc( dat, nClust = NULL, maxIter = 300, stoppingRule = "loglik", initMethod = "sampleData", numInit = 10 ) ## S4 method for signature 'ANY' pkbc( dat, nClust = NULL, maxIter = 300, stoppingRule = "loglik", initMethod = "sampleData", numInit = 10 ) ## S4 method for signature 'pkbc' show(object)
pkbc( dat, nClust = NULL, maxIter = 300, stoppingRule = "loglik", initMethod = "sampleData", numInit = 10 ) ## S4 method for signature 'ANY' pkbc( dat, nClust = NULL, maxIter = 300, stoppingRule = "loglik", initMethod = "sampleData", numInit = 10 ) ## S4 method for signature 'pkbc' show(object)
dat |
Data matrix or data.frame of data points on the sphere to be
clustered. The observations in |
nClust |
Number of clusters. It can be a single value or a numeric vector. |
maxIter |
The maximum number of iterations before a run is terminated. |
stoppingRule |
String describing the stopping rule to be used within
each run. Currently must be either |
initMethod |
String describing the initialization method to be used.
Currently must be |
numInit |
Number of initialization. |
object |
Object of class |
We set all concentration parameters equal to 0.5 and all mixing proportions
to be equal.
The initialization method 'sampleData'
indicates that observation
points are randomly chosen as initializers of the centroids .
This random starts strategy has a chance of not obtaining initial
representatives from the underlying clusters, then the clustering is
performed
numInit
times and the random start with the highest
likelihood is chosen as the final estimate of the parameters.
The possible stoppingRule
for each iteration are:
'loglik'
run the algorithm until the change in log-likelihood from
one iteration to the next is less than a given threshold (1e-7)
'membership'
run the algorithm until the membership is unchanged for
all points from one iteration to the next
'max'
reach a maximum number of iterations maxIter
The obtained estimates are used for assigning final memberships, identifying
the nClust
clusters, according to the following rule
The number of clusters nClust
must be provided as input to the
clustering algorithm.
An S4 object of class pkbc
containing the results of the
clustering procedure based on Poisson kernel-based distributions. The object
contains the following slots:
res_k
: List of results of the Poisson kernel-based clustering
algorithm for each value of number of clusters specified in nClust
.
Each object in the list contains:
postProbs
Posterior probabilities of each observation for
the indicated clusters.
LogLik
Maximum value of log-likelihood function
wcss
Values of within-cluster sum of squares computed with
Euclidean distance and cosine similarity, respectively.
params
List of estimated parameters of the mixture model
mu
estimated centroids
rho
estimated concentration parameters rho
alpha
estimated mixing proportions
finalMemb
Vector of final memberships
runInfo
List of information of the EM algorithm iterations
lokLikVec
vector of log-likelihood values
numIterPerRun
number of E-M iterations per run
input
: List of input information.
The clustering algorithm is tailored for data points on the sphere
, but it can also be performed on spherically
transformed observations, i.e. data points on the Euclidean space
that are normalized such that they lie on the
corresponding
-dimensional sphere
.
Golzy, M. and Markatou, M. (2020) Poisson Kernel-Based Clustering on the Sphere: Convergence Properties, Identifiability, and a Method of Sampling, Journal of Computational and Graphical Statistics, 29:4, 758-770, DOI: 10.1080/10618600.2020.1740713.
dpkb()
and rpkb()
for more information on the Poisson
kernel-based distribution.
pkbc for the class definition.
#We generate three samples of 100 observations from 3-dimensional #Poisson kernel-based densities with rho=0.8 and different mean directions size<-100 groups<-c(rep(1, size), rep(2, size),rep(3,size)) rho<-0.8 set.seed(081423) data1<-rpkb(size, c(1,0,0),rho) data2<-rpkb(size, c(0,1,0),rho) data3<-rpkb(size, c(0,0,1),rho) dat<-rbind(data1$x,data2$x, data3$x) #Perform the clustering algorithm with number of clusters k=3. pkbd<- pkbc(dat=dat, nClust=3) show(pkbd)
#We generate three samples of 100 observations from 3-dimensional #Poisson kernel-based densities with rho=0.8 and different mean directions size<-100 groups<-c(rep(1, size), rep(2, size),rep(3,size)) rho<-0.8 set.seed(081423) data1<-rpkb(size, c(1,0,0),rho) data2<-rpkb(size, c(0,1,0),rho) data3<-rpkb(size, c(0,0,1),rho) dat<-rbind(data1$x,data2$x, data3$x) #Perform the clustering algorithm with number of clusters k=3. pkbd<- pkbc(dat=dat, nClust=3) show(pkbd)
Method for objects of class pkbc
which computes evaluation measures
for clustering results.
The following evaluation measures are computed:
In-Group Proportion (Kapp and Tibshirani (2007)). If true label are
provided, ARI, Average Silhouette Width (Rousseeuw (1987)), Macro-Precision
and Macro-Recall are computed.
pkbc_validation(object, true_label = NULL)
pkbc_validation(object, true_label = NULL)
object |
Object of class |
true_label |
factor or vector of true membership to clusters (if available). It must have the same length of final memberships. |
The IGP is a statistical measure that quantifies the proportion of observations within a group that belong to the same predefined category or class. It is often used to assess the homogeneity of a group by evaluating how many of its members share the same label. A higher IGP indicates that the group is more cohesive, while a lower proportion suggests greater diversity or misclassification within the group (Kapp and Tibshirani 2007).
The Adjusted Rand Index (ARI) is a statistical measure used in data clustering analysis. It quantifies the similarity between two partitions of a dataset by comparing the assignments of data points to clusters. The ARI value ranges from 0 to 1, where a value of 1 indicates a perfect match between the partitions and a value close to 0 indicates a random assignment of data points to clusters.
Each cluster can represented by a so-called silhouette which is based on the comparison of its tightness and separation. The average silhouette width provides an evaluation of clustering validity, and might be used to select an appropriate number of clusters (Rousseeuw 1987).
Macro Precision is a metric used in multi-class classification that calculates the precision for each class independently and then takes the average of these values. Precision for a class is defined as the proportion of true positive predictions out of all predictions made for that class.
Macro Recall is similar to Macro Precision but focuses on recall. Recall for a class is the proportion of true positive predictions out of all actual instances of that class. Macro Recall is the average of the recall values computed for each class.
List with the following components:
metrics
Table of computed evaluation measures for each value
of number of clusters in the pkbc
object. The
number of cluster is indicated as column name.
IGP
List of in-group proportions for each value of number of
clusters specified.
Note that Macro Precision and Macro Recall depend on the assigned labels, while the ARI measures the similarity between partition up to label switching.
If the required packages (mclust
for ARI, clusterRepro
for IGP, and
cluster
for ASW) are not installed, the function will display a message
asking the user to install the missing package(s).
Kapp, A.V. and Tibshirani, R. (2007) "Are clusters found in one dataset present in another dataset?", Biostatistics, 8(1), 9–31, https://doi.org/10.1093/biostatistics/kxj029
Rousseeuw, P.J. (1987) Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20, 53–65.
pkbc()
for the clustering algorithm
pkbc for the class object definition.
#We generate three samples of 100 observations from 3-dimensional #Poisson kernel-based densities with rho=0.8 and different mean directions size<-20 groups<-c(rep(1, size), rep(2, size),rep(3,size)) rho<-0.8 set.seed(081423) data1<-rpkb(size, c(1,0,0),rho,method='rejvmf') data2<-rpkb(size, c(0,1,0),rho,method='rejvmf') data3<-rpkb(size, c(1,0,0),rho,method='rejvmf') data<-rbind(data1$x,data2$x, data3$x) #Perform the clustering algorithm pkbc_res<- pkbc(data, 3) pkbc_validation(pkbc_res)
#We generate three samples of 100 observations from 3-dimensional #Poisson kernel-based densities with rho=0.8 and different mean directions size<-20 groups<-c(rep(1, size), rep(2, size),rep(3,size)) rho<-0.8 set.seed(081423) data1<-rpkb(size, c(1,0,0),rho,method='rejvmf') data2<-rpkb(size, c(0,1,0),rho,method='rejvmf') data3<-rpkb(size, c(1,0,0),rho,method='rejvmf') data<-rbind(data1$x,data2$x, data3$x) #Perform the clustering algorithm pkbc_res<- pkbc(data, 3) pkbc_validation(pkbc_res)
A class to represent the results of Poisson kernel-based clustering procedure for spherical observations.
res_k
List of objects with the results of the clustering algorithm for each value of possible number of clusters considered.
input
List of input data
pkbc()
for more details.
data("wireless") res <- pkbc(as.matrix(wireless[,-8]),4)
data("wireless") res <- pkbc(as.matrix(wireless[,-8]),4)
Plots for a pkbc object.
## S4 method for signature 'pkbc,ANY' plot(x, k = NULL, true_label = NULL, pca_res = FALSE, ...)
## S4 method for signature 'pkbc,ANY' plot(x, k = NULL, true_label = NULL, pca_res = FALSE, ...)
x |
Object of class |
k |
number of considered clusters. If it is not provided the scatter
plot is displayed for each value of number of clusters present in
the |
true_label |
factor or vector of true membership to clusters (if available). It must have the same length of final memberships. |
pca_res |
Logical. If TRUE the results from PCALocantore are also reported (when dimension is greater than 3). |
... |
Additional arguments that can be passed to the plot function |
scatterplot: If dimension is equal to 2 or 3, points are displayed on the
circle and sphere, respectively. If dimension if greater than 3, the
spherical Principal Component procedure proposed by Locantore et al. (1999),
is applied for dimensionality reduction and the first three principal
components are normalized and displayed on the sphere. For d > 3, the
complete results from the PcaLocantore
function (package rrcov
)
are returned if pca_res=TRUE
.
elbow plot: the within cluster sum of squares (wcss) is computed using the Euclidean distance (left) and the cosine similarity (right).
The scatter-plot(s) and the elbow plot.
The elbow plot is commonly used as a graphical method for choosing the appropriate number of clusters. Specifically, plotting the wcss versus the number of clusters, the suggested number of clusters correspond to the point in which the plotted line has the greatest change in slope, showing an elbow.
Locantore, N., Marron, J.S., Simpson, D.G. et al. (1999) "Robust principal component analysis for functional data." Test 8, 1–73. https://doi.org/10.1007/BF02595862
pkbc()
for the clustering algorithm
pkbc for the class object definition.
dat<-matrix(rnorm(300),ncol=3) pkbc_res<- pkbc(dat, 3) plot(pkbc_res, 3)
dat<-matrix(rnorm(300),ncol=3) pkbc_res<- pkbc(dat, 3) plot(pkbc_res, 3)
Obtain predictions of membership for spherical observations based on a
mixture of Poisson kernel-based densities estimated by pkbc
## S4 method for signature 'pkbc' predict(object, k, newdata = NULL)
## S4 method for signature 'pkbc' predict(object, k, newdata = NULL)
object |
Object of class |
k |
Number of clusters to be used. |
newdata |
a data.frame or a matrix of the data. If missing the
clustering data obtained from the |
Returns a list with the following components
Memb: vector of predicted memberships of newdata
Probs: matrix where entry (i,j) denotes the probability that observation i belongs to the k-th cluster.
pkbc()
for the clustering algorithm
pkbc for the class object definition.
# generate data dat <- rbind(matrix(rnorm(100),ncol=2),matrix(rnorm(100,5),ncol=2)) res <- pkbc(dat,2) # extract membership of dat predict(res,k=2) # predict membership of new data newdat <- rbind(matrix(rnorm(10),ncol=2),matrix(rnorm(10,5),ncol=2)) predict(res, k=2, newdat)
# generate data dat <- rbind(matrix(rnorm(100),ncol=2),matrix(rnorm(100,5),ncol=2)) res <- pkbc(dat,2) # extract membership of dat predict(res,k=2) # predict membership of new data newdat <- rbind(matrix(rnorm(10),ncol=2),matrix(rnorm(10,5),ncol=2)) predict(res, k=2, newdat)
Generate a random sample from the uniform distribution on the hypersphere.
sample_hypersphere(d, n_points = 1)
sample_hypersphere(d, n_points = 1)
d |
Number of dimensions. |
n_points |
Number of sampled observations. |
Data matrix with the sampled observations.
x_sp <- sample_hypersphere(3,100)
x_sp <- sample_hypersphere(3,100)
This function computes the kernel bandwidth of the Gaussian kernel for the normality, two-sample and k-sample kernel-based quadratic distance (KBQD) tests.
select_h( x, y = NULL, alternative = NULL, method = "subsampling", b = 0.8, B = 100, delta_dim = 1, delta = NULL, h_values = NULL, Nrep = 50, n_cores = 2, Quantile = 0.95, power.plot = TRUE )
select_h( x, y = NULL, alternative = NULL, method = "subsampling", b = 0.8, B = 100, delta_dim = 1, delta = NULL, h_values = NULL, Nrep = 50, n_cores = 2, Quantile = 0.95, power.plot = TRUE )
x |
Data set of observations from X. |
y |
Numeric matrix or vector of data values. Depending on the input
|
alternative |
Family of alternative chosen for selecting h, between "location", "scale" and "skewness". |
method |
The method used for critical value estimation ("subsampling", "bootstrap", or "permutation"). |
b |
The size of the subsamples used in the subsampling algorithm . |
B |
The number of iterations to use for critical value estimation, B = 150 as default. |
delta_dim |
Vector of coefficient of alternative with respect to each dimension |
delta |
Vector of parameter values indicating chosen alternatives |
h_values |
Values of the tuning parameter used for the selection |
Nrep |
Number of bootstrap/permutation/subsampling replications. |
n_cores |
Number of cores used to parallel the h selection algorithm. If this is not provided, the function will detect the available cores. |
Quantile |
The quantile to use for critical value estimation, 0.95 is the default value. |
power.plot |
Logical. If TRUE, it is displayed the plot of power for values in h_values and delta. |
The function performs the selection of the optimal value for the tuning
parameter of the normal kernel function, for normality test, the
two-sample and k-sample KBQD tests. It performs a small simulation study,
generating samples according to the family of
alternative
specified,
for the chosen values of h_values
and delta
.
We consider target alternatives , where
and
indicate the location,
covariance and skewness parameter estimates from the pooled sample.
Compute the estimates of the mean , covariance matrix
and skewness
from the pooled sample.
Choose the family of alternatives .
For each value of and
:
Generate , for
;
Generate ;
Compute the -sample test statistic between
with kernel parameter
;
Compute the power of the test. If it is greater than 0.5,
select as optimal value.
If an optimal value has not been selected, choose the which
corresponds to maximum power.
The available alternative
are
location alternatives, ,with
;
scale alternatives,
,
;
skewness alternatives,
,
with
.
The values of and
are set as
default values.
The function select_h()
allows the user to
set the values of and
for a more extensive grid search.
We suggest to set a more extensive grid search when computational resources
permit.
A list with the following attributes:
h_sel
the selected value of tuning parameter h;
power
matrix of power values computed for the considered
values of delta
and h_values
;
power.plot
power plots (if power.plot
is TRUE
).
Please be aware that the select_h()
function may take a significant
amount of time to run, especially with larger datasets or when using an
larger number of parameters in h_values
and delta
. Consider
this when applying the function to large or complex data.
Markatou, M. and Saraceno, G. (2024). “A Unified Framework for
Multivariate Two- and k-Sample Kernel-based Quadratic Distance
Goodness-of-Fit Tests.”
https://doi.org/10.48550/arXiv.2407.16374
Saraceno, G., Markatou, M., Mukhopadhyay, R. and Golzy, M. (2024).
Goodness-of-Fit and Clustering of Spherical Data: the QuadratiK package
in R and Python.
https://arxiv.org/abs/2402.02290.
The function select_h
is used in the kb.test()
function.
# Select the value of h using the mid-power algorithm x <- matrix(rnorm(100),ncol=2) y <- matrix(rnorm(100),ncol=2) h_sel <- select_h(x,y,"skewness") h_sel
# Select the value of h using the mid-power algorithm x <- matrix(rnorm(100),ncol=2) y <- matrix(rnorm(100),ncol=2) h_sel <- select_h(x,y,"skewness") h_sel
Method for objects of class pkbc
which computes some
descriptive for each variable with respect to the detected groups.
Method for objects of class pkbc
which computes descriptive
statistics for each variable with respect to the detected groups.
stats_clusters(object, ...) ## S4 method for signature 'pkbc' stats_clusters(object, k)
stats_clusters(object, ...) ## S4 method for signature 'pkbc' stats_clusters(object, k)
object |
Object of class |
... |
possible additional inputs |
k |
Number of clusters to be used. |
The function computes mean, standard deviation, median, inter-quantile range, minimum and maximum for each variable in the data set given the final membership assigned by the clustering algorithm.
List with computed descriptive statistics for each dimension.
pkbc()
for the clustering algorithm
pkbc for the class object definition.
#We generate three samples of 100 observations from 3-dimensional #Poisson kernel-based densities with rho=0.8 and different mean directions dat<-matrix(rnorm(300),ncol=3) #Perform the clustering algorithm pkbc_res<- pkbc(dat, 3) stats_clusters(pkbc_res, 3)
#We generate three samples of 100 observations from 3-dimensional #Poisson kernel-based densities with rho=0.8 and different mean directions dat<-matrix(rnorm(300),ncol=3) #Perform the clustering algorithm pkbc_res<- pkbc(dat, 3) stats_clusters(pkbc_res, 3)
summary
method for the class kb.test
## S4 method for signature 'kb.test' summary(object)
## S4 method for signature 'kb.test' summary(object)
object |
Object of class |
List with the following components:
summary_tables
Table of computed descriptive statistics per
variable (and per group if available).
test_results
Data frame with the results of the performed
kernel-based quadratic distance test.
qqplots
Figure with qq-plots for each variable.
kb.test()
and kb.test for more details.
# create a kb.test object x <- matrix(rnorm(100),ncol=2) # Normality test my_test <- kb.test(x, h=0.5) summary(my_test)
# create a kb.test object x <- matrix(rnorm(100),ncol=2) # Normality test my_test <- kb.test(x, h=0.5) summary(my_test)
summary
method for the class pk.test
## S4 method for signature 'pk.test' summary(object)
## S4 method for signature 'pk.test' summary(object)
object |
Object of class |
List with the following components:
summary_tables
Table of computed descriptive statistics per
variable.
test_results
Data frame with the results of the performed
Poisson kernel-based test.
qqplots
Figure with qq-plots for each variable against the
uniform distribution.
pk.test()
and pk.test for additional details.
# create a pk.test object x_sp <- sample_hypersphere(3, n_points=100) unif_test <- pk.test(x_sp,rho=0.8) summary(unif_test)
# create a pk.test object x_sp <- sample_hypersphere(3, n_points=100) unif_test <- pk.test(x_sp,rho=0.8) summary(unif_test)
Summary method for class "pkbc"
## S4 method for signature 'pkbc' summary(object)
## S4 method for signature 'pkbc' summary(object)
object |
Object of class |
Display the logLikelihood values and within cluster sum of squares (wcss) for all the values of number of clusters provided. For each of these values the estimated mixing proportions are showed together with a table with the assigned memberships.
pkbc()
for the clustering algorithm
pkbc for the class object definition.
dat <- rbind(matrix(rnorm(100),2),matrix(rnorm(100,5),2)) res <- pkbc(dat,2:4) summary(res)
dat <- rbind(matrix(rnorm(100),2),matrix(rnorm(100,5),2)) res <- pkbc(dat,2:4) summary(res)
The wine
data frame has 178 rows and 14 columns. The first 13
variables report 13 constituents found in each of the three types of wines.
The last column indicates the class labels (1,2 or 3).
wine
wine
A data frame containing the following columns:
Alcohol
Malic acid
Ash
Alcalinity of ash
Magnesium
Total phenols
Flavanoids
Nonflavanoid phenols
Proanthocyanins
Color intensity
Hue
OD280/OD315 of diluted wines
Proline
y
: class membership
These data are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines.
Aeberhard, S. and Forina, M. (1991). Wine. UCI Machine Learning Repository. https://doi.org/10.24432/C5PC7J.
Aeberhard, S., Coomans, D. and De Vel, O. (1994). Comparative analysis of statistical pattern recognition methods in high dimensional settings. Pattern Recognition, 27(8), 1065-1077.
data(wine) summary(wine)
data(wine) summary(wine)
The wireless
data frame has 2000 rows and 8 columns. The first 7
variables report the measurements of the Wi-Fi signal strength received from
7 Wi-Fi routers in an office location in Pittsburgh (USA). The last column
indicates the class labels.
wireless
wireless
A data frame containing the following columns:
V1
Signal strength from router 1.
V2
Signal strength from router 2.
V3
Signal strength from router 3.
V4
Signal strength from router 4.
V5
Signal strength from router 5.
V6
Signal strength from router 6.
V7
Signal strength from router 7.
V8
Group memberships, from 1 to 4.
The Wi-Fi signal strength is measured in dBm, decibel milliwatts, which is expressed as a negative value ranging from -100 to 0. The labels correspond to 4 different rooms. In total, we have 4 groups with 500 observations each.
Bhatt, R. (2017). Wireless Indoor Localization.
UCI Machine Learning Repository.
https://doi.org/10.24432/C51880.
Rohra, J.G., Perumal, B., Narayanan, S.J., Thakur, P. and Bhatt, R.B. (2017). "User Localization in an Indoor Environment Using Fuzzy Hybrid of Particle Swarm Optimization & Gravitational Search Algorithm with Neural Networks". In: Deep, K., et al. Proceedings of Sixth International Conference on Soft Computing for Problem Solving. Advances in Intelligent Systems and Computing, vol 546. Springer, Singapore. https://doi.org/10.1007/978-981-10-3322-3_27
data(wireless) summary(wireless)
data(wireless) summary(wireless)