Package: DataSimilarity 0.4.0

Marieke Stolte

DataSimilarity: Quantifying Similarity of Datasets and Multivariate Two- And k-Sample Testing

A collection of methods for quantifying the similarity of two or more datasets, many of which can be used for two- or k-sample testing. It provides newly implemented methods as well as wrapper functions for existing methods that enable calling many different methods in a unified framework. The methods were selected from the review and comparison of Stolte et al. (2024) <doi:10.1214/24-SS149>. An empirical comparison of the methods was performed in Stolte et al. (2026) <doi:10.48550/arXiv.2604.11458> for categorical data and in Stolte et al. (2026) <doi:10.48550/arXiv.2604.12327> for numeric data.

Authors:Marieke Stolte [aut, cre, cph], Luca Sauer [aut], David Alvarez-Melis [ctb], Nabarun Deb [ctb], Bodhisattva Sen [ctb]

DataSimilarity_0.4.0.tar.gz
DataSimilarity_0.4.0.tar.gz(r-4.7-any)DataSimilarity_0.4.0.tar.gz(r-4.6-any)
DataSimilarity_0.4.0.tgz(r-4.6-emscripten)
manual.pdf |manual.html
card.svg |card.png
DataSimilarity/json (API)

# Install 'DataSimilarity' in R:
install.packages('DataSimilarity', repos = c('https://cran.r-universe.dev', 'https://cloud.r-project.org'))
Datasets:

On CRAN:

Conda:

This package does not link to any Github/Gitlab/R-forge repository. No issue tracker or development information is available.

2.78 score 534 downloads 70 exports 1 dependencies

Last updated from:b0bceebe6a. Checks:4 OK. Indexed: yes.

TargetResultTimeFilesSyslog
linux-devel-x86_64OK280
source / vignettesOK687
linux-release-x86_64OK299
wasm-releaseOK172

Exports:AUCBahrBallDivergenceBFBGBG2BMGBQSC2STCCSCCS_catCFCF_catCMDistanceCramerDataSimilarityDiProPermDISCOBDISCOFDSdwdProjEnergyengineerMetricf.af.aCatf.sf.sCatfindSigmafindSimilarityMethodFRFR_catFStestGGRLGGRLCatGPKgTestsgTests_catgTestsMultiHamiltonPathhammingDistHMNJeffreyskerTestsKMDknnknn.bfknn.fastLHZLHZStatisticMDMMCMMMDMSTMST5MWNKTOTDDPetrierectPartitionRISERItestRosenbaumSCSHsvmProjtStatWassersteinYMRZLZCZC_cat

Dependencies:boot

Details on methods and implementations

Rendered fromDetails.Rnwusingutils::Sweaveon Jun 14 2026.

Last update: 2026-05-15
Started: 2025-06-16

Getting Started with DataSimilarity

Rendered fromGettingStarted.Rnwusingutils::Sweaveon Jun 14 2026.

Last update: 2026-05-15
Started: 2025-06-16

Readme and manuals

Help Manual

Help pageTopics
Quantifying Similarity of Datasets and Multivariate Two- And k-Sample TestingDataSimilarity-package
Bahr (1996) Multivariate Two-sample TestBahr
Ball Divergence Based Two- or k-sample TestBallDivergence
Baringhaus and Franz (2010) Rigid Motion Invariant Multivariate Two-sample TestBF
Biau and Gyorfi (2005) Two-sample Homogeneity TestBG
Biswas and Ghosh (2014) Two-Sample TestBG2
Biswas et al. (2014) Two-sample Runs TestBMG
Barakat et al. (1996) Two-Sample TestBQS
Classifier Two-Sample TestC2ST
Weighted Edge-Count Two-Sample TestCCS
Weighted Edge-Count Two-Sample Test for Discrete DataCCS_cat
Generalized Edge-Count TestCF
Generalized Edge-Count Test for Discrete DataCF_cat
Constrained Minimum DistanceCMDistance
Cramér Two-Sample TestCramer
Dataset SimilarityDataSimilarity
Direction-Projection Functions for DiProPerm Testdipro.fun dwdProj svmProj
Direction-Projection-Permutation (DiProPerm) TestDiProPerm
Distance Components (DISCO) TestsDISCOB
Distance Components (DISCO) TestsDISCOF
Rank-Based Energy Test (Deb and Sen, 2021)DS
Energy Statistic and TestEnergy
Engineer MetricengineerMetric
Selection of Appropriate Methods for Quantifying the Similarity of DatasetsfindSimilarityMethod
Friedman-Rafsky TestFR
Friedman-Rafsky Test for Discrete DataFR_cat
Multisample FS TestFStest
Decision-Tree Based Measure of Dataset Distance and Two-Sample Testf.a f.aCat f.s f.sCat GGRL GGRLCat
Generalized Permutation-Based Kernel (GPK) Two-Sample TestfindSigma GPK
Graph-Based TestsgTests
Graph-Based Tests for Discrete DatagTests_cat
Graph-Based Multi-Sample TestgTestsMulti
Shortest Hamilton pathHamiltonPath
Random Forest Based Two-Sample TestHMN
Jeffreys DivergenceJeffreys
Generalized Permutation-Based Kernel (GPK) Two-Sample TestkerTests
Kernel Measure of Multi-Sample Dissimilarity (KMD)KMD
K-Nearest Neighbor Graphknn knn.bf knn.fast
Empirical Characteristic DistanceLHZ
Calculation of the Li et al. (2022) Empirical Characteristic DistanceLHZStatistic
List of Methods Included in the Packagemethod.table
Multisample Mahalanobis Crossmatch (MMCM) TestMMCM
Maximum Mean Discrepancy (MMD) TestMMD
Minimum Spanning Tree (MST)MST MST5
Nonparametric Graph-Based LP (GLP) TestMW
Decision-Tree Based Measure of Dataset Similarity ('Ntoutsi et al., 2008')NKT
Optimal Transport Dataset DistancehammingDist OTDD
Multisample Crossmatch (MCM) TestPetrie
Calculate a Rectangular PartitionrectPartition
Rank In Similarity Graph Edge-count two-sample test (RISE)RISE
Multisample RI TestRItest
Rosenbaum Crossmatch TestRosenbaum
Graph-Based Multi-Sample TestSC
Schilling-Henze Nearest Neighbor TestSH
Univariate Two-Sample Statistics for DiProPerm TestAUC MD stat.fun tStat
Wasserstein Distance Based TestWasserstein
Yu et al. (2007) Two-Sample TestYMRZL
Maxtype Edge-Count TestZC
Maxtype Edge-Count Test for Discrete DataZC_cat