Title: | Optimal Distribution Preserving Down-Sampling of Bio-Medical Data |
---|---|
Description: | An optimized method for distribution-preserving class-proportional down-sampling of bio-medical data. |
Authors: | Jorn Lotsch [aut,cre] , Sebastian Malkusch [aut] , Alfred Ultsch [aut] |
Maintainer: | Jorn Lotsch <[email protected]> |
License: | GPL-3 |
Version: | 1.0.1 |
Built: | 2024-12-12 07:00:26 UTC |
Source: | CRAN |
Data set of 6 flow cytometry-based lymphoma makers from 55,843 cells from healthy subjects (class 1) and 55,843 cells from lymphoma patients (class 2).
data("FlowcytometricData")
data("FlowcytometricData")
Size 111686 x 6 , stored in FlowcytometricData$[Var_1,Var_2,Var_3,Var_4,Var_5,Var_6]
Classes 2, stored in FlowcytometricData$Cls
data(FlowcytometricData) str(FlowcytometricData)
data(FlowcytometricData) str(FlowcytometricData)
Dataset of 30000 instances with 10 variables that are Gaussian mixtures and belong to classes Cls = 1, 2, or 3, with different means and standard deviations and equal weights of 0.5, 0.4, and 0.1, respectively.
data("GMMartificialData")
data("GMMartificialData")
Size 30000 x 10, stored in GMMartificialData$[X1,X2,X3,X4,X5,X6,X7,X8,X9,X10]
Classes 3, stored in GMMartificialData$Cls
data(GMMartificialData) str(GMMartificialData)
data(GMMartificialData) str(GMMartificialData)
The package provides the necessary functions for optimal distribution-preserving down-sampling of large (bio-medical) data sets.
opdisDownsampling(Data, Cls, Size, Seed, nTrials = 1000, TestStat = "ad", MaxCores = getOption("mc.cores", 2L), PCAimportance = FALSE)
opdisDownsampling(Data, Cls, Size, Seed, nTrials = 1000, TestStat = "ad", MaxCores = getOption("mc.cores", 2L), PCAimportance = FALSE)
Data |
the (numerical!) data as a vector, matrix or data frame. |
Cls |
the class information, if any, as a vector of similar length as instances in the data. |
Size |
the total number of instances across all classes to be drawn. |
Seed |
a predefined seed to modify the results. |
nTrials |
how many samples to choose from should be randomly drawn. |
TestStat |
statistical criterion for similarity judgment. |
MaxCores |
maximum number of cpu cores to use for parallel computing. |
PCAimportance |
PCA based feature selection; only variables important in PCA projection are considered. |
Returns a list of data containing the drawn samples and the omitted data.
ReducedData |
the selected sample data and class information. |
ReducedData |
the not-selected sample data and class information. |
ReducedInstances |
the instance numbers of the selected sample data. |
Jorn Lotsch
Lotsch, J., Malkusch, S., Ultsch, A. (2021): Optimal distribution-preserving downsampling of large biomedical data sets (opdisDownsampling). PLoS One. 2021 Aug 5;16(8):e0255838. doi: 10.1371/journal.pone.0255838. eCollection 2021.
## example 1 data(iris) Iris50percent <- opdisDownsampling(Data = iris[,1:4], Cls = as.integer(iris$Species), Size = 50, MaxCores = 1)
## example 1 data(iris) Iris50percent <- opdisDownsampling(Data = iris[,1:4], Cls = as.integer(iris$Species), Size = 50, MaxCores = 1)