Package 'opdisDownsampling'

Title: Optimal Distribution Preserving Down-Sampling of Bio-Medical Data
Description: An optimized method for distribution-preserving class-proportional down-sampling of bio-medical data.
Authors: Jorn Lotsch [aut,cre] , Sebastian Malkusch [aut] , Alfred Ultsch [aut]
Maintainer: Jorn Lotsch <[email protected]>
License: GPL-3
Version: 1.0.1
Built: 2024-12-12 07:00:26 UTC
Source: CRAN

Help Index


Example data of hematologic marker expression.

Description

Data set of 6 flow cytometry-based lymphoma makers from 55,843 cells from healthy subjects (class 1) and 55,843 cells from lymphoma patients (class 2).

Usage

data("FlowcytometricData")

Details

Size 111686 x 6 , stored in FlowcytometricData$[Var_1,Var_2,Var_3,Var_4,Var_5,Var_6] Classes 2, stored in FlowcytometricData$Cls

Examples

data(FlowcytometricData)
str(FlowcytometricData)

Example data an artificial Gaussian mixture.

Description

Dataset of 30000 instances with 10 variables that are Gaussian mixtures and belong to classes Cls = 1, 2, or 3, with different means and standard deviations and equal weights of 0.5, 0.4, and 0.1, respectively.

Usage

data("GMMartificialData")

Details

Size 30000 x 10, stored in GMMartificialData$[X1,X2,X3,X4,X5,X6,X7,X8,X9,X10]

Classes 3, stored in GMMartificialData$Cls

Examples

data(GMMartificialData)
str(GMMartificialData)

Optimal Distribution Preserving Down-Sampling of Bio-Medical Data

Description

The package provides the necessary functions for optimal distribution-preserving down-sampling of large (bio-medical) data sets.

Usage

opdisDownsampling(Data, Cls, Size, Seed, nTrials = 1000,
TestStat = "ad", MaxCores = getOption("mc.cores", 2L), PCAimportance = FALSE)

Arguments

Data

the (numerical!) data as a vector, matrix or data frame.

Cls

the class information, if any, as a vector of similar length as instances in the data.

Size

the total number of instances across all classes to be drawn.

Seed

a predefined seed to modify the results.

nTrials

how many samples to choose from should be randomly drawn.

TestStat

statistical criterion for similarity judgment.

MaxCores

maximum number of cpu cores to use for parallel computing.

PCAimportance

PCA based feature selection; only variables important in PCA projection are considered.

Value

Returns a list of data containing the drawn samples and the omitted data.

ReducedData

the selected sample data and class information.

ReducedData

the not-selected sample data and class information.

ReducedInstances

the instance numbers of the selected sample data.

Author(s)

Jorn Lotsch

References

Lotsch, J., Malkusch, S., Ultsch, A. (2021): Optimal distribution-preserving downsampling of large biomedical data sets (opdisDownsampling). PLoS One. 2021 Aug 5;16(8):e0255838. doi: 10.1371/journal.pone.0255838. eCollection 2021.

Examples

## example 1
data(iris)
Iris50percent <- opdisDownsampling(Data = iris[,1:4], Cls = as.integer(iris$Species),
  Size = 50, MaxCores = 1)