| Title: | Pairwise Rescaling of Numeric Matrices |
|---|---|
| Description: | Normalization of numerical matrices by minimizing the mean/median/mode difference between all column pairs. |
| Authors: | Frank Koopmans [aut, cre] (ORCID: <https://orcid.org/0000-0002-4973-5732>) |
| Maintainer: | Frank Koopmans <[email protected]> |
| License: | AGPL (>= 3) |
| Version: | 1.0 |
| Built: | 2026-05-19 11:29:50 UTC |
| Source: | https://github.com/cran/pairscale |
Pairwise differences between all columns in a matrix
pairdiff_madmean( x, cols = NULL, min_value_count = 3L, threshold_std = 3, na_mode = "check" )pairdiff_madmean( x, cols = NULL, min_value_count = 3L, threshold_std = 3, na_mode = "check" )
x |
a numeric matrix that needs to be normalized. This variable is changed by reference! i.e. after this function, the original variable is updated |
cols |
optionally, provide an integer vector with column indices (in |
min_value_count |
the minimum number of values overlapping between a pair of columns (if there are fewer overlapping values, respective rescaling is not computed) |
threshold_std |
ratio of MAD a value has to be away from the median to be considered an outlier (and thus removed/ignored). Note that the MAD thresholds are inclusive, i.e. values at +/- threshold_std*MAD from median are included |
na_mode |
string value that indicated how should we should deal with NA values (default). "check" = test if NA values are present and if so, remove these. "present" = you already know NA values are present so we can skip the check for NA values for efficiency. "unchecked" = you guarantee that there are no NA values in input, thus we'll use the fastest code paths that skip over any NA checks downstream; ONLY select this if you are sure there are no NA/NaN/Inf/-Inf values ! |
a N x N numeric matrix (where N is number of column in input x) representing the MAD-trimmed mean difference between each column
Pairwise differences between all columns in a matrix
pairdiff_mean(x, cols = NULL, min_value_count = 3L, na_mode = "check")pairdiff_mean(x, cols = NULL, min_value_count = 3L, na_mode = "check")
x |
a numeric matrix that needs to be normalized. This variable is changed by reference! i.e. after this function, the original variable is updated |
cols |
optionally, provide an integer vector with column indices (in |
min_value_count |
the minimum number of values overlapping between a pair of columns (if there are fewer overlapping values, respective rescaling is not computed) |
na_mode |
string value that indicated how should we should deal with NA values (default). "check" = test if NA values are present and if so, remove these. "present" = you already know NA values are present so we can skip the check for NA values for efficiency. "unchecked" = you guarantee that there are no NA values in input, thus we'll use the fastest code paths that skip over any NA checks downstream; ONLY select this if you are sure there are no NA/NaN/Inf/-Inf values ! |
a N x N numeric matrix (where N is number of column in input x) representing the mean difference between each column
Pairwise differences between all columns in a matrix
pairdiff_median(x, cols = NULL, min_value_count = 3L, na_mode = "check")pairdiff_median(x, cols = NULL, min_value_count = 3L, na_mode = "check")
x |
a numeric matrix that needs to be normalized. This variable is changed by reference! i.e. after this function, the original variable is updated |
cols |
optionally, provide an integer vector with column indices (in |
min_value_count |
the minimum number of values overlapping between a pair of columns (if there are fewer overlapping values, respective rescaling is not computed) |
na_mode |
string value that indicated how should we should deal with NA values (default). "check" = test if NA values are present and if so, remove these. "present" = you already know NA values are present so we can skip the check for NA values for efficiency. "unchecked" = you guarantee that there are no NA values in input, thus we'll use the fastest code paths that skip over any NA checks downstream; ONLY select this if you are sure there are no NA/NaN/Inf/-Inf values ! |
a N x N numeric matrix (where N is number of column in input x) representing the median difference between each column
Pairwise differences between all columns in a matrix
pairdiff_mode( x, cols = NULL, min_value_count = 3L, n_bins = 512L, adjust = 1, kernel_width_in_sd = 3, bandwidth_method = "nrd", mode_frac_maxdens = 1, na_mode = "check" )pairdiff_mode( x, cols = NULL, min_value_count = 3L, n_bins = 512L, adjust = 1, kernel_width_in_sd = 3, bandwidth_method = "nrd", mode_frac_maxdens = 1, na_mode = "check" )
x |
a numeric matrix that needs to be normalized. This variable is changed by reference! i.e. after this function, the original variable is updated |
cols |
optionally, provide an integer vector with column indices (in |
min_value_count |
the minimum number of values overlapping between a pair of columns (if there are fewer overlapping values, respective rescaling is not computed) |
n_bins |
grid size for density computation (the resolution / number of data points to use for binning column diffs) |
adjust |
bandwidth adjust factor for density computation |
kernel_width_in_sd |
maximum distance in standard deviations at which we'll include data points for the Gaussian kernal. Typically 3 or 4 |
bandwidth_method |
method in which this function computes bandwidth and optionally trims the data prior to binning. "nrd" is the robust, safe default. "nrd_fast" is faster and yields similar results for most distributions. Use "nrd_fastest" only when all pairwise distances are known to be near gaussian (i.e. no strong outliers and sd() is a reliable metric). "nrd_subset" is an experimental option that may be removed, it is fast but heavily favors symmetric distributions and is thus biased ! Valid options:
|
mode_frac_maxdens |
set to 1 to return the x-coordinate where the density is highest (mode). Setting this to a value < 1 will make this function compute not the mode, but the mean (x) value of the density where the density is some fraction higher than the maximum density. Typical value; 1. Optionally, set to 0.9 or 0.8 for possibly more robust center-finding, depending in your data distribution. Must not be smaller than 0.1 but recommended to never be lower than 0.7 |
na_mode |
string value that indicated how should we should deal with NA values (default). "check" = test if NA values are present and if so, remove these. "present" = you already know NA values are present so we can skip the check for NA values for efficiency. "unchecked" = you guarantee that there are no NA values in input, thus we'll use the fastest code paths that skip over any NA checks downstream; ONLY select this if you are sure there are no NA/NaN/Inf/-Inf values ! |
a N x N numeric matrix (where N is number of column in input x) representing the mode difference between each column
Pairwise differences between all columns in a matrix
pairdiff_trimmedmean( x, cols = NULL, min_value_count = 3L, trim = 0.2, na_mode = "check" )pairdiff_trimmedmean( x, cols = NULL, min_value_count = 3L, trim = 0.2, na_mode = "check" )
x |
a numeric matrix that needs to be normalized. This variable is changed by reference! i.e. after this function, the original variable is updated |
cols |
optionally, provide an integer vector with column indices (in |
min_value_count |
the minimum number of values overlapping between a pair of columns (if there are fewer overlapping values, respective rescaling is not computed) |
trim |
amount of trim to apply to both the lower- and upper-parts of a vector before computing the mean. 0 indicates no trim, 0.5 indicates 100% trim (i.e. 50% of data on both sides) so that value is out of bounds. Typically set to 0.1-0.3 |
na_mode |
string value that indicated how should we should deal with NA values (default). "check" = test if NA values are present and if so, remove these. "present" = you already know NA values are present so we can skip the check for NA values for efficiency. "unchecked" = you guarantee that there are no NA values in input, thus we'll use the fastest code paths that skip over any NA checks downstream; ONLY select this if you are sure there are no NA/NaN/Inf/-Inf values ! |
a N x N numeric matrix (where N is number of column in input x) representing the trimmed mean difference between each column
Pairwise normalization of columns in a matrix, using the MAD-trimmed mean to define pairwise distances between columns
pairscale_madmean( x, clusters = NULL, min_value_count = 3L, threshold_std = 3, niter_irls = 50L, na_mode = "check" )pairscale_madmean( x, clusters = NULL, min_value_count = 3L, threshold_std = 3, niter_irls = 50L, na_mode = "check" )
x |
a numeric matrix that needs to be normalized. This variable is changed by reference! i.e. after this function, the original variable is updated |
clusters |
an integer vector, of same length as |
min_value_count |
the minimum number of values overlapping between a pair of columns (if there are fewer overlapping values, respective rescaling is not computed) |
threshold_std |
ratio of MAD a value has to be away from the median to be considered an outlier (and thus removed/ignored). Note that the MAD thresholds are inclusive, i.e. values at +/- threshold_std*MAD from median are included |
niter_irls |
number of iterations to increase robustness (set to zero to disable) |
na_mode |
string value that indicated how should we should deal with NA values (default). "check" = test if NA values are present and if so, remove these. "present" = you already know NA values are present so we can skip the check for NA values for efficiency. "unchecked" = you guarantee that there are no NA values in input, thus we'll use the fastest code paths that skip over any NA checks downstream; ONLY select this if you are sure there are no NA/NaN/Inf/-Inf values ! |
a numeric vector that represents the normalization factors that were applied to each column in x. Note that x is updated by reference.
Pairwise normalization of columns in a matrix, using the mean to define pairwise distances between columns
pairscale_mean( x, clusters = NULL, min_value_count = 3L, niter_irls = 50L, na_mode = "check" )pairscale_mean( x, clusters = NULL, min_value_count = 3L, niter_irls = 50L, na_mode = "check" )
x |
a numeric matrix that needs to be normalized. This variable is changed by reference! i.e. after this function, the original variable is updated |
clusters |
an integer vector, of same length as |
min_value_count |
the minimum number of values overlapping between a pair of columns (if there are fewer overlapping values, respective rescaling is not computed) |
niter_irls |
number of iterations to increase robustness (set to zero to disable) |
na_mode |
string value that indicated how should we should deal with NA values (default). "check" = test if NA values are present and if so, remove these. "present" = you already know NA values are present so we can skip the check for NA values for efficiency. "unchecked" = you guarantee that there are no NA values in input, thus we'll use the fastest code paths that skip over any NA checks downstream; ONLY select this if you are sure there are no NA/NaN/Inf/-Inf values ! |
a numeric vector that represents the normalization factors that were applied to each column in x. Note that x is updated by reference.
Pairwise normalization of columns in a matrix, using the median to define pairwise distances between columns
pairscale_median( x, clusters = NULL, min_value_count = 3L, niter_irls = 50L, na_mode = "check" )pairscale_median( x, clusters = NULL, min_value_count = 3L, niter_irls = 50L, na_mode = "check" )
x |
a numeric matrix that needs to be normalized. This variable is changed by reference! i.e. after this function, the original variable is updated |
clusters |
an integer vector, of same length as |
min_value_count |
the minimum number of values overlapping between a pair of columns (if there are fewer overlapping values, respective rescaling is not computed) |
niter_irls |
number of iterations to increase robustness (set to zero to disable) |
na_mode |
string value that indicated how should we should deal with NA values (default). "check" = test if NA values are present and if so, remove these. "present" = you already know NA values are present so we can skip the check for NA values for efficiency. "unchecked" = you guarantee that there are no NA values in input, thus we'll use the fastest code paths that skip over any NA checks downstream; ONLY select this if you are sure there are no NA/NaN/Inf/-Inf values ! |
a numeric vector that represents the normalization factors that were applied to each column in x. Note that x is updated by reference.
Pairwise normalization of columns in a matrix, using the mode to define pairwise distances between columns
pairscale_mode( x, clusters = NULL, min_value_count = 3L, n_bins = 512L, adjust = 1, kernel_width_in_sd = 3, bandwidth_method = "nrd", mode_frac_maxdens = 1, niter_irls = 50L, na_mode = "check" )pairscale_mode( x, clusters = NULL, min_value_count = 3L, n_bins = 512L, adjust = 1, kernel_width_in_sd = 3, bandwidth_method = "nrd", mode_frac_maxdens = 1, niter_irls = 50L, na_mode = "check" )
x |
a numeric matrix that needs to be normalized. This variable is changed by reference! i.e. after this function, the original variable is updated |
clusters |
an integer vector, of same length as |
min_value_count |
the minimum number of values overlapping between a pair of columns (if there are fewer overlapping values, respective rescaling is not computed) |
n_bins |
grid size for density computation (the resolution / number of data points to use for binning column diffs) |
adjust |
bandwidth adjust factor for density computation |
kernel_width_in_sd |
maximum distance in standard deviations at which we'll include data points for the Gaussian kernal. Typically 3 or 4 |
bandwidth_method |
method in which this function computes bandwidth and optionally trims the data prior to binning. "nrd" is the robust, safe default. "nrd_fast" is faster and yields similar results for most distributions. Use "nrd_fastest" only when all pairwise distances are known to be near gaussian (i.e. no strong outliers and sd() is a reliable metric). "nrd_subset" is an experimental option that may be removed, it is fast but heavily favors symmetric distributions and is thus biased ! Valid options:
|
mode_frac_maxdens |
set to 1 to return the x-coordinate where the density is highest (mode). Setting this to a value < 1 will make this function compute not the mode, but the mean (x) value of the density where the density is some fraction higher than the maximum density. Typical value; 1. Optionally, set to 0.9 or 0.8 for possibly more robust center-finding, depending in your data distribution. Must not be smaller than 0.1 but recommended to never be lower than 0.7 |
niter_irls |
number of iterations to increase robustness (set to zero to disable) |
na_mode |
string value that indicated how should we should deal with NA values (default). "check" = test if NA values are present and if so, remove these. "present" = you already know NA values are present so we can skip the check for NA values for efficiency. "unchecked" = you guarantee that there are no NA values in input, thus we'll use the fastest code paths that skip over any NA checks downstream; ONLY select this if you are sure there are no NA/NaN/Inf/-Inf values ! |
a numeric vector that represents the normalization factors that were applied to each column in x. Note that x is updated by reference.
Pairwise normalization of columns in a matrix, using the trimmed-mean to define pairwise distances between columns
pairscale_trimmedmean( x, clusters = NULL, min_value_count = 3L, trim = 0.2, niter_irls = 50L, na_mode = "check" )pairscale_trimmedmean( x, clusters = NULL, min_value_count = 3L, trim = 0.2, niter_irls = 50L, na_mode = "check" )
x |
a numeric matrix that needs to be normalized. This variable is changed by reference! i.e. after this function, the original variable is updated |
clusters |
an integer vector, of same length as |
min_value_count |
the minimum number of values overlapping between a pair of columns (if there are fewer overlapping values, respective rescaling is not computed) |
trim |
amount of trim to apply to both the lower- and upper-parts of a vector before computing the mean. 0 indicates no trim, 0.5 indicates 100% trim (i.e. 50% of data on both sides) so that value is out of bounds. Typically set to 0.1-0.3 |
niter_irls |
number of iterations to increase robustness (set to zero to disable) |
na_mode |
string value that indicated how should we should deal with NA values (default). "check" = test if NA values are present and if so, remove these. "present" = you already know NA values are present so we can skip the check for NA values for efficiency. "unchecked" = you guarantee that there are no NA values in input, thus we'll use the fastest code paths that skip over any NA checks downstream; ONLY select this if you are sure there are no NA/NaN/Inf/-Inf values ! |
a numeric vector that represents the normalization factors that were applied to each column in x. Note that x is updated by reference.
find normalization factors for a given distance matrix computed with e.g. pairdiff_median(). For increased robustness, this function offers iterative reweighted improvement of the initial estimate.
solve_graph_laplacian(M, niter_irls = 1L)solve_graph_laplacian(M, niter_irls = 1L)
M |
skew-symmetric input matrix, generated with e.g. |
niter_irls |
refine the initial estimate using N additional iterative reweighted least squares loops for robust graph laplacian |
a numeric vector of length ncol(M) that contains scaling factors for M
# toy example x = cbind( c(1,2,3,4), c(2,3,4,9), c(1,2,4,5), c(1,0,1,0) ) # compute pairwide median difference between all columns M = pairscale::pairdiff_median(x) # solve matrix M to find scaling factors, without and with reweighting s1 = pairscale::solve_graph_laplacian(M, niter_irls = 0) s2 = pairscale::solve_graph_laplacian(M, niter_irls = 10) # rescaled matrices; only the robust variant correctly aligns columns 1 and 2 t(t(x) - s1) t(t(x) - s2)# toy example x = cbind( c(1,2,3,4), c(2,3,4,9), c(1,2,4,5), c(1,0,1,0) ) # compute pairwide median difference between all columns M = pairscale::pairdiff_median(x) # solve matrix M to find scaling factors, without and with reweighting s1 = pairscale::solve_graph_laplacian(M, niter_irls = 0) s2 = pairscale::solve_graph_laplacian(M, niter_irls = 10) # rescaled matrices; only the robust variant correctly aligns columns 1 and 2 t(t(x) - s1) t(t(x) - s2)
Compute measure of central tendency using efficient C++ code
vector_madmean(x, min_value_count = 1L, threshold_std = 3, na_mode = "check")vector_madmean(x, min_value_count = 1L, threshold_std = 3, na_mode = "check")
x |
numeric input vector, may contain non-finite values (removed if |
min_value_count |
the minimum number of values overlapping between a pair of columns (if there are fewer overlapping values, respective rescaling is not computed) |
threshold_std |
ratio of MAD a value has to be away from the median to be considered an outlier (and thus removed/ignored). Note that the MAD thresholds are inclusive, i.e. values at +/- threshold_std*MAD from median are included |
na_mode |
string value that indicated how should we should deal with NA values (default). "check" = test if NA values are present and if so, remove these. "present" = you already know NA values are present so we can skip the check for NA values for efficiency. "unchecked" = you guarantee that there are no NA values in input, thus we'll use the fastest code paths that skip over any NA checks downstream; ONLY select this if you are sure there are no NA/NaN/Inf/-Inf values ! |
a single numeric value representing the MAD-trimmed mean
Compute measure of central tendency using efficient C++ code
vector_mean(x, min_value_count = 1L, na_mode = "check")vector_mean(x, min_value_count = 1L, na_mode = "check")
x |
numeric input vector, may contain non-finite values (removed if |
min_value_count |
the minimum number of values overlapping between a pair of columns (if there are fewer overlapping values, respective rescaling is not computed) |
na_mode |
string value that indicated how should we should deal with NA values (default). "check" = test if NA values are present and if so, remove these. "present" = you already know NA values are present so we can skip the check for NA values for efficiency. "unchecked" = you guarantee that there are no NA values in input, thus we'll use the fastest code paths that skip over any NA checks downstream; ONLY select this if you are sure there are no NA/NaN/Inf/-Inf values ! |
a single numeric value representing the mean
Compute measure of central tendency using efficient C++ code
vector_median(x, min_value_count = 1L, na_mode = "check")vector_median(x, min_value_count = 1L, na_mode = "check")
x |
numeric input vector, may contain non-finite values (removed if |
min_value_count |
the minimum number of values overlapping between a pair of columns (if there are fewer overlapping values, respective rescaling is not computed) |
na_mode |
string value that indicated how should we should deal with NA values (default). "check" = test if NA values are present and if so, remove these. "present" = you already know NA values are present so we can skip the check for NA values for efficiency. "unchecked" = you guarantee that there are no NA values in input, thus we'll use the fastest code paths that skip over any NA checks downstream; ONLY select this if you are sure there are no NA/NaN/Inf/-Inf values ! |
a single numeric value representing the median
Compute measure of central tendency using efficient C++ code
vector_mode( x, min_value_count = 3L, n_bins = 512L, adjust = 1, kernel_width_in_sd = 3, bandwidth_method = "nrd", mode_frac_maxdens = 1, na_mode = "check" )vector_mode( x, min_value_count = 3L, n_bins = 512L, adjust = 1, kernel_width_in_sd = 3, bandwidth_method = "nrd", mode_frac_maxdens = 1, na_mode = "check" )
x |
numeric input vector, may contain non-finite values (removed if |
min_value_count |
the minimum number of values overlapping between a pair of columns (if there are fewer overlapping values, respective rescaling is not computed) |
n_bins |
grid size for density computation (the resolution / number of data points to use for binning column diffs) |
adjust |
bandwidth adjust factor for density computation |
kernel_width_in_sd |
maximum distance in standard deviations at which we'll include data points for the Gaussian kernal. Typically 3 or 4 |
bandwidth_method |
method in which this function computes bandwidth and optionally trims the data prior to binning. "nrd" is the robust, safe default. "nrd_fast" is faster and yields similar results for most distributions. Use "nrd_fastest" only when all pairwise distances are known to be near gaussian (i.e. no strong outliers and sd() is a reliable metric). "nrd_subset" is an experimental option that may be removed, it is fast but heavily favors symmetric distributions and is thus biased ! Valid options:
|
mode_frac_maxdens |
set to 1 to return the x-coordinate where the density is highest (mode). Setting this to a value < 1 will make this function compute not the mode, but the mean (x) value of the density where the density is some fraction higher than the maximum density. Typical value; 1. Optionally, set to 0.9 or 0.8 for possibly more robust center-finding, depending in your data distribution. Must not be smaller than 0.1 but recommended to never be lower than 0.7 |
na_mode |
string value that indicated how should we should deal with NA values (default). "check" = test if NA values are present and if so, remove these. "present" = you already know NA values are present so we can skip the check for NA values for efficiency. "unchecked" = you guarantee that there are no NA values in input, thus we'll use the fastest code paths that skip over any NA checks downstream; ONLY select this if you are sure there are no NA/NaN/Inf/-Inf values ! |
a single numeric value representing the mode
Compute measure of central tendency using efficient C++ code
vector_trimmedmean(x, min_value_count = 1L, trim = 0.2, na_mode = "check")vector_trimmedmean(x, min_value_count = 1L, trim = 0.2, na_mode = "check")
x |
numeric input vector, may contain non-finite values (removed if |
min_value_count |
the minimum number of values overlapping between a pair of columns (if there are fewer overlapping values, respective rescaling is not computed) |
trim |
amount of trim to apply to both the lower- and upper-parts of a vector before computing the mean. 0 indicates no trim, 0.5 indicates 100% trim (i.e. 50% of data on both sides) so that value is out of bounds. Typically set to 0.1-0.3 |
na_mode |
string value that indicated how should we should deal with NA values (default). "check" = test if NA values are present and if so, remove these. "present" = you already know NA values are present so we can skip the check for NA values for efficiency. "unchecked" = you guarantee that there are no NA values in input, thus we'll use the fastest code paths that skip over any NA checks downstream; ONLY select this if you are sure there are no NA/NaN/Inf/-Inf values ! |
a single numeric value representing the trimmed mean