Title: | Statistical Framework to Define Subgroups in Complex Datasets |
---|---|
Description: | High-dimensional datasets that do not exhibit a clear intrinsic clustered structure pose a challenge to conventional clustering algorithms. For this reason, we developed an unsupervised framework that helps scientists to better subgroup their datasets based on visual cues, please see Gao S, Mutter S, Casey A, Makinen V-P (2019) Numero: a statistical framework to define multivariable subgroups in complex population-based datasets, Int J Epidemiology, 48:369-37, <doi:10.1093/ije/dyy113>. The framework includes the necessary functions to construct a self-organizing map of the data, to evaluate the statistical significance of the observed data patterns, and to visualize the results. |
Authors: | Song Gao [aut], Stefan Mutter [aut], Aaron E. Casey [aut], Ville-Petteri Makinen [aut, cre] |
Maintainer: | Ville-Petteri Makinen <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.9.8 |
Built: | 2024-12-17 06:50:52 UTC |
Source: | CRAN |
Estimate district averages based on assigned map locations for each data point.
nroAggregate(topology, districts, data = NULL)
nroAggregate(topology, districts, data = NULL)
topology |
A data frame with K rows and six columns, see details. |
districts |
An integer vector of M best-matching districts. |
data |
A vector of M elements or an M x N matrix of data values. |
Topology can be either the output from nroKohonen()
or a
data frame in the same format as the element topology
within the
the output from nroKohonen()
.
The input argument districts
is expected to be the output from
nroMatch()
.
If the input argument data
is empty, the histogram of the data points
on the map is returned (a vector of K elements).
If data are available, a matrix of K rows and N columns that contains the average district values after smoothing is returned. The output includes the attribute 'histogram' that contains data point counts over each data column. Column names and the attribute 'binary' are copied from the input.
# Import data. fname <- system.file("extdata", "finndiane.txt", package = "Numero") dataset <- read.delim(file = fname) # Prepare training data. trvars <- c("CHOL", "HDL2C", "TG", "CREAT", "uALB") trdata <- scale.default(dataset[,trvars]) # K-means clustering. km <- nroKmeans(data = trdata) # Self-organizing map. sm <- nroKohonen(seeds = km) sm <- nroTrain(map = sm, data = trdata) # Assign data points into districts. matches <- nroMatch(centroids = sm, data = trdata) # District averages for one variable. chol <- nroAggregate(topology = sm, districts = matches, data = dataset$CHOL) print(chol) # District averages for all variables. planes <- nroAggregate(topology = sm, districts = matches, data = dataset) print(head(planes))
# Import data. fname <- system.file("extdata", "finndiane.txt", package = "Numero") dataset <- read.delim(file = fname) # Prepare training data. trvars <- c("CHOL", "HDL2C", "TG", "CREAT", "uALB") trdata <- scale.default(dataset[,trvars]) # K-means clustering. km <- nroKmeans(data = trdata) # Self-organizing map. sm <- nroKohonen(seeds = km) sm <- nroTrain(map = sm, data = trdata) # Assign data points into districts. matches <- nroMatch(centroids = sm, data = trdata) # District averages for one variable. chol <- nroAggregate(topology = sm, districts = matches, data = dataset$CHOL) print(chol) # District averages for all variables. planes <- nroAggregate(topology = sm, districts = matches, data = dataset) print(head(planes))
Assign colors to map districts based on the respective district values.
nroColorize(values, ranges = NULL, amplitudes = 1, palette = "rhodo")
nroColorize(values, ranges = NULL, amplitudes = 1, palette = "rhodo")
values |
A vector of K values or a K x N data frame, where K is the number of map districts and N is the number of variables. |
ranges |
A data frame with N rows and 2 columns, see details. |
amplitudes |
Single value or a vector of N elements or a data frame of N rows that contains the column AMPLITUDE. |
palette |
One of pre-defined colormap names (see details). |
The argument ranges
sets the minimum and maximum district values
irrespective of the contents of values
. It can be used as a fixed
reference across different colorings to ensure that the same value produces
the same color across function calls.
The argument amplitudes
controls the proportion of the color range
that is available for the district value range. For proportions below 1,
the minimum district value is assigned to a color that is between the
first and middle element in the color palette, and the maximum is assigned
to a color that is between the middle and the last element. If
amplitude
is greater than 1, extreme low and high values are
clipped to the first and last color in the palette, respectively.
Available color palettes include "grey", "fire", "jungle", "miami", "rhodo" or "tan". Any other word will revert to a rainbow colormap.
A matrix of hexadecimal color codes as strings. The output also includes the attribute 'contrast' that indicates which colors have a good contrast with black as opposed to white, the attribute 'ranges' that contains a copy of the dynamic ranges across districts, and the attribute 'palette' that indicates the color scheme.
# Import data. fname <- system.file("extdata", "finndiane.txt", package = "Numero") dataset <- read.delim(file = fname) # Prepare training data. trvars <- c("CHOL", "HDL2C", "TG", "CREAT", "uALB") trdata <- scale.default(dataset[,trvars]) # K-means clustering. km <- nroKmeans(data = trdata) # Self-organizing map. sm <- nroKohonen(seeds = km) sm <- nroTrain(map = sm, data = trdata) # Assign data points into districts. matches <- nroMatch(centroids = sm, data = trdata) # District averages for all variables. planes <- nroAggregate(topology = sm, districts = matches, data = dataset) # District colors for cholesterol. chol <- nroColorize(values = planes[,"CHOL"]) print(head(chol)) # District colors for all variables. colrs <- nroColorize(values = planes) print(head(colrs))
# Import data. fname <- system.file("extdata", "finndiane.txt", package = "Numero") dataset <- read.delim(file = fname) # Prepare training data. trvars <- c("CHOL", "HDL2C", "TG", "CREAT", "uALB") trdata <- scale.default(dataset[,trvars]) # K-means clustering. km <- nroKmeans(data = trdata) # Self-organizing map. sm <- nroKohonen(seeds = km) sm <- nroTrain(map = sm, data = trdata) # Assign data points into districts. matches <- nroMatch(centroids = sm, data = trdata) # District averages for all variables. planes <- nroAggregate(topology = sm, districts = matches, data = dataset) # District colors for cholesterol. chol <- nroColorize(values = planes[,"CHOL"]) print(head(chol)) # District colors for all variables. colrs <- nroColorize(values = planes) print(head(colrs))
Removes differences in value distribution within subsets of data points.
nroDestratify(data, labels)
nroDestratify(data, labels)
data |
A matrix or a data frame with M rows. |
labels |
A vector of M subset labels. |
Only non-binary numerical columns are processed, the rest will be excluded from the results.
The de-stratification algorithm is based on ranked data: the distribution of each subset will be mapped to the pooled distribution over all subsets by matching subset-specific ranking with ranking of all values.
A matrix of de-stratified values. The output also includes the attribute 'incomplete' that lists those columns where (some of) the values were set to missing due to processing failures.
# Import data. fname <- system.file("extdata", "finndiane.txt", package = "Numero") dataset <- read.delim(file = fname) # Remove sex differences for creatinine. creat <- nroDestratify(dataset$CREAT, dataset$MALE) # Compare creatinine distributions. men <- which(dataset$MALE == 1) women <- which(dataset$MALE == 0) print(summary(dataset[men,"CREAT"])) print(summary(dataset[women,"CREAT"])) print(summary(creat[men])) print(summary(creat[women])) # Remove sex differences (produces warnings for binary traits). ds <- nroDestratify(dataset, dataset$MALE) # Compare HDL2C distributions. print(summary(dataset[men,"HDL2C"])) print(summary(dataset[women,"HDL2C"])) print(summary(ds[men,"HDL2C"])) print(summary(ds[women,"HDL2C"]))
# Import data. fname <- system.file("extdata", "finndiane.txt", package = "Numero") dataset <- read.delim(file = fname) # Remove sex differences for creatinine. creat <- nroDestratify(dataset$CREAT, dataset$MALE) # Compare creatinine distributions. men <- which(dataset$MALE == 1) women <- which(dataset$MALE == 0) print(summary(dataset[men,"CREAT"])) print(summary(dataset[women,"CREAT"])) print(summary(creat[men])) print(summary(creat[women])) # Remove sex differences (produces warnings for binary traits). ds <- nroDestratify(dataset, dataset$MALE) # Compare HDL2C distributions. print(summary(dataset[men,"HDL2C"])) print(summary(dataset[women,"HDL2C"])) print(summary(ds[men,"HDL2C"])) print(summary(ds[women,"HDL2C"]))
K-means clustering for multi-dimensional data.
nroKmeans(data, k = 3, subsample = NULL, balance = 0, message = NULL)
nroKmeans(data, k = 3, subsample = NULL, balance = 0, message = NULL)
data |
A data frame or a matrix. |
k |
Number of centroids. |
subsample |
Number of randomly selected rows used during a single training cycle. |
balance |
Penalty parameter for size difference between clusters. |
message |
If positive, progress information is printed at the specified interval in seconds. |
The K centroids are determined by Lloyd's algorithm with Euclidean distances or by using 1 - Pearson correlation as the distance measure.
If subsample
is less than the number of data rows, a random subset of
the specified size is used for each training cycle. By default,
subsample
is set automatically depending on the size of the dataset.
If balance = 0.0
, the algorithm is applied with no balancing,
if balance = 1.0
all the clusters will be forced to be of equal size.
Intermediate values are permitted. Note that if subsampling is applied,
balancing may become less accurate.
A list with named elements: centroids
is a matrix of the
main results, layout
contains the best-matching centroid labels
and model residuals for each usable data point and history
is the
chronological record of training errors. The subsampling parameter that was
used during training is stored in the element subsample
.
# Import data. fname <- system.file("extdata", "finndiane.txt", package = "Numero") dataset <- read.delim(file = fname) # Prepare training data. trvars <- c("CHOL", "HDL2C", "TG", "CREAT", "uALB") trdata <- scale.default(dataset[,trvars]) # Unbalanced K-means clustering. km0 <- nroKmeans(data = trdata, k = 5, balance = 0.0) print(table(km0$layout$BMC)) print(km0$centroids) # Balanced K-means clustering. km1 <- nroKmeans(data = trdata, k = 5, balance = 1.0) print(table(km1$layout$BMC)) print(km1$centroids)
# Import data. fname <- system.file("extdata", "finndiane.txt", package = "Numero") dataset <- read.delim(file = fname) # Prepare training data. trvars <- c("CHOL", "HDL2C", "TG", "CREAT", "uALB") trdata <- scale.default(dataset[,trvars]) # Unbalanced K-means clustering. km0 <- nroKmeans(data = trdata, k = 5, balance = 0.0) print(table(km0$layout$BMC)) print(km0$centroids) # Balanced K-means clustering. km1 <- nroKmeans(data = trdata, k = 5, balance = 1.0) print(table(km1$layout$BMC)) print(km1$centroids)
Interpolates the initial district profiles of a self-organizing map based on pre-determined seed profiles.
nroKohonen(seeds, radius = 3, smoothness = 1.0)
nroKohonen(seeds, radius = 3, smoothness = 1.0)
seeds |
A matrix or a data frame of K rows and N columns. |
radius |
Map radius. |
smoothness |
Rigidity of the map to adapt to regional differences. |
A list of named elements: centroids
contains the N-dimensional
district profiles, and topology
is an H x 6 matrix that contains
the 2D spatial layout for the map districts: the first two columns (X, Y)
indicate the positions of districts in Cartesian coordinates, the other
four columns (RADIUS1, RADIUS2, ANGLE1, ANGLE2) define the perimeter of
the district areas for visualisation on a circular map.
Additional parameters are stored as attributes in topology
.
The function is named after Teuvo Kohonen, the inventor of the self-organizing map.
Please see nroKmeans()
to create the seeds.
# Import data. fname <- system.file("extdata", "finndiane.txt", package = "Numero") dataset <- read.delim(file = fname) # Prepare training data. trvars <- c("CHOL", "HDL2C", "TG", "CREAT", "uALB") trdata <- scale.default(dataset[,trvars]) # K-means clustering. km <- nroKmeans(data = trdata) # Self-organizing map. sm <- nroKohonen(seeds = km) print(head(sm$centroids)) print(head(sm$topology))
# Import data. fname <- system.file("extdata", "finndiane.txt", package = "Numero") dataset <- read.delim(file = fname) # Prepare training data. trvars <- c("CHOL", "HDL2C", "TG", "CREAT", "uALB") trdata <- scale.default(dataset[,trvars]) # K-means clustering. km <- nroKmeans(data = trdata) # Self-organizing map. sm <- nroKohonen(seeds = km) print(head(sm$centroids)) print(head(sm$topology))
Optimize the look and selection of labels on map districts.
nroLabel(topology, values, gap = 2.3)
nroLabel(topology, values, gap = 2.3)
topology |
A data frame with K rows and six columns, see details. |
values |
A vector of K values or a K x N data frame, where K is the number of map districts and N is the number of variables. |
gap |
Minimum distance between map districts with non-empty labels. |
The function assigns visible labels for districts based on the absolute
deviations from the average district value. The most extreme districts are
picked first, and then the remaining districts are prioritized based on
their value and distance to the other districts already labeled. Columns
that are listed in the attribute "binary" in values
are given
percentage labels.
Topology can be either the output from nroKohonen()
or a
data frame in the same format as the element topology
within the
aforementioned output list.
A matrix with K rows and N columns that contains selected labels
for the map districts for each of the columns in values
. The output
has the attribute 'visible' that contains binary flags to guide visibility.
# Import data. fname <- system.file("extdata", "finndiane.txt", package = "Numero") dataset <- read.delim(file = fname) # Prepare training data. trvars <- c("CHOL", "HDL2C", "TG", "CREAT", "uALB") trdata <- scale.default(dataset[,trvars]) # K-means clustering. km <- nroKmeans(data = trdata) # Self-organizing map. sm <- nroKohonen(seeds = km) sm <- nroTrain(map = sm, data = trdata) # Assign data points into districts. matches <- nroMatch(centroids = sm, data = trdata) # District averages for all variables. planes <- nroAggregate(topology = sm, districts = matches, data = dataset) # District labels for cholesterol. chol <- nroLabel(topology = sm, values = planes[,"CHOL"]) print(head(attr(chol, "visible"))) print(head(chol)) # District labels for all variables. colrs <- nroLabel(topology = sm, values = planes) print(head(attr(colrs, "visible"))) print(head(colrs))
# Import data. fname <- system.file("extdata", "finndiane.txt", package = "Numero") dataset <- read.delim(file = fname) # Prepare training data. trvars <- c("CHOL", "HDL2C", "TG", "CREAT", "uALB") trdata <- scale.default(dataset[,trvars]) # K-means clustering. km <- nroKmeans(data = trdata) # Self-organizing map. sm <- nroKohonen(seeds = km) sm <- nroTrain(map = sm, data = trdata) # Assign data points into districts. matches <- nroMatch(centroids = sm, data = trdata) # District averages for all variables. planes <- nroAggregate(topology = sm, districts = matches, data = dataset) # District labels for cholesterol. chol <- nroLabel(topology = sm, values = planes[,"CHOL"]) print(head(attr(chol, "visible"))) print(head(chol)) # District labels for all variables. colrs <- nroLabel(topology = sm, values = planes) print(head(attr(colrs, "visible"))) print(head(colrs))
Compare multi-dimensional data points against the district profiles of a self-organizing map (SOM).
nroMatch(centroids, data)
nroMatch(centroids, data)
centroids |
Either a matrix, a data frame or a list that contains the element
|
data |
A data matrix with identical column names to the centroid matrix. |
The input argument centroids
can be a matrix or a data frame that
contains multivariable data profiles organized row-wise. It can also be
the output list object from nroKmeans()
or
nroTrain()
.
A vector of integers with elements corresponding to the rows in
data
. Each element contains the index of the best matching
row from centroids
.
The vector also has the attribute 'quality' that contains three columns: RESIDUAL is the distance between a point and a centroid in data space (shorter is better), RESIDUAL.z is a scale-independent version of RESIDUAL if the mean residual and standard deviation are available from training history, and COVERAGE shows the proportion of data elements that were available for matching.
The names of the columns that were used for matching are stored in the
attribute variables
.
# Import data. fname <- system.file("extdata", "finndiane.txt", package = "Numero") dataset <- read.delim(file = fname) # Prepare training data. trvars <- c("CHOL", "HDL2C", "TG", "CREAT", "uALB") trdata <- scale.default(dataset[,trvars]) # K-means clustering. km <- nroKmeans(data = trdata, k = 10) # Assign data points into districts. matches <- nroMatch(centroids = km, data = trdata) print(head(attr(matches,"quality"))) print(table(matches))
# Import data. fname <- system.file("extdata", "finndiane.txt", package = "Numero") dataset <- read.delim(file = fname) # Prepare training data. trvars <- c("CHOL", "HDL2C", "TG", "CREAT", "uALB") trdata <- scale.default(dataset[,trvars]) # K-means clustering. km <- nroKmeans(data = trdata, k = 10) # Assign data points into districts. matches <- nroMatch(centroids = km, data = trdata) print(head(attr(matches,"quality"))) print(table(matches))
Estimate the dynamic range and statistical significance for regional patterns on a self-organizing maps using permutations.
nroPermute(map, districts, data, n = 1000, message = NULL, zbase = NULL, seed = 0.0)
nroPermute(map, districts, data, n = 1000, message = NULL, zbase = NULL, seed = 0.0)
map |
A list object in the format from |
districts |
An integer vector of M best matching districts. |
data |
A numeric vector of M values or an M x N matrix (or data frame), where M is the number of data points and N is the number of variables. |
n |
Maximum number of permutations per variable. |
message |
If positive, progress information is printed at the specified interval in seconds. |
zbase |
Reference Z-score for determining color amplitudes. |
seed |
Seed value for random number generator. |
The input argument map
must contain the map topology and the
centroid profiles as returned by the functions nroKmeans()
,
nroKohonen()
, or nroTrain()
.
The input argument districts
must contain integers between 1 and K,
where K is the number map units. Any other values will be ignored.
Training variables and data points are detected by the column names of
map$centroids
, the attribute "variables" in districts
and
the names of elements in districts
.
A data frame with eight columns: P.z is a parametric estimate for statistical
significance, P.freq is the frequency-based estimate for statistical
signicance, and Z is the estimated z-score of how far the
observed map plane is from the average randomly generated layout.
N.data indicates how many data values were used and N.cycles tells the
number of completed permutations. AMPLITUDE is a dynamic range modifier
for colors that can be used in nroColorize()
.
The output also contains the attribute 'zbase' that indicates the normalization factor for the color amplitudes.
# Import data. fname <- system.file("extdata", "finndiane.txt", package = "Numero") dataset <- read.delim(file = fname) # Set row names. rownames(dataset) <- paste("r", 1:nrow(dataset), sep="") # Prepare training data. trvars <- c("CHOL", "HDL2C", "TG", "CREAT", "uALB") trdata <- scale.default(dataset[,trvars]) # K-means clustering. km <- nroKmeans(data = trdata) # Self-organizing map. sm <- nroKohonen(seeds = km) sm <- nroTrain(map = sm, data = trdata) # Assign data points into districts. matches <- nroMatch(centroids = sm, data = trdata) # Estimate statistics for cholesterol chol <- nroPermute(map = sm, districts = matches, data = dataset$CHOL) print(chol[,c("TRAINING", "Z", "P.z", "P.freq")]) # Estimate statistics. stats <- nroPermute(map = sm, districts = matches, data = dataset) print(stats[,c("TRAINING", "Z", "P.z", "P.freq")])
# Import data. fname <- system.file("extdata", "finndiane.txt", package = "Numero") dataset <- read.delim(file = fname) # Set row names. rownames(dataset) <- paste("r", 1:nrow(dataset), sep="") # Prepare training data. trvars <- c("CHOL", "HDL2C", "TG", "CREAT", "uALB") trdata <- scale.default(dataset[,trvars]) # K-means clustering. km <- nroKmeans(data = trdata) # Self-organizing map. sm <- nroKohonen(seeds = km) sm <- nroTrain(map = sm, data = trdata) # Assign data points into districts. matches <- nroMatch(centroids = sm, data = trdata) # Estimate statistics for cholesterol chol <- nroPermute(map = sm, districts = matches, data = dataset$CHOL) print(chol[,c("TRAINING", "Z", "P.z", "P.freq")]) # Estimate statistics. stats <- nroPermute(map = sm, districts = matches, data = dataset) print(stats[,c("TRAINING", "Z", "P.z", "P.freq")])
Create a graphical interface for selecting subgroups from multiple map colorings simultaneously.
nroPlot(topology, colors, labels = NULL, subplot = NULL, interactive = FALSE, clear = NULL) nroPlot.save(file, topology, colors, labels = NULL, subplot = NULL, font = 1.0)
nroPlot(topology, colors, labels = NULL, subplot = NULL, interactive = FALSE, clear = NULL) nroPlot.save(file, topology, colors, labels = NULL, subplot = NULL, font = 1.0)
topology |
A data frame with K rows and six or more columns that contain the district positions of a self-organizing map and optional region assignments. |
colors |
A character vector with K elements or a K x N matrix of hexadecimal color codes as strings. |
labels |
A character vector with K elements or a K x N matrix of district labels. |
subplot |
A two-element vector that sets out the number of rows and columns for a grid layout of multiple colorings. |
clear |
If TRUE, all graphics devices are cleared when the plot is refreshed. |
interactive |
If TRUE, an interactive version of the plot is launched. |
file |
If non-empty, the figure is saved as an SVG or HTML file instead of plotting on graphics device. |
font |
Multiplier to adjust font size for SVG and HTML output. |
The input topology
must follow the format from
nroKohonen()
, but may also contain the columns REGION,
and REGION.label that specify the names for subsets of districts and the
single character labels to be shown on top of those districts. The input
can also be the list object as returned by nroKohonen()
.
The color input can include the attribute 'contrast' that contains a binary vector or a matrix of equal size. If an element is set, it means a dark label or highlight will have better contrast with the background.
The label input can include the attribute 'visible' that contains a binary vector or a matrix of equal size. If an element is set, it means a label is visible, otherwise it will not be shown on the map.
Some non-alphanumeric characters are not supported and will be automatically converted to "_". Too long labels or column names will be truncated.
The default value for clear
is TRUE to prevent multiple plot windows
from accumulating within the RStudio. If you are running R from the terminal
or using detached devices, setting clear
to FALSE will retain the
current window when refreshing.
If the file name ends with ".html", an interactive HTML document is produced, otherwise an SVG document is created. We recommend opening the HTML file with a web browser to select regions on large maps (i.e. when the R plot window becomes too slow to use). The HTML page allows you to assign subgroups and to save the results as tab-delimited text.
The main function returns a data frame with K rows that contains the
topology and subgrouping information. The .save
subroutine returns
the number of bytes written in the output file.
# Import data. fname <- system.file("extdata", "finndiane.txt", package = "Numero") dataset <- read.delim(file = fname) # Detect binary columns. dataset <- nroPreprocess(dataset, method = "") # Prepare training data. trvars <- c("CHOL", "HDL2C", "TG", "CREAT", "uALB") trdata <- scale.default(dataset[,trvars]) # K-means clustering. km <- nroKmeans(data = trdata) # Self-organizing map. sm <- nroKohonen(seeds = km) sm <- nroTrain(map = sm, data = trdata) # Assign data points into districts. matches <- nroMatch(centroids = sm, data = trdata) # Select a subset of variables and detect binary data. vars <- c("AGE", "MALE", "uALB", "CHOL", "DIAB_KIDNEY", "DECEASED") selected <- nroPreprocess(dataset[,vars], method = "") # Calculate district averages for seleted variables. vars <- c("AGE", "MALE", "uALB", "CHOL", "DIAB_KIDNEY", "DECEASED") planes <- nroAggregate(topology = sm, districts = matches, data = selected) # Estimate statistics. stats <- nroPermute(map = sm, districts = matches, data = selected) # Set visuals. colrs <- nroColorize(values = planes, amplitudes = stats) labls <- nroLabel(topology = sm, values = planes) # Add subgrouping information. topo <- sm$topology topo$REGION <- "" topo$REGION[1:8] <- "Center" topo$REGION[9:21] <- "Perimeter" # Add subgroup labels. topo$REGION.label <- "" topo$REGION.label[1:8] <- "C" topo$REGION.label[9:21] <- "P" # Add subgroup colors. topo$REGION.color <- "" topo$REGION.color[1:8] <- "#00f00060" topo$REGION.color[9:21] <- "#f000f060" # Plot colorings on screen. nroPlot(topology = topo, colors = colrs, labels = labls) # Save colorings in file. #fn <- "colorings.html" #n <- nroPlot.save(file = fn, topology = topo, # colors = colrs, labels = labls) #cat(n, " bytes saved in '", fn, "'\n", sep="")
# Import data. fname <- system.file("extdata", "finndiane.txt", package = "Numero") dataset <- read.delim(file = fname) # Detect binary columns. dataset <- nroPreprocess(dataset, method = "") # Prepare training data. trvars <- c("CHOL", "HDL2C", "TG", "CREAT", "uALB") trdata <- scale.default(dataset[,trvars]) # K-means clustering. km <- nroKmeans(data = trdata) # Self-organizing map. sm <- nroKohonen(seeds = km) sm <- nroTrain(map = sm, data = trdata) # Assign data points into districts. matches <- nroMatch(centroids = sm, data = trdata) # Select a subset of variables and detect binary data. vars <- c("AGE", "MALE", "uALB", "CHOL", "DIAB_KIDNEY", "DECEASED") selected <- nroPreprocess(dataset[,vars], method = "") # Calculate district averages for seleted variables. vars <- c("AGE", "MALE", "uALB", "CHOL", "DIAB_KIDNEY", "DECEASED") planes <- nroAggregate(topology = sm, districts = matches, data = selected) # Estimate statistics. stats <- nroPermute(map = sm, districts = matches, data = selected) # Set visuals. colrs <- nroColorize(values = planes, amplitudes = stats) labls <- nroLabel(topology = sm, values = planes) # Add subgrouping information. topo <- sm$topology topo$REGION <- "" topo$REGION[1:8] <- "Center" topo$REGION[9:21] <- "Perimeter" # Add subgroup labels. topo$REGION.label <- "" topo$REGION.label[1:8] <- "C" topo$REGION.label[9:21] <- "P" # Add subgroup colors. topo$REGION.color <- "" topo$REGION.color[1:8] <- "#00f00060" topo$REGION.color[9:21] <- "#f000f060" # Plot colorings on screen. nroPlot(topology = topo, colors = colrs, labels = labls) # Save colorings in file. #fn <- "colorings.html" #n <- nroPlot.save(file = fn, topology = topo, # colors = colrs, labels = labls) #cat(n, " bytes saved in '", fn, "'\n", sep="")
Process a new dataset using a standardization procedure that was created for another dataset
nroPostprocess(data, mapping, reverse = FALSE, trim = FALSE)
nroPostprocess(data, mapping, reverse = FALSE, trim = FALSE)
data |
A matrix or a data frame with column names. |
mapping |
A list object or a matrix or a data frame. |
reverse |
If true, standardized data will be reverted back to original scale. |
trim |
If true, unusable rows and columns are removed. |
The input argument can be a data frame with the attribute 'mapping' as
returned from nroPreprocess()
or a list object with the
elements input
and output
that each contain a data frame or
a matrix of equal size.
The function projects the input data to the values in mapping$input
to determine the positions of the input values with respect to the rows
in the model. These positions are then used to interpolate corresponding
output values in mapping$output
.
The mapping elements must have consistent row and column names.
A matrix or data frame of processed values.
Ville-Petteri Makinen
# Import data. fname <- system.file("extdata", "finndiane.txt", package = "Numero") dataset <- read.delim(file = fname) # Show original data characteristics. print(summary(dataset)) # Preprocess a subset of data. ds.pre <- nroPreprocess(dataset[1:100,]) print(summary(ds.pre)) # Repeat preprocessing for the whole dataset (approximation). ds.post <- nroPostprocess(dataset, ds.pre) print(summary(ds.post))
# Import data. fname <- system.file("extdata", "finndiane.txt", package = "Numero") dataset <- read.delim(file = fname) # Show original data characteristics. print(summary(dataset)) # Preprocess a subset of data. ds.pre <- nroPreprocess(dataset[1:100,]) print(summary(ds.pre)) # Repeat preprocessing for the whole dataset (approximation). ds.post <- nroPostprocess(dataset, ds.pre) print(summary(ds.post))
Convert to numerical values, remove unusable rows and columns, and standardize scale of each variable.
nroPreprocess(data, method = "standard", clip = 5.0, resolution = 100, trim = FALSE)
nroPreprocess(data, method = "standard", clip = 5.0, resolution = 100, trim = FALSE)
data |
A matrix or a data frame. |
method |
Method for standardizing scale and location, see details below. |
clip |
Range for clipping extreme values in multiples of standard deviations. |
resolution |
Maximum number of sampling points to capture distribution shape. |
trim |
if TRUE, empty rows and columns are removed. |
Standardization methods include empty string for no action, "standard" for centering by mean and division by standard deviation, "uniform" for normalized ranks between -1 and 1, "tapered" for a version of the rank-based method that puts more samples around zero and "normal" for quantile-based mapping to standard normal distribution.
The standard method also checks if the distribution is skewed and applies logarithm if it makes the distribution closer to the normal curve.
Clipping is not applied if the method is rank-based or if the threshold is set to NULL.
A matrix of numerical values. A value mapping model is stored in the attribute 'mapping'. The names of binary columns are stored in the attribute 'binary'.
Ville-Petteri Makinen
# Import data. fname <- system.file("extdata", "finndiane.txt", package = "Numero") dataset <- read.delim(file = fname) # Show original data characteristics. print(summary(dataset)) # Detect binary columns. ds <- nroPreprocess(dataset, method = "") print(attr(ds,"binary")) # Centering and scaling cholesterol. ds <- nroPreprocess(dataset$CHOL) print(summary(ds)) # Centering and scaling. ds <- nroPreprocess(dataset) print(summary(ds)) # Tapered ranks. ds <- nroPreprocess(dataset, method = "tapered") print(summary(ds)) # Standard normal ranks. ds <- nroPreprocess(dataset, method = "normal") print(summary(ds))
# Import data. fname <- system.file("extdata", "finndiane.txt", package = "Numero") dataset <- read.delim(file = fname) # Show original data characteristics. print(summary(dataset)) # Detect binary columns. ds <- nroPreprocess(dataset, method = "") print(attr(ds,"binary")) # Centering and scaling cholesterol. ds <- nroPreprocess(dataset$CHOL) print(summary(ds)) # Centering and scaling. ds <- nroPreprocess(dataset) print(summary(ds)) # Tapered ranks. ds <- nroPreprocess(dataset, method = "tapered") print(summary(ds)) # Standard normal ranks. ds <- nroPreprocess(dataset, method = "normal") print(summary(ds))
Combine data points that reside in districts that belong to a larger region into a subgroup; compare descriptive statistics between subgroups.
nroSummary(data, districts, regions = NULL, categlim = 8, capacity = 10)
nroSummary(data, districts, regions = NULL, categlim = 8, capacity = 10)
data |
A vector of named M elements or an M x N matrix of data values with row names. |
districts |
An integer vector of M named elements that indicate the best match out of K districts for each row name in the data matrix, please see nroMatch for an example. |
regions |
An vector of K elements or a data frame of K rows that defines if a district belongs to a larger region (i.e. a subgroup), see details. |
categlim |
The threshold for the number of unique values before a variable is considered continuous. |
capacity |
Maximum number of subgroups to compare. |
If defined, the region vector should have K elements where K is the total number of map districts.
The region input can also be a data frame of K rows where the column
REGION will be used for assigning district to regions, and REGION.label
will be used as the character label as seen on the map, see the output
from nroPlot()
for an example.
Districts and data points are connected by comparing element names in
districts
and names or row names of data
.
Districts and regions are connected by comparing element values in
districts
and names or row names of regions
.
If the region vector is empty, each district is automatically assigned to its own region.
Safeguards are in place to prevent crashes from empty categories; this reduces statistical power slightly when numbers are small.
A data frame of summary statistics that contains a row for every combination of subgroups and variables. The chi-squared test is used for comparisons with respect to categorical variables, and rank-regulated t-test and ANOVA are applied to continuous variables. Region labels for each row are stored in the attribute 'labels' and a list that contains the subsets of rows in each region is stored in the attribute 'subgroups'.
# Import data. fname <- system.file("extdata", "finndiane.txt", package = "Numero") dataset <- read.delim(file = fname) # Prepare training data. trvars <- c("CHOL", "HDL2C", "TG", "CREAT", "uALB") trdata <- scale.default(dataset[,trvars]) # K-means clustering. km <- nroKmeans(data = trdata) # Self-organizing map. sm <- nroKohonen(seeds = km) sm <- nroTrain(map = sm, data = trdata) # Assign data points into districts. matches <- nroMatch(centroids = sm, data = trdata) # Calculate district averages for urinary albumin. plane <- nroAggregate(topology = sm, districts = matches, data = dataset$uALB) plane <- as.vector(plane) # Assign subgroups based on urinary albumin. regns <- rep("HighAlb", length.out=length(plane)) regns[which(plane < quantile(plane, 0.67))] <- "MiddleAlb" regns[which(plane < quantile(plane, 0.33))] <- "LowAlb" # Add label info and make a data frame. regns <- data.frame(REGION=regns, REGION.label="", stringsAsFactors=FALSE) regns[which(regns$REGION == "HighAlb"),"REGION.label"] <- "H" regns[which(regns$REGION == "MiddleAlb"),"REGION.label"] <- "M" regns[which(regns$REGION == "LowAlb"),"REGION.label"] <- "L" # Calculate summary statistics. st <- nroSummary(data = dataset, districts = matches, regions = regns) print(st[,c("VARIABLE","SUBGROUP","MEAN","P.chisq","P.t","P.anova")])
# Import data. fname <- system.file("extdata", "finndiane.txt", package = "Numero") dataset <- read.delim(file = fname) # Prepare training data. trvars <- c("CHOL", "HDL2C", "TG", "CREAT", "uALB") trdata <- scale.default(dataset[,trvars]) # K-means clustering. km <- nroKmeans(data = trdata) # Self-organizing map. sm <- nroKohonen(seeds = km) sm <- nroTrain(map = sm, data = trdata) # Assign data points into districts. matches <- nroMatch(centroids = sm, data = trdata) # Calculate district averages for urinary albumin. plane <- nroAggregate(topology = sm, districts = matches, data = dataset$uALB) plane <- as.vector(plane) # Assign subgroups based on urinary albumin. regns <- rep("HighAlb", length.out=length(plane)) regns[which(plane < quantile(plane, 0.67))] <- "MiddleAlb" regns[which(plane < quantile(plane, 0.33))] <- "LowAlb" # Add label info and make a data frame. regns <- data.frame(REGION=regns, REGION.label="", stringsAsFactors=FALSE) regns[which(regns$REGION == "HighAlb"),"REGION.label"] <- "H" regns[which(regns$REGION == "MiddleAlb"),"REGION.label"] <- "M" regns[which(regns$REGION == "LowAlb"),"REGION.label"] <- "L" # Calculate summary statistics. st <- nroSummary(data = dataset, districts = matches, regions = regns) print(st[,c("VARIABLE","SUBGROUP","MEAN","P.chisq","P.t","P.anova")])
Iterative algorithm to adapt a self-organizing map (SOM) to a set of multivariable data.
nroTrain(map, data, subsample = NULL, balance = 0, message = NULL)
nroTrain(map, data, subsample = NULL, balance = 0, message = NULL)
map |
A list object as returned by |
data |
A matrix or a data frame. |
subsample |
Number of rows used during a single training cycle. |
balance |
Penalty parameter for variation in the numbers of resident
samples across disctricts, see |
message |
If positive, progress information is printed at the specified interval in seconds. |
The map is fitted according to columns that are found both in the SOM centroids and the input data.
If subsample
is less than the number of data rows, a random subset of
the specified size is used for each training cycle. By default,
subsample
is set automatically depending on the size of the dataset.
A copy of the list object map
, where the element centroids
is
updated according to the data patterns. The quantization errors during
training are stored in the element history
. The subsampling
parameter that was used during training is stored in the element
subsample
.
# Import data. fname <- system.file("extdata", "finndiane.txt", package = "Numero") dataset <- read.delim(file = fname) # Prepare training data. trvars <- c("CHOL", "HDL2C", "TG", "CREAT", "uALB") trdata <- scale.default(dataset[,trvars]) # K-means clustering. km <- nroKmeans(data = trdata) # Train with full data. sm <- nroKohonen(seeds = km) sm <- nroTrain(map = sm, data = trdata, subsample = nrow(trdata)) print(sm$history) # Train with subsampling. sm <- nroKohonen(seeds = km) sm <- nroTrain(map = sm, data = trdata, subsample = 200) print(sm$history)
# Import data. fname <- system.file("extdata", "finndiane.txt", package = "Numero") dataset <- read.delim(file = fname) # Prepare training data. trvars <- c("CHOL", "HDL2C", "TG", "CREAT", "uALB") trdata <- scale.default(dataset[,trvars]) # K-means clustering. km <- nroKmeans(data = trdata) # Train with full data. sm <- nroKohonen(seeds = km) sm <- nroTrain(map = sm, data = trdata, subsample = nrow(trdata)) print(sm$history) # Train with subsampling. sm <- nroKohonen(seeds = km) sm <- nroTrain(map = sm, data = trdata, subsample = 200) print(sm$history)
Sets row names and removes unusable columns and rows.
numero.clean(..., identity = NULL, na.freq = 0.9, num.only = TRUE, select = "")
numero.clean(..., identity = NULL, na.freq = 0.9, num.only = TRUE, select = "")
... |
Matrices or a data frames. |
identity |
Name(s) of the column(s) that contain identification information. |
na.freq |
The proportion of how many missing values are allowed in each column and in each row. |
num.only |
If true, only numeric columns are included. |
select |
Indicate if only identities present in all datasets or in exactly one of the datasets are included. |
If multiple identity columns are provided, composite identity keys are constructed by concatenating elements from each column with "_" added as a separator.
The frequency of missing values (against na.freq
) is tested first
by column then by row.
Selection can take three values: "" for no selection, "union" for all identities expanded to every dataset, "shared" for only those data points present in all usable datasets or "distinct" for excluding any points that can be found in more than one dataset. Note that the union may result in rows with no usable values.
A data frame if only one input dataset, or a list of data frames if multiple datasets.
Ville-Petteri Makinen
# Import data. fname <- system.file("extdata", "finndiane.txt", package = "Numero") dataset <- read.delim(file = fname) # Create new versions for testing. dsA <- dataset[1:250, c("INDEX","AGE","MALE","uALB")] dsB <- dataset[151:300, c("INDEX","AGE","MALE","uALB","CHOL")] dsC <- dataset[201:500, c("INDEX","AGE","MALE","DIAB_RETINO")] # Select all rows. results <- numero.clean(a = dsA, b = dsB, c = dsC, identity = "INDEX") cat("\n\nNo selection:\n") print(nrow(results$a)) print(nrow(results$b)) print(nrow(results$c)) # Select all rows and expanded for all identities. results <- numero.clean(a = dsA, b = dsB, c = dsC, identity = "INDEX", select = "union") cat("\n\nUnion:\n") print(nrow(results$a)) print(nrow(results$b)) print(nrow(results$c)) # Select only rows that are shared between all datasets. results <- numero.clean(a = dsA, b = dsB, c = dsC, identity = "INDEX", select = "intersection") cat("\n\nIntersection:\n") print(nrow(results$a)) print(nrow(results$b)) print(nrow(results$c)) # Select only rows with a unique INDEX ('dsB' has none). results <- numero.clean(a = dsA, b = dsB, c = dsC, identity = "INDEX", select = "exclusion") cat("\n\nExclusion:\n") print(nrow(results$a)) print(nrow(results$b)) print(nrow(results$c)) # Add extra identification information. dsA$GROUP <- "A" dsB$GROUP <- "B" dsC$GROUP <- "C" # Select rows with a unique identifier. results <- numero.clean(a = dsA, b = dsB, c = dsC, identity = c("GROUP","INDEX"), select = "exclusion") cat("\n\nMulti-identities:\n") print(nrow(results$a)) print(nrow(results$b)) print(nrow(results$c))
# Import data. fname <- system.file("extdata", "finndiane.txt", package = "Numero") dataset <- read.delim(file = fname) # Create new versions for testing. dsA <- dataset[1:250, c("INDEX","AGE","MALE","uALB")] dsB <- dataset[151:300, c("INDEX","AGE","MALE","uALB","CHOL")] dsC <- dataset[201:500, c("INDEX","AGE","MALE","DIAB_RETINO")] # Select all rows. results <- numero.clean(a = dsA, b = dsB, c = dsC, identity = "INDEX") cat("\n\nNo selection:\n") print(nrow(results$a)) print(nrow(results$b)) print(nrow(results$c)) # Select all rows and expanded for all identities. results <- numero.clean(a = dsA, b = dsB, c = dsC, identity = "INDEX", select = "union") cat("\n\nUnion:\n") print(nrow(results$a)) print(nrow(results$b)) print(nrow(results$c)) # Select only rows that are shared between all datasets. results <- numero.clean(a = dsA, b = dsB, c = dsC, identity = "INDEX", select = "intersection") cat("\n\nIntersection:\n") print(nrow(results$a)) print(nrow(results$b)) print(nrow(results$c)) # Select only rows with a unique INDEX ('dsB' has none). results <- numero.clean(a = dsA, b = dsB, c = dsC, identity = "INDEX", select = "exclusion") cat("\n\nExclusion:\n") print(nrow(results$a)) print(nrow(results$b)) print(nrow(results$c)) # Add extra identification information. dsA$GROUP <- "A" dsB$GROUP <- "B" dsC$GROUP <- "C" # Select rows with a unique identifier. results <- numero.clean(a = dsA, b = dsB, c = dsC, identity = c("GROUP","INDEX"), select = "exclusion") cat("\n\nMulti-identities:\n") print(nrow(results$a)) print(nrow(results$b)) print(nrow(results$c))
Set up a self-organizing map and train it with data
numero.create(data, radius = NULL, smoothness = NULL, subsample = NULL)
numero.create(data, radius = NULL, smoothness = NULL, subsample = NULL)
data |
A matrix or a data frame. |
radius |
Map radius. |
smoothness |
Rigidity of the map to adapt to regional differences. |
subsample |
Number of data points used during a single training cycle. |
The parameter subsample
sets the number of data points that are
randomly picked for each training cycle; if the number is substantially less
than the size of the dataset, the function will finish quicker.
A list with named elements: data
contains the training data,
kmeans
is the output from nroKmeans()
during the
initialiation of the SOM, map
is the finished self-organising map
from nroTrain()
and layout
contains the output from
nroMatch()
for the training data points.
# Import data. fname <- system.file("extdata", "finndiane.txt", package = "Numero") dataset <- read.delim(file = fname) # Set identities and manage missing data. dataset <- numero.clean(dataset, identity = "INDEX") # Prepare training set. trvars <- c("CHOL", "HDL2C", "TG", "CREAT", "uALB") trdata <- numero.prepare(data = dataset, variables = trvars) # Create a self-organizing map. modl <- numero.create(data = trdata)
# Import data. fname <- system.file("extdata", "finndiane.txt", package = "Numero") dataset <- read.delim(file = fname) # Set identities and manage missing data. dataset <- numero.clean(dataset, identity = "INDEX") # Prepare training set. trvars <- c("CHOL", "HDL2C", "TG", "CREAT", "uALB") trdata <- numero.prepare(data = dataset, variables = trvars) # Create a self-organizing map. modl <- numero.create(data = trdata)
Evaluate regional variation of data values on a self-organizing map
numero.evaluate(model, data, ranked = TRUE, n = 1000)
numero.evaluate(model, data, ranked = TRUE, n = 1000)
model |
A list object that contains a self-organizing map and a data layout. |
data |
A matrix or a data frame. |
ranked |
If true, a rank transform is applied to avoid problems from skewed distributions or outliers. |
n |
Maximum number of permutations per data column. |
The input argument model
can be the output from
numero.create()
or from numero.quality()
.
A list with named elements: som
contains the self-organizing map,
layout
contains the district assignments for data points,
planes
contains smoothed district averages from
nroAggregate()
, the element ranges
contains the
reference ranges to be used in nroColorize()
,
the element statistics
contains the output from
nroPermute()
, the element palette
is the name of
the colormap and the element data
contains the data points
that were used for calculating the statistics.
# Import data. fname <- system.file("extdata", "finndiane.txt", package = "Numero") dataset <- read.delim(file = fname) # Set identities and manage missing data. dataset <- numero.clean(dataset, identity = "INDEX") # Prepare training variables. trvars <- c("CHOL", "HDL2C", "TG", "CREAT", "uALB") trdata <- numero.prepare(data = dataset, variables = trvars) # Create a self-organizing map. sm <- numero.create(data = trdata) qc <- numero.quality(model = sm) # Evaluate map statistics. results <- numero.evaluate(model = qc, data = dataset) print(results$statistics[,c("TRAINING", "Z", "P.z", "P.freq")])
# Import data. fname <- system.file("extdata", "finndiane.txt", package = "Numero") dataset <- read.delim(file = fname) # Set identities and manage missing data. dataset <- numero.clean(dataset, identity = "INDEX") # Prepare training variables. trvars <- c("CHOL", "HDL2C", "TG", "CREAT", "uALB") trdata <- numero.prepare(data = dataset, variables = trvars) # Create a self-organizing map. sm <- numero.create(data = trdata) qc <- numero.quality(model = sm) # Evaluate map statistics. results <- numero.evaluate(model = qc, data = dataset) print(results$statistics[,c("TRAINING", "Z", "P.z", "P.freq")])
Plot map colorings and save them as vector graphics
numero.plot(results, variables = NULL, topology = NULL, folder = NULL, prefix = "figure", reference = NULL, subplot = NULL, gain = 1, detach = FALSE, capacity = 500, font = NULL)
numero.plot(results, variables = NULL, topology = NULL, folder = NULL, prefix = "figure", reference = NULL, subplot = NULL, gain = 1, detach = FALSE, capacity = 500, font = NULL)
results |
A list object that contains the self-organizing map and its statistical colorings. |
variables |
A string vector that contains names of variables to show. |
topology |
The topology of a SOM with subgroup labels. |
folder |
Folder path for saving figures. |
prefix |
Prefix for each figure file (if saving enabled). |
reference |
Reference color ranges and scales. |
gain |
Modifier for overall color intensity. |
subplot |
A two-element vector that sets out the number of rows and columns for subplots per figure. |
detach |
Use detached windows for figures. |
capacity |
Maximum number of subplots to show on screen. |
font |
Multiplier to adjust font size for SVG and HTML output,
see |
The input results
must contain the output from
numero.evaluate()
or similar.
The input argument topology
can be the topology of a SOM or with
additional columns as in the output from numero.subgroup()
.
The input argument reference
follows the output format from
numero.evaluate()
.
Possible values for detach
include "X11", "aqua", TRUE or FALSE.
Using multiple figures may result in different behaviour in terminal vs.
RStudio instances. The default behaviour is to create detached windows for each
figure when the X11 display server is available (e.g. in Linux). To use
detached windows in Mac, use the value "aqua". Setting detach = TRUE
will use a more general approach, however, some systems may behave
unpredictably. To create multiple figures that remain docked within the
RStudio work window, set detach = FALSE
.
If a destination folder is provided, all plots are saved in files without plotting them on screen.
The number of figures that were created.
Ville-Petteri Makinen
# Import data. fname <- system.file("extdata", "finndiane.txt", package = "Numero") dataset <- read.delim(file = fname) # Set identities and manage missing data. dataset <- numero.clean(dataset, identity = "INDEX") # Prepare training variables. trvars <- c("CHOL", "HDL2C", "TG", "CREAT", "uALB") trdata <- numero.prepare(data = dataset, variables = trvars, batch = "MALE", confounders = c("AGE", "T1D_DURAT")) # Create a self-organizing map. sm <- numero.create(data = trdata) qc <- numero.quality(model = sm) # Evaluate map statistics for all variables. stats <- numero.evaluate(model = qc, data = dataset) # Plot map colorings. numero.plot(results = stats)
# Import data. fname <- system.file("extdata", "finndiane.txt", package = "Numero") dataset <- read.delim(file = fname) # Set identities and manage missing data. dataset <- numero.clean(dataset, identity = "INDEX") # Prepare training variables. trvars <- c("CHOL", "HDL2C", "TG", "CREAT", "uALB") trdata <- numero.prepare(data = dataset, variables = trvars, batch = "MALE", confounders = c("AGE", "T1D_DURAT")) # Create a self-organizing map. sm <- numero.create(data = trdata) qc <- numero.quality(model = sm) # Evaluate map statistics for all variables. stats <- numero.evaluate(model = qc, data = dataset) # Plot map colorings. numero.plot(results = stats)
Prepare training data by mitigating confounding factors and standardizing values.
numero.prepare(data, variables = NULL, confounders = NULL, batch = NULL, method = "standard", clip = 5.0, pipeline = NULL, undo = FALSE)
numero.prepare(data, variables = NULL, confounders = NULL, batch = NULL, method = "standard", clip = 5.0, pipeline = NULL, undo = FALSE)
data |
A matrix or a data frame. |
variables |
A character vector of column names, see details. |
confounders |
Names of columns that contain confounder data. |
batch |
The name of the column that contains batch labels. |
method |
Method to standardize values, see |
clip |
Range for clipping extreme values in multiples of standard deviations. |
pipeline |
Processing parameters from a previous use of the function. |
undo |
If true, standardization is reversed after adjusting for batches and confounders. |
We recommend first applying numero.clean()
to the full
dataset, then selecting a subset for training using the input argument
variables
. This preserves any attributes that may be used in
Numero functions.
If a previous pipeline
is available, it overrides all processing
parameters irrespective of other input arguments.
Due to safeguards against numerical instability, the standardized values may deviate slightly from the expected range (<0.1 percent error is typical).
Clipping of extreme values is applied only during the first round of standardization before adjustments for confounders. Therefore, the final output may contain values that exceed the threshold.
A matrix with the attributes 'pipeline' that contains the processing parameters and 'subsets' that contains row names divided into batches if batch correction was applied.
# Import data. fname <- system.file("extdata", "finndiane.txt", package = "Numero") dataset <- read.delim(file = fname) # Set identities and manage missing data. dataset <- numero.clean(dataset, identity = "INDEX") # Prepare training variables using default standardization. trvars <- c("CHOL", "HDL2C", "TG", "CREAT", "uALB") trdata <- numero.prepare(data = dataset, variables = trvars) print(summary(trdata)) # Prepare training values adjusted for age and sex and # standardized by rank-based method. trdata <- numero.prepare(data = dataset, variables = trvars, batch = "MALE", confounders = "AGE", method = "tapered") print(summary(trdata))
# Import data. fname <- system.file("extdata", "finndiane.txt", package = "Numero") dataset <- read.delim(file = fname) # Set identities and manage missing data. dataset <- numero.clean(dataset, identity = "INDEX") # Prepare training variables using default standardization. trvars <- c("CHOL", "HDL2C", "TG", "CREAT", "uALB") trdata <- numero.prepare(data = dataset, variables = trvars) print(summary(trdata)) # Prepare training values adjusted for age and sex and # standardized by rank-based method. trdata <- numero.prepare(data = dataset, variables = trvars, batch = "MALE", confounders = "AGE", method = "tapered") print(summary(trdata))
Assign new data points to map districts and calculate quality measures
numero.quality(model, data = NULL)
numero.quality(model, data = NULL)
model |
A list object that contains a self-organizing map (and a data layout). |
data |
A matrix or a data frame. |
The input argument model
must be in the the output format
as returned by numero.create()
.
A list with named elements: som
contains the self-organizing map;
layout
contains the district assignments for data points;
planes
contains smoothed district averages of quality measures, see
nroAggregate()
and nroMatch()
; the element
ranges
contains the reference ranges to be used in
nroColorize()
; the element palette
is the name of
the colormap to be used for colorings; and statistics
contains
the output from nroPermute()
.
# Import data. fname <- system.file("extdata", "finndiane.txt", package = "Numero") dataset <- read.delim(file = fname) # Set identities and manage missing data. dataset <- numero.clean(dataset, identity = "INDEX") # Prepare training variables. trvars <- c("CHOL", "HDL2C", "TG", "CREAT", "uALB") trdata <- numero.prepare(data = dataset, variables = trvars) # Create a self-organizing map. modl <- numero.create(data = trdata) # Analyze map quality. qc <- numero.quality(model = modl)
# Import data. fname <- system.file("extdata", "finndiane.txt", package = "Numero") dataset <- read.delim(file = fname) # Set identities and manage missing data. dataset <- numero.clean(dataset, identity = "INDEX") # Prepare training variables. trvars <- c("CHOL", "HDL2C", "TG", "CREAT", "uALB") trdata <- numero.prepare(data = dataset, variables = trvars) # Create a self-organizing map. modl <- numero.create(data = trdata) # Analyze map quality. qc <- numero.quality(model = modl)
Plot self-organizing map colorings and let the user choose multi-district regions as subgroups
numero.subgroup(results, variables, topology = NULL, reference = NULL, gain = 1, detach = FALSE, capacity = 9, automatic = FALSE)
numero.subgroup(results, variables, topology = NULL, reference = NULL, gain = 1, detach = FALSE, capacity = 9, automatic = FALSE)
results |
A list object that contains the self-organizing map and its statistical colorings. |
variables |
A string vector that contains names of variables to show on screen. |
topology |
A SOM topology or the output from a previous subgrouping session. |
reference |
Reference color ranges and scales. |
gain |
Modifier for overall color intensity. |
detach |
Use a detached window. |
capacity |
Maximum number of subplots to show on screen. |
automatic |
If greater than zero, automatic segmentation of the map is triggered, the value sets the number of subgroups. |
The input results
must contain the output from
numero.evaluate()
or similar.
The input argument topology
can be the structure of a SOM or with
additional columns as in the output from nroPlot()
.
The input argument reference
follows the output format from
numero.evaluate()
.
Setting detach to FALSE will also clear all devices whenever the figure is
refreshed. This may be inconvenient when using R from the terminal,
for example; please see the help page of numero.plot()
for
using a detached window device instead.
If any districts are left unmarked, they are automatically collected
into a subgroup of their own. If automatic
is set, user input
is skipped.
A data frame similar to the format returned by nroPlot()
.
# Import data. fname <- system.file("extdata", "finndiane.txt", package = "Numero") dataset <- read.delim(file = fname) # Set identities and manage missing data. dataset <- numero.clean(dataset, identity = "INDEX") # Prepare training variables. trvars <- c("CHOL", "HDL2C", "TG", "CREAT", "uALB") trdata <- numero.prepare(data = dataset, variables = trvars) # Create a self-organizing map. sm <- numero.create(data = trdata) qc <- numero.quality(model = sm) # Evaluate map statistics for all variables. stats <- numero.evaluate(model = qc, data = dataset) # Define subgroups, uncomment to launch interactive window. #elem <- numero.subgroup(results = stats, variables = trvars)
# Import data. fname <- system.file("extdata", "finndiane.txt", package = "Numero") dataset <- read.delim(file = fname) # Set identities and manage missing data. dataset <- numero.clean(dataset, identity = "INDEX") # Prepare training variables. trvars <- c("CHOL", "HDL2C", "TG", "CREAT", "uALB") trdata <- numero.prepare(data = dataset, variables = trvars) # Create a self-organizing map. sm <- numero.create(data = trdata) qc <- numero.quality(model = sm) # Evaluate map statistics for all variables. stats <- numero.evaluate(model = qc, data = dataset) # Define subgroups, uncomment to launch interactive window. #elem <- numero.subgroup(results = stats, variables = trvars)
Estimates subgroup statistics after self-organizing map analysis
numero.summary(results, topology, data = NULL, capacity = 10)
numero.summary(results, topology, data = NULL, capacity = 10)
results |
A list object that contains the self-organizing map and its statistical colorings. |
topology |
A SOM topology with additional labels that indicate selected regions. |
data |
A matrix or a data frame. |
capacity |
Maximum number of subgroups to compare. |
The input results
must contain the output from
numero.evaluate()
or similar.
The input argument topology
must be a definition of a SOM with
additional columns as in the output from numero.subgroup()
.
The function first looks for row names in data
that are also included
in results
. The rows are then divided into subgroups according to the
district assignments in results
and the region labels in
topology
.
A data frame of summary statistics, see nroSummary()
for details. The data frame also contains additional information on which
variables were used for the training of the SOM.
The attribute 'layout' is added to the output. It indicates the location on the map and the subgroup name and label for each data row that were included in the analysis.
Ville-Petteri Makinen
# Import data. fname <- system.file("extdata", "finndiane.txt", package = "Numero") dataset <- read.delim(file = fname) # Set identities and manage missing data. dataset <- numero.clean(dataset, identity = "INDEX") # Prepare training variables. trvars <- c("CHOL", "HDL2C", "TG", "CREAT", "uALB") trdata <- numero.prepare(data = dataset, variables = trvars) # Create a self-organizing map. sm <- numero.create(data = trdata) qc <- numero.quality(model = sm) # Evaluate map statistics for all variables. stats <- numero.evaluate(model = qc, data = dataset) # Define subgroups. x <- stats$planes[,"uALB"] tops <- which(x >= quantile(x, 0.75, na.rm=TRUE)) bottoms <- which(x <= quantile(x, 0.25, na.rm=TRUE)) elem <- data.frame(stats$map$topology, stringsAsFactors = FALSE) elem$REGION <- "MiddleAlb" elem$REGION[tops] <- "HighAlb" elem$REGION[bottoms] <- "LowAlb" elem$REGION.label <- "M" elem$REGION.label[tops] <- "H" elem$REGION.label[bottoms] <- "L" # Compare subgroups. cmp <- numero.summary(results = stats, topology = elem, data = dataset)
# Import data. fname <- system.file("extdata", "finndiane.txt", package = "Numero") dataset <- read.delim(file = fname) # Set identities and manage missing data. dataset <- numero.clean(dataset, identity = "INDEX") # Prepare training variables. trvars <- c("CHOL", "HDL2C", "TG", "CREAT", "uALB") trdata <- numero.prepare(data = dataset, variables = trvars) # Create a self-organizing map. sm <- numero.create(data = trdata) qc <- numero.quality(model = sm) # Evaluate map statistics for all variables. stats <- numero.evaluate(model = qc, data = dataset) # Define subgroups. x <- stats$planes[,"uALB"] tops <- which(x >= quantile(x, 0.75, na.rm=TRUE)) bottoms <- which(x <= quantile(x, 0.25, na.rm=TRUE)) elem <- data.frame(stats$map$topology, stringsAsFactors = FALSE) elem$REGION <- "MiddleAlb" elem$REGION[tops] <- "HighAlb" elem$REGION[bottoms] <- "LowAlb" elem$REGION.label <- "M" elem$REGION.label[tops] <- "H" elem$REGION.label[bottoms] <- "L" # Compare subgroups. cmp <- numero.summary(results = stats, topology = elem, data = dataset)