Title: | Histogram-Valued Data Analysis |
---|---|
Description: | In the framework of Symbolic Data Analysis, a relatively new approach to the statistical analysis of multi-valued data, we consider histogram-valued data, i.e., data described by univariate histograms. The methods and the basic statistics for histogram-valued data are mainly based on the L2 Wasserstein metric between distributions, i.e., the Euclidean metric between quantile functions. The package contains unsupervised classification techniques, least square regression and tools for histogram-valued data and for histogram time series. An introducing paper is Irpino A. Verde R. (2015) <doi: 10.1007/s11634-014-0176-4>. |
Authors: | Antonio Irpino [aut, cre] |
Maintainer: | Antonio Irpino <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.0.8 |
Built: | 2024-11-20 06:29:55 UTC |
Source: | CRAN |
We consider histogram-valued data, i.e., data described by univariate histograms. The methods and the basic statistics for histogram-valued data are mainly based on the L2 Wasserstein metric between distributions, i.e., a Euclidean metric between quantile functions. The package contains unsupervised classification techniques, least square regression and tools for histrogram-valued data and for histogram time series.
Package: | HistDAWass |
Type: | Package |
Version: | 0.1.1 |
Date: | 2014-09-17 |
License: | GPL (>=2) |
Depends: | methods |
An overview of how to use the package, including the most important functions
Antonio Irpino <[email protected]>
Irpino, A., Verde, R. (2015) Basic
statistics for distributional symbolic variables: a new metric-based
approach, Advances in Data Analysis and Classification, Volume 9, Issue 2, pp 143–175.
DOI doi:10.1007/s11634-014-0176-4
# Generating a list of distributions a <- vector("list", 4) a[[1]] <- distributionH( x = c(80, 100, 120, 135, 150, 165, 180, 200, 240), p = c(0, 0.025, 0.1, 0.275, 0.525, 0.725, 0.887, 0.975, 1) ) a[[2]] <- distributionH( x = c(80, 100, 120, 135, 150, 165, 180, 195, 210, 240), p = c(0, 0.013, 0.101, 0.255, 0.508, 0.718, 0.895, 0.961, 0.987, 1) ) a[[3]] <- distributionH( x = c(95, 110, 125, 140, 155, 170, 185, 200, 215, 230, 245), p = c(0, 0.012, 0.041, 0.154, 0.36, 0.595, 0.781, 0.929, 0.972, 0.992, 1) ) a[[4]] <- distributionH( x = c(105, 120, 135, 150, 165, 180, 195, 210, 225, 240, 260), p = c(0, 0.009, 0.035, 0.081, 0.186, 0.385, 0.633, 0.832, 0.932, 0.977, 1) ) # Generating a list of names of observations namerows <- list("u1", "u2") # Generating a list of names of variables namevars <- list("Var_1", "Var_2") # creating the MatH Mat_of_distributions <- MatH( x = a, nrows = 2, ncols = 2, rownames = namerows, varnames = namevars, by.row = FALSE )
# Generating a list of distributions a <- vector("list", 4) a[[1]] <- distributionH( x = c(80, 100, 120, 135, 150, 165, 180, 200, 240), p = c(0, 0.025, 0.1, 0.275, 0.525, 0.725, 0.887, 0.975, 1) ) a[[2]] <- distributionH( x = c(80, 100, 120, 135, 150, 165, 180, 195, 210, 240), p = c(0, 0.013, 0.101, 0.255, 0.508, 0.718, 0.895, 0.961, 0.987, 1) ) a[[3]] <- distributionH( x = c(95, 110, 125, 140, 155, 170, 185, 200, 215, 230, 245), p = c(0, 0.012, 0.041, 0.154, 0.36, 0.595, 0.781, 0.929, 0.972, 0.992, 1) ) a[[4]] <- distributionH( x = c(105, 120, 135, 150, 165, 180, 195, 210, 225, 240, 260), p = c(0, 0.009, 0.035, 0.081, 0.186, 0.385, 0.633, 0.832, 0.932, 0.977, 1) ) # Generating a list of names of observations namerows <- list("u1", "u2") # Generating a list of names of variables namevars <- list("Var_1", "Var_2") # creating the MatH Mat_of_distributions <- MatH( x = a, nrows = 2, ncols = 2, rownames = namerows, varnames = namevars, by.row = FALSE )
This method overrides the "[" operator for a matH
object.
## S4 method for signature 'MatH' x[i, j, ..., drop = TRUE]
## S4 method for signature 'MatH' x[i, j, ..., drop = TRUE]
x |
a |
i |
a set of integer values identifying the rows |
j |
a set of integer values identifying the columns |
... |
not useful |
drop |
a logical value inherited from the basic method "[" but not used (default=TRUE) |
A matH
object
D <- BLOOD # the BLOOD dataset SUB_D <- BLOOD[c(1, 2, 5), c(1, 2)]
D <- BLOOD # the BLOOD dataset SUB_D <- BLOOD[c(1, 2, 5), c(1, 2)]
the product of a number and a distribution according to the L2 Wasssertein
the product of a number and a distribution according to the L2 Wasssertein
the product of a number and a distribution according to the L2 Wasssertein
## S4 method for signature 'distributionH,distributionH' e1 * e2 ## S4 method for signature 'numeric,distributionH' e1 * e2 ## S4 method for signature 'distributionH,numeric' e1 * e2
## S4 method for signature 'distributionH,distributionH' e1 * e2 ## S4 method for signature 'numeric,distributionH' e1 * e2 ## S4 method for signature 'distributionH,numeric' e1 * e2
e1 |
a |
e2 |
a |
the sum of two distribution according to the L2 Wasssertein
the sum of a number and a distribution according to the L2 Wasssertein
the sum of adistribution and a number according to the L2 Wasssertein
## S4 method for signature 'distributionH,distributionH' e1 + e2 ## S4 method for signature 'numeric,distributionH' e1 + e2 ## S4 method for signature 'distributionH,numeric' e1 + e2
## S4 method for signature 'distributionH,distributionH' e1 + e2 ## S4 method for signature 'numeric,distributionH' e1 + e2 ## S4 method for signature 'distributionH,numeric' e1 + e2
e1 |
a |
e2 |
a |
a distributionH
object
The dataset contains a MatH (matrix of histogram-valued data) object, with three hisogram-valued variables, the 5-years age (relative frequencies) distribution of all the population, of the male and of the female population of 228 countries of the World. The first row is the World data. Thus it contains 229 rows(228 countries plus the World) and 3 variables: "Both.Sexes.Population", "Male.Population", "Female.Population"
a MatH
object, a matrix of distributions.
Antonio Irpino, 2014-10-05
United States Census Bureau https://www.census.gov/data.html
A dataset with the distributions of marginal costs of farms in 22 France regions. It contains four histogram variables: "Y_TSC" (Total costs of a farm), "X_Wheat" (Costs for Wheat), "X_Pig" (Costs for Pigs) "X_Cmilk" (Costs for Cow Milk)
a MatH
object, a matrix of distributions.
Antonio Irpino, 2014-10-05
Rosanna Verde, Antonio Irpino, Second University of Naples; Dominique Desbois, UMR Economie publique, INRA-AgroParisTech, How to cope with modelling and privacy concerns? A regression model and a visualization tool for aggregated data, Conference of European Statistics Stakeholders, Rome, November, 24-25,2014
The dataset contains a MatH (matrix of histogram-valued data) object This data set list 14 groups of patients described by 3 variables.
a MatH
istance, 1 row per group.
Antonio Irpino, 2014-10-05
Billard L. and Diday E. (2006). Symbolic Data Analysis: Conceptual Statistics and Data Mining, Wiley.
The dataset contains a MatH (matrix of histogram-valued data) object This data set list 10 patients described by 2 variables.
a MatH
istance, 1 row per patient.
Antonio Irpino, 2014-10-05
Dias, S. and Brito P. Distribution and Symmetric Distribution Regression Model for Histogram-Valued Variables, ArXiv, arXiv:1303.6199 [stat.ME]
The function transform a MatH object (i.e. a matrix of distributions), such that each distribution is shifted and has a mean equal to zero
Center.cell.MatH(object) ## S4 method for signature 'MatH' Center.cell.MatH(object)
Center.cell.MatH(object) ## S4 method for signature 'MatH' Center.cell.MatH(object)
object |
a MatH object, a matrix of distributions. |
A MatH
object, having each distribution with a zero mean.
CEN_BLOOD <- Center.cell.MatH(BLOOD) get.MatH.stats(BLOOD, stat = "mean")
CEN_BLOOD <- Center.cell.MatH(BLOOD) get.MatH.stats(BLOOD, stat = "mean")
checkEmptyBins
The method checking for empty bins in a distribution, i.e. if two cdf consecutive
values are equal. In that case a probability value of 1e-7
is
assigned to the empty bin and the cdf is recomputed. This methods is useful
for numerical reasons.
checkEmptyBins(object) ## S4 method for signature 'distributionH' checkEmptyBins(object)
checkEmptyBins(object) ## S4 method for signature 'distributionH' checkEmptyBins(object)
object |
a |
A distributionH
object without empty bins
Antonio Irpino
## ---- A mydist distribution with an empty bin i.e. two consecutive values of p are equal---- mydist <- distributionH(x = c(1, 2, 3, 10), p = c(0, 0.5, 0.5, 1)) ## ---- Checks for empty byns and returns the newdist object without empty bins ---- newdist <- checkEmptyBins(mydist)
## ---- A mydist distribution with an empty bin i.e. two consecutive values of p are equal---- mydist <- distributionH(x = c(1, 2, 3, 10), p = c(0, 0.5, 0.5, 1)) ## ---- Checks for empty byns and returns the newdist object without empty bins ---- newdist <- checkEmptyBins(mydist)
A dataset with the distributions of some climatic variables collected for each month in 60 stations of China.
The collected variables are 168 i.e. 14 climatic variables observed for 12 months. The 14 variables are the following:
mean station pressure (mb), mean temperature, mean maximum temperature, mean minimum temperature,
total precipitation (mm), sunshine duration (h), mean cloud amount (percentage of sky cover),
mean relative humidity (
mean wind speed (m/s), dominant wind frequency (
extreme minimum temperature.
Use the command get.MatH.main.info(China_Month)
for rapid info.
a MatH
object, a matrix of distributions.
Antonio Irpino, 2014-10-05
raw data are available here: https://data.ess-dive.lbl.gov/view/doi:10.3334/CDIAC/CLI.TR055
A dataset with the distributions of some climatic variables collected for each season in 60 stations of China.
The collected variables are 56 i.e. 14 climatic variables observed for 4 seasons. The 14 variables are the following:
mean station pressure (mb), mean temperature, mean maximum temperature, mean minimum temperature,
total precipitation (mm), sunshine duration (h), mean cloud amount (percentage of sky cover),
mean relative humidity (
mean wind speed (m/s), dominant wind frequency (
extreme minimum temperature.
Use the command get.MatH.main.info(China_Seas)
for rapid info.
a MatH
object, a matrix of distributions.
Antonio Irpino, 2014-10-05
raw data are available here: https://data.ess-dive.lbl.gov/view/doi:10.3334/CDIAC/CLI.TR055. Climate Data Bases of the People's Republic of China 1841-1988 (TR055) DOI: 10.3334/CDIAC/cli.tr055
compP
Compute the cdf probability at a given value for a histogram
compP(object, q) ## S4 method for signature 'distributionH,numeric' compP(object, q)
compP(object, q) ## S4 method for signature 'distributionH,numeric' compP(object, q)
object |
is an object of distributionH class |
q |
is a numeric value |
Returns a value between 0 and 1.
## ---- A mydist distribution ---- mydist <- distributionH(x = c(1, 2, 3, 10), p = c(0, 0.1, 0.5, 1)) ## ---- Compute the cfd value for q=5 (not observed) ---- p <- compP(mydist, 5)
## ---- A mydist distribution ---- mydist <- distributionH(x = c(1, 2, 3, 10), p = c(0, 0.1, 0.5, 1)) ## ---- Compute the cfd value for q=5 (not observed) ---- p <- compP(mydist, 5)
compQ
Compute the quantile value of a histogram for a given probability.
compQ(object, p) ## S4 method for signature 'distributionH,numeric' compQ(object, p)
compQ(object, p) ## S4 method for signature 'distributionH,numeric' compQ(object, p)
object |
an object of distributionH class |
p |
a number between 0 and 1 |
A number that is the quantile of the passed histogram object at level p.
Antonio Irpino
## ---- A mydist distribution ---- mydist <- distributionH(x = c(1, 2, 3, 10), p = c(0, 0.1, 0.5, 1)) ## ---- Compute the quantile of mydist for different values of p ---- y <- compQ(mydist, 0.5) # the median y <- compQ(mydist, 0) # the minimum y <- compQ(mydist, 1) # the maximum y <- compQ(mydist, 0.25) # the first quartile y <- compQ(mydist, 0.9) # the ninth decile
## ---- A mydist distribution ---- mydist <- distributionH(x = c(1, 2, 3, 10), p = c(0, 0.1, 0.5, 1)) ## ---- Compute the quantile of mydist for different values of p ---- y <- compQ(mydist, 0.5) # the median y <- compQ(mydist, 0) # the minimum y <- compQ(mydist, 1) # the maximum y <- compQ(mydist, 0.25) # the first quartile y <- compQ(mydist, 0.9) # the ninth decile
crwtransform
: returns the centers and the radii of bins of a distributionCenters and ranges calculation for bins of a histogram. It is useful for a very fast computation of statistics and methods based on the L2 Wassertein distance between histograms.
crwtransform(object) ## S4 method for signature 'distributionH' crwtransform(object)
crwtransform(object) ## S4 method for signature 'distributionH' crwtransform(object)
object |
a |
A list containing
$Centers |
The midpoints of the bins of the histogram |
$Radii |
The half-lenghts of the bins of the histogram |
$Weights |
The relative frequencies or the probailities associated with each bin (the sum is equal to 1) |
Antonio Irpino
Irpino, A., Verde, R., Lechevallier, Y. (2006) Dynamic clustering of histograms using Wasserstein metric, In: Proceedings of COMPSTAT 2006, Physica-Verlag, 869-876
## ---- A mydist distribution ---- mydist <- distributionH(x = c(1, 2, 3, 10), p = c(0, 0.1, 0.5, 1)) ## ---- Compute the cfd value for q=5 (not observed) ---- crwtransform(mydist)
## ---- A mydist distribution ---- mydist <- distributionH(x = c(1, 2, 3, 10), p = c(0, 0.1, 0.5, 1)) ## ---- Compute the cfd value for q=5 (not observed) ---- crwtransform(mydist)
From real data to distributionH.
data2hist( data, algo = "histogram", type = "combined", qua = 10, breaks = numeric(0), epsilon = 0.01 )
data2hist( data, algo = "histogram", type = "combined", qua = 10, breaks = numeric(0), epsilon = 0.01 )
data |
a set of numeric values. |
algo |
(optional) a string. Default is "histogram", i.e. the function "histogram"
defined in the |
type |
(optional) a string. Default is "combined" and generates
a histogram having regularly spaced breaks (i.e., equi-width bins) and
irregularly spaced ones. The choice is done accordingly with the penalization method described in
|
qua |
a positive integer to provide if |
breaks |
a vector of values to provide if |
epsilon |
a number between 0 and 1 to provide if |
A distributionH
object, i.e. a distribution.
histogram
function
data <- rnorm(n = 1000, mean = 2, sd = 3) mydist <- data2hist(data) plot(mydist)
data <- rnorm(n = 1000, mean = 2, sd = 3) mydist <- data2hist(data) plot(mydist)
Class "distributionH"
desfines an histogram object
The class describes a histogram by means of its cumulative distribution
function. The methods are develoved accordingly to the L2 Wasserstein
distance between distributions.
A histogram object can be created also with the function distributionH(...)
, the costructor function for creating an object containing the description of
a histogram.
## S4 method for signature 'distributionH' initialize( .Object, x = numeric(0), p = numeric(0), m = numeric(0), s = numeric(0) ) distributionH(x = numeric(0), p = numeric(0))
## S4 method for signature 'distributionH' initialize( .Object, x = numeric(0), p = numeric(0), m = numeric(0), s = numeric(0) ) distributionH(x = numeric(0), p = numeric(0))
.Object |
the type ("distributionH") |
x |
a numeric vector. it is the domain of the distribution (i.e. the extremes of bins). |
p |
a numeric vector (of the same lenght of x). It is the cumulative distribution function CDF. |
m |
(optional) a numeric value. Is the mean of the histogram. |
s |
(optional) a numeric positive value. It is the standard deviation of a histogram. |
Class distributionH
defines a histogram object
A distributionH
object
Objects can be created by calls of the form
new("distributionH", x, p, m, s)
.
Antonio Irpino
Irpino, A., Verde, R. (2015) Basic statistics for distributional symbolic variables: a new metric-based approach Advances in Data Analysis and Classification, DOI 10.1007/s11634-014-0176-4
meanH
computes the mean. stdH
computes the standard deviation.
#---- initialize a distributionH object mydist # from a simple histogram # ---------------------------- # | Bins | Prob | cdf | # ---------------------------- # | [1,2) | 0.4 | 0.4 | # | [2,3] | 0.6 | 1.0 | # ---------------------------- # | Tot. | 1.0 | - | # ---------------------------- mydist <- new("distributionH", c(1, 2, 3), c(0, 0.4, 1)) str(mydist) # OUTPUT # Formal class 'distributionH' [package "HistDAWass"] with 4 slots # ..@ x: num [1:3] 1 2 3 the quantiles # ..@ p: num [1:3] 0 0.4 1 the cdf # ..@ m: num 2.1 the mean # ..@ s: num 0.569 the standard deviation # or using mydist <- distributionH(x = c(1, 2, 3), p = c(0, 0.4, 1))
#---- initialize a distributionH object mydist # from a simple histogram # ---------------------------- # | Bins | Prob | cdf | # ---------------------------- # | [1,2) | 0.4 | 0.4 | # | [2,3] | 0.6 | 1.0 | # ---------------------------- # | Tot. | 1.0 | - | # ---------------------------- mydist <- new("distributionH", c(1, 2, 3), c(0, 0.4, 1)) str(mydist) # OUTPUT # Formal class 'distributionH' [package "HistDAWass"] with 4 slots # ..@ x: num [1:3] 1 2 3 the quantiles # ..@ p: num [1:3] 0 0.4 1 the cdf # ..@ m: num 2.1 the mean # ..@ s: num 0.569 the standard deviation # or using mydist <- distributionH(x = c(1, 2, 3), p = c(0, 0.4, 1))
dotpW
The dot product of two distributions inducing the L2 Wasserstein metric
The dot product of a number (considered as an impulse distribution function) and a distribution
The dot product of a distribution and a number (considered as an impulse distribution function).
dotpW(e1, e2) ## S4 method for signature 'distributionH,distributionH' dotpW(e1, e2) ## S4 method for signature 'numeric,distributionH' dotpW(e1, e2) ## S4 method for signature 'distributionH,numeric' dotpW(e1, e2)
dotpW(e1, e2) ## S4 method for signature 'distributionH,distributionH' dotpW(e1, e2) ## S4 method for signature 'numeric,distributionH' dotpW(e1, e2) ## S4 method for signature 'distributionH,numeric' dotpW(e1, e2)
e1 |
a |
e2 |
a |
A numeric value
Antonio Irpino
Irpino, A., Verde, R. (2015) Basic statistics for distributional symbolic variables: a new metric-based approach Advances in Data Analysis and Classification, DOI 10.1007/s11634-014-0176-4
## let's define two distributionH objects mydist1 <- distributionH(x = c(1, 2, 3, 10), p = c(0, 0.1, 0.5, 1)) mydist2 <- distributionH(x = c(5, 7, 15), p = c(0, 0.7, 1)) ## the dot product between the distributions dotpW(mydist1, mydist2) #---> 39.51429 ## the dot product between a distribution and a numeric dotpW(mydist1, 3) #---> 13.2 dotpW(3, mydist1) #---> 13.2 # DOTPW method -----
## let's define two distributionH objects mydist1 <- distributionH(x = c(1, 2, 3, 10), p = c(0, 0.1, 0.5, 1)) mydist2 <- distributionH(x = c(5, 7, 15), p = c(0, 0.7, 1)) ## the dot product between the distributions dotpW(mydist1, mydist2) #---> 39.51429 ## the dot product between a distribution and a numeric dotpW(mydist1, 3) #---> 13.2 dotpW(3, mydist1) #---> 13.2 # DOTPW method -----
Ramer-Douglas-Peucker algorithm for curve fitting with a PolyLine
DouglasPeucker(points, epsilon)
DouglasPeucker(points, epsilon)
points |
a 2D matrix with the coordinates of 2D points |
epsilon |
an number between 0 and 1. Recomended 0.01. |
A matrix with the points of segments of a Poly Line.
data2hist
function
Returns the histogram data in the r-th row and the c-th column.
get.cell.MatH(object, r, c) ## S4 method for signature 'MatH,numeric,numeric' get.cell.MatH(object, r, c)
get.cell.MatH(object, r, c) ## S4 method for signature 'MatH,numeric,numeric' get.cell.MatH(object, r, c)
object |
a MatH object, a matrix of distributions. |
r |
an integer, the row index. |
c |
an integer, the column index |
A distributionH
object.
get.cell.MatH(BLOOD, r = 1, c = 1)
get.cell.MatH(BLOOD, r = 1, c = 1)
get.distr
: show the distributionThis functon return the cumulative distribution function of a distributionH
object.
get.distr(object) ## S4 method for signature 'distributionH' get.distr(object)
get.distr(object) ## S4 method for signature 'distributionH' get.distr(object)
object |
a |
A data frame: the first column contains the domain the second the CDF values.
D <- distributionH(x = c(1, 2, 3, 4), p = c(0, 0.2, 0.6, 1)) get.distr(D) # a data.frame describing the CDF of D
D <- distributionH(x = c(1, 2, 3, 4), p = c(0, 0.2, 0.6, 1)) get.distr(D) # a data.frame describing the CDF of D
get.histo
: show the distribution with binsThis functon return a data.frame describing the histogram of a distributionH
object.
get.histo(object) ## S4 method for signature 'distributionH' get.histo(object)
get.histo(object) ## S4 method for signature 'distributionH' get.histo(object)
object |
a |
A matrix: the two columns contains the bounds of the histogram the third contains the probablity (or the relative frequency) of the bin.
D <- distributionH(x = c(1, 2, 3, 4), p = c(0, 0.2, 0.6, 1)) get.histo(D) # returns the histogram representation of D by a data.frame
D <- distributionH(x = c(1, 2, 3, 4), p = c(0, 0.2, 0.6, 1)) get.histo(D) # returns the histogram representation of D by a data.frame
get.m
: the mean of a distributionThis functon return the mean of a distributionH
object.
get.m(object) ## S4 method for signature 'distributionH' get.m(object)
get.m(object) ## S4 method for signature 'distributionH' get.m(object)
object |
a |
A numeric value
D <- distributionH(x = c(1, 2, 3, 4), p = c(0, 0.2, 0.6, 1)) get.m(D) # returns the mean of D
D <- distributionH(x = c(1, 2, 3, 4), p = c(0, 0.2, 0.6, 1)) get.m(D) # returns the mean of D
It returns the number of rows, of columns the labels of rows and columns of a MatH
object.
get.MatH.main.info(object) ## S4 method for signature 'MatH' get.MatH.main.info(object)
get.MatH.main.info(object) ## S4 method for signature 'MatH' get.MatH.main.info(object)
object |
a |
A list of char, the labels of the columns, or the names of the variables.
nrows
- the number of rows
ncols
- the number of columns
rownames
- a vector of char, the names of rows
varnames
- a vector of char, the names of columns
It returns the number of columns of a MatH
object
get.MatH.ncols(object) ## S4 method for signature 'MatH' get.MatH.ncols(object)
get.MatH.ncols(object) ## S4 method for signature 'MatH' get.MatH.ncols(object)
object |
a |
An integer, the number of columns.
It returns the number of rows of a MatH
object
## S4 method for signature 'MatH' get.MatH.nrows(object)
## S4 method for signature 'MatH' get.MatH.nrows(object)
object |
a |
An integer, the number of rows.
It returns the labels of the rows of a MatH
object
get.MatH.rownames(object) ## S4 method for signature 'MatH' get.MatH.rownames(object)
get.MatH.rownames(object) ## S4 method for signature 'MatH' get.MatH.rownames(object)
object |
a |
A vector of char, the label of the rows.
It returns statistics for each distribution contained in a MatH
object.
get.MatH.stats(object, ...) ## S4 method for signature 'MatH' get.MatH.stats(object, stat = "mean", prob = 0.5)
get.MatH.stats(object, ...) ## S4 method for signature 'MatH' get.MatH.stats(object, stat = "mean", prob = 0.5)
object |
a |
... |
a set of other parameters |
stat |
(optional) a string containing the required statistic. Default='mean' |
prob |
(optional)a number between 0 and 1 for computing the value once choosen the |
A list
stat
- the chosen statistic
prob
- level of probability if stat='quantile'
MAT
- a matrix of values
get.MatH.stats(BLOOD) # the means of the distributions in BLOOD dataset get.MatH.stats(BLOOD, stat = "median") # the medians of the distributions in BLOOD dataset get.MatH.stats(BLOOD, stat = "quantile", prob = 0.5) # the same as median get.MatH.stats(BLOOD, stat = "min") # minima of the distributions in BLOOD dataset get.MatH.stats(BLOOD, stat = "quantile", prob = 0) # the same as min get.MatH.stats(BLOOD, stat = "max") # maxima of the distributions in BLOOD dataset get.MatH.stats(BLOOD, stat = "quantile", prob = 1) # the same as max get.MatH.stats(BLOOD, stat = "std") # standard deviations of the distributions in BLOOD dataset get.MatH.stats(BLOOD, stat = "skewness") # skewness indices of the distributions in BLOOD dataset get.MatH.stats(BLOOD, stat = "kurtosis") # kurtosis indices of the distributions in BLOOD dataset get.MatH.stats(BLOOD, stat = "quantile", prob = 0.05) # the fifth percentiles of distributions in BLOOD dataset
get.MatH.stats(BLOOD) # the means of the distributions in BLOOD dataset get.MatH.stats(BLOOD, stat = "median") # the medians of the distributions in BLOOD dataset get.MatH.stats(BLOOD, stat = "quantile", prob = 0.5) # the same as median get.MatH.stats(BLOOD, stat = "min") # minima of the distributions in BLOOD dataset get.MatH.stats(BLOOD, stat = "quantile", prob = 0) # the same as min get.MatH.stats(BLOOD, stat = "max") # maxima of the distributions in BLOOD dataset get.MatH.stats(BLOOD, stat = "quantile", prob = 1) # the same as max get.MatH.stats(BLOOD, stat = "std") # standard deviations of the distributions in BLOOD dataset get.MatH.stats(BLOOD, stat = "skewness") # skewness indices of the distributions in BLOOD dataset get.MatH.stats(BLOOD, stat = "kurtosis") # kurtosis indices of the distributions in BLOOD dataset get.MatH.stats(BLOOD, stat = "quantile", prob = 0.05) # the fifth percentiles of distributions in BLOOD dataset
It returns the labels of the columns, or the names of the variables, of a MatH
object
get.MatH.varnames(object) ## S4 method for signature 'MatH' get.MatH.varnames(object)
get.MatH.varnames(object) ## S4 method for signature 'MatH' get.MatH.varnames(object)
object |
a |
A vector of char, the labels of the columns, or the names of the variables.
get.s
: the standard deviation of a distributionThis functon return the standard deviation of a distributionH
object.
get.s(object) ## S4 method for signature 'distributionH' get.s(object)
get.s(object) ## S4 method for signature 'distributionH' get.s(object)
object |
a |
A numeric positive value, the standard deviation.
D <- distributionH(x = c(1, 2, 3, 4), p = c(0, 0.2, 0.6, 1)) get.s(D) # returns the standard deviation of D
D <- distributionH(x = c(1, 2, 3, 4), p = c(0, 0.2, 0.6, 1)) get.s(D) # returns the standard deviation of D
Class HTS
defines a histogram time series, i.e. a set of histograms observed along time
## S4 method for signature 'HTS' initialize(.Object, epocs = 1, ListOfTimedElements = c(new("TdistributionH")))
## S4 method for signature 'HTS' initialize(.Object, epocs = 1, ListOfTimedElements = c(new("TdistributionH")))
.Object |
the object type ("HTS") a histogram time series |
epocs |
the number of histograms (one for each timepoint or period) |
ListOfTimedElements |
a vector of |
(Beta verson of) Extends theexponential smoothing of a time series to a histogram time series,using L2 Wasserstein distance.
HTS.exponential.smoothing(HTS, alpha = 0.9)
HTS.exponential.smoothing(HTS, alpha = 0.9)
HTS |
A |
alpha |
a number between 0 and 1 for exponential smoothing |
a list with the results of the smoothing procedure.
smoothing.alpha
the alpha parameter
AveragedHTS
The smoothed HTS
mov.expo.smooth <- HTS.exponential.smoothing(HTS = RetHTS, alpha = 0.8) # a show method for HTS must be implemented you can see it using # str(mov.expo.smooth$AveragedHTS)
mov.expo.smooth <- HTS.exponential.smoothing(HTS = RetHTS, alpha = 0.8) # a show method for HTS must be implemented you can see it using # str(mov.expo.smooth$AveragedHTS)
(Beta verson of) Extends the moving average smoothing of a time series to a histogram time series, using L2 Wasserstein distance.
HTS.moving.averages(HTS, k = 3, weights = rep(1, k))
HTS.moving.averages(HTS, k = 3, weights = rep(1, k))
HTS |
A |
k |
an integer value, the number of elements for moving averages |
weights |
a vector of positive weights for a weighted moving average |
a list with the results of the smoothing procedure.
k
the number of elements for the average
weights
the vector of weights for smoothing
AveragedHTS
The smoothed HTS
mov.av.smoothed <- HTS.moving.averages(HTS = RetHTS, k = 5) # a show method for HTS must be implemented you can see it using # str(mov.av.smoothed$AveragedHTS)
mov.av.smoothed <- HTS.moving.averages(HTS = RetHTS, k = 5) # a show method for HTS must be implemented you can see it using # str(mov.av.smoothed$AveragedHTS)
(Beta verson of) Extends the K-NN algorithm for predicting a time series to a histogram time series, using L2 Wasserstein distance.
HTS.predict.knn(HTS, position = length(HTS@data), k = 3)
HTS.predict.knn(HTS, position = length(HTS@data), k = 3)
HTS |
A |
position |
an integer, the data histogram to predict |
k |
the number of neighbours (default=3) |
Histogram time series (HTS) describe situations where a distribution of values is available for each instant of time. These situations usually arise when contemporaneous or temporal aggregation is required. In these cases, histograms provide a summary of the data that is more informative than those provided by other aggregates such as the mean. Some fields where HTS are useful include economy, official statistics and environmental science. The function adapts the k-Nearest Neighbours (k-NN) algorithm to forecast HTS and, more generally, to deal with histogram data. The proposed k-NN relies on the L2 Wasserstein distance that is used to measure dissimilarities between sequences of histograms and to compute the forecasts.
a distributionH
object predicted from data.
Javier Arroyo, Carlos Mate, Forecasting histogram time series with k-nearest neighbours methods,
International Journal of Forecasting, Volume 25, Issue 1, January-March 2009, Pages 192-207,
ISSN 0169-2070, http://dx.doi.org/10.1016/j.ijforecast.2008.07.003.
prediction <- HTS.predict.knn(HTS = RetHTS, position = 108, k = 3)
prediction <- HTS.predict.knn(HTS = RetHTS, position = 108, k = 3)
Checks if a MatH
contains histograms described by the same number of
bins and the same cdf.
is.registeredMH(object) ## S4 method for signature 'MatH' is.registeredMH(object)
is.registeredMH(object) ## S4 method for signature 'MatH' is.registeredMH(object)
object |
A |
a logical
value TRUE
if the distributions share the
same cdf, FALSE
otherwise.
Antonio Irpino
Irpino, A., Lechevallier, Y. and Verde, R. (2006): Dynamic
clustering of histograms using Wasserstein metric In: Rizzi, A., Vichi, M.
(eds.) COMPSTAT 2006. Physica-Verlag, Berlin, 869-876.
Irpino, A.,Verde,
R. (2006): A new Wasserstein based distance for the hierarchical
clustering of histogram symbolic data In: Batanjeli, V., Bock, H.H.,
Ferligoj, A., Ziberna, A. (eds.) Data Science and Classification, IFCS 2006.
Springer, Berlin, 185-192.
## ---- initialize three distributionH objects mydist1 and mydist2 mydist1 <- new("distributionH", c(1, 2, 3), c(0, 0.4, 1)) mydist2 <- new("distributionH", c(7, 8, 10, 15), c(0, 0.2, 0.7, 1)) mydist3 <- new("distributionH", c(9, 11, 20), c(0, 0.8, 1)) ## create a MatH object MyMAT <- new("MatH", nrows = 1, ncols = 3, ListOfDist = c(mydist1, mydist2, mydist3), 1, 3) is.registeredMH(MyMAT) ## [1] FALSE #the distributions do not share the same cdf ## Hint: check with str(MyMAT) ## register the two distributions MATregistered <- registerMH(MyMAT) is.registeredMH(MATregistered) ## TRUE #the distributions share the same cdf ## Hint: check with str(MATregistered)
## ---- initialize three distributionH objects mydist1 and mydist2 mydist1 <- new("distributionH", c(1, 2, 3), c(0, 0.4, 1)) mydist2 <- new("distributionH", c(7, 8, 10, 15), c(0, 0.2, 0.7, 1)) mydist3 <- new("distributionH", c(9, 11, 20), c(0, 0.8, 1)) ## create a MatH object MyMAT <- new("MatH", nrows = 1, ncols = 3, ListOfDist = c(mydist1, mydist2, mydist3), 1, 3) is.registeredMH(MyMAT) ## [1] FALSE #the distributions do not share the same cdf ## Hint: check with str(MyMAT) ## register the two distributions MATregistered <- registerMH(MyMAT) is.registeredMH(MATregistered) ## TRUE #the distributions share the same cdf ## Hint: check with str(MATregistered)
kurtH
: computes the kurthosis of a distributionKurtosis of a histogram (using the fourth standardized moment)
kurtH(object) ## S4 method for signature 'distributionH' kurtH(object)
kurtH(object) ## S4 method for signature 'distributionH' kurtH(object)
object |
a |
A value for the kurtosis index, 3 is the kurtosis of a Gaussian distribution
Antonio Irpino
## ---- A mydist distribution ---- mydist <- distributionH(x = c(1, 2, 3, 10), p = c(0, 0.1, 0.5, 1)) ## ---- Compute the kurtosis of mydist ---- kurtH(mydist) #---> 1.473242
## ---- A mydist distribution ---- mydist <- distributionH(x = c(1, 2, 3, 10), p = c(0, 0.1, 0.5, 1)) ## ---- Compute the kurtosis of mydist ---- kurtH(mydist) #---> 1.473242
Class MatH
defines a matrix of distributionH
objects
This function create a matrix of histogram data, i.e. a MatH
object
## S4 method for signature 'MatH' initialize( .Object, nrows = 1, ncols = 1, ListOfDist = NULL, names.rows = NULL, names.cols = NULL, by.row = FALSE ) MatH( x = NULL, nrows = 1, ncols = 1, rownames = NULL, varnames = NULL, by.row = FALSE )
## S4 method for signature 'MatH' initialize( .Object, nrows = 1, ncols = 1, ListOfDist = NULL, names.rows = NULL, names.cols = NULL, by.row = FALSE ) MatH( x = NULL, nrows = 1, ncols = 1, rownames = NULL, varnames = NULL, by.row = FALSE )
.Object |
the object type "MatH" |
nrows |
(optional, default=1)an integer, the number of rows. |
ncols |
(optional, default=1) an integer, the number of columns (aka variables). |
ListOfDist |
a vector or a list of |
names.rows |
a vector or list of strings with thenames of the rows |
names.cols |
a vector or list of strings with thenames of the columns (variables) |
by.row |
(optional, default=FALSE) a logical value, TRUE the matrix is row wise filled, FALSE the matrix is filled column wise. |
x |
(optional, default= an empty |
rownames |
(optional, default=NULL) a list of strings containing the names of the rows. |
varnames |
(optional, default=NULL) a list of strings containing the names of the columns (aka variables). |
A matH
object
Antonio Irpino
Irpino, A., Verde, R. (2015) Basic statistics for distributional symbolic variables: a new metric-based approach Advances in Data Analysis and Classification, DOI 10.1007/s11634-014-0176-4
## ---- create a list of six distributionH objects ListOfDist <- vector("list", 6) ListOfDist[[1]] <- distributionH(c(1, 2, 3), c(0, 0.4, 1)) ListOfDist[[2]] <- distributionH(c(7, 8, 10, 15), c(0, 0.2, 0.7, 1)) ListOfDist[[3]] <- distributionH(c(9, 11, 20), c(0, 0.5, 1)) ListOfDist[[4]] <- distributionH(c(2, 5, 8), c(0, 0.3, 1)) ListOfDist[[5]] <- distributionH(c(8, 10, 15), c(0, 0.75, 1)) ListOfDist[[6]] <- distributionH(c(20, 22, 24), c(0, 0.12, 1)) ## create a MatH object filling it by columns MyMAT <- new("MatH", nrows = 3, ncols = 2, ListOfDist = ListOfDist, names.rows = c("I1", "I2", "I3"), names.cols = c("Var1", "Var2"), by.row = FALSE ) showClass("MatH") # bulding an empty 10 by 4 matrix of histograms MAT <- MatH(nrows = 10, ncols = 4)
## ---- create a list of six distributionH objects ListOfDist <- vector("list", 6) ListOfDist[[1]] <- distributionH(c(1, 2, 3), c(0, 0.4, 1)) ListOfDist[[2]] <- distributionH(c(7, 8, 10, 15), c(0, 0.2, 0.7, 1)) ListOfDist[[3]] <- distributionH(c(9, 11, 20), c(0, 0.5, 1)) ListOfDist[[4]] <- distributionH(c(2, 5, 8), c(0, 0.3, 1)) ListOfDist[[5]] <- distributionH(c(8, 10, 15), c(0, 0.75, 1)) ListOfDist[[6]] <- distributionH(c(20, 22, 24), c(0, 0.12, 1)) ## create a MatH object filling it by columns MyMAT <- new("MatH", nrows = 3, ncols = 2, ListOfDist = ListOfDist, names.rows = c("I1", "I2", "I3"), names.cols = c("Var1", "Var2"), by.row = FALSE ) showClass("MatH") # bulding an empty 10 by 4 matrix of histograms MAT <- MatH(nrows = 10, ncols = 4)
meanH
: computes the mean of a distributionMean of a histogram (First moment of the distribution)
meanH(object) ## S4 method for signature 'distributionH' meanH(object)
meanH(object) ## S4 method for signature 'distributionH' meanH(object)
object |
a |
the mean of the distribution
Antonio Irpino
## ---- A mydist distribution ---- mydist <- distributionH(x = c(1, 2, 3, 10), p = c(0, 0.1, 0.5, 1)) ## ---- Compute the mean of mydist ---- meanH(mydist) #---> 4.4
## ---- A mydist distribution ---- mydist <- distributionH(x = c(1, 2, 3, 10), p = c(0, 0.1, 0.5, 1)) ## ---- Compute the mean of mydist ---- meanH(mydist) #---> 4.4
the difference of two distribution according to the L2 Wasssertein
the difference of a number and a distribution according to the L2 Wasssertein
the difference of a distribution and a number according to the L2 Wasssertein
## S4 method for signature 'distributionH,distributionH' e1 - e2 ## S4 method for signature 'numeric,distributionH' e1 - e2 ## S4 method for signature 'distributionH,numeric' e1 - e2
## S4 method for signature 'distributionH,distributionH' e1 - e2 ## S4 method for signature 'numeric,distributionH' e1 - e2 ## S4 method for signature 'distributionH,numeric' e1 - e2
e1 |
a |
e2 |
a |
it may not works properly if the difference is not a distribution
The dataset contains MatH (matrix of histogram-valued data) object This data set list 78 stations located in the USA recording four variables, without missing data.
a MatH
istance, 1 row per station.
Antonio Irpino, 2014-10-05
http://java.epa.gov/castnet/epa_jsp/prepackageddata.jsp ftp://ftp.epa.gov/castnet/data/metdata.zip
The dataset contains MatH (matrix of histogram-valued data) object This data set list 84 stations located in the USA recording four variables. Some stations contains missing data.
a MatH
istance, 1 row per station.
Antonio Irpino, 2014-10-05
http://java.epa.gov/castnet/epa_jsp/prepackageddata.jsp ftp://ftp.epa.gov/castnet/data/metdata.zip
This function allows the representation of the difference between observed histograms and the respective predicted ones. It can be used as a tool for interpreting preditive methods (for exampe, the regression of histogrma data)
plot_errors(PRED, OBS, type = "HISTO_QUA", np = 200)
plot_errors(PRED, OBS, type = "HISTO_QUA", np = 200)
PRED |
a |
OBS |
a |
type |
a string. "HISTO_QUA" (default), if ones want to compare histograms quantile differences |
np |
number of points considered for density or quantile computation (default=200). |
A plot with functions of differences between observed and predicted histograms, and a Root Mean Squared value computing by using the L2 Wasserstein distance.
## do a regression pars <- WH.regression.two.components(BLOOD, Yvar = 1, Xvars = c(2:3)) ## predict data PRED <- WH.regression.two.components.predict(data = BLOOD[, 2:3], parameters = pars) ## define observed data OBS <- BLOOD[, 1] plot_errors(PRED, OBS, "HISTO_QUA") plot_errors(PRED, OBS, "HISTO_DEN") plot_errors(PRED, OBS, "DENS_KDE")
## do a regression pars <- WH.regression.two.components(BLOOD, Yvar = 1, Xvars = c(2:3)) ## predict data PRED <- WH.regression.two.components.predict(data = BLOOD[, 2:3], parameters = pars) ## define observed data OBS <- BLOOD[, 1] plot_errors(PRED, OBS, "HISTO_QUA") plot_errors(PRED, OBS, "HISTO_DEN") plot_errors(PRED, OBS, "DENS_KDE")
A plot function for a distributionH
object. The function returns a representation
of the histogram.
## S4 method for signature 'distributionH' plot(x, type = "HISTO", col = "green", border = "black")
## S4 method for signature 'distributionH' plot(x, type = "HISTO", col = "green", border = "black")
x |
a |
type |
(optional) a string describing the type of plot, default="HISTO". |
col |
(optional) a string the color of the plot, default="green". |
border |
(optional) a string the color of the border of the plot, default="black". |
## ---- initialize a distributionH mydist <- distributionH(x = c(7, 8, 10, 15), p = c(0, 0.2, 0.7, 1)) # show the histogram plot(mydist) # plots mydist plot(mydist, type = "HISTO", col = "red", border = "blue") # plots mydist plot(mydist, type = "DENS", col = "red", border = "blue") # plots a density approximation for mydist plot(mydist, type = "HBOXPLOT") # plots a horizontal boxplot for mydist plot(mydist, type = "VBOXPLOT") # plots a vertical boxplot for mydist plot(mydist, type = "CDF") # plots the cumulative distribution function of mydist plot(mydist, type = "QF") # plots the quantile function of mydist
## ---- initialize a distributionH mydist <- distributionH(x = c(7, 8, 10, 15), p = c(0, 0.2, 0.7, 1)) # show the histogram plot(mydist) # plots mydist plot(mydist, type = "HISTO", col = "red", border = "blue") # plots mydist plot(mydist, type = "DENS", col = "red", border = "blue") # plots a density approximation for mydist plot(mydist, type = "HBOXPLOT") # plots a horizontal boxplot for mydist plot(mydist, type = "VBOXPLOT") # plots a vertical boxplot for mydist plot(mydist, type = "CDF") # plots the cumulative distribution function of mydist plot(mydist, type = "QF") # plots the quantile function of mydist
An overloading plot function for a HTS
object. The method returns a graphical representation
of a histogram time series.
## S4 method for signature 'HTS' plot(x, y = "missing", type = "VIOLIN", border = "black", maxno.perplot = 30)
## S4 method for signature 'HTS' plot(x, y = "missing", type = "VIOLIN", border = "black", maxno.perplot = 30)
x |
a |
y |
not used in this implementation |
type |
(optional) a string describing the type of plot, default="BOXPLOT". |
border |
(optional) a string the color of the border of the plot, default="black". |
maxno.perplot |
An integer (DEFAULT=30). Maximum number of timestamps per row. It allows a plot organized by rows, each row of the plot contains a max number of time stamps indicated by maxno.perplot. |
plot(subsetHTS(RetHTS, from = 1, to = 10)) # plots RetHTS dataset ## Not run: plot(RetHTS, type = "BOXPLOT", border = "blue", maxno.perplot = 20) plot(RetHTS, type = "VIOLIN", border = "blue", maxno.perplot = 20) plot(RetHTS, type = "VIOLIN", border = "blue", maxno.perplot = 10) ## End(Not run)
plot(subsetHTS(RetHTS, from = 1, to = 10)) # plots RetHTS dataset ## Not run: plot(RetHTS, type = "BOXPLOT", border = "blue", maxno.perplot = 20) plot(RetHTS, type = "VIOLIN", border = "blue", maxno.perplot = 20) plot(RetHTS, type = "VIOLIN", border = "blue", maxno.perplot = 10) ## End(Not run)
An overloading plot function for a MatH
object. The method returns a graphical representation
of the matrix of histograms.
## S4 method for signature 'MatH' plot(x, y = "missing", type = "HISTO", border = "black", angL = 330)
## S4 method for signature 'MatH' plot(x, y = "missing", type = "HISTO", border = "black", angL = 330)
x |
a |
y |
not used in this implementation |
type |
(optional) a string describing the type of plot, default="HISTO". |
border |
(optional) a string the color of the border of the plot, default="black". |
angL |
(optional) angle of labels of rows (DEFAULT=330). |
plot(BLOOD) # plots BLOOD dataset ## Not run: plot(BLOOD, type = "HISTO", border = "blue") # plots a matrix of histograms plot(BLOOD, type = "DENS", border = "blue") # plots a matrix of densities plot(BLOOD, type = "BOXPLOT") # plots a boxplots ## End(Not run)
plot(BLOOD) # plots BLOOD dataset ## Not run: plot(BLOOD, type = "HISTO", border = "blue") # plots a matrix of histograms plot(BLOOD, type = "DENS", border = "blue") # plots a matrix of densities plot(BLOOD, type = "BOXPLOT") # plots a boxplots ## End(Not run)
A plot function for a TdistributionH
object. The function returns a representation
of the histogram.
## S4 method for signature 'TdistributionH' plot(x, type = "HISTO", col = "green", border = "black")
## S4 method for signature 'TdistributionH' plot(x, type = "HISTO", col = "green", border = "black")
x |
a |
type |
(optional) a string describing the type of plot, default="HISTO". |
col |
(optional) a string the color of the plot, default="green". |
border |
(optional) a string the color of the border of the plot, default="black". |
This function allows the representation of observed vs predicted histograms. It can be used as a tool for interpreting preditive methods (for exampe, the regression of histogrma data)
plotPredVsObs(PRED, OBS, type = "HISTO", ncolu = 2)
plotPredVsObs(PRED, OBS, type = "HISTO", ncolu = 2)
PRED |
a |
OBS |
a |
type |
a string. "HISTO" (default), if ones want to compare histograms |
ncolu |
number of columns in which is arranged the plot, default is 2. If you have a lot of data consider to choose higher values. |
A plot with compared histogram-valued data.
## do a regression pars <- WH.regression.two.components(BLOOD, Yvar = 1, Xvars = c(2:3)) ## predict data PRED <- WH.regression.two.components.predict(data = BLOOD[, 2:3], parameters = pars) ## define observed data ## Not run: OBS <- BLOOD[, 1] plotPredVsObs(PRED, OBS, "HISTO") plotPredVsObs(PRED, OBS, "CDF") plotPredVsObs(PRED, OBS, "DENS") ## End(Not run)
## do a regression pars <- WH.regression.two.components(BLOOD, Yvar = 1, Xvars = c(2:3)) ## predict data PRED <- WH.regression.two.components.predict(data = BLOOD[, 2:3], parameters = pars) ## define observed data ## Not run: OBS <- BLOOD[, 1] plotPredVsObs(PRED, OBS, "HISTO") plotPredVsObs(PRED, OBS, "CDF") plotPredVsObs(PRED, OBS, "DENS") ## End(Not run)
register
Given two distributionH
objects, it returns two equivalent distributions such that
they share the same cdf values. This function is useful for computing basic statistics.
register(object1, object2) ## S4 method for signature 'distributionH,distributionH' register(object1, object2)
register(object1, object2) ## S4 method for signature 'distributionH,distributionH' register(object1, object2)
object1 |
A |
object2 |
A |
The two distributionH
objects in input sharing the same cdf (the p
slot)
Antonio Irpino
Irpino, A., Lechevallier, Y. and Verde, R. (2006): Dynamic
clustering of histograms using Wasserstein metric In: Rizzi, A., Vichi, M.
(eds.) COMPSTAT 2006. Physica-Verlag, Berlin, 869-876.
Irpino, A.,Verde,
R. (2006): A new Wasserstein based distance for the hierarchical
clustering of histogram symbolic data In: Batanjeli, V., Bock, H.H.,
Ferligoj, A., Ziberna, A. (eds.) Data Science and Classification, IFCS 2006.
Springer, Berlin, 185-192.
## ---- initialize two distributionH objects mydist1 and mydist2 mydist1 <- distributionH(c(1, 2, 3), c(0, 0.4, 1)) mydist2 <- distributionH(c(7, 8, 10, 15), c(0, 0.2, 0.7, 1)) ## register the two distributions regDist <- register(mydist1, mydist2) ## OUTPUT: ## regDist$[[1]] ## An object of class "distributionH" ## Slot "x": [1] 1.0 1.5 2.0 2.5 3.0 ## Slot "p": [1] 0.0 0.2 0.4 0.7 1.0 ## ... ## regDist$[[2]] ## An object of class "distributionH" ## Slot "x": [1] 7.0 8.0 8.8 10.0 15.0 ## Slot "p": [1] 0.0 0.2 0.4 0.7 1.0 ## ... # The REGISTER function ----
## ---- initialize two distributionH objects mydist1 and mydist2 mydist1 <- distributionH(c(1, 2, 3), c(0, 0.4, 1)) mydist2 <- distributionH(c(7, 8, 10, 15), c(0, 0.2, 0.7, 1)) ## register the two distributions regDist <- register(mydist1, mydist2) ## OUTPUT: ## regDist$[[1]] ## An object of class "distributionH" ## Slot "x": [1] 1.0 1.5 2.0 2.5 3.0 ## Slot "p": [1] 0.0 0.2 0.4 0.7 1.0 ## ... ## regDist$[[2]] ## An object of class "distributionH" ## Slot "x": [1] 7.0 8.0 8.8 10.0 15.0 ## Slot "p": [1] 0.0 0.2 0.4 0.7 1.0 ## ... # The REGISTER function ----
registerMH
method registers a set of distributions of a MatH
object
All the
distribution are recomputed to obtain distributions sharing the same
p
slot. This methods is useful for using fast computation of all
methods based on L2 Wasserstein metric. The distributions will have the same
number of element in the x
slot without modifing their density
function.
registerMH(object) ## S4 method for signature 'MatH' registerMH(object)
registerMH(object) ## S4 method for signature 'MatH' registerMH(object)
object |
A |
A MatH
object, a matrix of distributions sharing the same
p
slot (i.e. the same cdf).
Antonio Irpino
Irpino, A., Lechevallier, Y. and Verde, R. (2006): Dynamic
clustering of histograms using Wasserstein metric In: Rizzi, A., Vichi, M.
(eds.) COMPSTAT 2006. Physica-Verlag, Berlin, 869-876.
Irpino, A.,Verde,
R. (2006): A new Wasserstein based distance for the hierarchical
clustering of histogram symbolic data In: Batanjeli, V., Bock, H.H.,
Ferligoj, A., Ziberna, A. (eds.) Data Science and Classification, IFCS 2006.
Springer, Berlin, 185-192.
# initialize three distributionH objects mydist1 and mydist2 mydist1 <- new("distributionH", c(1, 2, 3), c(0, 0.4, 1)) mydist2 <- new("distributionH", c(7, 8, 10, 15), c(0, 0.2, 0.7, 1)) mydist3 <- new("distributionH", c(9, 11, 20), c(0, 0.8, 1)) # create a MatH object MyMAT <- new("MatH", nrows = 1, ncols = 3, ListOfDist = c(mydist1, mydist2, mydist3), 1, 3) # register the two distributions MATregistered <- registerMH(MyMAT) # # OUTPUT the structure of MATregstered str(MATregistered) # Formal class 'MatH' [package "HistDAWass"] with 1 slots # .. @ M:List of 3 # .. ..$ :Formal class 'distributionH' [package "HistDAWass"] with 4 slots # .. .. .. ..@ x: num [1:6] 1 1.5 2 2.5 2.67 ... # .. .. .. ..@ p: num [1:6] 0 0.2 0.4 0.7 0.8 1 # ... # .. ..$ :Formal class 'distributionH' [package "HistDAWass"] with 4 slots # .. .. .. ..@ x: num [1:6] 7 8 8.8 10 11.7 ... # .. .. .. ..@ p: num [1:6] 0 0.2 0.4 0.7 0.8 1 # ... # .. ..$ :Formal class 'distributionH' [package "HistDAWass"] with 4 slots # .. .. .. ..@ x: num [1:6] 9 9.5 10 10.8 11 ... # .. .. .. ..@ p: num [1:6] 0 0.2 0.4 0.7 0.8 1 # ... # .. ..- attr(*, "dim")= int [1:2] 1 3 # .. ..- attr(*, "dimnames")=List of 2 # .. .. ..$ : chr "I1" # .. .. ..$ : chr [1:3] "X1" "X2" "X3" #
# initialize three distributionH objects mydist1 and mydist2 mydist1 <- new("distributionH", c(1, 2, 3), c(0, 0.4, 1)) mydist2 <- new("distributionH", c(7, 8, 10, 15), c(0, 0.2, 0.7, 1)) mydist3 <- new("distributionH", c(9, 11, 20), c(0, 0.8, 1)) # create a MatH object MyMAT <- new("MatH", nrows = 1, ncols = 3, ListOfDist = c(mydist1, mydist2, mydist3), 1, 3) # register the two distributions MATregistered <- registerMH(MyMAT) # # OUTPUT the structure of MATregstered str(MATregistered) # Formal class 'MatH' [package "HistDAWass"] with 1 slots # .. @ M:List of 3 # .. ..$ :Formal class 'distributionH' [package "HistDAWass"] with 4 slots # .. .. .. ..@ x: num [1:6] 1 1.5 2 2.5 2.67 ... # .. .. .. ..@ p: num [1:6] 0 0.2 0.4 0.7 0.8 1 # ... # .. ..$ :Formal class 'distributionH' [package "HistDAWass"] with 4 slots # .. .. .. ..@ x: num [1:6] 7 8 8.8 10 11.7 ... # .. .. .. ..@ p: num [1:6] 0 0.2 0.4 0.7 0.8 1 # ... # .. ..$ :Formal class 'distributionH' [package "HistDAWass"] with 4 slots # .. .. .. ..@ x: num [1:6] 9 9.5 10 10.8 11 ... # .. .. .. ..@ p: num [1:6] 0 0.2 0.4 0.7 0.8 1 # ... # .. ..- attr(*, "dim")= int [1:2] 1 3 # .. ..- attr(*, "dimnames")=List of 2 # .. .. ..$ : chr "I1" # .. .. ..$ : chr [1:3] "X1" "X2" "X3" #
A histogram-valued dataset of returns of dollar vs yen change rates
a MatH
object, a matrix of distributions.
Antonio Irpino, 2014-10-05
rQQ
Quantile-Quantile correlation between two distributions
rQQ(e1, e2) ## S4 method for signature 'distributionH,distributionH' rQQ(e1, e2)
rQQ(e1, e2) ## S4 method for signature 'distributionH,distributionH' rQQ(e1, e2)
e1 |
A |
e2 |
A |
Pearson correlation index between quantiles
Antonio Irpino
Irpino, A., Verde, R. (2015) Basic statistics for distributional symbolic variables: a new metric-based approach Advances in Data Analysis and Classification, DOI 10.1007/s11634-014-0176-4
## ---- initialize two distributionH object mydist1 and mydist2 mydist1 <- distributionH(x = c(1, 2, 3), p = c(0, 0.4, 1)) mydist2 <- distributionH(x = c(7, 8, 10, 15), p = c(0, 0.2, 0.7, 1)) ## computes the rQQ rQQ(mydist1, mydist2) ## OUTPUT 0.916894
## ---- initialize two distributionH object mydist1 and mydist2 mydist1 <- distributionH(x = c(1, 2, 3), p = c(0, 0.4, 1)) mydist2 <- distributionH(x = c(7, 8, 10, 15), p = c(0, 0.2, 0.7, 1)) ## computes the rQQ rQQ(mydist1, mydist2) ## OUTPUT 0.916894
Assign a histogram data to the r-th row and the c-th column of a matrix of histograms.
set.cell.MatH(object, mat, r, c) ## S4 method for signature 'distributionH,MatH,numeric,numeric' set.cell.MatH(object, mat, r, c)
set.cell.MatH(object, mat, r, c) ## S4 method for signature 'distributionH,MatH,numeric,numeric' set.cell.MatH(object, mat, r, c)
object |
a distributionH object, a matrix of distributions. |
mat |
a MatH object, a matrix of distributions. |
r |
an integer, the row index. |
c |
an integer, the column index |
A MatH
object.
mydist <- distributionH(x = c(0, 1, 2, 3, 4), p = c(0, 0.1, 0.6, 0.9, 1)) MAT <- set.cell.MatH(mydist, BLOOD, r = 1, c = 1)
mydist <- distributionH(x = c(0, 1, 2, 3, 4), p = c(0, 0.1, 0.6, 0.9, 1)) MAT <- set.cell.MatH(mydist, BLOOD, r = 1, c = 1)
Shortes distance from a point o a 2d segment
ShortestDistance(p, line)
ShortestDistance(p, line)
p |
coordinates of a point |
line |
a 2x2 matrix with the coordinates of two points defining a line |
A numeric value, the Euclidean distance of point p
to the line
.
data2hist
function and DouglasPeucker
function
An overriding show function for a distributionH
object. The function returns a representation
of the histogram, if the number of bins is high the central part of the histogram is truncated.
## S4 method for signature 'distributionH' show(object)
## S4 method for signature 'distributionH' show(object)
object |
a |
## ---- initialize a distributionH mydist <- distributionH(x = c(7, 8, 10, 15), p = c(0, 0.2, 0.7, 1)) # show the histogram mydist
## ---- initialize a distributionH mydist <- distributionH(x = c(7, 8, 10, 15), p = c(0, 0.2, 0.7, 1)) # show the histogram mydist
An overriding show method for a MatH
object. The method returns a representation
of the matrix using the mean and the standard deviation for each histogram.
## S4 method for signature 'MatH' show(object)
## S4 method for signature 'MatH' show(object)
object |
a |
show(BLOOD) print(BLOOD) BLOOD
show(BLOOD) print(BLOOD) BLOOD
skewH
: computes the skewness of a distributionSkewness of a histogram (using the third standardized moment)
skewH(object) ## S4 method for signature 'distributionH' skewH(object)
skewH(object) ## S4 method for signature 'distributionH' skewH(object)
object |
a |
A value for the skewness index
Antonio Irpino
## ---- A mydist distribution ---- mydist <- distributionH(x = c(1, 2, 3, 10), p = c(0, 0.1, 0.5, 1)) ## ---- Compute the skewness of mydist ---- skewH(mydist) #---> -1.186017
## ---- A mydist distribution ---- mydist <- distributionH(x = c(1, 2, 3, 10), p = c(0, 0.1, 0.5, 1)) ## ---- Compute the skewness of mydist ---- skewH(mydist) #---> -1.186017
A dataset containing the geographical coordinates of stations described in China_Month and China_Seas datasets
a data.frame
Antonio Irpino, 2014-10-05
raw data are available here: https://data.ess-dive.lbl.gov/view/doi:10.3334/CDIAC/CLI.TR055. Climate Data Bases of the People's Republic of China 1841-1988 (TR055) DOI: 10.3334/CDIAC/cli.tr055
stdH
: computes the standard deviation of a distributionStandard deviation of a histogram (i.e., the square root of the centered second moment)
stdH(object) ## S4 method for signature 'distributionH' stdH(object)
stdH(object) ## S4 method for signature 'distributionH' stdH(object)
object |
a |
A value for the standard deviation
Antonio Irpino
## ---- A mydist distribution ---- mydist <- distributionH(x = c(1, 2, 3, 10), p = c(0, 0.1, 0.5, 1)) ## ---- Compute the standard deviation of mydist ---- stdH(mydist) #---> 2.563851
## ---- A mydist distribution ---- mydist <- distributionH(x = c(1, 2, 3, 10), p = c(0, 0.1, 0.5, 1)) ## ---- Compute the standard deviation of mydist ---- stdH(mydist) #---> 2.563851
subsetHTS
: extract a subset of a histogram time seriesThis functon return the mean of a distributionH
object.
subsetHTS(object, from, to) ## S4 method for signature 'HTS,numeric,numeric' subsetHTS(object, from, to)
subsetHTS(object, from, to) ## S4 method for signature 'HTS,numeric,numeric' subsetHTS(object, from, to)
object |
a |
from |
an integer, the initioal timepont |
to |
an integer, a final timepoint |
a HTS
object. A histogram 1d time series
SUB_RetHTS <- subsetHTS(RetHTS, from = 1, to = 20) # the first 20 elements
SUB_RetHTS <- subsetHTS(RetHTS, from = 1, to = 20) # the first 20 elements
A summarizer for HTS
summaryHTS(x)
summaryHTS(x)
x |
a HTS |
A matrix with basic statistics.
summaryHTS(subsetHTS(RetHTS, from = 1, to = 10))
summaryHTS(subsetHTS(RetHTS, from = 1, to = 10))
Class TdistributionH
defines a histogram with a time (point or period)
## S4 method for signature 'TdistributionH' initialize( .Object, tstamp = numeric(0), period = list(start = -Inf, end = -Inf), x = numeric(0), p = numeric(0), m = numeric(0), s = numeric(0) )
## S4 method for signature 'TdistributionH' initialize( .Object, tstamp = numeric(0), period = list(start = -Inf, end = -Inf), x = numeric(0), p = numeric(0), m = numeric(0), s = numeric(0) )
.Object |
the type of object ("TdistributionH") a |
tstamp |
a numeric value related to a timestamp |
period |
a list of two values, the starting time and the ending time (alternative to tstamp if the distribution is observed along a period and not on a timestamp) |
x |
a vector of increasing values, the domain of the distribution (the same of |
p |
a vector of increasing values from 0 to 1,
the CDF of the distribution (the same of |
m |
a number, the mean of the distribution (the same of |
s |
a positive number, the standard deviation of the distribution (the same of |
Class TMatH
defines a matrix of histograms, a TMatH
object, with a time (a timepoint or a time window).
## S4 method for signature 'TMatH' initialize( .Object, tstamp = numeric(0), period = list(start = -Inf, end = -Inf), mat = new("MatH") )
## S4 method for signature 'TMatH' initialize( .Object, tstamp = numeric(0), period = list(start = -Inf, end = -Inf), mat = new("MatH") )
.Object |
the type of object ("TMatH") |
tstamp |
a vector of time stamps, numeric. |
period |
a list of pairs with a vectorof starting time and a vector ofending time.
This parameter is used alternatively to |
mat |
a |
WassSqDistH
Computes the squared L2 Wasserstein distance between two distributionH
objects.
WassSqDistH(object1, object2, ...) ## S4 method for signature 'distributionH,distributionH' WassSqDistH(object1 = object1, object2 = object2, details = FALSE)
WassSqDistH(object1, object2, ...) ## S4 method for signature 'distributionH,distributionH' WassSqDistH(object1 = object1, object2 = object2, details = FALSE)
object1 |
is an object of distributionH class |
object2 |
is an object of distributionH class |
... |
optional parameters |
details |
(optional, default=FALSE) is a logical value, if TRUE returns the decomposition of the distance |
If details=FALSE
, the function returns the squared L2 Wasserstein distance.
If details=TRUE
, the function returns list containing the squared distance, its
decomposition in three parts (position, size and shape) and the correlation coefficient between the quantile functions.
Irpino, A. and Romano, E. (2007): Optimal histogram representation of large data sets:
Fisher vs piecewise linear approximations. RNTI E-9, 99-110.
Irpino, A., Verde, R. (2015) Basic
statistics for distributional symbolic variables: a new metric-based
approach Advances in Data Analysis and Classification, DOI 10.1007/s11634-014-0176-4
## ---- create two distributionH objects ---- mydist1 <- distributionH(x = c(1, 2, 3), p = c(0, 0.4, 1)) mydist2 <- distributionH(x = c(7, 8, 10, 15), p = c(0, 0.2, 0.7, 1)) # -- compute the squared L2 Waaserstein distance WassSqDistH(mydist1, mydist2) # -- compute the squared L2 Waaserstein distance with details WassSqDistH(mydist1, mydist2, details = TRUE)
## ---- create two distributionH objects ---- mydist1 <- distributionH(x = c(1, 2, 3), p = c(0, 0.4, 1)) mydist2 <- distributionH(x = c(7, 8, 10, 15), p = c(0, 0.2, 0.7, 1)) # -- compute the squared L2 Waaserstein distance WassSqDistH(mydist1, mydist2) # -- compute the squared L2 Waaserstein distance with details WassSqDistH(mydist1, mydist2, details = TRUE)
The function implements a Batch Kohonen self-organizing 2d maps algorithm for histogram-valued data.
WH_2d_Adaptive_Kohonen_maps( x, net = list(xdim = 4, ydim = 3, topo = c("rectangular")), kern.param = 2, TMAX = -9999, Tmin = -9999, niter = 30, repetitions, simplify = FALSE, qua = 10, standardize = FALSE, schema = 6, init.weights = "EQUAL", weight.sys = "PROD", theta = 2, Wfix = FALSE, verbose = FALSE, atleast = 2 )
WH_2d_Adaptive_Kohonen_maps( x, net = list(xdim = 4, ydim = 3, topo = c("rectangular")), kern.param = 2, TMAX = -9999, Tmin = -9999, niter = 30, repetitions, simplify = FALSE, qua = 10, standardize = FALSE, schema = 6, init.weights = "EQUAL", weight.sys = "PROD", theta = 2, Wfix = FALSE, verbose = FALSE, atleast = 2 )
x |
A MatH object (a matrix of distributionH). |
net |
a list describing the topology of the net |
kern.param |
(default =2) the kernel parameter for the RBF kernel used in the algorithm |
TMAX |
a parameter useful for the iterations (default=2) |
Tmin |
a parameter useful for the iterations (default=0.2) |
niter |
maximum number of iterations (default=30) |
repetitions |
number of repetion of the algorithm (default=5), beacuase each launch may generate a local optimum |
simplify |
a logical parameter for speeding up computations (default=FALSE). If true data are recoded in order to have fast computations |
qua |
if |
standardize |
A logic value (default is FALSE). If TRUE, histogram-valued data are standardized, variable by variable, using the Wassertein based standard deviation. Use if one wants to have variables with std equal to one. |
schema |
a number from 1 to 4 |
init.weights |
a string how to initialize weights: 'EQUAL' (default), all weights are the same, |
weight.sys |
a string. Weights may add to one ('SUM') or their product is equal to 1 ('PROD', default). |
theta |
a number. A parameter if |
Wfix |
a logical parameter (default=FALSE). If TRUE the algorithm does not use adaptive distances. |
verbose |
a logical parameter (default=FALSE). If TRUE details of computation are shown during the execution. #' |
atleast |
integer. Check for degeneration of the map into a very low number of voronoi sets. (default 2) 2 means that the map will have at least 2 neurons attracting data instances in their voronoi sets. |
An extension of Batch Self Organised Map (BSOM) is here proposed for histogram data. These kind of data have been defined in the context of symbolic data analysis. The BSOM cost function is then based on a distance function: the L2 Wasserstein distance. This distance has been widely proposed in several techniques of analysis (clustering, regression) when input data are expressed by distributions (empirical by histograms or theoretical by probability distributions). The peculiarity of such distance is to be an Euclidean distance between quantile functions so that all the properties proved for L2 distances are verified again. An adaptative versions of BSOM is also introduced considering an automatic system of weights in the cost function in order to take into account the different effect of the several variables in the Self-Organised Map grid.
a list with the results of the Batch Kohonen map
solution
A list.Returns the best solution among the repetitions
etitions, i.e.
the one having the minimum sum of squares criterion.
solution$MAP
The map topology.
solution$IDX
A vector. The clusters at which the objects are assigned.
solution$cardinality
A vector. The cardinality of each final cluster.
solution$proto
A MatH
object with the description of centers.
solution$Crit
A number. The criterion (Sum od square deviation from the centers) value at the end of the run.
solution$Weights.comp
the final weights assigned to each component of the histogram variables
solution$Weight.sys
a string the type of weighting system ('SUM' or 'PRODUCT')
quality
A number. The percentage of Sum of square deviation explained by the model. (The higher the better)
Irpino A, Verde R, De Carvalho FAT (2012). Batch self organizing maps for interval and histogram data. In: Proceedings of COMPSTAT 2012. p. 143-154, ISI/IASC, ISBN: 978-90-73592-32-2
## Not run: results <- WH_2d_Adaptive_Kohonen_maps( x = BLOOD, net = list(xdim = 2, ydim = 3, topo = c("rectangular")), repetitions = 2, simplify = TRUE, qua = 10, standardize = TRUE ) ## End(Not run)
## Not run: results <- WH_2d_Adaptive_Kohonen_maps( x = BLOOD, net = list(xdim = 2, ydim = 3, topo = c("rectangular")), repetitions = 2, simplify = TRUE, qua = 10, standardize = TRUE ) ## End(Not run)
The function implements a Batch Kohonen self-organizing 2d maps algorithm for histogram-valued data.
WH_2d_Kohonen_maps( x, net = list(xdim = 4, ydim = 3, topo = c("rectangular")), kern.param = 2, TMAX = 2, Tmin = 0.2, niter = 30, repetitions = 5, simplify = FALSE, qua = 10, standardize = FALSE, verbose = FALSE )
WH_2d_Kohonen_maps( x, net = list(xdim = 4, ydim = 3, topo = c("rectangular")), kern.param = 2, TMAX = 2, Tmin = 0.2, niter = 30, repetitions = 5, simplify = FALSE, qua = 10, standardize = FALSE, verbose = FALSE )
x |
A MatH object (a matrix of distributionH). |
net |
a list describing the topology of the net |
kern.param |
(default =2) the kernel parameter for the RBF kernel used in the algorithm |
TMAX |
a parameter useful for the iterations (default=2) |
Tmin |
a parameter useful for the iterations (default=0.2) |
niter |
maximum number of iterations (default=30) |
repetitions |
number of repetion of the algorithm (default=5), beacuase each launch may generate a local optimum |
simplify |
a logical parameter for speeding up computations (default=FALSE). If true data are recoded in order to have fast computations |
qua |
if |
standardize |
A logic value (default is FALSE). If TRUE, histogram-valued data are standardized, variable by variable, using the Wassertein based standard deviation. Use if one wants to have variables with std equal to one. |
verbose |
a logical parameter (default=FALSE). If TRUE details of computation are shown during the execution. |
An extension of Batch Self Organised Map (BSOM) is here proposed for histogram data. These kind of data have been defined in the context of symbolic data analysis. The BSOM cost function is then based on a distance function: the L2 Wasserstein distance. This distance has been widely proposed in several techniques of analysis (clustering, regression) when input data are expressed by distributions (empirical by histograms or theoretical by probability distributions). The peculiarity of such distance is to be an Euclidean distance between quantile functions so that all the properties proved for L2 distances are verified again. An adaptative versions of BSOM is also introduced considering an automatic system of weights in the cost function in order to take into account the different effect of the several variables in the Self-Organised Map grid.
a list with the results of the Batch Kohonen map
solution
A list.Returns the best solution among the repetitions
etitions, i.e.
the one having the minimum sum of squares criterion.
solution$MAP
The map topology.
solution$IDX
A vector. The clusters at which the objects are assigned.
solution$cardinality
A vector. The cardinality of each final cluster.
solution$proto
A MatH
object with the description of centers.
solution$Crit
A number. The criterion (Sum od square deviation from the centers) value at the end of the run.
quality
A number. The percentage of Sum of square deviation explained by the model. (The higher the better)
Irpino A, Verde R, De Carvalho FAT (2012). Batch self organizing maps for interval and histogram data. In: Proceedings of COMPSTAT 2012. p. 143-154, ISI/IASC, ISBN: 978-90-73592-32-2
## Not run: results <- WH_2d_Kohonen_maps( x = BLOOD, net = list(xdim = 2, ydim = 3, topo = c("rectangular")), repetitions = 2, simplify = TRUE, qua = 10, standardize = TRUE ) ## End(Not run)
## Not run: results <- WH_2d_Kohonen_maps( x = BLOOD, net = list(xdim = 2, ydim = 3, topo = c("rectangular")), repetitions = 2, simplify = TRUE, qua = 10, standardize = TRUE ) ## End(Not run)
Fuzzy c-means of a dataset of histogram-valued data using different adaptive distances based on the L2 Wasserstein metric.
WH_adaptive_fcmeans( x, k = 5, schema, m = 1.6, rep, simplify = FALSE, qua = 10, standardize = FALSE, init.weights = "EQUAL", weight.sys = "PROD", theta = 2, verbose = FALSE )
WH_adaptive_fcmeans( x, k = 5, schema, m = 1.6, rep, simplify = FALSE, qua = 10, standardize = FALSE, init.weights = "EQUAL", weight.sys = "PROD", theta = 2, verbose = FALSE )
x |
A MatH object (a matrix of distributionH). |
k |
An integer, the number of groups. |
schema |
An integer. 1=one weight per variable, 2=two weights per variables (one for each component: the mean and the variability component), 3=one weight per variable and per cluster, 4= two weights per variable and per cluster. |
m |
A number grater than 0, a fuzziness coefficient (default |
rep |
An integer, maximum number of repetitions of the algorithm (default |
simplify |
A logic value (default is FALSE), if TRUE histograms are recomputed in order to speed-up the algorithm. |
qua |
An integer, if |
standardize |
A logic value (default is FALSE). If TRUE, histogram-valued data are standardized, variable by variable, using the Wassertein based standard deviation. Use if one wants to have variables with std equal to one. |
init.weights |
A string. (default='EQUAL'). EQUAL, all variables or components have the same weight; 'RANDOM', a random assignment is done. |
weight.sys |
A string. (default='PROD') PROD, Weights product is equal to one. SUM, the weights sum up to one. |
theta |
A number. (default=2) A parameter for the system of weights summing up to one. |
verbose |
A logic value (default is FALSE). If TRUE some details are provided. |
The results of the fuzzy c-means of the set of Histogram-valued data x
into k
cluster.
solution |
A list.Returns the best solution among the |
solution$membership |
A matrix. The membership degree of each unit to each cluster. |
solution$IDX |
A vector. The crisp assignement to a cluster. |
solution$cardinality |
A vector. The cardinality of each final cluster (after the crisp assignement). |
solution$Crit |
A number. The criterion (Sum od square deviation from the prototypes) value at the end of the run. |
quality |
A number. The percentage of Sum of square deviation explained by the model. (The higher the better) |
results <- WH_adaptive_fcmeans( x = BLOOD, k = 2, schema = 4, m = 1.5, rep = 3, simplify = TRUE, qua = 10, standardize = TRUE, init.weights = "EQUAL", weight.sys = "PROD" )
results <- WH_adaptive_fcmeans( x = BLOOD, k = 2, schema = 4, m = 1.5, rep = 3, simplify = TRUE, qua = 10, standardize = TRUE, init.weights = "EQUAL", weight.sys = "PROD" )
The function implements the k-means using adaptive distance for a set of histogram-valued data.
WH_adaptive.kmeans( x, k, schema = 1, init, rep, simplify = FALSE, qua = 10, standardize = FALSE, weight.sys = "PROD", theta = 2, init.weights = "EQUAL", verbose = FALSE )
WH_adaptive.kmeans( x, k, schema = 1, init, rep, simplify = FALSE, qua = 10, standardize = FALSE, weight.sys = "PROD", theta = 2, init.weights = "EQUAL", verbose = FALSE )
x |
A MatH object (a matrix of distributionH). |
k |
An integer, the number of groups. |
schema |
a number from 1 to 4 |
init |
(optional, do not use) initialization for partitioning the data default is 'RPART', other strategies shoul be implemented. |
rep |
An integer, maximum number of repetitions of the algorithm (default |
simplify |
A logic value (default is FALSE), if TRUE histograms are recomputed in order to speed-up the algorithm. |
qua |
An integer, if |
standardize |
A logic value (default is FALSE). If TRUE, histogram-valued data are standardized, variable by variable, using the Wassertein based standard deviation. Use if one wants to have variables with std equal to one. |
weight.sys |
a string. Weights may add to one ('SUM') or their product is equal to 1 ('PROD', default). |
theta |
a number. A parameter if |
init.weights |
a string how to initialize weights: 'EQUAL' (default), all weights are the same, 'RANDOM', weights are initalised at random. |
verbose |
A logic value (default is FALSE). If TRUE, details on computations are shown. |
a list with the results of the k-means of the set of Histogram-valued data x
into k
cluster.
solution
A list.Returns the best solution among the rep
etitions, i.e.
the one having the minimum sum of squares criterion.
solution$IDX
A vector. The clusters at which the objects are assigned.
solution$cardinality
A vector. The cardinality of each final cluster.
solution$centers
A MatH
object with the description of centers.
solution$Crit
A number. The criterion (Sum od square deviation from the centers) value at the end of the run.
quality
A number. The percentage of Sum of square deviation explained by the model. (The higher the better)
Irpino A., Rosanna V., De Carvalho F.A.T. (2014). Dynamic clustering of histogram data based on adaptive squared Wasserstein distances. EXPERT SYSTEMS WITH APPLICATIONS, vol. 41, p. 3351-3366, ISSN: 0957-4174, doi: http://dx.doi.org/10.1016/j.eswa.2013.12.001
results <- WH_adaptive.kmeans(x = BLOOD, k = 2, rep = 10, simplify = TRUE, qua = 10, standardize = TRUE)
results <- WH_adaptive.kmeans(x = BLOOD, k = 2, rep = 10, simplify = TRUE, qua = 10, standardize = TRUE)
The function implements the fuzzy c-means for a set of histogram-valued data.
WH_fcmeans(x, k, m = 1.6, rep, simplify = FALSE, qua = 10, standardize = FALSE)
WH_fcmeans(x, k, m = 1.6, rep, simplify = FALSE, qua = 10, standardize = FALSE)
x |
A MatH object (a matrix of distributionH). |
k |
An integer, the number of groups. |
m |
A number grater than 0, a fuzziness coefficient (default |
rep |
An integer, maximum number of repetitions of the algorithm (default |
simplify |
A logic value (default is FALSE), if TRUE histograms are recomputed in order to speed-up the algorithm. |
qua |
An integer, if |
standardize |
A logic value (default is FALSE). If TRUE, histogram-valued data are standardized, variable by variable, using the Wassertein based standard deviation. Use if one wants to have variables with std equal to one. |
a list with the results of the fuzzy c-means of the set of Histogram-valued data x
into k
cluster.
solution
A list.Returns the best solution among the rep
etitions, i.e.
the one having the minimum sum of squares deviation.
solution$membership
A matrix. The membership degree of each unit to each cluster.
solution$IDX
A vector. The crisp assignement to a cluster.
solution$cardinality
A vector. The cardinality of each final cluster (after the crisp assignement).
solution$Crit
A number. The criterion (Sum of square deviation from the prototypes) value at the end of the run.
quality
A number. The percentage of Sum of square deviation explained by the model. (The higher the better)
results <- WH_fcmeans(x = BLOOD, k = 2, m = 1.5, rep = 10, simplify = TRUE, qua = 10, standardize = TRUE)
results <- WH_fcmeans(x = BLOOD, k = 2, m = 1.5, rep = 10, simplify = TRUE, qua = 10, standardize = TRUE)
The function implements a Hierarchical clustering
for a set of histogram-valued data, based on the L2 Wassertein distance.
Extends the hclust
function of the stat package.
WH_hclust( x, simplify = FALSE, qua = 10, standardize = FALSE, distance = "WDIST", method = "complete" )
WH_hclust( x, simplify = FALSE, qua = 10, standardize = FALSE, distance = "WDIST", method = "complete" )
x |
A MatH object (a matrix of distributionH). |
simplify |
A logic value (default is FALSE), if TRUE histograms are recomputed in order to speed-up the algorithm. |
qua |
An integer, if |
standardize |
A logic value (default is FALSE). If TRUE, histogram-valued data are standardized, variable by variable, using the Wassertein based standard deviation. Use if one wants to have variables with std equal to one. |
distance |
A string default "WDIST" the L2 Wasserstein distance (other distances will be implemented) |
method |
A string, default="complete", is the the agglomeration method to be used.
This should be (an unambiguous abbreviation of) one of " |
An object of class hclust which describes the tree produced by the clustering process.
Irpino A., Verde R. (2006). A new Wasserstein based distance for the hierarchical clustering of histogram symbolic data. In: Batanjeli et al. Data Science and Classification, IFCS 2006. p. 185-192, BERLIN:Springer, ISBN: 3-540-34415-2
hclust
of stat package for further details.
results <- WH_hclust(x = BLOOD, simplify = TRUE, method = "complete") plot(results) # it plots the dendrogram cutree(results, k = 5) # it returns the labels for 5 clusters
results <- WH_hclust(x = BLOOD, simplify = TRUE, method = "complete") plot(results) # it plots the dendrogram cutree(results, k = 5) # it returns the labels for 5 clusters
The function implements the k-means for a set of histogram-valued data.
WH_kmeans( x, k, rep = 5, simplify = FALSE, qua = 10, standardize = FALSE, verbose = FALSE )
WH_kmeans( x, k, rep = 5, simplify = FALSE, qua = 10, standardize = FALSE, verbose = FALSE )
x |
A MatH object (a matrix of distributionH). |
k |
An integer, the number of groups. |
rep |
An integer, maximum number of repetitions of the algorithm (default |
simplify |
A logic value (default is FALSE), if TRUE histograms are recomputed in order to speed-up the algorithm. |
qua |
An integer, if |
standardize |
A logic value (default is FALSE). If TRUE, histogram-valued data are standardized, variable by variable, using the Wassertein based standard deviation. Use if one wants to have variables with std equal to one. |
verbose |
A logic value (default is FALSE). If TRUE, details on computations are shown. |
a list with the results of the k-means of the set of Histogram-valued data x
into k
cluster.
solution
A list.Returns the best solution among the rep
etitions, i.e.
the one having the minimum sum of squares criterion.
solution$IDX
A vector. The clusters at which the objects are assigned.
solution$cardinality
A vector. The cardinality of each final cluster.
solution$centers
A MatH
object with the description of centers.
solution$Crit
A number. The criterion (Sum od square deviation from the centers) value at the end of the run.
quality
A number. The percentage of Sum of square deviation explained by the model. (The higher the better)
Irpino A., Verde R., Lechevallier Y. (2006). Dynamic clustering of histograms using Wasserstein metric. In: Rizzi A., Vichi M.. COMPSTAT 2006 - Advances in computational statistics. p. 869-876, Heidelberg:Physica-Verlag
results <- WH_kmeans( x = BLOOD, k = 2, rep = 10, simplify = TRUE, qua = 10, standardize = TRUE, verbose = TRUE )
results <- WH_kmeans( x = BLOOD, k = 2, rep = 10, simplify = TRUE, qua = 10, standardize = TRUE, verbose = TRUE )
The function extracts the L2 Wasserstein distance matrix from a MatH object.
WH_MAT_DIST(x, simplify = FALSE, qua = 10, standardize = FALSE)
WH_MAT_DIST(x, simplify = FALSE, qua = 10, standardize = FALSE)
x |
A MatH object (a matrix of distributionH). |
simplify |
A logic value (default is FALSE), if TRUE histograms are recomputed in order to speed-up the algorithm. |
qua |
An integer, if |
standardize |
A logic value (default is FALSE). If TRUE, histogram-valued data are standardized, variable by variable, using the Wasserstein based standard deviation. Use if one wants to have variables with std equal to one. |
A matrix of squared L2 distances.
Irpino A., Verde R. (2006). A new Wasserstein based distance for the hierarchical clustering of histogram symbolic data. In: Batanjeli et al. Data Science and Classification, IFCS 2006. p. 185-192, BERLIN:Springer, ISBN: 3-540-34415-2
DMAT <- WH_MAT_DIST(x = BLOOD, simplify = TRUE)
DMAT <- WH_MAT_DIST(x = BLOOD, simplify = TRUE)
The function implements a Principal components analysis of histogram variable based on Wasserstein distance. It performs a centered (not standardized) PCA on a set of quantiles of a variable. Being a distribution a multivalued description, the analysis performs a dimensional reduction and a visualization of distributions. It is a 1d (one dimension) becuse it is considered just one histogram variable.
WH.1d.PCA( data, var, quantiles = 10, plots = TRUE, listaxes = c(1:4), axisequal = FALSE, qcut = 1, outl = 0 )
WH.1d.PCA( data, var, quantiles = 10, plots = TRUE, listaxes = c(1:4), axisequal = FALSE, qcut = 1, outl = 0 )
data |
A MatH object (a matrix of distributionH). |
var |
An integer, the variable number. |
quantiles |
An integer, it is the number of quantiles used in the analysis. |
plots |
a logical value. Default=TRUE plots are drawn. |
listaxes |
A vector of integers listing the axis for the 2d factorial reperesntations. |
axisequal |
A logical value. Default TRUE, the plot have the same scale for the x and the y axes. |
qcut |
a number between 0.5 and 1, it is used for the plot of densities, and avoids very peaked densities. Default=1, all the densities are considered. |
outl |
a number between 0 (default) and 0.5. For each distribution, is the amount of mass removed from the tails of the distribution. For example, if 0.1, from each distribution is cut away a left tail and a right one each containing the 0.1 of mass. |
In the framework of symbolic data analysis (SDA), distribution-valued data are defined as multivalued data, where each unit is described by a distribution (e.g., a histogram, a density, or a quantile function) of a quantitative variable. SDA provides different methods for analyzing multivalued data. Among them, the most relevant techniques proposed for a dimensional reduction of multivalued quantitative variables is principal component analysis (PCA). This paper gives a contribution in this context of analysis. Starting from new association measures for distributional variables based on a peculiar metric for distributions, the squared Wasserstein distance, a PCA approach is proposed for distribution-valued data, represented by quantile-variables.
a list with the results of the PCA in the MFA format of package FactoMineR for function MFA
Verde, R.; Irpino, A.; Balzanella, A., "Dimension Reduction Techniques for Distributional Symbolic Data," Cybernetics, IEEE Transactions on , vol.PP, no.99, pp.1,1 doi: 10.1109/TCYB.2015.2389653 keywords: Correlation;Covariance matrices;Distribution functions;Histograms;Measurement;Principal component analysis;Shape;Distributional data;Wasserstein distance;principal components analysis;quantiles, https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7024099&isnumber=6352949
results <- WH.1d.PCA(data = BLOOD, var = 1, listaxes = c(1:2))
results <- WH.1d.PCA(data = BLOOD, var = 1, listaxes = c(1:2))
It attaches two MatH
objects with the same columns by row, or the same rows by colum.
WH.bind(object1, object2, byrow) ## S4 method for signature 'MatH,MatH' WH.bind(object1, object2, byrow = TRUE)
WH.bind(object1, object2, byrow) ## S4 method for signature 'MatH,MatH' WH.bind(object1, object2, byrow = TRUE)
object1 |
a |
object2 |
a |
byrow |
a logical value (default=TRUE) attaches the objects by row |
a MatH
object,
WH.bind.row
for binding by row, WH.bind.col
for binding by column
# binding by row M1 <- BLOOD[1:10, 1] M2 <- BLOOD[1:10, 3] MAT <- WH.bind(M1, M2, byrow = TRUE) # binding by col M1 <- BLOOD[1:10, 1] M2 <- BLOOD[1:10, 3] MAT <- WH.bind(M1, M2, byrow = FALSE)
# binding by row M1 <- BLOOD[1:10, 1] M2 <- BLOOD[1:10, 3] MAT <- WH.bind(M1, M2, byrow = TRUE) # binding by col M1 <- BLOOD[1:10, 1] M2 <- BLOOD[1:10, 3] MAT <- WH.bind(M1, M2, byrow = FALSE)
It attaches two MatH
objects with the same rows by colums.
WH.bind.col(object1, object2) ## S4 method for signature 'MatH,MatH' WH.bind.col(object1, object2)
WH.bind.col(object1, object2) ## S4 method for signature 'MatH,MatH' WH.bind.col(object1, object2)
object1 |
a |
object2 |
a |
a MatH
object,
M1 <- BLOOD[1:10, 1] M2 <- BLOOD[1:10, 3] MAT <- WH.bind.col(M1, M2)
M1 <- BLOOD[1:10, 1] M2 <- BLOOD[1:10, 3] MAT <- WH.bind.col(M1, M2)
It attaches two MatH
objects with the same columns by row.
WH.bind.row(object1, object2) ## S4 method for signature 'MatH,MatH' WH.bind.row(object1, object2)
WH.bind.row(object1, object2) ## S4 method for signature 'MatH,MatH' WH.bind.row(object1, object2)
object1 |
a |
object2 |
a |
a MatH
object,
M1 <- BLOOD[1:3, ] M2 <- BLOOD[5:8, ] MAT <- WH.bind.row(M1, M2)
M1 <- BLOOD[1:3, ] M2 <- BLOOD[5:8, ] MAT <- WH.bind.row(M1, M2)
Compute the correlation matrix of a MatH
object, i.e.
a matrix of values consistent with
a set of distributions equipped with a L2 wasserstein metric.
WH.correlation(object, ...) ## S4 method for signature 'MatH' WH.correlation(object, w = numeric(0))
WH.correlation(object, ...) ## S4 method for signature 'MatH' WH.correlation(object, w = numeric(0))
object |
a |
... |
some optional parameters |
w |
it is possible to add a vector of weights (positive numbers)
having the same size of the rows of the |
a squared matrix
with the (weighted) correlations indices
Irpino, A., Verde, R. (2015) Basic statistics for distributional symbolic variables: a new metric-based approach Advances in Data Analysis and Classification, DOI 10.1007/s11634-014-0176-4
WH.correlation(BLOOD) # generate a set of random weights RN <- runif(get.MatH.nrows(BLOOD)) WH.correlation(BLOOD, w = RN)
WH.correlation(BLOOD) # generate a set of random weights RN <- runif(get.MatH.nrows(BLOOD)) WH.correlation(BLOOD, w = RN)
Compute the correlation matrix using two MatH
objects having the same number of rows,
It returns a rectangular a matrix of numbers, consistent with
a set of distributions equipped with a L2 wasserstein metric.
WH.correlation2(object1, object2, ...) ## S4 method for signature 'MatH,MatH' WH.correlation2(object1, object2, w = numeric(0))
WH.correlation2(object1, object2, ...) ## S4 method for signature 'MatH,MatH' WH.correlation2(object1, object2, w = numeric(0))
object1 |
a |
object2 |
a |
... |
some optional parameters |
w |
it is possible to add a vector of weights (positive numbers)
having the same size of the rows of the |
a rectangular matrix
with the weighted sum of squares
M1 <- BLOOD[, 1] M2 <- BLOOD[, 2:3] WH.correlation2(M1, M2) # generate a set of random weights RN <- runif(get.MatH.nrows(BLOOD)) WH.correlation2(M1, M2, w = RN)
M1 <- BLOOD[, 1] M2 <- BLOOD[, 2:3] WH.correlation2(M1, M2) # generate a set of random weights RN <- runif(get.MatH.nrows(BLOOD)) WH.correlation2(M1, M2, w = RN)
It is the matrix product of two MatH
objects, i.e. two matrices of distributions,
by using the dot product of two histograms that is consistent with
a set of distributions equipped with a L2 wasserstein metric.
WH.mat.prod(object1, object2, ...) ## S4 method for signature 'MatH,MatH' WH.mat.prod(object1, object2, traspose1 = FALSE, traspose2 = FALSE)
WH.mat.prod(object1, object2, ...) ## S4 method for signature 'MatH,MatH' WH.mat.prod(object1, object2, traspose1 = FALSE, traspose2 = FALSE)
object1 |
a |
object2 |
a |
... |
other optional parameters |
traspose1 |
a logical value, default=FALSE. If TRUE trasposes object1 |
traspose2 |
a logical value, default=FALSE. If TRUE trasposes object2 |
a matrix of numbers
M1 <- BLOOD[1:5, ] M2 <- BLOOD[6:10, ] MAT <- WH.mat.prod(M1, M2, traspose1 = TRUE, traspose2 = FALSE)
M1 <- BLOOD[1:5, ] M2 <- BLOOD[6:10, ] MAT <- WH.mat.prod(M1, M2, traspose1 = TRUE, traspose2 = FALSE)
It sums two MatH
objects, i.e. two matrices of distributions,
by summing the quantile functions of histograms. This sum is consistent with
a set of distributions equipped with a L2 wasserstein metric.
WH.mat.sum(object1, object2) ## S4 method for signature 'MatH,MatH' WH.mat.sum(object1, object2)
WH.mat.sum(object1, object2) ## S4 method for signature 'MatH,MatH' WH.mat.sum(object1, object2)
object1 |
a |
object2 |
a |
a MatH
object,
# binding by row M1 <- BLOOD[1:5, ] M2 <- BLOOD[6:10, ] MAT <- WH.mat.sum(M1, M2)
# binding by row M1 <- BLOOD[1:5, ] M2 <- BLOOD[6:10, ] MAT <- WH.mat.sum(M1, M2)
(Beta version) The function implements a Principal components analysis of a set of histogram variables based on Wasserstein distance. It performs a centered (not standardized) PCA on a set of quantiles of a variable. Being a distribution a multivalued description, the analysis performs a dimensional reduction and a visualization of distributions. It is a 1d (one dimension) becuse it is considered just one histogram variable.
WH.MultiplePCA(data, list.of.vars, quantiles = 10, outl = 0)
WH.MultiplePCA(data, list.of.vars, quantiles = 10, outl = 0)
data |
A MatH object (a matrix of distributionH). |
list.of.vars |
A list of integers, the active variables. |
quantiles |
An integer, it is the number of quantiles used in the analysis. Default=10. |
outl |
a number between 0 (default) and 0.5. For each distribution, is the amount of mass removed from the tails of the distribution. For example, if 0.1, from each distribution is cut away a left tail and a right one each containing the 0.1 of mass. |
It is an extension of WH.1d.PCA to the multiple case.
a list with the results of the PCA in the MFA format of package FactoMineR for function MFA
(Beta version) The function plots histogram data of the individuals for a particular variable on a factorial palne after a Multiple factor analysis.
WH.plot_multiple_indivs( data, res, axes = c(1, 2), indiv = 0, var = 1, strx = 0.1, stry = 0.1, HISTO = TRUE, coor = 0, stat = "mean" )
WH.plot_multiple_indivs( data, res, axes = c(1, 2), indiv = 0, var = 1, strx = 0.1, stry = 0.1, HISTO = TRUE, coor = 0, stat = "mean" )
data |
a MatH object |
res |
Results from WH.MultiplePCA. |
axes |
A list of integers, the new factorial axes c(1,2) are the default. |
indiv |
A list of objects (rows) of data to plot. Default=0 all the objects of data. |
var |
An integer indicating an original histogrma variable to plot. |
strx |
a resizing factor for the domain of histograms (default=0.1 means that each distribution has a support that is one tenth of the spread of the x axis) |
stry |
a resizing factor for the density of histograms (default=0.1 means that each distribution has a density that is one tenth of the spread of the y axis) |
HISTO |
a logical value. Default=TRUE plots histograms, FALSE plot smooth densities. |
coor |
(optional) if 0 (Default) takes the coordinates in res, if a a matrix is passed the coordinates are those passed |
stat |
(optional) if 'mean'(Default) a plot of individuals labeled by the means is produced. Otherwise if 'std', 'skewness' or 'kurtosis', data are labeled with this statistic. |
a plot of class ggplot
# Do a MultiplePCA on the BLOOD dataset ## Not run: #' results=WH.MultiplePCA(BLOOD,list.of.vars = c(1:3)) # Plot histograms of variable 1 of BLOOD dataset on the first # factorial plane showing histograms WH.plot_multiple_indivs(BLOOD, results, axes = c(1, 2), var = 1, strx = 0.1, stry = 0.1, HISTO = TRUE ) # Plot histograms of variable 1 of BLOOD dataset on the first # factorial plane showing densities WH.plot_multiple_indivs(BLOOD, results, axes = c(1, 2), var = 1, strx = 0.1, stry = 0.1, HISTO = FALSE ) ## End(Not run)
# Do a MultiplePCA on the BLOOD dataset ## Not run: #' results=WH.MultiplePCA(BLOOD,list.of.vars = c(1:3)) # Plot histograms of variable 1 of BLOOD dataset on the first # factorial plane showing histograms WH.plot_multiple_indivs(BLOOD, results, axes = c(1, 2), var = 1, strx = 0.1, stry = 0.1, HISTO = TRUE ) # Plot histograms of variable 1 of BLOOD dataset on the first # factorial plane showing densities WH.plot_multiple_indivs(BLOOD, results, axes = c(1, 2), var = 1, strx = 0.1, stry = 0.1, HISTO = FALSE ) ## End(Not run)
The function plots the circle of correlation of the quantiles of the histogrma variables after a Multiple factor analysis.
WH.plot_multiple_Spanish.funs( res, axes = c(1, 2), var = 1, LABS = TRUE, multi = TRUE, corplot = TRUE )
WH.plot_multiple_Spanish.funs( res, axes = c(1, 2), var = 1, LABS = TRUE, multi = TRUE, corplot = TRUE )
res |
Results from WH.MultiplePCA, or WH.1D.PCA. |
axes |
A list of integers, the new factorial axes c(1,2) are the default. |
var |
A list of integers are the variables to plot. |
LABS |
Logical, if TRUE graph is labeled, otherwise it does not. |
multi |
Logical, if TRUE (default) results come from a WH.MultiplePCA, if FALSE results come from WH.1D.PCA. |
corplot |
Logical, if TRUE (default) the plot reports correlations, if FALSE the coordinates of quantiles on the factorial plane |
a plot of class ggplot
# Do a MultiplePCA on the BLOOD dataset ## Not run: res <- WH.MultiplePCA(BLOOD, list.of.vars = c(1:3)) ## End(Not run) # Plot results ## Not run: WH.plot_multiple_Spanish.funs(res, axes = c(1, 2), var = c(1:3)) ## End(Not run)
# Do a MultiplePCA on the BLOOD dataset ## Not run: res <- WH.MultiplePCA(BLOOD, list.of.vars = c(1:3)) ## End(Not run) # Plot results ## Not run: WH.plot_multiple_Spanish.funs(res, axes = c(1, 2), var = c(1:3)) ## End(Not run)
It computes three goodness of fit indices using the results and the predictions of a regression done with WH.regression.two.components
function.
WH.regression.GOF(observed, predicted)
WH.regression.GOF(observed, predicted)
observed |
A one column MatH object, the observed histogram variable |
predicted |
A one column MatH object, the predicted histogram variable. |
a list with the GOF indices
Irpino A, Verde R (in press 2015). Linear regression for numeric symbolic variables: a least squares approach
based on Wasserstein Distance. ADVANCES IN DATA ANALYSIS AND CLASSIFICATION, ISSN: 1862-5347, DOI:10.1007/s11634-015-0197-7
An extended version is available on arXiv repository arXiv:1202.1436v2 https://arxiv.org/abs/1202.1436v2
# do regression model.parameters <- WH.regression.two.components(data = BLOOD, Yvar = 1, Xvars = c(2:3)) #' # do prediction Predicted.BLOOD <- WH.regression.two.components.predict(data = BLOOD[, 2:3], parameters = model.parameters) # compute GOF indices GOF.indices <- WH.regression.GOF(observed = BLOOD[, 1], predicted = Predicted.BLOOD)
# do regression model.parameters <- WH.regression.two.components(data = BLOOD, Yvar = 1, Xvars = c(2:3)) #' # do prediction Predicted.BLOOD <- WH.regression.two.components.predict(data = BLOOD[, 2:3], parameters = model.parameters) # compute GOF indices GOF.indices <- WH.regression.GOF(observed = BLOOD[, 1], predicted = Predicted.BLOOD)
The function implements Multiple regression analysis for histogram variables based on a two component model and L2 Wasserstein distance. Taking as imput dependent histogram variable and a set of explanatory histogram variables the methods return a least squares estimation of a two component regression model based on the decomposition of L2 Wasserstein metric for distributional data.
WH.regression.two.components(data, Yvar, Xvars, simplify = FALSE, qua = 20)
WH.regression.two.components(data, Yvar, Xvars, simplify = FALSE, qua = 20)
data |
A MatH object (a matrix of distributionH). |
Yvar |
An integer, the dependent variable number in data. |
Xvars |
A set of integers the explanantory variables in data. |
simplify |
a logical argument (default=FALSE). If TRUE only few equally spaced quantiles are considered (for speeding up the algorithm) |
qua |
If |
A two component regression model is implemented. The observed variables are histogram variables according to the definition given in the framework of Symbolic Data Analysis and the parameters of the model are estimated using the classic Least Squares method. An appropriate metric is introduced in order to measure the error between the observed and the predicted distributions. In particular, the Wasserstein distance is proposed. Such a metric permits to predict the response variable as direct linear combination of other independent histogram variables.
a named vector with the model estimated parameters
Irpino A, Verde R (in press 2015). Linear regression for numeric symbolic variables: a least squares approach
based on Wasserstein Distance. ADVANCES IN DATA ANALYSIS AND CLASSIFICATION, ISSN: 1862-5347, DOI:10.1007/s11634-015-0197-7
An extended version is available on arXiv repository arXiv:1202.1436v2 https://arxiv.org/abs/1202.1436v2
model.parameters <- WH.regression.two.components(data = BLOOD, Yvar = 1, Xvars = c(2:3))
model.parameters <- WH.regression.two.components(data = BLOOD, Yvar = 1, Xvars = c(2:3))
Predict distributions using the results of a regression done with WH.regression.two.components
function.
WH.regression.two.components.predict(data, parameters)
WH.regression.two.components.predict(data, parameters)
data |
A MatH object (a matrix of distributionH) explantory part. |
parameters |
A named vector with the parameter from a |
a MatH
object, the predicted histograms
Irpino A, Verde R (in press 2015). Linear regression for numeric symbolic variables: a least squares approach
based on Wasserstein Distance. ADVANCES IN DATA ANALYSIS AND CLASSIFICATION, ISSN: 1862-5347, DOI:10.1007/s11634-015-0197-7
An extended version is available on arXiv repository arXiv:1202.1436v2 https://arxiv.org/abs/1202.1436v2
# do regression model.parameters <- WH.regression.two.components(data = BLOOD, Yvar = 1, Xvars = c(2:3)) # do prediction Predicted.BLOOD <- WH.regression.two.components.predict(data = BLOOD[, 2:3], parameters = model.parameters)
# do regression model.parameters <- WH.regression.two.components(data = BLOOD, Yvar = 1, Xvars = c(2:3)) # do prediction Predicted.BLOOD <- WH.regression.two.components.predict(data = BLOOD[, 2:3], parameters = model.parameters)
Compute the sum-of-squares-deviations (from the mean) matrix of a MatH
object, i.e.
a matrix of numbers, consistent with
a set of distributions equipped with a L2 wasserstein metric.
WH.SSQ(object, ...) ## S4 method for signature 'MatH' WH.SSQ(object, w = numeric(0))
WH.SSQ(object, ...) ## S4 method for signature 'MatH' WH.SSQ(object, w = numeric(0))
object |
a |
... |
some optional parameters |
w |
it is possible to add a vector of weights (positive numbers)
having the same size of the rows of the |
a squared matrix
with the weighted sum of squares
WH.SSQ(BLOOD) # generate a set of random weights RN <- runif(get.MatH.nrows(BLOOD)) WH.SSQ(BLOOD, w = RN)
WH.SSQ(BLOOD) # generate a set of random weights RN <- runif(get.MatH.nrows(BLOOD)) WH.SSQ(BLOOD, w = RN)
Compute the sum-of-squares-deviations (from the mean) matrix using two MatH
objects having the same number of rows,
It returns a rectangular a matrix of numbers, consistent with
a set of distributions equipped with a L2 wasserstein metric.
WH.SSQ2(object1, object2, ...) ## S4 method for signature 'MatH,MatH' WH.SSQ2(object1, object2, w = numeric(0))
WH.SSQ2(object1, object2, ...) ## S4 method for signature 'MatH,MatH' WH.SSQ2(object1, object2, w = numeric(0))
object1 |
a |
object2 |
a |
... |
some optional parameters |
w |
it is possible to add a vector of weights (positive numbers)
having the same size of the rows of the |
a rectangular matrix
with the weighted sum of squares
M1 <- BLOOD[, 1] M2 <- BLOOD[, 2:3] WH.SSQ2(M1, M2) # generate a set of random weights RN <- runif(get.MatH.nrows(BLOOD)) WH.SSQ2(M1, M2, w = RN)
M1 <- BLOOD[, 1] M2 <- BLOOD[, 2:3] WH.SSQ2(M1, M2) # generate a set of random weights RN <- runif(get.MatH.nrows(BLOOD)) WH.SSQ2(M1, M2, w = RN)
Compute the variance-covariance matrix of a MatH
object, i.e.
a matrix of values consistent with
a set of distributions equipped with a L2 wasserstein metric.
WH.var.covar(object, ...) ## S4 method for signature 'MatH' WH.var.covar(object, w = numeric(0))
WH.var.covar(object, ...) ## S4 method for signature 'MatH' WH.var.covar(object, w = numeric(0))
object |
a |
... |
some optional parameters |
w |
it is possible to add a vector of weights (positive numbers)
having the same size of the rows of the |
a squared matrix
with the (weighted) variance-covariance values
Irpino, A., Verde, R. (2015) Basic statistics for distributional symbolic variables: a new metric-based approach Advances in Data Analysis and Classification, DOI 10.1007/s11634-014-0176-4
WH.var.covar(BLOOD) # generate a set of random weights RN <- runif(get.MatH.nrows(BLOOD)) WH.var.covar(BLOOD, w = RN)
WH.var.covar(BLOOD) # generate a set of random weights RN <- runif(get.MatH.nrows(BLOOD)) WH.var.covar(BLOOD, w = RN)
Compute the covariance matrix using two MatH
objects having the same number of rows,
It returns a rectangular a matrix of numbers, consistent with
a set of distributions equipped with a L2 wasserstein metric.
WH.var.covar2(object1, object2, ...) ## S4 method for signature 'MatH,MatH' WH.var.covar2(object1, object2, w = numeric(0))
WH.var.covar2(object1, object2, ...) ## S4 method for signature 'MatH,MatH' WH.var.covar2(object1, object2, w = numeric(0))
object1 |
a |
object2 |
a |
... |
some optional parameters |
w |
it is possible to add a vector of weights (positive numbers)
having the same size of the rows of the |
a rectangular matrix
with the weighted sum of squares
M1 <- BLOOD[, 1] M2 <- BLOOD[, 2:3] WH.var.covar2(M1, M2) # generate a set of random weights RN <- runif(get.MatH.nrows(BLOOD)) WH.var.covar2(M1, M2, w = RN)
M1 <- BLOOD[, 1] M2 <- BLOOD[, 2:3] WH.var.covar2(M1, M2) # generate a set of random weights RN <- runif(get.MatH.nrows(BLOOD)) WH.var.covar2(M1, M2, w = RN)
Compute a histogram that is the weighted mean of the set of histograms contained
in a MatH
object, i.e. a matrix of histograms, consistent with
a set of distributions equipped with a L2 wasserstein metric.
WH.vec.mean(object, ...) ## S4 method for signature 'MatH' WH.vec.mean(object, w = numeric(0))
WH.vec.mean(object, ...) ## S4 method for signature 'MatH' WH.vec.mean(object, w = numeric(0))
object |
a |
... |
optional arguments |
w |
it is possible to add a vector of weights (positive numbers) having the same size of
the |
a distributionH
object, i.e. a histogram
hmean <- WH.vec.mean(BLOOD) # generate a set of random weights RN <- runif(get.MatH.nrows(BLOOD) * get.MatH.ncols(BLOOD)) hmean <- WH.vec.mean(BLOOD, w = RN)
hmean <- WH.vec.mean(BLOOD) # generate a set of random weights RN <- runif(get.MatH.nrows(BLOOD) * get.MatH.ncols(BLOOD)) hmean <- WH.vec.mean(BLOOD, w = RN)
Compute a histogram that is the weighted sum of the set of histograms contained
in a MatH
object, i.e. a matrix of histograms, consistent with
a set of distributions equipped with a L2 wasserstein metric.
WH.vec.sum(object, ...) ## S4 method for signature 'MatH' WH.vec.sum(object, w = numeric(0))
WH.vec.sum(object, ...) ## S4 method for signature 'MatH' WH.vec.sum(object, w = numeric(0))
object |
a |
... |
optional arguments |
w |
it is possible to add a vector of weights (positive numbers) having the same size of the |
a distributionH
object, i.e. a histogram
hsum <- WH.vec.sum(BLOOD) # generate a set of random weights RN <- runif(get.MatH.nrows(BLOOD) * get.MatH.ncols(BLOOD)) hsum <- WH.vec.sum(BLOOD, w = RN) ### SUM of distributions ----
hsum <- WH.vec.sum(BLOOD) # generate a set of random weights RN <- runif(get.MatH.nrows(BLOOD) * get.MatH.ncols(BLOOD)) hsum <- WH.vec.sum(BLOOD, w = RN) ### SUM of distributions ----