Title: | Robust Data Analysis Through Monitoring and Dynamic Visualization |
---|---|
Description: | Provides interface to the 'MATLAB' toolbox 'Flexible Statistical Data Analysis (FSDA)' which is comprehensive and computationally efficient software package for robust statistics in regression, multivariate and categorical data analysis. The current R version implements tools for regression: (forward search, S- and MM-estimation, least trimmed squares (LTS) and least median of squares (LMS)), for multivariate analysis (forward search, S- and MM-estimation), for cluster analysis and cluster-wise regression. The distinctive feature of our package is the possibility of monitoring the statistics of interest as a function of breakdown point, efficiency or subset size, depending on the estimator. This is accompanied by a rich set of graphical features, such as dynamic brushing, linking, particularly useful for exploratory data analysis. |
Authors: | Valentin Todorov [aut, cre] , Emmanuele Sordini [aut], Aldo Corbellini [ctb], Francesca Torti [ctb], Marco Riani [ctb], Domenico Perrotta [ctb], Andrea Cerioli [ctb] |
Maintainer: | Valentin Todorov <[email protected]> |
License: | GPL (>= 3) |
Version: | 0.9-0 |
Built: | 2024-12-31 07:23:22 UTC |
Source: | CRAN |
There are 60 observations on a response y with the values of three explanatory variables. The scatter plot matrix of the data shows y increasing with each of x1, x2 and x3. The plot of residuals against fitted values shows no obvious pattern. However the FS finds that there are 6 masked outliers.
data(bank_data)
data(bank_data)
A data frame with 1949 rows and 14 variables. The variables are as follows:
x1: Personal loans
x2: Financing and hire-purchase
x3: Mortgages
x4: Life insurance
x5: Share amount
x6: Bond account
x7: Current account
x8: Salary deposits
x9: Debit cards
x10: Credit cards
x11: Telephone banking
x12: Domestic direct debits
x13: Money transfers
y: Profit/loss
Riani, M., Cerioli, A., Atkinson, A. C., and Perrotta, D. (2014). Supplement to ”Monitoring robust regression”. doi:10.1214/14-EJS897SUPP.
Riani, M., Cerioli, A., Atkinson, A. C., and Perrotta, D. (2014). Monitoring robust regression. Electronic Journal of Statistics, 8, 642-673.
tclustICsol
Takes as input the output of function tclustICsol
(that is a structure containing the best relevant solutions) and produces the
car-bike plot. This plot provides a concise summary of the best relevant solutions.
This plot shows on the horizontal axis the value of c
and on the vertical axis
the value of k
. For each solution we draw a rectangle for the interval of
values for which the solution is best and stable and a horizontal line which departs
from the rectangle for the values of c
in which the solution is only stable.
Finally, for the best value of c
associated to the solution, we show a circle
with two numbers, the first number indicates the ranked solution among those which are
not spurious and the second one the ranked number including the spurious solutions.
This plot has been baptized 'car-bike', because the first best solutions (in general
2 or 3) are generally best and stable for a large number of values of c
and
therefore will have large rectangles. In addition, these solutions are likely to
be stable for additional values of c
and therefore are likely to have horizontal
lines departing from the rectangles (from here the name 'cars'). Finally, local minor
solutions (which are associated with particular values of c
and k
) do not
generally present rectangles or lines and are shown with circles (from here the
name 'bikes').
carbikeplot(out, SpuriousSolutions = FALSE, trace = FALSE, ...)
carbikeplot(out, SpuriousSolutions = FALSE, trace = FALSE, ...)
out |
An S3 object of class |
SpuriousSolutions |
Wheather to include or not spurious solutions. By default spurios solutions are not included into the plot. |
trace |
Whether to print intermediate results. Default is |
... |
potential further arguments passed to lower level functions. |
FSDA team, [email protected]
Cerioli, A., Garcia-Escudero, L.A., Mayo-Iscar, A. and Riani M. (2017). Finding the Number of Groups in Model-Based Clustering via Constrained Likelihoods, Journal of Computational and Graphical Statistics, pp. 404-416, https://doi.org/10.1080/10618600.2017.1390469.
## Not run: ## Car-bike plot for the geyser data ======================== data(geyser2) out <- tclustIC(geyser2, whichIC="MIXMIX", plot=FALSE, alpha=0.1) ## Find the best solutions using as Information criterion MIXMIX print("Best solutions using MIXMIX") outMIXMIX <- tclustICsol(out, whichIC="MIXMIX", plot=FALSE, NumberOfBestSolutions=6) print(outMIXMIX$MIXMIXbs) carbikeplot(outMIXMIX) ## Car-bike plot for the flea data ========================== data(flea) Y <- as.matrix(flea[, 1:(ncol(flea)-1)]) # select only the numeric variables rownames(Y) <- 1:nrow(Y) head(Y) out <- tclustIC(Y, whichIC="CLACLA", plot=FALSE, alpha=0.1, nsamp=100) ## Find the best solutions using as Information criterion CLACLA print("Best solutions using CLACLA") outCLACLA <- tclustICsol(out,whichIC="CLACLA", plot=FALSE, NumberOfBestSolutions=66) ## Produce the car-bike plot carbikeplot(outCLACLA) ## End(Not run)
## Not run: ## Car-bike plot for the geyser data ======================== data(geyser2) out <- tclustIC(geyser2, whichIC="MIXMIX", plot=FALSE, alpha=0.1) ## Find the best solutions using as Information criterion MIXMIX print("Best solutions using MIXMIX") outMIXMIX <- tclustICsol(out, whichIC="MIXMIX", plot=FALSE, NumberOfBestSolutions=6) print(outMIXMIX$MIXMIXbs) carbikeplot(outMIXMIX) ## Car-bike plot for the flea data ========================== data(flea) Y <- as.matrix(flea[, 1:(ncol(flea)-1)]) # select only the numeric variables rownames(Y) <- 1:nrow(Y) head(Y) out <- tclustIC(Y, whichIC="CLACLA", plot=FALSE, alpha=0.1, nsamp=100) ## Find the best solutions using as Information criterion CLACLA print("Best solutions using CLACLA") outCLACLA <- tclustICsol(out,whichIC="CLACLA", plot=FALSE, NumberOfBestSolutions=66) ## Produce the car-bike plot carbikeplot(outCLACLA) ## End(Not run)
Provides a method for obtaining the maximum empirical efficiency
(in case of MM estimates) or maximum empirical breakdownplot (in case of S estimates) or
maximum subset size (in case of forward search),
using various measures of correlation between the n
Mahalanobis distances or residuals at
adjacent values of efficiecy, breakdown point or subset size.
corfwdplot(out, trace = FALSE, ...)
corfwdplot(out, trace = FALSE, ...)
out |
An object of S3 class returned by one of the estimation functions with the
monitoring option selected ( The needed elements of
|
trace |
Whether to print intermediate results. Default is |
... |
potential further arguments passed to lower level functions. |
A ggplot
plot object which can be printed on screen or to a file.
FSDA team, [email protected]
## Not run: data(hbk, package="robustbase") (out <- fsmult(hbk[,1:3], monitoring=TRUE)) corfwdplot(out) (out1 <- smult(hbk, monitoring=TRUE, trace=TRUE)) corfwdplot(out1) (out2 <- mmmult(hbk[,1:3], monitoring=TRUE, trace=TRUE)) corfwdplot(out2) (out3 <- fsreg(hbk[,1:3], hbk[,4], monitoring=TRUE, trace=TRUE, method="FS")) corfwdplot(out3) (out4 <- fsreg(hbk[,1:3], hbk[,4], monitoring=TRUE, trace=TRUE, method="S")) corfwdplot(out4) (out5 <- fsreg(hbk[,1:3], hbk[,4], monitoring=TRUE, trace=TRUE, method="MM")) corfwdplot(out5) ## End(Not run)
## Not run: data(hbk, package="robustbase") (out <- fsmult(hbk[,1:3], monitoring=TRUE)) corfwdplot(out) (out1 <- smult(hbk, monitoring=TRUE, trace=TRUE)) corfwdplot(out1) (out2 <- mmmult(hbk[,1:3], monitoring=TRUE, trace=TRUE)) corfwdplot(out2) (out3 <- fsreg(hbk[,1:3], hbk[,4], monitoring=TRUE, trace=TRUE, method="FS")) corfwdplot(out3) (out4 <- fsreg(hbk[,1:3], hbk[,4], monitoring=TRUE, trace=TRUE, method="S")) corfwdplot(out4) (out5 <- fsreg(hbk[,1:3], hbk[,4], monitoring=TRUE, trace=TRUE, method="MM")) corfwdplot(out5) ## End(Not run)
Plots the trajectories of the elements of the covariance (correlation) matrix monitored
covplot( out, xlim, ylim, xlab, ylab, main, lwd, lty, col, cex.lab, cex.axis, subsize, fg.thresh, fg.unit, fg.labstep, fg.lwd, fg.lty, fg.col, fg.mark, fg.cex, standard, fground, tag, datatooltip, trace = FALSE, ... )
covplot( out, xlim, ylim, xlab, ylab, main, lwd, lty, col, cex.lab, cex.axis, subsize, fg.thresh, fg.unit, fg.labstep, fg.lwd, fg.lty, fg.col, fg.mark, fg.cex, standard, fground, tag, datatooltip, trace = FALSE, ... )
out |
An object of S3 class The needed elements of
|
xlim |
Controls the |
ylim |
Controls the |
xlab |
A title for the x axis |
ylab |
A title for the y axis |
main |
An overall title for the plot |
lwd |
The line width, a positive number, defaulting to 1 |
lty |
The line type. Line types can either be specified as an integer (1=solid (default), 2=dashed, 3=dotted, 4=dotdash, 5=longdash, 6=twodash) or as one of the character strings "solid", "dashed", "dotted", "dotdash", "longdash", or "twodash". The latter two are not supported by Matlab. |
col |
Colors to be used for the highlighted units |
cex.lab |
The magnification to be used for x and y labels relative to the current setting of cex |
cex.axis |
The magnification to be used for axis annotation relative to the current setting of cex |
subsize |
Numeric vector containing the subset size with length equal to the number of columns of
matrix of mahalanobis distances. The default value of subsize is |
fg.thresh |
(alternative to fg.unit) numeric vector of length 1 or 2 which specifies
the highlighted trajectories.
If |
fg.unit |
(alternative to fg.thresh), vector containing the list of the units to be highlighted.
If |
fg.labstep |
numeric vector which specifies the steps of the search where to put labels for
the highlighted trajectories (units). The default is to put the labels at the
initial and final steps of the search. |
fg.lwd |
The line width for the highlighted trajectories (units). Default is 1. |
fg.lty |
The line type for the highlighted trajectories (units). Line types can either be specified as an integer (1=solid (default), 2=dashed, 3=dotted, 4=dotdash, 5=longdash, 6=twodash) or as one of the character strings "solid", "dashed", "dotted", "dotdash", "longdash", or "twodash". The latter two are not supported by Matlab. |
fg.col |
colors to be used for the highlighted units. |
fg.mark |
Controlls whether to plot highlighted trajectories as symbols.
if |
fg.cex |
Controls the font size of the labels of the trajectories in foreground. If
|
standard |
MATLAB-style arguments - appearance of the plot in terms of xlim, ylim, axes labels and their font size style, color of the lines, etc. |
fground |
MATLAB-style arguments - for the trajectories in foregroud. |
tag |
Plot handle. String which identifies the handle of the plot which is about to be created.
The default is |
datatooltip |
If datatooltip is not empty the user can use the mouse in order to have
information about the unit selected, the step in which the unit enters the search and
the associated label. If datatooltip is a list, it is possible to control the aspect
of the data cursor (see MATLAB function |
trace |
Whether to print intermediate results. Default is |
... |
potential further arguments passed to lower level functions. |
none
FSDA team, [email protected]
## Not run: X <- iris[,1:4] out <- fsmult(X, monitoring=TRUE) ## Produce monitoring covariances plot with all the default options covplot(out) ## End(Not run)
## Not run: X <- iris[,1:4] out <- fsmult(X, monitoring=TRUE) ## Produce monitoring covariances plot with all the default options covplot(out) ## End(Not run)
The diabetes dataset, introduced by Reaven and Miller (1979), consists of 145 observations (patients). For each patient three measurements are reported: plasma glucose response to oral glucose, plasma insulin response to oral glucose, degree of insulin resistance.
data("diabetes")
data("diabetes")
A data frame with the following variables:
Area under plasma glucose curve after a three hour oral glucose tolerance test (OGTT).
Area under plasma insulin curve after a three hour oral glucose tolerance test (OGTT).
Steady state plasma glucose.
The type of diabete: Normal
, Overt
, and Chemical
.
Reaven, G. M. and Miller, R. G. (1979). An attempt to define the nature of chemical diabetes using a multidimensional analysis. Diabetologia 16:17-24.
library(rrcov) data(diabetes) head(diabetes) plot(CovMcd(diabetes[, 1:3]), which="pairs", col=diabetes$class)
library(rrcov) data(diabetes) head(diabetes) plot(CovMcd(diabetes[, 1:3]), which="pairs", col=diabetes$class)
A data set containing 28 demographic variables for 341 municipalities in Emilia Romagna (an Italian region).
data(emilia2001)
data(emilia2001)
A data frame with 341 rows and 28 variables The variables are as follows:
less10: population aged less than 10
more75: population aged more than 75
single single-member families
divorced": divorsed
widows: widows and widowers
graduates: population aged more than 25 who are graduates
no_education: of those aged over 6 having no education
employed: activity rate
unemployed: unemployment rate
increase_popul: standardised natural increase in population
migration: standardised change in population due to migration
birth_92_94: average birth rate over 1992-94
fecundity: three-year average birth rate amongst women of child-bearing age
houses: occupied houses built since 1982
houses_2WCs: occupied houses with 2 or more WCs
houses_heating: occupied houses with fixed heating system
TV: TV licence holders
cars: number of cars for 100 inhabitants
luxury_cars: luxury cars
hotels: working in hotels and restaurants
banking: working in banking and finance
income: average declared income amongst those filing income tax returns
income_tax_returns: inhabitants filing income tax returns
factories: residents employed in factories and public services
factories_more10: employees employed in factories withy more tha 10 employees
factories_more50: employees employed in factories withy more tha 50 employees
artisanal: artisanal enterprises
entrepreneurs: enterpreneous and skilled self-employed among those of working age
@references Atkinson, A. C., Riani, M., and Cerioli, A. (2004). Exploring Multivariate Data with the Forward Search. Springer-Verlag, New York.
The fishery data consist of 677 transactions of a fishery product in Europe. For each transaction the Value in 1000 euro and the quantity in Tons are reported.
data(fishery)
data(fishery)
A data frame with 677 rows and 2 variables
Flea-beetle measurements
data(flea)
data(flea)
A data frame with 74 rows and 7 variables: six explanatory and one response variable - species
.
The variables are as follows:
tars1: width of the first joint of the first tarsus in microns (the sum of measurements for both tarsi)
tars2: the same for the second joint
head: the maximal width of the head between the external edges of the eyes in 0.01 mm
ade1: the maximal width of the aedeagus in the fore-part in microns
ade2: the front angle of the aedeagus ( 1 unit = 7.5 degrees)
ade3: the aedeagus width from the side in microns
species, which species is being examined - Concinna
, Heptapotamica
, Heikertingeri
A. A. Lubischew (1962), On the Use of Discriminant Functions in Taxonomy, Biometrics, 184 pp.455–477.
data(flea) head(flea)
data(flea) head(flea)
A data set on air pressure in the Alps and the boiling point of water (Weisberg, 1985). There are 17 observations on the boiling point of water at different pressures, obtained from measurements at a variety of elevations in the Alps. The purpose of the experiment was to allow prediction of pressure from boiling point, which is easily measured, and so to provide an estimate of altitude: the higher the altitude, the lower the pressure. The dataset is characterized by one clear outlier.
data(forbes)
data(forbes)
A data frame with 17 rows and 2 variables The variables are as follows:
x: boiling point
y: 100 x log(pressure)
Weisberg, S. (1985). Applied Linear Regression. Wiley, New York.
data(forbes) plot(y~x, data=forbes)
data(forbes) plot(y~x, data=forbes)
fsdalms
ObjectsAn object of class fsdalms.object
holds information about
the result of a call to fsreg
.
The object itself is basically a list
with the following
components:
rew |
If |
beta |
p-by-1 vector containing the estimated regression parameters. |
bs |
p x 1 vector containing the units forming subset associated with bLMS (bLTS). |
residuals |
residuals. |
scale |
scale estimate of the residuals. |
weights |
Vector like y containing weights. The elements of this vector are 0 or 1. These weights identify the h observations which are used to compute the final LTS (LMS) estimate. sum(weights)=h if there is not a perfect fit otherwise sum(weights) can be greater than h |
h |
The number of observations that have determined the LTS (LMS) estimator, i.e. the value of h. |
outliers |
vector containing the list of the units declared as outliers using confidence level specified in input scalar conflev. |
conflev |
confidence level which is used to declare outliers.
Remark: |
singsub |
Number of subsets wihtout full rank. Notice that if this number is greater than 0.1*(number of subsamples) a warning is produced |
X |
the data matrix X |
y |
the response vector y |
The object has class "fsdalms"
.
## Not run: data(hbk, package="robustbase") (out <- fsreg(Y~., data=hbk, method="LMS")) class(out) summary(out) ## End(Not run)
## Not run: data(hbk, package="robustbase") (out <- fsreg(Y~., data=hbk, method="LMS")) class(out) summary(out) ## End(Not run)
fsdalts
ObjectsAn object of class fsdalts.object
holds information about
the result of a call to fsreg
.
The object itself is basically a list
with the following
components:
rew |
If |
beta |
p-by-1 vector containing the estimated regression parameters. |
bs |
p x 1 vector containing the units forming subset associated with bLMS (bLTS). |
residuals |
residuals. |
scale |
scale estimate of the residuals. |
weights |
Vector like y containing weights. The elements of this vector are 0 or 1. These weights identify the h observations which are used to compute the final LTS (LMS) estimate. sum(weights)=h if there is not a perfect fit otherwise sum(weights) can be greater than h |
h |
The number of observations that have determined the LTS (LMS) estimator, i.e. the value of h. |
outliers |
vector containing the list of the units declared as outliers using confidence level specified in input scalar conflev. |
conflev |
confidence level which is used to declare outliers.
Remark: |
singsub |
Number of subsets wihtout full rank. Notice that if this number is greater than 0.1*(number of subsamples) a warning is produced |
X |
the data matrix X |
y |
the response vector y |
The object has class "fsdalts"
.
## Not run: data(hbk, package="robustbase") (out <- fsreg(Y~., data=hbk, method="LTS")) class(out) summary(out) ## End(Not run)
## Not run: data(hbk, package="robustbase") (out <- fsreg(Y~., data=hbk, method="LTS")) class(out) summary(out) ## End(Not run)
fsmeda.object
ObjectsAn object of class fsmeda.object
holds information about
the result of a call to fsmult
when called with parameter
monitoring=TRUE
.
The object itself is basically a list
with the following
components:
MAL
: n x (n-init+1) matrix containing the monitoring of
Each row represents the distance Mahalanobis distance for the corresponding unit.
BB
: n x (n-init+1) matrix containing the information about the units belonging
to the subset at each step of the forward search. The first column contains the
indexes of the units forming subset in the initial step and each further column -
the indexes of the units forming the corresponding step. The last column contains
the units forming subset in the final step (all units).
md
: n-by-1 vector containing the estimates of the robust
Mahalanobis distances (in squared units). This vector contains
the distances of each observation from the location of the data,
relative to the scatter matrix cov.
mmd
: (n-init) x 3 matrix. which contains the monitoring of minimum
MD or (m+1)th ordered MD at each step of
the forward search.
1st column = fwd search index (from init to n-1)
2nd column = minimum MD
3rd column = (m+1)th-ordered MD
msr
: (n-init+1) x 3 matrix which contains the monitoring of maximum
MD or m-th ordered MD at each step of the forward search.
1st column = fwd search index (from init to n)
2nd column = maximum MD
3rd column = mth-ordered MD
gap
: (n-init+1) x 3 matrix which contains the monitoring of the gap
(difference between minMD outside subset and max inside).
1st column = fwd search index (from init to n)
2nd column = min MD - max MD
3rd column = (m+1)th-ordered MD - mth ordered distance
Loc
: (n-init+1) x (p+1) matrix which contains the monitoring
of the estimated means at each step of the fwd search.
S2cov
: (n-init+1) x (p*(p+1)/2+1) matrix which contains the monitoring of the
of the elements of the covariance matrix in each step of the forward search.
1st column = fwd search index (from init to n)
2nd column = monitoring of S[1,1]
3rd column = monitoring of S[1,2]
...
last column = monitoring of S[p,p]
detS
: (n-init+1) x 2 matrix which contains the monitoring
of the determinant of the covariance matrix in each step of
the forward search.
Un
: (n-init)-by-11 matrix which contains the unit(s) included
in the subset at each step of the fwd search.
REMARK: in every step the new subset is compared with the
old subset. Un contains the unit(s) present in the new
subset but not in the old one. Un[1 ,2] for example contains
the unit included in step init+1.
Un[end, 2] contains the units included in the final step
of the search.
X
: the data matrix X.
The object has class "fsmeda"
.
## Not run: data(hbk, package="robustbase") (out <- fsmult(hbk[,1:3], monitoring=TRUE)) class(out) summary(out) ## End(Not run)
## Not run: data(hbk, package="robustbase") (out <- fsmult(hbk[,1:3], monitoring=TRUE)) class(out) summary(out) ## End(Not run)
The trajectories originate from many different random initial subsets and provide information on the presence of groups in the data. Groups are investigated by monitoring the minimum Mahalanobis distance outside the forward search subset.
fsmmmdrs( x, plot = FALSE, init, bsbsteps, nsimul = 200, nocheck = FALSE, numpool, cleanpool = FALSE, msg = FALSE, trace = FALSE, ... )
fsmmmdrs( x, plot = FALSE, init, bsbsteps, nsimul = 200, nocheck = FALSE, numpool, cleanpool = FALSE, msg = FALSE, trace = FALSE, ... )
x |
An n x p data matrix (n observations and p variables). Rows of x represent observations, and columns represent variables. Missing values (NA's) and infinite values (Inf's) are allowed, since observations (rows) with missing or infinite values will automatically be excluded from the computations. |
plot |
Plots the random starts minimum Mahalanobis distance with 1
If
Remark: the plot which is produced is very simple. In order to control a series of options
in this plot (including the y scale) and in order to connect it dynamically to the other
forward plots it is necessary to use function |
init |
Point where to start monitoring required diagnostics.
If |
bsbsteps |
A vector which specifies for which steps of the forward
search it is necessary to save the units forming subset for each
random start. if REMARK: The vector bsbsteps must contain numbers from init to n.
if |
nsimul |
Number of random starts. Default value is |
nocheck |
It controls whether to perform checks on matrix Y. If |
numpool |
If REMARK: up to R2013b, there was a limitation on the maximum number of cores that could be addressed by the parallel processing toolbox (8 and, more recently, 12). From R2014a, it is possible to run a local cluster of more than 12 workers. REMARK: Unless you adjust the cluster profile, the default maximum number of workers is the same as the number of computational (physical) cores on the machine. REMARK: In modern computers the number of logical cores is larger than the number of physical cores. By default, MATLAB is not using all logical cores because, normally, hyper-threading is enabled and some cores are reserved to this feature. REMARK: It is because of Remarks 3 that we have chosen as default value for numpool the number of physical cores rather than the number of logical ones. The user can increase the number of parallel pool workers allocated to the multiple start monitoring by:
Therefore, *if a parallel pool is not already open*, UserOption numpool (if set)
overwrites the number of workers set in the local/current profile. Similarly,
the number of workers in the local/current profile overwrites default value of
|
cleanpool |
Set cleanpool |
msg |
Level of output to sidplay. It controls whether to display or not messages
about random start progress. More precisely, if previous option REMARK: in order to create the progress bar when Error using ProgressBar (line 57) Do you have write permissions for C:/Program Files/MATLAB?" |
trace |
Whether to print intermediate results. Default is |
... |
potential further arguments passed to lower level functions. |
Returns an object of class fsmmmdrs.object
.
FSDA team, [email protected]
Atkinson, A.C., Riani, M., and Cerioli, A. (2006), Random Start Forward Searches with Envelopes for Detecting Clusters in Multivariate Data, in: Zani S., Cerioli A., Riani M., Vichi M., Eds., Data Analysis, Classification and the Forward Search, pp. 163-172, Springer Verlag.
Atkinson, A.C. and Riani, M., (2007), Exploratory Tools for Clustering Multivariate Data, Computational Statistics and Data Analysis, Vol. 52, pp. 272-285, doi:10.1016/j.csda.2006.12.034
Riani, M., Cerioli, A., Atkinson, A.C., Perrotta, D. and Torti, F. (2008), Fitting Mixtures of Regression Lines with the Forward Search, in: Mining Massive Data Sets for Security, F. Fogelman-Soulie et al. Eds., pp. 271-286, IOS Press.
## Not run: data(hbk, package="robustbase") out <- fsmmmdrs(hbk[,1:3]) class(out) summary(out) ## End(Not run)
## Not run: data(hbk, package="robustbase") out <- fsmmmdrs(hbk[,1:3]) class(out) summary(out) ## End(Not run)
fsmmmdrs.object
ObjectsAn object of class fsmmmdrs.object
holds information about
the result of a call to fsmmmdrs
.
The object itself is basically a list
with the following
components:
mmdrs
: Minimum Mahalanobis distance, (n-init) by (nsimul+1) matrix which contains the monitoring of minimum
Mahalanobis distance at each step of the forward search.
1st column = fwd search index (from init to n-1)
2nd column = minimum Mahalanobis distance for random start 1
3rd column ...
nsimul+1 column minimum Mahalanobis distance for random start nsimul
BBrs
: Units belonging to the subset at the steps specified by input option bsbsteps.
If bsbsteps=0
BBrs
has size n-by-(n-init+1)-by-nsimul
. In this
case BBrs[,,j]
with j=1, 2, ..., nsimul has the following structure:
1st row = has number 1 in correspondence of the steps in which unit 1 is included inside subset and a missing value for the other steps
...
(n-1)-th row = has number n-1 in correspondence of the steps in which unit n-1 is included inside subset and a missing value for the other steps
n-th row = has the number n in correspondence of the steps in which unit n is included inside subset and a missing value for the other steps
If, on the other hand, bsbsteps is a vector which specifies the steps of the search in which
it is necessary to store subset, BBrs has size n-by-length(bsbsteps)-by-nsimul
.
In other words, BBrs[,,j]
with j=1, 2, ..., nsimul
has the same structure
as before, but now contains just length(bsbsteps)
columns.
X
: the data matrix X.
The object has class "fsmmmdrs"
.
## Not run: data(hbk, package="robustbase") out <- fsmmmdrs(hbk[,1:3]) class(out) summary(out) ## End(Not run)
## Not run: data(hbk, package="robustbase") out <- fsmmmdrs(hbk[,1:3]) class(out) summary(out) ## End(Not run)
Gives an automatic outlier detection procedure in multivariate analysis and performs forward search in multivariate analysis with exploratory data
fsmult( x, bsb, monitoring = FALSE, crit = c("md", "biv", "uni"), rf = 0.95, init, plot = FALSE, bonflev, msg = TRUE, nocheck = FALSE, scaled = FALSE, trace = FALSE, ... )
fsmult( x, bsb, monitoring = FALSE, crit = c("md", "biv", "uni"), rf = 0.95, init, plot = FALSE, bonflev, msg = TRUE, nocheck = FALSE, scaled = FALSE, trace = FALSE, ... )
x |
An n x p data matrix (n observations and p variables). Rows of x represent observations, and columns represent variables. Missing values (NA's) and infinite values (Inf's) are allowed, since observations (rows) with missing or infinite values will automatically be excluded from the computations. |
bsb |
List of units forming the initial subset or size of the initial subset.
If Remark: if bsb is a vector, the option crit is ignored. |
monitoring |
Wheather to perform monitoring of Mahalanobis distances and other specific quantities |
crit |
If specified, the criterion to be used to initialize the search.
Remark: as the user can see the starting point of the search is not going to affect at all the results of the analysis. The user can explore this point with his own datasets. Remark: if |
rf |
Confidence level for bivariate ellipses. The default is 0.95. This option is useful only if |
init |
Point where to start monitoring required diagnostics. Note that if a vector |
plot |
Plots the minimum Mahalanobis distance. If
|
bonflev |
Option that might be used to identify extreme outliers when the distribution of
the data is strongly non normal. In these circumstances, the general signal detection rule
based on consecutive exceedances cannot be used. In this case
Default value is empty, which means to rely on general rules based on consecutive exceedances. |
msg |
It controls whether to display or not messages on the screen. If |
nocheck |
It controls whether to perform checks on matrix Y. If |
scaled |
Controls whether to monitor scaled Mahalanobis distances (only if |
trace |
Whether to print intermediate results. Default is |
... |
potential further arguments passed to lower level functions. |
Depending on the input parameter monitoring
, one of
the following objects will be returned:
FSDA team, [email protected]
Riani, M., Atkinson A.C., Cerioli A. (2009). Finding an unknown number of multivariate outliers. Journal of the Royal Statistical Society Series B, Vol. 71, pp. 201-221.
Cerioli A., Farcomeni A., Riani M., (2014). Strong consistency and robustness of the Forward Search estimator of multivariate location and scatter, Journal of Multivariate Analysis, Vol. 126, pp. 167-183, http://dx.doi.org/10.1016/j.jmva.2013.12.010.
Atkinson Riani and Cerioli (2004), Exploring multivariate data with the forward search Springer Verlag, New York.
## Not run: data(hbk, package="robustbase") (out <- fsmult(hbk[,1:3])) class(out) summary(out) ## Generate contaminated data (200,3) n <- 200 p <- 3 set.seed(123456) X <- matrix(rnorm(n*p), nrow=n) Xcont <- X Xcont[1:5, ] <- Xcont[1:5,] + 3 out1 <- fsmult(Xcont, trace=TRUE) # no plots (plot defaults to FALSE) names(out1) (out1 <- fsmult(Xcont, trace=TRUE, plot=TRUE)) # identical to plot=1 ## plot=1 - minimum MD with envelopes based on n observations ## and the scatterplot matrix with the outliers highlighted (out1 <- fsmult(Xcont, trace=TRUE, plot=1)) ## plot=2 - additional plots of envelope resuperimposition (out1 <- fsmult(Xcont, trace=TRUE, plot=2)) ## plots is a list: plots showing envelope superimposition in normal coordinates. (out1 <- fsmult(Xcont, trace=TRUE, plot=list(ncoord=1))) ## Choosing an initial subset formed by the three observations with ## the smallest Mahalanobis Distance. (out1 <- fsmult(Xcont, m0=5, crit="md", trace=TRUE)) ## fsmult() with monitoring (out2 <- fsmult(Xcont, monitoring=TRUE, trace=TRUE)) names(out2) ## Monitor the exceedances from m=200 without showing plots. n <- 1000 p <- 10 Y <- matrix(rnorm(10000), ncol=10) (out <- fsmult(Y, init=200)) ## Forgery Swiss banknotes examples. data(swissbanknotes) ## Monitor the exceedances of Minimum Mahalanobis Distance (out1 <- fsmult(swissbanknotes[101:200,], plot=1)) ## Control minimum and maximum on the x axis (out1 <- fsmult(swissbanknotes[101:200,], plot=list(xlim=c(60,90)))) ## Monitor the exceedances of Minimum Mahalanobis Distance using ## normal coordinates for mmd. (out1 <- fsmult(swissbanknotes[101:200,], plot=list(ncoord=1))) ## End(Not run)
## Not run: data(hbk, package="robustbase") (out <- fsmult(hbk[,1:3])) class(out) summary(out) ## Generate contaminated data (200,3) n <- 200 p <- 3 set.seed(123456) X <- matrix(rnorm(n*p), nrow=n) Xcont <- X Xcont[1:5, ] <- Xcont[1:5,] + 3 out1 <- fsmult(Xcont, trace=TRUE) # no plots (plot defaults to FALSE) names(out1) (out1 <- fsmult(Xcont, trace=TRUE, plot=TRUE)) # identical to plot=1 ## plot=1 - minimum MD with envelopes based on n observations ## and the scatterplot matrix with the outliers highlighted (out1 <- fsmult(Xcont, trace=TRUE, plot=1)) ## plot=2 - additional plots of envelope resuperimposition (out1 <- fsmult(Xcont, trace=TRUE, plot=2)) ## plots is a list: plots showing envelope superimposition in normal coordinates. (out1 <- fsmult(Xcont, trace=TRUE, plot=list(ncoord=1))) ## Choosing an initial subset formed by the three observations with ## the smallest Mahalanobis Distance. (out1 <- fsmult(Xcont, m0=5, crit="md", trace=TRUE)) ## fsmult() with monitoring (out2 <- fsmult(Xcont, monitoring=TRUE, trace=TRUE)) names(out2) ## Monitor the exceedances from m=200 without showing plots. n <- 1000 p <- 10 Y <- matrix(rnorm(10000), ncol=10) (out <- fsmult(Y, init=200)) ## Forgery Swiss banknotes examples. data(swissbanknotes) ## Monitor the exceedances of Minimum Mahalanobis Distance (out1 <- fsmult(swissbanknotes[101:200,], plot=1)) ## Control minimum and maximum on the x axis (out1 <- fsmult(swissbanknotes[101:200,], plot=list(xlim=c(60,90)))) ## Monitor the exceedances of Minimum Mahalanobis Distance using ## normal coordinates for mmd. (out1 <- fsmult(swissbanknotes[101:200,], plot=list(ncoord=1))) ## End(Not run)
fsmult.object
ObjectsAn object of class fsmult.object
holds information about
the result of a call to fsmult
.
The object itself is basically a list
with the following
components:
outliers |
kx1 vector containing the list of the k units declared as outliers or NULL if the sample is homogeneous. |
loc |
p-by-1 vector containing location of the data. |
cov |
p-by-p robust estimate of covariance matrix. |
md |
n-by-1 vector containing the estimates of the robust Mahalanobis distances (in squared units). This vector contains the distances of each observation from the location of the data, relative to the scatter matrix cov. |
mmd |
(n-init)-by-2 matrix. 1st col is the forward search index; 2nd col is the value of minimum Mahalanobis Distance in each step of the fwd search. |
Un |
(n-init)-by-11 matrix which contains the unit(s) included in the subset at each step of the fwd search. REMARK: in every step the new subset is compared with the old subset. Un contains the unit(s) present in the new subset but not in the old one. Un[1 ,2] for example contains the unit included in step init+1. Un[end, 2] contains the units included in the final step of the search. |
nout |
2 x 5 matrix containing the number of times mdr went out of particular quantiles. First row contains quantiles 1 99 99.9 99.99 99.999. Second row contains the frequency distribution. It is NULL if bonflev threshold is used. |
constr |
This output is produced only if the search found at a certain step is a non singular matrix X. In this case the search run in a constrained mode, that is including the units which produced a singular matrix in the last n-constr steps. out.constr is a vector which contains the list of units which produced a singular X matrix. |
X |
the data matrix X |
The object has class "fsmult"
.
## Not run: data(hbk, package="robustbase") (out <- fsmult(hbk[,1:3])) class(out) summary(out) ## End(Not run)
## Not run: data(hbk, package="robustbase") (out <- fsmult(hbk[,1:3])) class(out) summary(out) ## End(Not run)
FSR_control
object
Creates an object of class FSR_control
to be used with the fsreg()
function,
containing various control parameters.
FSR_control(intercept = TRUE, h, nsamp = 1000, lms = 1, init, nocheck = FALSE, bonflev = "", msg = TRUE, bsbmfullrank = TRUE, plot = FALSE, bivarfit = FALSE, multivarfit = FALSE, labeladd = FALSE, nameX, namey, ylim, xlim)
FSR_control(intercept = TRUE, h, nsamp = 1000, lms = 1, init, nocheck = FALSE, bonflev = "", msg = TRUE, bsbmfullrank = TRUE, plot = FALSE, bivarfit = FALSE, multivarfit = FALSE, labeladd = FALSE, nameX, namey, ylim, xlim)
intercept |
Indicator for constant term. Scalar. If |
h |
The number of observations that have determined the least trimmed squares
estimator, scalar. |
nsamp |
Number of subsamples which will be extracted to find the robust estimator,
scalar. If |
lms |
Criterion to use to find the initial subset to initialize the search
(LMS, LTS with concentration steps, LTS without concentration steps
or subset supplied directly by the user). The default value is 1
(Least Median of Squares is computed to initialize the search).
On the other hand, if the user wants to initialze the search with
LTS with all the default options for concentration steps then lms=2.
If the user wants to use LTS without concentration steps, lms can be
a scalar different from 1 or 2. If lms is a list it is possible
to control a series of options for concentration steps (for more
details see option
|
init |
Search initialization, scalar. It specifies the initial subset size to
start monitoring exceedances of minimum deletion residual, if init is
not specified it set equal to: |
nocheck |
Check input arguments, scalar. If |
bonflev |
Option to be used if the distribution of the data is strongly non normal and, thus, the general signal detection rule based on consecutive exceedances cannot be used. In this case bonflev can be:
Default value is ”, which means to rely on general rules based on consecutive exceedances. |
msg |
Controls whether to display or not messages on the screen If |
bsbmfullrank |
How to behave in case subset at step m (say bsbm) produces a singular X.
In other words, this options controls what to do when |
plot |
Plot on the screen. Scalar. If |
bivarfit |
Wheather to superimpose bivariate least square lines on the plot (if |
multivarfit |
Wheather to superimpose multivariate least square lines.
This option adds one or more least square lines, based on MULTIVARIATE REGRESSION
of y on X, to the plots of y|Xi.
The default is |
labeladd |
Add outlier labels in plot. If |
nameX |
Add variable labels in plot. A vector of strings of length |
namey |
Add response label. A string containing the label of the response |
ylim |
Control |
xlim |
Control |
Creates an object of class FSR_control
to be used with the fsreg()
function,
containing various control parameters.
An object of class "FSR_control"
which is basically a
list
with components the input arguments of
the function mapped accordingly to the corresponding Matlab function.
FSDA team
See Also Sreg_control
, MMreg_control
, LXS_control
,
FSReda_control
, Sregeda_control
and MMregeda_control
.
## Not run: data(hbk, package="robustbase") (out <- fsreg(Y~., data=hbk, method="FS", control=FSR_control(h=56, nsamp=500, lms=2))) summary(out) ## End(Not run)
## Not run: data(hbk, package="robustbase") (out <- fsreg(Y~., data=hbk, method="FS", control=FSR_control(h=56, nsamp=500, lms=2))) summary(out) ## End(Not run)
fsr
ObjectsAn object of class fsr.object
holds information about
the result of a call to fsreg
.
The object itself is basically a list
with the following
components:
beta |
p-by-1 vector containing the estimated regression parameters (in step n-k). |
scale |
scalar containing the estimate of the scale (sigma). |
residuals |
residuals. |
fittedvalues |
fitted values. |
outliers |
kx1 vector containing the list of the k units declared as outliers or NULL if the sample is homogeneous. |
mdr |
(n-init) x 2 matrix 1st col = fwd search index, 2nd col = value of minimum deletion residual in each step of the fwd search |
Un |
(n-init) x 11 matrix which contains the unit(s) included in the subset at each step of the fwd search. REMARK: in every step the new subset is compared with the old subset. Un contains the unit(s) present in the new subset but not in the old one. Un(1,2) for example contains the unit included in step init+1. Un(end,2) contains the units included in the final step of the search. |
nout |
2 x 5 matrix containing the number of times mdr went out of particular quantiles. First row contains quantiles 1 99 99.9 99.99 99.999. Second row contains the frequency distribution. |
constr |
This output is produced only if the search found at a certain step is a non singular matrix X. In this case the search run in a constrained mode, that is including the units which produced a singular matrix in the last n-constr steps. out.constr is a vector which contains the list of units which produced a singular X matrix. |
X |
the data matrix X |
y |
the response vector y |
The object has class "fsr"
.
## Not run: data(hbk, package="robustbase") (out <- fsreg(Y~., data=hbk, method="FS")) class(out) summary(out) ## End(Not run)
## Not run: data(hbk, package="robustbase") (out <- fsreg(Y~., data=hbk, method="FS")) class(out) summary(out) ## End(Not run)
An automatic outlier detection procedure in linear regression
fsrbase(x, ...) ## S3 method for class 'formula' fsrbase(formula, data, subset, weights, na.action, model = TRUE, x.ret = FALSE, y.ret = FALSE, contrasts = NULL, offset, ...) ## Default S3 method: fsrbase(x, y, bsb, intercept = TRUE, monitoring = FALSE, control, trace = FALSE, ...)
fsrbase(x, ...) ## S3 method for class 'formula' fsrbase(formula, data, subset, weights, na.action, model = TRUE, x.ret = FALSE, y.ret = FALSE, contrasts = NULL, offset, ...) ## Default S3 method: fsrbase(x, y, bsb, intercept = TRUE, monitoring = FALSE, control, trace = FALSE, ...)
formula |
a |
data |
data frame from which variables specified in
|
subset |
an optional vector specifying a subset of observations to be used in the fitting process. |
weights |
an optional vector of weights to be used in the fitting process. NOT USED YET. |
na.action |
a function which indicates what should happen
when the data contain |
model , x.ret , y.ret
|
|
contrasts |
an optional list. See the |
offset |
this can be used to specify an a priori
known component to be included in the linear predictor
during fitting. An |
x |
Predictor variables. Matrix. Matrix of explanatory
variables (also called 'regressors') of dimension n x (p-1)
where p denotes the number of explanatory variables
including the intercept.
Rows of X represent observations, and columns represent
variables. By default, there is a constant term in the
model, unless you explicitly remove it using input option
|
y |
Response variable. Vector. Response variable, specified as a vector of length n, where n is the number of observations. Each entry in y is the response for the corresponding row of X. Missing values (NA's) and infinite values (Inf's) are allowed, since observations (rows) with missing or infinite values will automatically be excluded from the computations. |
bsb |
Initial subset - vector of indices. If |
intercept |
Indicator for constant term. Scalar. If |
monitoring |
wheather to perform monitoring for several quantities
in each step of the forward search. Deafault is |
control |
A control object (S3) containing estimation options, as returned
by |
trace |
Whether to print intermediate results. Default is |
... |
Potential further optional arguments, see the help of the function |
Depending on the input parameter monitoring
, one of
the following objects will be returned:
FSDA team
Riani, M., Atkinson A.C., Cerioli A. (2009). Finding an unknown number of multivariate outliers. Journal of the Royal Statistical Society Series B, Vol. 71, pp. 201-221.
## Not run: n <- 200 p <- 3 X <- matrix(data=rnorm(n*p), nrow=n, ncol=p) y <- matrix(data=rnorm(n*1), nrow=n, ncol=1) (out = fsrbase(X, y)) ## Now we use the formula interface: (out1 = fsrbase(y~X, control=FSR_control(plot=FALSE))) ## Or use the variables in a data frame (out2 = fsrbase(Y~., data=hbk, control=FSR_control(plot=FALSE))) ## let us compare to the LTS solution (out3 = ltsReg(Y~., data=hbk)) ## Now compute the model without intercept (out4 = fsrbase(Y~.-1, data=hbk, control=FSR_control(plot=FALSE))) ## And compare again with the LTS solution (out5 = ltsReg(Y~.-1, data=hbk)) ## using default (optional arguments) (out6 = fsrbase(Y~.-1, data=hbk, control=FSR_control(plot=FALSE, nsamp=1500, h=50))) ## End(Not run)
## Not run: n <- 200 p <- 3 X <- matrix(data=rnorm(n*p), nrow=n, ncol=p) y <- matrix(data=rnorm(n*1), nrow=n, ncol=1) (out = fsrbase(X, y)) ## Now we use the formula interface: (out1 = fsrbase(y~X, control=FSR_control(plot=FALSE))) ## Or use the variables in a data frame (out2 = fsrbase(Y~., data=hbk, control=FSR_control(plot=FALSE))) ## let us compare to the LTS solution (out3 = ltsReg(Y~., data=hbk)) ## Now compute the model without intercept (out4 = fsrbase(Y~.-1, data=hbk, control=FSR_control(plot=FALSE))) ## And compare again with the LTS solution (out5 = ltsReg(Y~.-1, data=hbk)) ## using default (optional arguments) (out6 = fsrbase(Y~.-1, data=hbk, control=FSR_control(plot=FALSE, nsamp=1500, h=50))) ## End(Not run)
FSReda_control
object
Creates an object of class FSReda_control
to be used with the fsreg()
function,
containing various control parameters.
FSReda_control(intercept = TRUE, init, nocheck = FALSE, tstat = c("trad", "scal"), conflev = c(0.95, 0.99))
FSReda_control(intercept = TRUE, init, nocheck = FALSE, tstat = c("trad", "scal"), conflev = c(0.95, 0.99))
intercept |
Indicator for constant term. Scalar. If |
init |
Search initialization, scalar. It specifies the initial subset size to
start monitoring exceedances of minimum deletion residual, if init is
not specified it set equal to: |
nocheck |
Check input arguments, scalar. If |
tstat |
The kind of t-statistics which have to be monitored.
|
conflev |
Confidence level which is used to declare units as outliers. Usually conflev=0.95, 0.975, 0.99 (individual alpha) or conflev=1-0.05/n, 1-0.025/n, 1-0.01/n (simultaneous alpha). Default value is 0.975. |
Creates an object of class FSReda_control
to be used with the fsreg()
function,
containing various control parameters.
An object of class "FSReda_control"
which is basically a
list
with components the input arguments of
the function mapped accordingly to the corresponding Matlab function.
FSDA team
See Also as FSR_control
, MMreg_control
and LXS_control
## Not run: data(hbk, package="robustbase") (out <- fsreg(Y~., data=hbk, method="FS", monitoring=TRUE, control=FSReda_control(tstat="scal"))) ## End(Not run)
## Not run: data(hbk, package="robustbase") (out <- fsreg(Y~., data=hbk, method="FS", monitoring=TRUE, control=FSReda_control(tstat="scal"))) ## End(Not run)
fsreda
ObjectsAn object of class fsreda.object
holds information about
the result of a call to fsreg
.
The object itself is basically a list
with the following
components:
RES |
n x (n-init+1) matrix containing the monitoring of scaled residuals: the first row is the residual for the first unit, ..., n-th row is the residual for the n-th unit. |
LEV |
(n+1) x (n-init+1) matrix containing the monitoring of leverage: the first row is the leverage for the first unit, ..., n-th row is the leverage for the n-th unit. |
BB |
n x (n-init+1) matrix containing the information about the units belonging to the subset at each step of the forward search: first col contains indexes of the units forming subset in the initial step; ...; last column contains units forming subset in the final step (all units). |
mdr |
n-init x 3 matrix which contains the monitoring of minimum deletion residual or (m+1)-ordered residual at each step of the forward search: first col is the fwd search index (from init to n-1); 2nd col = minimum deletion residual; 3rd col = (m+1)-ordered residual. Remark: these quantities are stored with sign, that is the min deletion residual is stored with negative sign if it corresponds to a negative residual. |
msr |
n-init+1 x 3 matrix which contains the monitoring of maximum studentized residual or m-th ordered residual: first col is the fwd search index (from init to n); 2nd col = maximum studentized residual; 3rd col = (m)-ordered studentized residual. |
nor |
(n-init+1) x 4 matrix containing the monitoring of normality test in each step of the forward search: first col = fwd search index (from init to n); 2nd col = Asymmetry test; 3rd col = Kurtosis test; 4th col = Normality test. |
Bols |
(n-init+1) x (p+1) matrix containing the monitoring of estimated beta coefficients in each step of the forward search. |
S2 |
(n-init+1) x 5 matrix containing the monitoring of S2 or R2 and F-test in each step of the forward search:
In this case the estimated of s2 at step m is divided by the consistency factor (to make the estimate asymptotically unbiased). |
coo |
(n-init+1) x 3 matrix containing the monitoring of Cook or modified Cook distance in each step of the forward search:
|
Tols |
(n-init+1) x (p+1) matrix containing the monitoring of estimated t-statistics (as specified in option input 'tstat') in each step of the forward search |
Un |
(n-init) x 11 matrix which contains the unit(s) included in the subset at each step of the fwd search. REMARK: in every step the new subset is compared with the old subset. Un contains the unit(s) present in the new subset but not in the old one Un(1,2) for example contains the unit included in step init+1 Un(end,2) contains the units included in the final step of the search. |
betaINT |
Confidence intervals for the elements of β. betaINT is a (n-init+1)-by-2*length(confint)-by-p 3D array. Each third dimension refers to an element of beta:
The first two columns contain the lower and upper confidence
limits associated with conflev(1). Columns three and four contain
the lower and upper confidence limits associated with conflev(2); ...;
The last two columns contain the lower and upper confidence
limits associated with conflev(end).
For example |
sigma2INT |
confidence interval for s2.
|
X |
the data matrix X |
y |
the response vector y |
The object has class "fsreda"
.
## Not run: data(hbk, package="robustbase") (out <- fsreg(Y~., data=hbk, method="FS", monitoring=TRUE)) class(out) summary(out) ## End(Not run)
## Not run: data(hbk, package="robustbase") (out <- fsreg(Y~., data=hbk, method="FS", monitoring=TRUE)) class(out) summary(out) ## End(Not run)
An automatic outlier detection procedure in linear regression
fsreg(x, ...) ## S3 method for class 'formula' fsreg(formula, data, subset, weights, na.action, model = TRUE, x.ret = FALSE, y.ret = FALSE, contrasts = NULL, offset, ...) ## Default S3 method: fsreg(x, y, bsb, intercept = TRUE, family = c("homo", "hetero", "bayes"), method = c("FS", "S", "MM", "LTS", "LMS"), monitoring = FALSE, control, trace = FALSE, ...)
fsreg(x, ...) ## S3 method for class 'formula' fsreg(formula, data, subset, weights, na.action, model = TRUE, x.ret = FALSE, y.ret = FALSE, contrasts = NULL, offset, ...) ## Default S3 method: fsreg(x, y, bsb, intercept = TRUE, family = c("homo", "hetero", "bayes"), method = c("FS", "S", "MM", "LTS", "LMS"), monitoring = FALSE, control, trace = FALSE, ...)
formula |
a |
data |
data frame from which variables specified in
|
subset |
an optional vector specifying a subset of observations to be used in the fitting process. |
weights |
an optional vector of weights to be used in the fitting process. NOT USED YET. |
na.action |
a function which indicates what should happen
when the data contain |
model , x.ret , y.ret
|
|
contrasts |
an optional list. See the |
offset |
this can be used to specify an a priori
known component to be included in the linear predictor
during fitting. An |
family |
family of robust regression models, can be 'homo' for
homoscedastic (same variance) regression model, 'hetero' for
heteroskedastic regression model or 'bayes' Bayesian
linear regression. The default is |
method |
robust regression estimation model, can be 'FS' for
Forward search, 'S' for S regression, 'MM' for MM regression, 'LMS' or 'LTS'.
The default is |
monitoring |
wheather to perform monitoring for several quantities
in each step of the forward search or for series of values of the
breakdown point in case of S estimates or for series of values of the
efficiency in case of MM estimates. Deafault is |
y |
Response variable. Vector. Response variable, specified as a vector of length n, where n is the number of observations. Each entry in y is the response for the corresponding row of X. Missing values (NA's) and infinite values (Inf's) are allowed, since observations (rows) with missing or infinite values will automatically be excluded from the computations. |
x |
Predictor variables. Matrix. Matrix of explanatory
variables (also called 'regressors') of dimension n x (p-1)
where p denotes the number of explanatory variables
including the intercept.
Rows of X represent observations, and columns represent
variables. By default, there is a constant term in the
model, unless you explicitly remove it using input option
|
bsb |
Initial subset - vector of indices. If |
intercept |
Indicator for constant term. Scalar. If |
control |
A control object (S3) containing estimation options. If the control object is supplied, the parameters from it will be used. If parameters are passed also in the invocation statement, they will override the corresponding elements of the control object. |
trace |
Whether to print intermediate results. Default is |
... |
potential further arguments passed to lower level functions. |
Depending on the input parameters family
and method
, one of
the following objects will be returned:
FSDA team
Riani, M., Atkinson A.C., Cerioli A. (2009). Finding an unknown number of multivariate outliers. Journal of the Royal Statistical Society Series B, Vol. 71, pp. 201-221.
## Not run: library(robustbase) n <- 200 p <- 3 X <- matrix(data=rnorm(n*p), nrow=n, ncol=p) y <- matrix(data=rnorm(n*1), nrow=n, ncol=1) (out = fsreg(X, y)) ## Now we use the formula interface: (out1 = fsreg(y~X, control=FSR_control(plot=FALSE))) ## Or use the variables in a data frame (out2 = fsreg(Y~., data=hbk, control=FSR_control(plot=FALSE))) ## let us compare to the LTS solution library(robustbase) (out3 = ltsReg(Y~., data=hbk)) ## Now compute the model without intercept (out4 = fsreg(Y~.-1, data=hbk, control=FSR_control(plot=FALSE))) ## And compare again with the LTS solution (out5 = ltsReg(Y~.-1, data=hbk)) ## using default (optional arguments) (out6 = fsreg(Y~.-1, data=hbk, control=FSR_control(plot=FALSE, nsamp=1500, h=50))) ## End(Not run)
## Not run: library(robustbase) n <- 200 p <- 3 X <- matrix(data=rnorm(n*p), nrow=n, ncol=p) y <- matrix(data=rnorm(n*1), nrow=n, ncol=1) (out = fsreg(X, y)) ## Now we use the formula interface: (out1 = fsreg(y~X, control=FSR_control(plot=FALSE))) ## Or use the variables in a data frame (out2 = fsreg(Y~., data=hbk, control=FSR_control(plot=FALSE))) ## let us compare to the LTS solution library(robustbase) (out3 = ltsReg(Y~., data=hbk)) ## Now compute the model without intercept (out4 = fsreg(Y~.-1, data=hbk, control=FSR_control(plot=FALSE))) ## And compare again with the LTS solution (out5 = ltsReg(Y~.-1, data=hbk)) ## using default (optional arguments) (out6 = fsreg(Y~.-1, data=hbk, control=FSR_control(plot=FALSE, nsamp=1500, h=50))) ## End(Not run)
The transformations for negative and positive responses were determined
by Yeo and Johnson (2000) by imposing the smoothness condition that the second
derivative of zYJ () with respect to y be smooth at y = 0. However some authors,
for example Weisberg (2005), query the physical interpretability of this constraint
which is oftern violated in data analysis. Accordingly, Atkinson et al. (2019) and (2020)
extend the Yeo-Johnson transformation to allow two values of the transformations
parameter:
for negative observations and
for non-negative ones.
FSRfan monitors:
the t test associated with the constructed variable computed assuming the same transformation parameter for positive and negative observations fixed. In short we call this test, "global score test for positive observations".
the t test associated with the constructed variable computed assuming a different transformation for positive observations keeping the value of the transformation parameter for negative observations fixed. In short we call this test, "test for positive observations".
the t test associated with the constructed variable computed assuming a different transformation for negative observations keeping the value of the transformation parameter for positive observations fixed. In short we call this test, "test for negative observations".
the F test for the joint presence of the two constructed variables described in points 2) and 3).
the F likelihood ratio test based on the MLE of and
.
In this case the residual sum of squares of the null model bsaed on a single trasnformation
parameter
is compared with the residual sum of squares of the model based
on data transformed data using MLE of
and
.
fsrfan(x, ...) ## S3 method for class 'formula' fsrfan( formula, data, subset, weights, na.action, model = TRUE, x.ret = FALSE, y.ret = FALSE, contrasts = NULL, offset, ... ) ## Default S3 method: fsrfan( x, y, intercept = TRUE, plot = FALSE, family = c("BoxCox", "YJ", "YJpn", "YJall"), la = c(-1, -0.5, 0, 0.5, 1), lms, alpha = 0.75, h, init, msg = FALSE, nocheck = FALSE, nsamp = 1000, conflev = 0.99, xlab, ylab, main, xlim, ylim, lwd = 2, lwd.env = 1, trace = FALSE, ... ) ## S3 method for class 'fsrfan' plot( x, conflev = 0.99, xlim, ylim, xlab = "Subset of size m", ylab = "Score test statistic", main = "Fan plot", col, lty, lwd = 2.5, lwd.env = 1, ... )
fsrfan(x, ...) ## S3 method for class 'formula' fsrfan( formula, data, subset, weights, na.action, model = TRUE, x.ret = FALSE, y.ret = FALSE, contrasts = NULL, offset, ... ) ## Default S3 method: fsrfan( x, y, intercept = TRUE, plot = FALSE, family = c("BoxCox", "YJ", "YJpn", "YJall"), la = c(-1, -0.5, 0, 0.5, 1), lms, alpha = 0.75, h, init, msg = FALSE, nocheck = FALSE, nsamp = 1000, conflev = 0.99, xlab, ylab, main, xlim, ylim, lwd = 2, lwd.env = 1, trace = FALSE, ... ) ## S3 method for class 'fsrfan' plot( x, conflev = 0.99, xlim, ylim, xlab = "Subset of size m", ylab = "Score test statistic", main = "Fan plot", col, lty, lwd = 2.5, lwd.env = 1, ... )
x |
An Missing values (NA's) and infinite values (Inf's) are allowed, since observations (rows) with missing or infinite values will automatically be excluded from the computations. |
... |
potential further arguments passed to lower level functions. |
formula |
a |
data |
data frame from which variables specified in
|
subset |
an optional vector specifying a subset of observations to be used in the fitting process. |
weights |
an optional vector of weights to be used NOT USED YET. |
na.action |
a function which indicates what should happen
when the data contain |
model |
|
x.ret |
|
y.ret |
|
contrasts |
an optional list. See the |
offset |
this can be used to specify an a priori
known component to be included in the linear predictor
during fitting. An |
y |
Response variable. A vector with |
intercept |
wheather to use constant term (default is |
plot |
If |
family |
string which identifies the family of transformations which must be used. Possible values are
|
la |
values of the transformation parameter for which it is necessary
to compute the score test. Default value of lambda is
|
lms |
how to find the initlal subset to initialize the search. If |
alpha |
the percentage (roughly) of squared residuals whose sum will be minimized,
by default |
h |
The number of observations that have determined the least trimmed squares
estimator, scalar. |
init |
Search initialization. It specifies the initial subset size to start
monitoring the value of the score test. If |
msg |
Controls whether to display or not messages on the screen. If |
nocheck |
Whether to check input arguments. If |
nsamp |
number of subsamples which will be extracted to find the robust estimator. If Remark: if the number of all possible subset is <1000 the default is to extract all subsets
otherwise just 1000. If |
conflev |
Confidence level for the bands (default is 0.99, that is, we plot two horizontal lines corresponding to values -2.58 and 2.58). |
xlab |
A label for the X-axis, default is 'Subset size m' |
ylab |
A label for the Y-axis, default is 'Score test statistic' |
main |
A label for the title, default is 'Fan plot' |
xlim |
Minimum and maximum for the X-axis |
ylim |
Minimum and maximum for the Y-axis |
lwd |
The line width of the curves which contain the score test, a positive number, default is |
lwd.env |
The line width of the lines associated with the envelopes, a positive number, default is |
trace |
Whether to print intermediate results. Default is |
col |
a vector specifying the colors for the lines, each
one corresponding to a |
lty |
a vector specifying the line types for the lines, each
one corresponding to a |
An S3 object of class fsrfan.object
will be returned which is basically a list
containing the following elements:
la
: vector containing the values of lambda for which fan plot is constructed
bs
: matrix of size p X length(la)
containing the units forming
the initial subset for each value of lambda
Score
: a matrix containing the values of the score test for
each value of the transformation parameter:
1st col = fwd search index;
2nd col = value of the score test in each step of the fwd search for la[1]
...
Scorep
: matrix containing the values of the score test for positive
observations for each value of the transformation parameter.
Note: this output is present only if input option family='YJpn'
or family='YJall'
.
Scoren
: matrix containing the values of the score test for negative observations
for each value of the transformation parameter.
Note: this output is present only if input option 'family' is 'YJpn' or 'YJall'.
Scoreb
: matrix containing the values of the score test for the joint
presence of both constructed variables (associated with positive and negative
observations) for each value of the transformation parameter. In this case
the reference distribution is the with 2 and
subset_size - p
degrees of freedom.
Note: this output is present only if input option family='YJall'
.
Un
: a three-dimensional array containing length(la)
matrices of
size retnUn=(n-init) X retpUn=11
. Each matrix contains
the unit(s) included in the subset at each step in the search associated
with the corresponding element of la
.
REMARK: at each step the new subset is compared with the old subset.
Un
contains the unit(s) present in the new subset but not in the old one.
FSDA team, [email protected]
Atkinson, A.C. and Riani, M. (2000), Robust Diagnostic Regression Analysis Springer Verlag, New York.
Atkinson, A.C. and Riani, M. (2002), Tests in the fan plot for robust, diagnostic transformations in regression, Chemometrics and Intelligent Laboratory Systems, 60, pp. 87–100.
Atkinson, A.C. Riani, M. and Corbellini A. (2019), The analysis of transformations for profit-and-loss data, Journal of the Royal Statistical Society, Series C, "Applied Statistics", 69, pp. 251–275. doi:10.1111/rssc.12389
Atkinson, A.C. Riani, M. and Corbellini A. (2021), The Box-Cox Transformation: Review and Extensions, Statistical Science, 36(2), pp. 239–255. doi:10.1214/20-STS778.
## Not run: data(wool) XX <- wool y <- XX[, ncol(XX)] X <- XX[, 1:(ncol(XX)-1), drop=FALSE] out <- fsrfan(X, y) # call 'fsrfan' with all default parameters out <- fsrfan(cycles~., data=wool) # use the formula interface set.seed(10) out <- fsrfan(cycles~., data=wool, plot=TRUE) # call 'fsrfan' and produce the plot plot(out) # use the plot method on the fsrfan object plot(out, conflev=c(0.9, 0.95, 0.99)) # change the confidence leel in the plot method ##====================== ## ## fsrfan() with all default options. ## Store values of the score test statistic for the five most common ## values of $\lambda$. Produce also a fan plot and display it on the screen. ## Common part to all examples: load 'wool' data set. data(wool) head(wool) dim(wool) ## The function fsrfan() stores the score test statistic. ## In this case we use the five most common values of lambda are considered out <- fsrfan(cycles~., data=wool) plot(out) ## fanplot(out) # Not yet implemented in fsdaR ## The fan plot shows the log transformation is diffused throughout the data ## and does not depend on the presence of particular observations. ##====================== ## ## Example specifying 'lambda'. ## Produce a fan plot for each value of 'lambda' in the vector 'la'. ## Extract in matrix 'Un' the units which entered the search in each step data(wool) out <- fsrfan(cycles~., data=wool, la=c(-1, -0.5, 0, 0.5), plot=TRUE) plot(out) out$Un[,2,] ##====================== ## Example specifying the confidence level and the initial starting point for monitoring. ## Construct the fan plot specifying the confidence level and the initial starting point ## for monitoring. data(wool) out <- fsrfan(cycles~., data=wool, init=ncol(wool)+1, nsamp=0, conflev=0.95, plots=TRUE) plot(out, conflev=0.95) ##===================== ## Example with starting point based on LTS. ## Extract all subsamples, construct a fan plot specifying the confidence level ## and the initial starting point for monitoring based on p+2 observations, ## strong line width for lines associated with the confidence bands. data(wool) out <- fsrfan(cycles~., data=wool, init=ncol(wool)+1, nsamp=0, lms=0, lwd.env=3, plot=TRUE) plot(out, lwd.env=3) ##===================== ## Fan plot using the loyalty cards data. ## In this example, 'la' is the vector contanining the most common values ## of the transformation parameter. ## Store the score test statistics for the specified values of lambda ## and automatically produce the fan plot data(loyalty) head(loyalty) dim(loyalty) ## la is a vector contanining the most common values of the transformation parameter out <- fsrfan(amount_spent~., data=loyalty, la=c(0, 1/3, 0.4, 0.5), init=ncol(loyalty)+1, plot=TRUE, lwd=3) plot(out, lwd=3) ## The fan plot shows that even if the third root is the best value of the transformation ## parameter at the end of the search, in earlier steps it lies very close to the upper ## rejection region. The best value of the transformation parameter seems to be the one ## associated with la=0.4, which is always the confidence bands but at the end of search, ## due to the presence of particular observations it goes below the lower rejection line. ##===================== ## Compare BoxCox with Yeo and Johnson transformation. ## Store values of the score test statistic for the five most common ## values of lambda. Produce also a fan plot and display it on the screen. ## Common part to all examples: load wool dataset. data(wool) ## Store the score test statistic using Box Cox transformation. outBC <- fsrfan(cycles~., data=wool, nsamp=0) ## Store the score test statistic using Yeo and Johnson transformation. outYJ <- fsrfan(cycles~., data=wool, family="YJ", nsamp=0) ## Not yet fully implemented ## fanplot(outBC, main="Box Cox") ## fanplot(outYJ,main="Yeo and Johnson") plot(outBC, main="Box Cox") plot(outYJ, main="Yeo and Johnson") cat("\nMaximum difference in absolute value: ", max(max(abs(outYJ$Score - outBC$Score), na.rm=TRUE)), "\n") ##====================== ## Call 'fsrfan' with Yeo-Johnson (YJ) transformation out <- fsrfan(cycles~., data=wool, family="YJ") plot(out) ## End(Not run)
## Not run: data(wool) XX <- wool y <- XX[, ncol(XX)] X <- XX[, 1:(ncol(XX)-1), drop=FALSE] out <- fsrfan(X, y) # call 'fsrfan' with all default parameters out <- fsrfan(cycles~., data=wool) # use the formula interface set.seed(10) out <- fsrfan(cycles~., data=wool, plot=TRUE) # call 'fsrfan' and produce the plot plot(out) # use the plot method on the fsrfan object plot(out, conflev=c(0.9, 0.95, 0.99)) # change the confidence leel in the plot method ##====================== ## ## fsrfan() with all default options. ## Store values of the score test statistic for the five most common ## values of $\lambda$. Produce also a fan plot and display it on the screen. ## Common part to all examples: load 'wool' data set. data(wool) head(wool) dim(wool) ## The function fsrfan() stores the score test statistic. ## In this case we use the five most common values of lambda are considered out <- fsrfan(cycles~., data=wool) plot(out) ## fanplot(out) # Not yet implemented in fsdaR ## The fan plot shows the log transformation is diffused throughout the data ## and does not depend on the presence of particular observations. ##====================== ## ## Example specifying 'lambda'. ## Produce a fan plot for each value of 'lambda' in the vector 'la'. ## Extract in matrix 'Un' the units which entered the search in each step data(wool) out <- fsrfan(cycles~., data=wool, la=c(-1, -0.5, 0, 0.5), plot=TRUE) plot(out) out$Un[,2,] ##====================== ## Example specifying the confidence level and the initial starting point for monitoring. ## Construct the fan plot specifying the confidence level and the initial starting point ## for monitoring. data(wool) out <- fsrfan(cycles~., data=wool, init=ncol(wool)+1, nsamp=0, conflev=0.95, plots=TRUE) plot(out, conflev=0.95) ##===================== ## Example with starting point based on LTS. ## Extract all subsamples, construct a fan plot specifying the confidence level ## and the initial starting point for monitoring based on p+2 observations, ## strong line width for lines associated with the confidence bands. data(wool) out <- fsrfan(cycles~., data=wool, init=ncol(wool)+1, nsamp=0, lms=0, lwd.env=3, plot=TRUE) plot(out, lwd.env=3) ##===================== ## Fan plot using the loyalty cards data. ## In this example, 'la' is the vector contanining the most common values ## of the transformation parameter. ## Store the score test statistics for the specified values of lambda ## and automatically produce the fan plot data(loyalty) head(loyalty) dim(loyalty) ## la is a vector contanining the most common values of the transformation parameter out <- fsrfan(amount_spent~., data=loyalty, la=c(0, 1/3, 0.4, 0.5), init=ncol(loyalty)+1, plot=TRUE, lwd=3) plot(out, lwd=3) ## The fan plot shows that even if the third root is the best value of the transformation ## parameter at the end of the search, in earlier steps it lies very close to the upper ## rejection region. The best value of the transformation parameter seems to be the one ## associated with la=0.4, which is always the confidence bands but at the end of search, ## due to the presence of particular observations it goes below the lower rejection line. ##===================== ## Compare BoxCox with Yeo and Johnson transformation. ## Store values of the score test statistic for the five most common ## values of lambda. Produce also a fan plot and display it on the screen. ## Common part to all examples: load wool dataset. data(wool) ## Store the score test statistic using Box Cox transformation. outBC <- fsrfan(cycles~., data=wool, nsamp=0) ## Store the score test statistic using Yeo and Johnson transformation. outYJ <- fsrfan(cycles~., data=wool, family="YJ", nsamp=0) ## Not yet fully implemented ## fanplot(outBC, main="Box Cox") ## fanplot(outYJ,main="Yeo and Johnson") plot(outBC, main="Box Cox") plot(outYJ, main="Yeo and Johnson") cat("\nMaximum difference in absolute value: ", max(max(abs(outYJ$Score - outBC$Score), na.rm=TRUE)), "\n") ##====================== ## Call 'fsrfan' with Yeo-Johnson (YJ) transformation out <- fsrfan(cycles~., data=wool, family="YJ") plot(out) ## End(Not run)
fsrfan
An object of class fsrfan.object
holds information about
the result of a call to fsrfan
.
The functions print()
and summary()
are used to obtain and print a
summary of the results. An object of class fsrfan
is a list containing at least the following components:
la
vector containing the values of lambda for which fan plot is constructed
bs
matrix of size p X length(la)
containing the units forming
the initial subset for each value of lambda
Score
a matrix containing the values of the score test for
each value of the transformation parameter:
1st col = fwd search index;
2nd col = value of the score test in each step of the fwd search for la[1]
...
Scorep
matrix containing the values of the score test for positive
observations for each value of the transformation parameter.
Note: this output is present only if input option family='YJpn'
or family='YJall'
.
Scoren
matrix containing the values of the score test for negative observations
for each value of the transformation parameter.
Note: this output is present only if input option 'family' is 'YJpn' or 'YJall'.
Scoreb
matrix containing the values of the score test for the joint
presence of both constructed variables (associated with positive and negative
observations) for each value of the transformation parameter. In this case
the reference distribution is the with 2 and
subset_size - p
degrees of freedom.
Note: this output is present only if input option family='YJall'
.
Un
a three-dimensional array containing length(la)
matrices of
size retnUn=(n-init) X retpUn=11
. Each matrix contains
the unit(s) included in the subset at each step in the search associated
with the corresponding element of la
.
REMARK: at each step the new subset is compared with the old subset.
Un
contains the unit(s) present in the new subset but not in the old one.
## Not run: data(wool) XX <- wool y <- XX[, ncol(XX)] X <- XX[, 1:(ncol(XX)-1), drop=FALSE] out <- fsrfan(X, y) class(out) summary(out) ## End(Not run)
## Not run: data(wool) XX <- wool y <- XX[, ncol(XX)] X <- XX[, 1:(ncol(XX)-1), drop=FALSE] out <- fsrfan(X, y) class(out) summary(out) ## End(Not run)
A bivariate data set obtained from the Old Faithful Geyser, containing the eruption length and the length of the previous eruption for 271 eruptions of this geyser in minutes.
data(geyser2)
data(geyser2)
A data frame with 271 rows and 2 variables The variables are as follows:
Eruption length: The eruption length in minutes.
Previous eruption length: The length of the previous eruption in minutes.
Garcia-Escudero, L.A., Gordaliza, A. (1999). Robustness properties of k-means and trimmed k-means, Journal of the American Statistical Assoc., Vol.94, No.447, 956-969.
Haerdle, W. (1991). Smoothing Techniques with Implementation in S, New York: Springer.
These data, simulated by Hawkins, consist of 128 observations
and eight explanatory variables (X1, ..., X8)
and one dependent variable, y
.
data(hawkins)
data(hawkins)
A data frame with 128 rows and 9 variables
Data on the logged survival time of 108 patients undergoing liver surgery, together with four potential explanatory variables. Data are composed of 54 observations plus other 54 observations, introduced to check the model fitted to the first 54. Their comparison suggests there is no systematic difference between the two sets. However by looking at some FS plots (Riani and Atkinson, 2007), we conclude that these two groups are significantly different
data(hospital)
data(hospital)
A data frame with 108 rows and 5 variables The variables are as follows:
X1
X2
X3
X4
y
@source J. NETER, M. H. KUTNER, C. J. NACHTSHEIM, W.WASSERMAN, Applied Linear Statistical Models (4th edition). McGraw-Hill, New York, 1996.
@references A. C. ATKINSON, M. RIANI, Robust Diagnostic Regression Analysis. Springer-Verlag, New York, 2000.
Income data taken from the United States Census Bureau. The data are a random sample of 200 observations referred to four variables. The goal is to predict HTOTVAL.
data(Income1)
data(Income1)
A data frame with 200 rows and 4 variables. The variables are as follows:
H_NUMPER: Number of persons in household
HOTHVAL: All other types of income except HEARNVAL Recode - Total other household income
HSSVAL: household income - social security
HTOTVAL: total household income (dollar amount)
United States Census Bureau (2021). Annual Social and Economic Supplements
data(Income1) head(Income1)
data(Income1) head(Income1)
A sample of 200 observations of full time employees from a municipality in Northern Italy who have declared extra income from investment sources. The variables are as follows. The goal is the possibility in predicting income level based on the individual's personal information.
data(Income2)
data(Income2)
A data frame with 200 rows and 6 variables. The variables are as follows:
Age: Age of the person (the minimum is 19 and the maximum is 73).
Education: Number of years of education (the minimum value of 5 is primary school, and the maximum value is 16 bachelor degree)
Gender: A factor - Male or Female
ExtraGain: Income from investment sources (profit-losses) apart from wages/salary
Hours: total number of declared hours worked during the week. The minimum value is 35 and the maximum is 99
Income: total yearly income (Euro amount)
data(Income2) head(Income2)
data(Income2) head(Income2)
Plots the trajectories of the monitored scaled (squared) residuals
levfwdplot(out, xlim, ylim, xlab, ylab, main, lwd, lty, col, cex.lab, cex.axis, xvalues, fg.thresh, fg.unit, fg.labstep, fg.lwd, fg.lty, fg.col, fg.mark, fg.cex, bg.thresh, bg.style, xground=c("lev", "res"), tag, datatooltip, label, nameX, namey, msg, databrush, standard, fground, bground, ...)
levfwdplot(out, xlim, ylim, xlab, ylab, main, lwd, lty, col, cex.lab, cex.axis, xvalues, fg.thresh, fg.unit, fg.labstep, fg.lwd, fg.lty, fg.col, fg.mark, fg.cex, bg.thresh, bg.style, xground=c("lev", "res"), tag, datatooltip, label, nameX, namey, msg, databrush, standard, fground, bground, ...)
out |
An object containing monitoring of leverage, The needed elements of
|
ylim |
Control |
xlim |
Control |
xlab |
a title for the x axis |
ylab |
a title for the y axis |
main |
an overall title for the plot |
lwd |
The line width, a positive number, defaulting to 1 |
lty |
The line type. Line types can either be specified as an integer (1=solid (default), 2=dashed, 3=dotted, 4=dotdash, 5=longdash, 6=twodash) or as one of the character strings "solid", "dashed", "dotted", "dotdash", "longdash", or "twodash". The latter two are not supported by Matlab. |
col |
colors to be used for the highlighted units |
cex.lab |
The magnification to be used for x and y labels relative to the current setting of cex |
cex.axis |
The magnification to be used for axis annotation relative to the current setting of cex |
xvalues |
values for the x axis. Numeric vector of |
fg.thresh |
(alternative to fg.unit) numeric vector of length 1 or 2 which specifies the highlighted trajectories.
If |
fg.unit |
(alternative to fg.thresh), vector containing the list of the units to be highlighted.
If |
fg.labstep |
numeric vector which specifies the steps of the search where to put labels for
the highlighted trajectories (units). The default is to put the labels at the
initial and final steps of the search. |
fg.lwd |
The line width for the highlighted trajectories (units). Default is 1. |
fg.lty |
The line type for the highlighted trajectories (units). Line types can either be specified as an integer (1=solid (default), 2=dashed, 3=dotted, 4=dotdash, 5=longdash, 6=twodash) or as one of the character strings "solid", "dashed", "dotted", "dotdash", "longdash", or "twodash". The latter two are not supported by Matlab. |
fg.col |
colors to be used for the highlighted units. |
fg.mark |
Controlls whether to plot highlighted trajectories as symbols.
if |
fg.cex |
controls the font size of the labels of the trajectories in foreground. |
bg.thresh |
numeric vector of length 1 or 2 which specifies how to define the unimmportant trajectories.
Unimmportant trajectories will be plotted using a colormap, in greysh or will be hidden.
If |
bg.style |
specifies how to plot the unimportant trajectories as defined in option bthresh.
When |
tag |
Plot handle. String which identifies the handle of the plot which is about to be created. The default is to use tag 'pl_resfwd'. Notice that if the program finds a plot which has a tag equal to the one specified by the user, then the output of the new plot overwrites the existing one in the same window else a new window is created. |
xground |
trajectories to highlight in connection with resfwdplot. If |
datatooltip |
Interactive clicking. It is inactive if this parameter is missing or empty.
The default is For example for class
|
label |
Character vector containing the labels of the units (optional argument used when
|
nameX |
Add variable labels in plot. A vector of strings of length |
namey |
Add response label. A string containing the label of the response |
msg |
Controls whether to display or not messages on the screen If |
databrush |
interactive mouse brushing. If databrush is missing or empty (default), no brushing is done. The activation of this option (databrush is a scalar or a list) enables the user to select a set of trajectories in the current plot and to see them highlighted in the y|X plot, i.e. a matrix of scatter plots of y against each column of X, grouped according to the selection(s) done by brushing. If the plot y|X does not exist it is automatically created. In addition, brushed units are automatically highlighted in the minimum deletion residual plot if it is already open. The extension to the following plots will be available in future versions of the toolbox:
Note that the window style of the other figures is set equal to that which contains the monitoring residual plot. In other words, if the monitoring residual plot is docked all the other figures will be docked too If If
|
standard |
(MATLAB-style arguments) appearance of the plot in terms of xlim, ylim, axes labels and their font size style, color of the lines, etc. |
fground |
MATLAB-style arguments for the fground trajectories in foregroud. |
bground |
MATLAB-style arguments for the fground trajectories in backgroud. |
... |
potential further arguments passed to lower level functions. |
No details
No value returned
FSDA team
## Not run: n <- 100 y <- rnorm(n) X <- matrix(rnorm(n*4), nrow=n) out <- fsreg(y~X, method="LTS") out <- fsreg(y~X, method="FS", bsb=out$bs, monitoring=TRUE) levfwdplot(out) ## End(Not run)
## Not run: n <- 100 y <- rnorm(n) X <- matrix(rnorm(n*4), nrow=n) out <- fsreg(y~X, method="LTS") out <- fsreg(y~X, method="FS", bsb=out$bs, monitoring=TRUE) levfwdplot(out) ## End(Not run)
The loyalty data consist of 509 observations on the behaviour of
customers with loyalty cards from a supermarket chain in Northern
Italy. The response y
is the amount in euros spent at the
shop over six months and the explanatory variables are:
X1, the number of visits to the supermarket in the six month period;
X2, the age of the customer;
X3, the number of members of the customers' family.
To find out more about this data set please see Atkinson and Riani (2006), JCGS
data("loyalty")
data("loyalty")
A data frame with 509 observations on the following 4 variables.
visits
the number of visits to the supermarket in the six month period
age
the age of the customer
family
the number of members of the customers' family
amount_spent
the amount in euros spent at the shop over six months
To find out more about this data set please see Atkinson and Riani (2006), JCGS
The data are themselves a random sample from a larger database. The sample of 509 observations is available at http://www.riani.it/trimmed/.
Atkinson, A. and Riani, M (2006) Distribution Theory and Simulations for Tests of Outliers in Regression, Journal of Computational and Graphical Statistics, 15 2, pp 460–476.
data(loyalty)
data(loyalty)
LSX_control
object
Creates an object of class LXS_control
to be used with the fsreg()
function,
containing various control parameters.
LXS_control(intercept = TRUE, lms, h, bdp, nsamp, rew = FALSE, conflev = 0, msg = TRUE, nocheck = FALSE, nomes = FALSE, plot = FALSE)
LXS_control(intercept = TRUE, lms, h, bdp, nsamp, rew = FALSE, conflev = 0, msg = TRUE, nocheck = FALSE, nomes = FALSE, plot = FALSE)
intercept |
Indicator for constant term. Scalar. If |
lms |
Criterion to use to find the initial subset to initialize the search
(LMS, LTS with concentration steps, LTS without concentration steps
or subset supplied directly by the user). The default value is 1
(Least Median of Squares is computed to initialize the search).
On the other hand, if the user wants to initialze the search with
LTS with all the default options for concentration steps then lms=2.
If the user wants to use LTS without concentration steps, lms can be
a scalar different from 1 or 2. If lms is a list it is possible
to control a series of options for concentration steps (for more
details see option
|
h |
The number of observations that have determined the least trimmed squares
estimator, scalar. |
bdp |
Breakdown point. It measures the fraction of outliers the algorithm should resist. In this case any value greater than 0 but smaller or equal than 0.5 will do fine. If on the other hand the purpose is subgroups detection then bdp can be greater than 0.5. In any case however n*(1-bdp) must be greater than p. If this condition is not fulfilled an error will be given. Please specify h or bdp not both. |
nsamp |
Number of subsamples which will be extracted to find the robust estimator,
scalar. If |
rew |
LXS reweighted - if rew=1 the reweighted version of LTS (LMS) is used and the output quantities refer to the reweighted version else no reweighting is performed (default). |
conflev |
Confidence level which is used to declare units as outliers, usually |
msg |
Controls whether to display or not messages on the screen If |
nocheck |
Check input arguments, scalar. If |
nomes |
It controls whether to display or not on the screen messages about estimated time to compute LMS (LTS). If nomes is equal to 1 no message about estimated time to compute LMS (LTS) is displayed, else if nomes is equal to 0 (default), a message about estimated time is displayed. |
plot |
Plot on the screen. Scalar. If |
Creates an object of class FSR_control
to be used with the fsreg()
function,
containing various control parameters.
An object of class "LXS_control"
which is basically a
list
with components the input arguments of
the function mapped accordingly to the corresponding Matlab function.
FSDA team
See Also as Sreg_control
, MMreg_control
and FSR_control
## Not run: data(hbk, package="robustbase") (out <- fsreg(Y~., data=hbk, method="LMS", control=LXS_control(h=56, nsamp=500, lms=2))) ## End(Not run)
## Not run: data(hbk, package="robustbase") (out <- fsreg(Y~., data=hbk, method="LMS", control=LXS_control(h=56, nsamp=500, lms=2))) ## End(Not run)
A bivariate data set obtained from three normal bivariate distributions with different scales and proportions 1:2:2. One of the components is strongly overlapping with another one. A 10 noise is added uniformly distributed in a rectangle containing the three normal components and not strongly overlapping with the three mixture components. A precise description of the M5 data set can be found in Garcia-Escudero et al. (2008).
data(M5data)
data(M5data)
A data frame with 2000 rows and 3 variables The first two columns are the two variables. The last column is the true classification vector where symbol "0" stands for the contaminating data points.
Garcia-Escudero, L.A., Gordaliza, A., Matran, C. and Mayo-Iscar, A. (2008). A General Trimming Approach to Robust Cluster Analysis, Annals of Statistics, Vol.36, 1324-1345. doi:10.1214/07-AOS515.
Plots the trajectories of scaled Mahalanobis distances along the forward search
malfwdplot( out, xlim, ylim, xlab, ylab, main, lwd, lty, col, cex.lab, cex.axis, subsize, fg.thresh, fg.unit, fg.labstep, fg.lwd, fg.lty, fg.col, fg.mark, fg.cex, bg.thresh, bg.style, standard, fground, bground, tag, datatooltip, label, nameX, databrush, conflev, trace = FALSE, ... )
malfwdplot( out, xlim, ylim, xlab, ylab, main, lwd, lty, col, cex.lab, cex.axis, subsize, fg.thresh, fg.unit, fg.labstep, fg.lwd, fg.lty, fg.col, fg.mark, fg.cex, bg.thresh, bg.style, standard, fground, bground, tag, datatooltip, label, nameX, databrush, conflev, trace = FALSE, ... )
out |
An object of S3 class The needed elements of
|
xlim |
Controls the |
ylim |
Controls the |
xlab |
A title for the x axis |
ylab |
A title for the y axis, deafult is "Squared Mahalanobis distances". |
main |
An overall title for the plot |
lwd |
The line width, a positive number, defaulting to 1 |
lty |
The line type. Line types can either be specified as an integer (1=solid (default), 2=dashed, 3=dotted, 4=dotdash, 5=longdash, 6=twodash) or as one of the character strings "solid", "dashed", "dotted", "dotdash", "longdash", or "twodash". The latter two are not supported by Matlab. |
col |
Colors to be used for the highlighted units |
cex.lab |
The magnification to be used for x and y labels relative to the current setting of cex |
cex.axis |
The magnification to be used for axis annotation relative to the current setting of cex |
subsize |
Numeric vector containing the subset size with length equal to the number of columns of
matrix of mahalanobis distances. The default value of subsize is |
fg.thresh |
(alternative to fg.unit) numeric vector of length 1 or 2 which specifies
the highlighted trajectories.
If |
fg.unit |
(alternative to fg.thresh), vector containing the list of the units to be highlighted.
If |
fg.labstep |
numeric vector which specifies the steps of the search where to put labels for
the highlighted trajectories (units). The default is to put the labels at the
initial and final steps of the search. |
fg.lwd |
The line width for the highlighted trajectories (units). Default is 1. |
fg.lty |
The line type for the highlighted trajectories (units). Line types can either be specified as an integer (1=solid (default), 2=dashed, 3=dotted, 4=dotdash, 5=longdash, 6=twodash) or as one of the character strings "solid", "dashed", "dotted", "dotdash", "longdash", or "twodash". The latter two are not supported by Matlab. |
fg.col |
colors to be used for the highlighted units. |
fg.mark |
Controlls whether to plot highlighted trajectories as symbols.
if |
fg.cex |
Controls the font size of the labels of the trajectories in foreground. If
|
bg.thresh |
Numeric vector of length 1 or 2 which specifies how to define the unimmportant trajectories.
Unimmportant trajectories will be plotted using a colormap, in greysh or will be hidden.
If |
bg.style |
Specifies how to plot the unimportant trajectories as defined in option bg.thresh.
When |
standard |
MATLAB-style arguments - appearance of the plot in terms of xlim, ylim, axes labels and their font size style, color of the lines, etc. |
fground |
MATLAB-style arguments - for the trajectories in foregroud. |
bground |
MATLAB-style arguments - for the trajectories in backgroud. |
tag |
Plot handle. String which identifies the handle of the plot which is about to be created. The default is to use tag 'pl_resfwd'. Notice that if the program finds a plot which has a tag equal to the one specified by the user, then the output of the new plot overwrites the existing one in the same window else a new window is created. |
datatooltip |
Interactive clicking. It is inactive if this parameter is set to FALSE.
The default is
|
label |
Character vector containing the labels of the units (optional argument used when
|
nameX |
Add variable labels in plot. A vector of strings of length |
databrush |
Interactive mouse brushing. If databrush is missing or empty (default), no brushing is done.
The activation of this option (databrush is If If
|
conflev |
confidence interval for the horizontal bands. It can be a vector of different confidence level values, e.g. c(0.95, 0.99, 0.999). The confidence interval is based on the $chi^2$ distribution. |
trace |
Whether to print intermediate results. Default is |
... |
potential further arguments passed to lower level functions. |
none
FSDA team, [email protected]
Atkinson A.C., Riani M. and Cerioli A. (2004), Exploring Multivariate Data with the Forward Search, Springer Verlag, New York.
## Not run: ## Produce monitoring MD plot with all the default options. ## Generate input structure for malfwdplot n <- 100 p <- 4 Y <- matrix(rnorm(n*p), ncol=p) Y[1:10,] <- Y[1:10,] + 4 out <- fsmult(Y, monitoring=TRUE, init=30) ## Produce monitoring MD plot with all the default options malfwdplot(out) ## End(Not run)
## Not run: ## Produce monitoring MD plot with all the default options. ## Generate input structure for malfwdplot n <- 100 p <- 4 Y <- matrix(rnorm(n*p), ncol=p) Y[1:10,] <- Y[1:10,] + 4 out <- fsmult(Y, monitoring=TRUE, init=30) ## Produce monitoring MD plot with all the default options malfwdplot(out) ## End(Not run)
Plots the trajectory of minimum Mahalanobis distance (mmd)
malindexplot( out, p, xlab, ylab, main, nameX, conflev, numlab, tag, trace = FALSE, ... )
malindexplot( out, p, xlab, ylab, main, nameX, conflev, numlab, tag, trace = FALSE, ... )
out |
a numeric vector or an object of S3 class (one of |
p |
If |
xlab |
A title for the x axis |
ylab |
A title for the y axis |
main |
An overall title for the plot |
nameX |
Add variable labels in the plot. A vector of strings of length |
conflev |
confidence interval for the horizontal bands. It can be a vector of different confidence level values, e.g. c(0.95, 0.99, 0.999). The confidence interval is based on the chi^2 distribution. |
numlab |
Number of points to be labeled in the plot. If |
tag |
Tag of the figure which will host the malindexplot. The default tag is |
trace |
Whether to print intermediate results. Default is |
... |
potential further arguments passed to lower level functions. |
none
FSDA team, [email protected]
Atkinson and Riani (2000), Robust Diagnostic Regression Analysis, Springer Verlag, New York.
## Not run: ## Mahalanobis distance plot of 100 random numbers. ## Numbers are from from the chisq with 5 degrees of freedom malindexplot(rchisq(100, 5), 5) ## End(Not run)
## Not run: ## Mahalanobis distance plot of 100 random numbers. ## Numbers are from from the chisq with 5 degrees of freedom malindexplot(rchisq(100, 5), 5) ## End(Not run)
Plots the trajectory of minimum deletion residual (mdr).
mdrplot(out, quant = c(0.01, 0.5, 0.99), sign = TRUE, mplus1 = FALSE, envm, xlim, ylim, xlab, ylab, main, lwdenv, lwd, cex.lab, cex.axis, tag, datatooltip, label, nameX, namey, databrush, ...)
mdrplot(out, quant = c(0.01, 0.5, 0.99), sign = TRUE, mplus1 = FALSE, envm, xlim, ylim, xlab, ylab, main, lwdenv, lwd, cex.lab, cex.axis, tag, datatooltip, label, nameX, namey, databrush, ...)
out |
An object returned by FSReda() (see The needed elements of
|
quant |
Quantiles for which envelopes have to be computed. The default is to
produce 1%, 50% and 99% envelopes. In other words the default
is |
sign |
Wheather to use MDR with sign: if |
mplus1 |
Wheather to plot the (m+1)-th order statistic. Specifies if it is necessary to plot the curve associated with (m+1)-th order statistic. |
envm |
Sample size for drawing enevlopes. Specifies the size of the sample which is used to superimpose the envelope. The default is to add an envelope based on all the observations (size n envelope). |
ylim |
Control |
xlim |
Control |
xlab |
a title for the x axis |
ylab |
a title for the y axis |
main |
an overall title for the plot |
lwdenv |
Controls the width of the lines associated with the envelopes, default is |
lwd |
Controls the linewidth of the curve which contains the monitoring of minimum deletion residual. |
cex.lab |
The magnification to be used for x and y labels relative to the current setting of cex |
cex.axis |
The magnification to be used for axis annotation relative to the current setting of cex |
tag |
Plot handle. String which identifies the handle of the plot which is about to be created. The default is to use tag 'pl_mdr'. Notice that if the program finds a plot which has a tag equal to the one specified by the user, then the output of the new plot overwrites the existing one in the same window else a new window is created. |
datatooltip |
If datatooltip is not empty the user can use the mouse in order to have
information about the unit selected, the step in which the unit enters the search and
the associated label. If datatooltip is a list, it is possible to control the aspect
of the data cursor (see MATLAB function |
label |
Character vector containing the labels of the units (optional argument used when
|
nameX |
Add variable labels in plot. A vector of strings of length |
namey |
Add response label. A string containing the label of the response |
databrush |
interactive mouse brushing. If databrush is missing or empty (default), no brushing is done. The activation of this option (databrush is a scalar or a list) enables the user to select a set of trajectories in the current plot and to see them highlighted in the y|X plot, i.e. a matrix of scatter plots of y against each column of X, grouped according to the selection(s) done by brushing. If the plot y|X does not exist it is automatically created. In addition, brushed units are automatically highlighted in the minimum deletion residual plot if it is already open. The extension to the following plots will be available in future versions of the toolbox:
Note that the window style of the other figures is set equal to that which contains the monitoring residual plot. In other words, if the monitoring residual plot is docked all the other figures will be docked too If If
|
... |
potential further arguments passed to lower level functions. |
No details
No value returned
FSDA team
## Not run: n <- 100 y <- rnorm(n) X <- matrix(rnorm(n*4), nrow=n) out <- fsreg(y~X, method="LTS") out <- fsreg(y~X, method="FS", bsb=out$bs, monitoring=TRUE) mdrplot(out) ## End(Not run)
## Not run: n <- 100 y <- rnorm(n) X <- matrix(rnorm(n*4), nrow=n) out <- fsreg(y~X, method="LTS") out <- fsreg(y~X, method="FS", bsb=out$bs, monitoring=TRUE) mdrplot(out) ## End(Not run)
Plots the trajectory of minimum Mahalanobis distance (mmd)
mmdplot( out, quant = c(0.01, 0.5, 0.99), mplus1 = FALSE, envm, lwd, lwdenv, xlim, ylim, tag, datatooltip, label, xlab, ylab, main, nameX, cex.lab, cex.axis, databrush, trace = FALSE, ... )
mmdplot( out, quant = c(0.01, 0.5, 0.99), mplus1 = FALSE, envm, lwd, lwdenv, xlim, ylim, tag, datatooltip, label, xlab, ylab, main, nameX, cex.lab, cex.axis, databrush, trace = FALSE, ... )
out |
An object of S3 class |
quant |
Quantiles for which envelopes have to be computed.
The default is to produce 1%, 50% and 99% envelopes. In other
words the default is |
mplus1 |
Wheather to plot the (m+1)-th order statistic. |
envm |
Sample size for drawing enevlopes. Specifies the size of the sample which is
used to superimpose the envelope. The default is to add an envelope based on
all the observations (size |
lwd |
Controls the line width of the curve which contains the monitoring of minimum deletion residual. |
lwdenv |
Controls the width of the lines associated with the envelopes. Default is |
xlim |
Control the x scale in plot. Vector with two elements controlling minimum and maximum on the x axis. Default is to use automatic scale. |
ylim |
Control the y scale in plot. Vector with two elements controlling minimum and maximum on the y axis. Default is to use automatic scale. |
tag |
Plot handle. String which identifies the handle of the plot which is about to be created.
The default is |
datatooltip |
If datatooltip is not empty the user can use the mouse in order to have
information about the unit selected, the step in which the unit enters the search and
the associated label. If datatooltip is a list, it is possible to control the aspect
of the data cursor (see MATLAB function |
label |
Row labels. Character vector containing the labels of the units (optional argument used
when |
xlab |
A title for the x axis |
ylab |
A title for the y axis |
main |
An overall title for the plot |
nameX |
Add variable labels in the plot. A vector of strings of length |
cex.lab |
The magnification to be used for x and y labels relative to the current setting of cex |
cex.axis |
The magnification to be used for axis annotation relative to the current setting of cex |
databrush |
Interactive mouse brushing. If databrush is missing or empty (default), no brushing is done. The activation of this option (databrush is TRUE or a list) enables the user to select a set of trajectories in the current plot and to see them highlighted in the scatterplot matrix. If the scatterplot matrix does not exist it is automatically created. In addition, brushed units can be highlighted in the monitoring MD plot. Note that the window style of the other figures is set equal to that which contains the monitoring residual plot. In other words, if the monitoring residual plot is docked all the other figures will be docked too. If Note that the window style of the other figures is set equal to that which contains the monitoring residual plot. In other words, if the monitoring residual plot is docked all the other figures will be docked too If If
|
trace |
Whether to print intermediate results. Default is |
... |
potential further arguments passed to lower level functions. |
none
FSDA team, [email protected]
Atkinson and Riani (2000), Robust Diagnostic Regression Analysis, Springer Verlag, New York.
## Not run: data(hbk, package="robustbase") (out <- fsmult(hbk[,1:3], monitoring=TRUE)) mmdplot(out) ## End(Not run)
## Not run: data(hbk, package="robustbase") (out <- fsmult(hbk[,1:3], monitoring=TRUE)) mmdplot(out) ## End(Not run)
Plots the trajectories of minimum Mahalanobis distances from different starting points
mmdrsplot( out, quant = c(0.01, 0.5, 0.99), envm, lwd, lwdenv, xlim, ylim, tag, datatooltip, label, xlab, ylab, envlab = TRUE, main, nameX, cex.lab, cex.axis, databrush, scaled = FALSE, trace = FALSE, ... )
mmdrsplot( out, quant = c(0.01, 0.5, 0.99), envm, lwd, lwdenv, xlim, ylim, tag, datatooltip, label, xlab, ylab, envlab = TRUE, main, nameX, cex.lab, cex.axis, databrush, scaled = FALSE, trace = FALSE, ... )
out |
An object of S3 class
|
quant |
Quantiles for which envelopes have to be computed.
The default is to produce 1%, 50% and 99% envelopes. In other
words the default is |
envm |
Sample size for drawing enevlopes. Specifies the size of the sample which is
used to superimpose the envelope. The default is to add an envelope based on
all the observations (size |
lwd |
Controls the linewidth of the curve which contains the monitoring of minimum deletion residual. |
lwdenv |
line width: a scalar which controls the width of the lines associated
with the envelopes. Default is |
xlim |
Control the x scale in plot. Vector with two elements controlling minimum and maximum on the x axis. Default is to use automatic scale. |
ylim |
Control the y scale in plot. Vector with two elements controlling minimum and maximum on the y axis. Default is to use automatic scale. |
tag |
Plot handle. String which identifies the handle of the plot which is about to be created.
The default is |
datatooltip |
If datatooltip is not empty the user can use the mouse in order to have
information about the unit selected, the step in which the unit enters the search and
the associated label. If datatooltip is a list, it is possible to control the aspect
of the data cursor (see MATLAB function |
label |
Row labels. Character vector containing the labels of the units (optional argument used
when |
xlab |
A title for the x axis |
ylab |
A title for the y axis |
envlab |
wheather to label the envelopes. If |
main |
An overall title for the plot |
nameX |
Add variable labels in the plot. A vector of strings of length |
cex.lab |
The magnification to be used for x and y labels relative to the current setting of cex |
cex.axis |
The magnification to be used for axis annotation relative to the current setting of cex |
databrush |
Interactive mouse brushing. If databrush is missing or empty (default), no brushing is done. The activation of this option (databrush is TRUE or a list) enables the user to select a set of trajectories in the current plot and to see them highlighted in the scatterplot matrix. If the scatterplot matrix does not exist it is automatically created. In addition, brushed units can be highlighted in the monitoring MD plot. Note that the window style of the other figures is set equal to that which contains the monitoring residual plot. In other words, if the monitoring residual plot is docked all the other figures will be docked too. If Note that the window style of the other figures is set equal to that which contains the monitoring residual plot. In other words, if the monitoring residual plot is docked all the other figures will be docked too If If
|
scaled |
Wheather to use scaled or unscaled envelopes. If |
trace |
Whether to print intermediate results. Default is |
... |
potential further arguments passed to lower level functions. |
none
FSDA team, [email protected]
Atkinson, A.C., Riani, M. and Cerioli, A. (2004), 'Exploring multivariate data with the forward search, Springer Verlag, New York.
## Not run: data(hbk, package="robustbase") out <- fsmmmdrs(hbk[,1:3]) mmdrsplot(out) ## End(Not run)
## Not run: data(hbk, package="robustbase") out <- fsmmmdrs(hbk[,1:3]) mmdrsplot(out) ## End(Not run)
Computes MM estimators in multivariate analysis with auxiliary S-scale
mmmult( x, monitoring = FALSE, plot = FALSE, eff, conflev = 0.975, nocheck = FALSE, trace = FALSE, ... )
mmmult( x, monitoring = FALSE, plot = FALSE, eff, conflev = 0.975, nocheck = FALSE, trace = FALSE, ... )
x |
An n x p data matrix (n observations and p variables). Rows of x represent observations, and columns represent variables. Missing values (NA's) and infinite values (Inf's) are allowed, since observations (rows) with missing or infinite values will automatically be excluded from the computations. |
monitoring |
Wheather to perform monitoring of Mahalanobis distances and other specific quantities |
plot |
Plots the Mahalanobis distances against index number. If
|
eff |
Defining the nominal efficiency (i.e. a number between 0.5 and 0.99). The default value is |
conflev |
Confidence level which is used to declare units as outliers (scalar).
Usually |
nocheck |
It controls whether to perform checks on matrix Y. If |
trace |
Whether to print intermediate results. Default is |
... |
potential further arguments passed to lower level functions. |
This function follows the lines of MATLAB/R code developed during the years by many authors. For more details see http://www.econ.kuleuven.be/public/NDBAE06/programs/ and the R package CovMMest The core of these routines, e.g. the resampling approach, however, has been completely redesigned, with considerable increase of the computational performance.
Depending on the input parameter monitoring
, one of
the following objects will be returned:
FSDA team, [email protected]
Maronna, R.A., Martin D. and Yohai V.J. (2006), Robust Statistics, Theory and Methods, Wiley, New York.
## Not run: data(hbk, package="robustbase") (out <- mmmult(hbk[,1:3])) class(out) summary(out) ## Generate contaminated data (200,3) n <- 200 p <- 3 set.seed(123456) X <- matrix(rnorm(n*p), nrow=n) Xcont <- X Xcont[1:5, ] <- Xcont[1:5,] + 3 out1 <- mmmult(Xcont, trace=TRUE) # no plots (plot defaults to FALSE) names(out1) ## plot=TRUE - generates: (1) a plot of Mahalanobis distances against ## index number. The confidence level used to draw the confidence bands for ## the MD is given by the input option conflev. If conflev is ## not specified a nominal 0.975 confidence interval will be used and ## (2) a scatter plot matrix with the outliers highlighted. (out1 <- mmmult(Xcont, trace=TRUE, plot=TRUE)) ## plots is a list: the spm shows the labels of the outliers. (out1 <- mmmult(Xcont, trace=TRUE, plot=list(labeladd="1"))) ## plots is a list: the spm uses the variable names provided by 'nameY'. (out1 <- mmmult(Xcont, trace=TRUE, plot=list(nameY=c("A", "B", "C")))) ## mmmult() with monitoring (out2 <- mmmult(Xcont, monitoring=TRUE, trace=TRUE)) names(out2) ## Forgery Swiss banknotes examples. data(swissbanknotes) (out1 <- mmmult(swissbanknotes[101:200,], plot=TRUE)) (out1 <- mmmult(swissbanknotes[101:200,], plot=list(labeladd="1"))) ## End(Not run)
## Not run: data(hbk, package="robustbase") (out <- mmmult(hbk[,1:3])) class(out) summary(out) ## Generate contaminated data (200,3) n <- 200 p <- 3 set.seed(123456) X <- matrix(rnorm(n*p), nrow=n) Xcont <- X Xcont[1:5, ] <- Xcont[1:5,] + 3 out1 <- mmmult(Xcont, trace=TRUE) # no plots (plot defaults to FALSE) names(out1) ## plot=TRUE - generates: (1) a plot of Mahalanobis distances against ## index number. The confidence level used to draw the confidence bands for ## the MD is given by the input option conflev. If conflev is ## not specified a nominal 0.975 confidence interval will be used and ## (2) a scatter plot matrix with the outliers highlighted. (out1 <- mmmult(Xcont, trace=TRUE, plot=TRUE)) ## plots is a list: the spm shows the labels of the outliers. (out1 <- mmmult(Xcont, trace=TRUE, plot=list(labeladd="1"))) ## plots is a list: the spm uses the variable names provided by 'nameY'. (out1 <- mmmult(Xcont, trace=TRUE, plot=list(nameY=c("A", "B", "C")))) ## mmmult() with monitoring (out2 <- mmmult(Xcont, monitoring=TRUE, trace=TRUE)) names(out2) ## Forgery Swiss banknotes examples. data(swissbanknotes) (out1 <- mmmult(swissbanknotes[101:200,], plot=TRUE)) (out1 <- mmmult(swissbanknotes[101:200,], plot=list(labeladd="1"))) ## End(Not run)
mmmult.object
ObjectsAn object of class mmmult.object
holds information about
the result of a call to mmmult
.
The object itself is basically a list
with the following
components:
loc |
p-by-1 vector containing MM estimate of location. |
shape |
p-by-p matrix with MM estimate of the shape matrix. |
cov |
matrix with MM estimate of the covariance matrix.
Remark: |
weights |
A vector containing the estimates of the weights. |
outliers |
A vector containing the list of the units declared
as outliers using confidence level specified in input scalar
|
Sloc |
A vector with S estimate of location. |
Sshape |
A matrix with S estimate of the shape matrix. |
Scov |
A matrix with S estimate of the covariance matrix. |
auxscale |
S estimate of the scale. |
md |
n-by-1 vector containing the estimates of the robust Mahalanobis distances (in squared units). |
conflev |
Confidence level that was used to declare outliers. |
X |
the data matrix X |
The object has class "mmmult"
.
## Not run: data(hbk, package="robustbase") (out <- mmmult(hbk[,1:3])) class(out) summary(out) ## End(Not run)
## Not run: data(hbk, package="robustbase") (out <- mmmult(hbk[,1:3])) class(out) summary(out) ## End(Not run)
mmmulteda.object
ObjectsAn object of class mmmulteda.object
holds information about
the result of a call to mmmult
with monitoring=TRUE
.
The object itself is basically a list
with the following
components:
Loc |
length(eff)-by-p matrix containing MM estimate of location
for each value of |
Shape |
p-by-p-by-length(eff) 3D array containing robust estimate of the shape for each value of eff. Remark: det|shape|=1. |
Scale |
length(eff) vector containing robust estimate of the scale for each value of eff. |
Cov |
p-by-p-by-length(eff) 3D array containing robust estimate
of covariance matrix for each value of |
Bs |
(p+1)-by-length(eff) matrix containing the units forming best subset for each value of eff. |
MAL |
n-by-length(eff) matrix containing the estimates of the robust Mahalanobis distances (in squared units) for each value of eff. |
Outliers |
n-by-length(eff) matrix containing flags for the outliers.
Boolean matrix containing the list of the
units declared as outliers for each value of eff using confidence
level specified in input scalar |
Weights |
n x length(eff) matrix containing the weights for each value of eff. |
conflev |
Confidence level that was used to declare outliers. |
singsub |
Number of subsets without full rank. Notice that
|
eff |
vector which contains the values of eff which have been used. |
X |
the data matrix X. |
The object has class "mmmulteda"
.
## Not run: data(hbk, package="robustbase") (out <- mmmult(hbk[,1:3], monitoring=TRUE)) class(out) summary(out) ## End(Not run)
## Not run: data(hbk, package="robustbase") (out <- mmmult(hbk[,1:3], monitoring=TRUE)) class(out) summary(out) ## End(Not run)
MMreg_control
object
Creates an object of class MMreg_control
to be used with the fsreg()
function,
containing various control parameters for calling the MATLAB function MMreg()
.
MMreg_control(intercept = TRUE, InitialEst, eff, effshape, rhofunc = c("bisquare", "optimal", "hyperbolic", "hampel", "mdpd", "AS"), rhofuncparam, refsteps = 3, tol = 1e-07, conflev, nocheck = FALSE, Smsg = TRUE, plot = FALSE)
MMreg_control(intercept = TRUE, InitialEst, eff, effshape, rhofunc = c("bisquare", "optimal", "hyperbolic", "hampel", "mdpd", "AS"), rhofuncparam, refsteps = 3, tol = 1e-07, conflev, nocheck = FALSE, Smsg = TRUE, plot = FALSE)
intercept |
Indicator for constant term. Scalar. If |
InitialEst |
Starting values of the MM-estimator, a list with the fiollowing
elements: |
eff |
Scalar defining nominal efficiency (i.e. a number between 0.5 and 0.99). The default value is 0.95. |
effshape |
Location or scale efficiency. If |
rhofunc |
Specifies the rho function which must be used to weight the residuals. Possible values are 'bisquare' 'optimal' 'hyperbolic' 'hampel'.
The default is 'bisquare'. |
rhofuncparam |
Additional parameters for the specified rho function. For hyperbolic rho function it is possible to set up the value of k = sup CVC (the default value of k is 4.5). For Hampel rho function it is possible to define parameters a, b and c (the default values are a=2, b=4, c=8) |
refsteps |
Number of refining iterations in each subsample (default is |
tol |
Scalar controlling tolerance in the MM loop. The default value is |
conflev |
Confidence level which is used to declare units as outliers. Usually conflev=0.95, 0.975, 0.99 (individual alpha) or conflev=1-0.05/n, 1-0.025/n, 1-0.01/n (simultaneous alpha). Default value is 0.975 |
nocheck |
Check input arguments, scalar. If |
Smsg |
Controls whether to display or not messages on the screen If |
plot |
Plot on the screen. Scalar. If |
Creates an object of class MMreg_control
to be used with the fsreg()
function,
containing various control parameters.
An object of class "MMreg_control"
which is basically a
list
with components the input arguments of
the function mapped accordingly to the corresponding Matlab function.
FSDA team
See Also as FSR_control
, MMreg_control
and LXS_control
## Not run: data(hbk, package="robustbase") (out <- fsreg(Y~., data=hbk, method="MM", control=MMreg_control(eff=0.99, rhofunc="optimal"))) ## End(Not run)
## Not run: data(hbk, package="robustbase") (out <- fsreg(Y~., data=hbk, method="MM", control=MMreg_control(eff=0.99, rhofunc="optimal"))) ## End(Not run)
An object of class mmreg.object
holds information about
the result of a call to fsreg
with method="MM"
.
The object itself is basically a list
with the following
components:
beta |
p-by-1 vector containing the MM estimate of regression coefficients. |
auxscale |
scalar, S estimate of the scale (or supplied external estimate of scale, if option InitialEst is not empty). |
residuals |
residuals. |
fittedvalues |
fitted values. |
weights |
n x 1 vector. Weights assigned to each observation. |
Sbeta |
p x 1 vector containing S estimate of regression coefficients (or supplied initial external estimate of regression coefficients, if option InitialEst is not empty) |
Ssingsub |
Number of subsets without full rank in the S preliminary part. Notice that out.singsub > 0.1*(number of subsamples) produces a warning. |
outliers |
kx1 vector containing the list of the k units declared as outliers or NULL if the sample is homogeneous. |
conflev |
Confidence level which is used to declare units as outliers.
Usually |
rhofunc |
Specifies the rho function which has been used to weight
the residuals. If a different rho function is specified for S and MM
loop then insted of |
rhofuncparam |
Vector which contains the additional parameters for the specified
rho function which has been used. For hyperbolic rho function the value of k =sup CVC.
For Hampel rho function the parameters a, b and c. If a different rho function is
specified for S and MM loop then insted of |
X |
the data matrix X |
y |
the response vector y |
The object has class "mmreg"
.
## Not run: data(hbk, package="robustbase") (out <- fsreg(Y~., data=hbk, method="MM")) class(out) summary(out) ## End(Not run)
## Not run: data(hbk, package="robustbase") (out <- fsreg(Y~., data=hbk, method="MM")) class(out) summary(out) ## End(Not run)
MMregeda_control
object
Creates an object of class MMregeda_control
to be used with the fsreg()
function,
containing various control parameters.
MMregeda_control(intercept = TRUE, InitialEst, Soptions, eff, effshape, rhofunc = c("bisquare", "optimal", "hyperbolic", "hampel", "mdpd", "AS"), rhofuncparam, refsteps = 3, tol = 1e-07, conflev, nocheck = FALSE, plot = FALSE)
MMregeda_control(intercept = TRUE, InitialEst, Soptions, eff, effshape, rhofunc = c("bisquare", "optimal", "hyperbolic", "hampel", "mdpd", "AS"), rhofuncparam, refsteps = 3, tol = 1e-07, conflev, nocheck = FALSE, plot = FALSE)
intercept |
Indicator for constant term. Scalar. If |
InitialEst |
Starting values of the MM-estimator, a list with the fiollowing
elements: |
Soptions |
Options to pass to Sreg, an It is necessary to add to the S options the letter S at the beginning. For example, if you want to use the optimal rho function the supplied option is 'Srhofunc','optimal'. For example, if you want to use 3000 subsets, the supplied option is 'Snsamp',3000 |
eff |
Vector defining nominal efficiency (i.e. a series of numbers
between 0.5 and 0.99). The default value is the sequence |
effshape |
Location or scale efficiency. If |
rhofunc |
Specifies the rho function which must be used to weight the residuals. Possible values are 'bisquare' 'optimal' 'hyperbolic' 'hampel'.
The default is 'bisquare'. |
rhofuncparam |
Additional parameters for the specified rho function. For hyperbolic rho function it is possible to set up the value of k = sup CVC (the default value of k is 4.5). For Hampel rho function it is possible to define parameters a, b and c (the default values are a=2, b=4, c=8) |
refsteps |
Number of refining iterations in each subsample (default is |
tol |
Scalar controlling tolerance in the MM loop. The default value is |
conflev |
Confidence level which is used to declare units as outliers. Usually conflev=0.95, 0.975, 0.99 (individual alpha) or conflev=1-0.05/n, 1-0.025/n, 1-0.01/n (simultaneous alpha). Default value is 0.975 |
nocheck |
Check input arguments, scalar. If |
plot |
Plot on the screen. Scalar. If |
Creates an object of class MMregeda_control
to be used with the fsreg()
function,
containing various control parameters.
An object of class "MMregeda_control"
which is basically a
list
with components the input arguments of
the function mapped accordingly to the corresponding Matlab function.
FSDA team
See Also as FSR_control
, Sreg_control
, MMreg_control
and LXS_control
## Not run: data(hbk, package="robustbase") (out <- fsreg(Y~., data=hbk, method="MM", monitoring=TRUE, control=MMregeda_control(eff=seq(0.75, 0.99, 0.01)))) ## End(Not run)
## Not run: data(hbk, package="robustbase") (out <- fsreg(Y~., data=hbk, method="MM", monitoring=TRUE, control=MMregeda_control(eff=seq(0.75, 0.99, 0.01)))) ## End(Not run)
mmregeda
ObjectsAn object of class mmregeda.object
holds information about
the result of a call to fsreg
when method="MM"
and
monitoring=TRUE
.
The object itself is basically a list
with the following
components:
auxscale |
scalar, S estimate of the scale (or supplied external estimate of scale, if option InitialEst is not empty). |
Beta |
p x length(eff) matrix containing MM estimate of regression coefficients for each value of eff. |
RES |
n x length(eff) matrix containing the monitoring
of scaled residuals for each value of |
Weights |
n x length(eff) matrix containing the estimates of the weights for each value of eff |
Outliers |
Boolean matrix containing the list of the units declared as outliers for each value of eff using confidence level specified in input scalar conflev. |
conflev |
Confidence level which is used to declare units as outliers. Remark: conflev will be used to draw the horizontal line (confidence band) in the plot. |
Ssingsub |
Number of subsets wihtout full rank. Notice that Notice that singsub > 0.1*(number of subsamples) produces a warning |
rhofunc |
string identifying the rho function which has been used. |
rhofuncparam |
vector which contains the additional parameters for the specified rho function which have been used. For hyperbolic rho function the value of k =sup CVC. For Hampel rho function the parameters a, b and c. |
eff |
vector containing the value of eff which have been used. |
X |
the data matrix X |
y |
the response vector y |
The object has class "mmregeda"
.
## Not run: data(hbk, package="robustbase") (out <- fsreg(Y~., data=hbk, method="MM", monitoring=TRUE)) class(out) summary(out) ## End(Not run)
## Not run: data(hbk, package="robustbase") (out <- fsreg(Y~., data=hbk, method="MM", monitoring=TRUE)) class(out) summary(out) ## End(Not run)
There are 60 observations on a response y with the values of three explanatory variables. The scatter plot matrix of the data shows y increasing with each of x1, x2 and x3. The plot of residuals against fitted values shows no obvious pattern. However the FS finds that there are 6 masked outliers.
data(multiple_regression)
data(multiple_regression)
A data frame with 60 rows and 4 variables The variables are as follows:
X1
X2
X3
y
@references Atkinson, A. C., and Riani, M. (2000). Robust Diagnostic Regression Analysis. Springer-Verlag, New York.
These data, introduced by Cook and Weisberg (1994), consist of 82 observations on horse mussels from New Zeland. The variables are shell length, width, height, mass and muscle mass
data(mussels)
data(mussels)
A data frame with 82 rows and 5 variables
Initializes the MATLAB random generator
myrng(seed)
myrng(seed)
seed |
a single value, interpreted as an integer |
Integer, the seed value with which the MATLAB random number generator was initialized.
FSDA team, [email protected]
## Not run: data(wool) XX <- wool y <- XX[, ncol(XX)] X <- XX[, 1:(ncol(XX)-1), drop=FALSE] seed <- myrng() #i nitialized the RNG and keep the seed myrng(seed) # repeat the computations with the same seed (out2 <- fsreg(X, y, method="LTS")) all.equal(out1, out2) ## End(Not run)
## Not run: data(wool) XX <- wool y <- XX[, ncol(XX)] X <- XX[, 1:(ncol(XX)-1), drop=FALSE] seed <- myrng() #i nitialized the RNG and keep the seed myrng(seed) # repeat the computations with the same seed (out2 <- fsreg(X, y, method="LTS")) all.equal(out1, out2) ## End(Not run)
The poison data (by Box and Cox, 1964) are about the time to death of animals in a 3 x 4
factorial experiment with four observations at each factor combination. There are no outliers
or influential observations that cannot be reconciled with the greater part of the data by a
suitable transformation.
data(poison)
data(poison)
A data frame with 48 rows and 7 variables: six explanatory and one response variable.
G. E. P. Box and D. R. Cox (1964). An Analysis of Transformations, Journal of the Royal Statistical Society. Series B, 262 pp. 211–252.
data(poison) head(poison)
data(poison) head(poison)
Finds the tuning constant(s) associated to the supplied breakdown point or asymptotic efficiency or computes breakdown point and efficiency associated with the supplied constant(s) for the following psi functions: TB=Tukey biweight, HA=Hampel, HU=Huber, HYP=Hyperbolic, OPT=Optimal, PD=mdpd.
psifun( u = vector(mode = "double", length = 0), p = 1, fun = c("TB", "bisquare", "biweight", "HA", "hampel", "HU", "huber", "HYP", "hyperbolic", "OPT", "optimal", "PD", "mdpd"), bdp, eff, const, param, trace = FALSE )
psifun( u = vector(mode = "double", length = 0), p = 1, fun = c("TB", "bisquare", "biweight", "HA", "hampel", "HU", "huber", "HYP", "hyperbolic", "OPT", "optimal", "PD", "mdpd"), bdp, eff, const, param, trace = FALSE )
u |
optional vector containing scaled residuals or Mahalanobis
distances for the |
p |
number of variables ( |
fun |
psi function class. One of TB, HA, HU, HP, OPT or PD. |
bdp |
requested breakdown point |
eff |
requested asymptotic efficiency |
const |
tuning constant c |
param |
additional parameters |
trace |
whether to print intermediate results. Default is |
A list will be returned containing the following elements:
class
: link function which has be used. Possible values are
'bisquare', 'optimal', 'hyperbolic', 'hampel', 'huber' or 'mdpd'
bdp
: breakdown point
eff
: asymptotic efficiency
c1
: consistency factor (and other parameters) associated
to required breakdown point or nominal efficiency.
kc1
: Expectation of rho associated with c1
to get a
consistent estimator at the model distribution
kc1 = E(rho) = sup(rho)*bdp
rho
: vector of length n
which contains the rho
function associated to the residuals or Mahalanobis distances
for the n
units of the sample. Empty if u
is not provided.
psi
: vector of length n
which contains the psi
function associated with the residuals or Mahalanobis distances
for the n
units of the sample. Empty if u
is not provided.
psider
: vector of length n
which contains the derivative of the
psi function associated with the residuals or Mahalanobis distances
for the n
units of the sample. Empty if u
is not provided.
psix
: vector of length n
which contains
the psi function mutiplied by u
associated with the residuals or Mahalanobis distances
for the n
units of the sample. Empty if u
is not provided.
wei
: vector of length n
which contains the weights
associated with the residuals or Mahalanobis distances
for the n
units of the sample. Empty if u
is not provided.
FSDA team, [email protected]
Hoaglin, D.C. and Mosteller, F. and Tukey, J.W. (1982), Understanding Robust and Exploratory Data Analysis, Wiley, New York.
Huber, P.J. (1981), Robust Statistics, Wiley.
Huber, P.J. and Ronchetti, E.M. (2009), Robust Statistics, 2nd Edition, Wiley.
Hampel, F.R. and Rousseeuw, P.J. and Ronchetti E. (1981), The Change-of-Variance Curve and Optimal Redescending M-Estimators, Journal of the American Statistical Association, 76, pp. 643–648.
Maronna, R.A. and Martin D. and Yohai V.J. (2006), Robust Statistics, Theory and Methods, Wiley, New York.
Riani, M. and Atkinson, A. C. and Corbellini, A. and Perrotta, D. (2020) Robust regression with density power divergence: Theory, comparisons, and data analysis, Entropy 22. doi:10.3390/e22040399.
## Not run: ## Find c for given bdp for the Tukey biweight function ## The constant c associated to a breakdown point of ## 50 percent in regression is ## c=1.547644980928226 psifun(bdp=0.5) psifun(c=1.547644980928226) ## Find c for given bdp for the Hampel function psifun(bdp=0.5, fun="hampel") ## Plot Huber rho function. x <- seq(-3, 3, 0.001) c <- 1.345; HUc1 <- psifun(u=x, p=1, fun="HU", const=c) rhoHU <- HUc1$rho plot(x, rhoHU, type="l", lty="solid", lwd=2, col="blue", xlab="u", ylab="rho (u,1.345)", ylim=c(0.16, 4.5)) lines(x, x^2/2, type="l", lty="dotted", lwd=1.5, col="red") legend(-1, 4.6, legend=c("Huber rho function", "u^2/2"), lty=c("solid", "dotted"), lwd=c(2,1.5), col=c("blue", "red")) yc <- 0.13; text(-c, yc, paste0("-c=", -c), adj=1) text(c,yc, paste0("c=",c), adj=0) segments(c, 0, c, c**2/2, col="red") segments(-c, 0, -c, c**2/2, col="red") points(c, c**2/2, col="red") points(-c, c**2/2, col="red") ## End(Not run)
## Not run: ## Find c for given bdp for the Tukey biweight function ## The constant c associated to a breakdown point of ## 50 percent in regression is ## c=1.547644980928226 psifun(bdp=0.5) psifun(c=1.547644980928226) ## Find c for given bdp for the Hampel function psifun(bdp=0.5, fun="hampel") ## Plot Huber rho function. x <- seq(-3, 3, 0.001) c <- 1.345; HUc1 <- psifun(u=x, p=1, fun="HU", const=c) rhoHU <- HUc1$rho plot(x, rhoHU, type="l", lty="solid", lwd=2, col="blue", xlab="u", ylab="rho (u,1.345)", ylim=c(0.16, 4.5)) lines(x, x^2/2, type="l", lty="dotted", lwd=1.5, col="red") legend(-1, 4.6, legend=c("Huber rho function", "u^2/2"), lty=c("solid", "dotted"), lwd=c(2,1.5), col=c("blue", "red")) yc <- 0.13; text(-c, yc, paste0("-c=", -c), adj=1) text(c,yc, paste0("c=",c), adj=0) segments(c, 0, c, c**2/2, col="red") segments(-c, 0, -c, c**2/2, col="red") points(c, c**2/2, col="red") points(-c, c**2/2, col="red") ## End(Not run)
Produces an interactive scatterplot of the responce y
against each variable of the predictor matrix X
.
regspmplot( y, X, group, plot, namey, nameX, col, cex, pch, labeladd, legend, xlim, ylim, tag, datatooltip, databrush, subsize, selstep, selunit, trace = FALSE, ... )
regspmplot( y, X, group, plot, namey, nameX, col, cex, pch, labeladd, legend, xlim, ylim, tag, datatooltip, databrush, subsize, selstep, selunit, trace = FALSE, ... )
y |
responce variable or an object containing the responce, the predictors and possibly other variable resulting from monitoring of regression. If |
X |
Predictor variables. Data matrix of explanatory variables (also called 'regressors') of
dimension |
group |
grouping variable. Vector with |
plot |
This option controls the names which are displayed in the margins
of the scatterplot matrix as well as the labels of the legend.
If
|
namey |
a character string with the name of the responce variable |
nameX |
a vector of character strings with the names of the explanatory variables |
col |
color specification for the data point. Can be different for each group. By default, the order of the colors is blue, red, black, magenta, green, cyan and yelow. |
cex |
the size of the symbols used for plotting. By default |
pch |
specification of the symbols to use. For example, if
there are three groups, and |
labeladd |
logical, controls wheather the elements belonging to the last group
in the scatterplot matrix are labelled with their unit row index
or their rowname. The rowname is taken from the parameter |
legend |
logical, controls where a legend is shown or not. |
xlim |
x limits. A vector with two elements controlling minimum and maximum on the x axis. By defaul automatic scale is used. |
ylim |
y limits. A vector with two elements controlling minimum and maximum on the y axis. By defaul automatic scale is used. |
tag |
Plot handle. String which identifies the handle of the plot which is about to be created.
The default is |
datatooltip |
If datatooltip is not empty the user can use the mouse in order to have
information about the unit selected, the step in which the unit enters the search and
the associated label. If datatooltip is a list, it is possible to control the aspect
of the data cursor (see MATLAB function |
databrush |
Interactive mouse brushing. If databrush is missing or empty (default), no brushing is done. The activation of this option (databrush is TRUE or a list) enables the user to select a set of trajectories in the current plot and to see them highlighted in the scatterplot matrix. If the scatterplot matrix does not exist it is automatically created. In addition, brushed units can be highlighted in the monitoring MD plot. Note that the window style of the other figures is set equal to that which contains the monitoring residual plot. In other words, if the monitoring residual plot is docked all the other figures will be docked too. If Note that the window style of the other figures is set equal to that which contains the monitoring residual plot. In other words, if the monitoring residual plot is docked all the other figures will be docked too If If
|
subsize |
x axis control, a numeric vector containing the subset size
with length equal to the number of columns of matrix residuals. If it is
not specified it will be set equal to |
selstep |
Text shown in selected steps, a numeric vector which specifies
for which steps of the forward search textlabels are added in the monitoring
residual plot after a brushing action in the yXplot. The default is to
write the labels at the initial and final step. The default is
|
selunit |
Unit labelling. A vector of strings, a string, or a numeric
vector for labelling units. If out is an object the threshold is associated
with the trajectories of the residuals monitored along the search else it
refers to the values of the response variable. If it is a vector of strings,
only the lines associated with the units that in at least one step of the
search had a residual smaller than |
trace |
Whether to print intermediate results. Default is |
... |
potential further arguments passed to lower level functions. |
none
FSDA team, [email protected]
## Not run: ## Example of the use of function regspmplot with all the default options ## regsmplot() with first argument vector y and no option. ## In the first example as input there are two matrices: y and X respectively ## A simple plot is created n <- 100 p <- 3 X <- matrix(data=rnorm(n*p), nrow=n, ncol=p) y <- matrix(data=rnorm(n*1), nrow=n, ncol=1) regspmplot(y, X) ## Example of the use of function regspmplot with first argument ## vector y and third argument group. ## Different groups are shown in the yXplot group <- rep(0, n) group[1:(n/2)] <- rep(1, n/2) regspmplot(y, X, group) ## Example of the use of function regspmplot with first argument ## vector y, third argument group and fourth argument plot ## (Ex1) plot=TRUE regspmplot(y, X, group, plot=TRUE) ## (Ex1) Set the scale for the x axes, the y axis and control symbol type regspmplot(y, X, group, xlim=c(-1,2), ylim=c(0,2), pch=c(10,11), trace=TRUE) ## When the first input argument is an object. ## In the following example the input is an object which also contains ## information about the forward search. (out <- fsreg(y~X, method="LMS", control=LXS_control(nsamp=1000))) (out <- fsreg(y~X, bsb=out$bs, monitoring=TRUE)) regspmplot(out, plot=0) ## End(Not run)
## Not run: ## Example of the use of function regspmplot with all the default options ## regsmplot() with first argument vector y and no option. ## In the first example as input there are two matrices: y and X respectively ## A simple plot is created n <- 100 p <- 3 X <- matrix(data=rnorm(n*p), nrow=n, ncol=p) y <- matrix(data=rnorm(n*1), nrow=n, ncol=1) regspmplot(y, X) ## Example of the use of function regspmplot with first argument ## vector y and third argument group. ## Different groups are shown in the yXplot group <- rep(0, n) group[1:(n/2)] <- rep(1, n/2) regspmplot(y, X, group) ## Example of the use of function regspmplot with first argument ## vector y, third argument group and fourth argument plot ## (Ex1) plot=TRUE regspmplot(y, X, group, plot=TRUE) ## (Ex1) Set the scale for the x axes, the y axis and control symbol type regspmplot(y, X, group, xlim=c(-1,2), ylim=c(0,2), pch=c(10,11), trace=TRUE) ## When the first input argument is an object. ## In the following example the input is an object which also contains ## information about the forward search. (out <- fsreg(y~X, method="LMS", control=LXS_control(nsamp=1000))) (out <- fsreg(y~X, bsb=out$bs, monitoring=TRUE)) regspmplot(out, plot=0) ## End(Not run)
Plots the trajectories of the monitored scaled (squared) residuals
resfwdplot(out, xlim, ylim, xlab, ylab, main, lwd, lty, col, cex.lab, cex.axis, xvalues, fg.thresh, fg.unit, fg.labstep, fg.lwd, fg.lty, fg.col, fg.mark, fg.cex, bg.thresh, bg.style, tag, datatooltip, label, nameX, namey, msg, databrush, standard, fground, bground, ...)
resfwdplot(out, xlim, ylim, xlab, ylab, main, lwd, lty, col, cex.lab, cex.axis, xvalues, fg.thresh, fg.unit, fg.labstep, fg.lwd, fg.lty, fg.col, fg.mark, fg.cex, bg.thresh, bg.style, tag, datatooltip, label, nameX, namey, msg, databrush, standard, fground, bground, ...)
out |
An object returned by one of the monitoring functions (see The needed elements of
|
ylim |
Control |
xlim |
Control |
xlab |
a title for the x axis |
ylab |
a title for the y axis |
main |
an overall title for the plot |
lwd |
The line width, a positive number, defaulting to 1 |
lty |
The line type. Line types can either be specified as an integer (1=solid (default), 2=dashed, 3=dotted, 4=dotdash, 5=longdash, 6=twodash) or as one of the character strings "solid", "dashed", "dotted", "dotdash", "longdash", or "twodash". The latter two are not supported by Matlab. |
col |
colors to be used for the highlighted units |
cex.lab |
The magnification to be used for x and y labels relative to the current setting of cex |
cex.axis |
The magnification to be used for axis annotation relative to the current setting of cex |
xvalues |
values for the x axis. Numeric vector of |
fg.thresh |
(alternative to fg.unit) numeric vector of length 1 or 2 which specifies the highlighted trajectories.
If |
fg.unit |
(alternative to fg.thresh), vector containing the list of the units to be highlighted.
If |
fg.labstep |
numeric vector which specifies the steps of the search where to put labels for
the highlighted trajectories (units). The default is to put the labels at the
initial and final steps of the search. |
fg.lwd |
The line width for the highlighted trajectories (units). Default is 1. |
fg.lty |
The line type for the highlighted trajectories (units). Line types can either be specified as an integer (1=solid (default), 2=dashed, 3=dotted, 4=dotdash, 5=longdash, 6=twodash) or as one of the character strings "solid", "dashed", "dotted", "dotdash", "longdash", or "twodash". The latter two are not supported by Matlab. |
fg.col |
colors to be used for the highlighted units. |
fg.mark |
Controlls whether to plot highlighted trajectories as symbols.
if |
fg.cex |
controls the font size of the labels of the trajectories in foreground. |
bg.thresh |
numeric vector of length 1 or 2 which specifies how to define the unimmportant trajectories.
Unimmportant trajectories will be plotted using a colormap, in greysh or will be hidden.
If |
bg.style |
specifies how to plot the unimportant trajectories as defined in option bthresh.
When |
tag |
Plot handle. String which identifies the handle of the plot which is about to be created. The default is to use tag 'pl_resfwd'. Notice that if the program finds a plot which has a tag equal to the one specified by the user, then the output of the new plot overwrites the existing one in the same window else a new window is created. |
datatooltip |
Interactive clicking. It is inactive if this parameter is missing or empty.
The default is For example for class
|
label |
Character vector containing the labels of the units (optional argument used when
|
nameX |
Add variable labels in plot. A vector of strings of length |
namey |
Add response label. A string containing the label of the response |
msg |
Controls whether to display or not messages on the screen If |
databrush |
interactive mouse brushing. If databrush is missing or empty (default), no brushing is done. The activation of this option (databrush is a scalar or a list) enables the user to select a set of trajectories in the current plot and to see them highlighted in the y|X plot, i.e. a matrix of scatter plots of y against each column of X, grouped according to the selection(s) done by brushing. If the plot y|X does not exist it is automatically created. In addition, brushed units are automatically highlighted in the minimum deletion residual plot if it is already open. The extension to the following plots will be available in future versions of the toolbox:
Note that the window style of the other figures is set equal to that which contains the monitoring residual plot. In other words, if the monitoring residual plot is docked all the other figures will be docked too If If
|
standard |
(MATLAB-style arguments) appearance of the plot in terms of xlim, ylim, axes labels and their font size style, color of the lines, etc. |
fground |
MATLAB-style arguments for the fground trajectories in foregroud. |
bground |
MATLAB-style arguments for the fground trajectories in backgroud. |
... |
potential further arguments passed to lower level functions. |
No details
No value returned
FSDA team
## Not run: n <- 100 y <- rnorm(n) X <- matrix(rnorm(n*4), nrow=n) out <- fsreg(y~X, method="LTS") out <- fsreg(y~X, method="FS", bsb=out$bs, monitoring=TRUE) resfwdplot(out) ## End(Not run)
## Not run: n <- 100 y <- rnorm(n) X <- matrix(rnorm(n*4), nrow=n) out <- fsreg(y~X, method="LTS") out <- fsreg(y~X, method="FS", bsb=out$bs, monitoring=TRUE) resfwdplot(out) ## End(Not run)
The function resindexplot()
plots the residuals from a regression analysis
versus index number or any other variable. The residuals come from an output
object of any of the regression fucntions or a simply a vector of values.
In order to use the databrush option, the residuals must come from one of
the fsdaR regression functions.
resindexplot(out, x, xlim, ylim, xlab, ylab, main, numlab, indlab, conflev, cex.axis, cex.lab, lwd, nameX, namey, tag, col, cex, databrush, ...)
resindexplot(out, x, xlim, ylim, xlab, ylab, main, numlab, indlab, conflev, cex.axis, cex.lab, lwd, nameX, namey, tag, col, cex, databrush, ...)
out |
A vector containing the residuals from a regression analysis or an object returned by one of
the regression functions (see |
x |
The vector to be plotted on the x-axis. As default the sequence 1:length(residuals) will be used |
xlim |
Control |
ylim |
Control |
xlab |
a title for the x axis |
ylab |
a title for the y axis |
main |
an overall title for the plot |
numlab |
Number of points to be identified in plots (see also |
indlab |
Which points to be identified in plots (see also |
conflev |
Confidence interval for the horizontal bands (a numeric vector). It can be a vector of different confidence level values. Remark: confidence interval is based on the chi^2 distribution |
cex.axis |
The magnification to be used for axis annotation relative to the current setting of cex |
cex.lab |
The magnification to be used for x and y labels relative to the current setting of cex |
lwd |
The line width, a positive number, defaulting to 1 |
tag |
Figure tag (character). Tag of the figure which will host the |
col |
Fill color for markers that are closed shapes (circle, square, diamond, pentagram, hexagram, and the four triangles).
Can be |
cex |
Size of the point symbols. The magnification to be used relative to the current setting of cex. |
nameX |
Add variable labels in plot. A vector of strings of length |
namey |
Add response label. A string containing the label of the response |
databrush |
Interactive mouse brushing. If databrush is missing or empty (default) or
Note that the window style of the other figures is set equal to that which contains the monitoring residual plot. In other words, if the monitoring residual plot is docked all the other figures will be docked too If If
|
... |
potential further arguments passed to lower level functions. |
No details
No value returned
FSDA team
## Not run: out <- fsreg(stack.loss~., data=stackloss) resindexplot(out, conflev=c(0.95,0.99), col="green") ## End(Not run)
## Not run: out <- fsreg(stack.loss~., data=stackloss) resindexplot(out, conflev=c(0.95,0.99), col="green") ## End(Not run)
Computes the score test for transformation in regression
score(x, ...) ## S3 method for class 'formula' score( formula, data, subset, weights, na.action, model = TRUE, contrasts = NULL, offset, ... ) ## Default S3 method: score( x, y, intercept = TRUE, la = c(-1, -0.5, 0, 0.5, 1), lik = FALSE, nocheck = FALSE, trace = FALSE, ... )
score(x, ...) ## S3 method for class 'formula' score( formula, data, subset, weights, na.action, model = TRUE, contrasts = NULL, offset, ... ) ## Default S3 method: score( x, y, intercept = TRUE, la = c(-1, -0.5, 0, 0.5, 1), lik = FALSE, nocheck = FALSE, trace = FALSE, ... )
x |
An Missing values (NA's) and infinite values (Inf's) are allowed, since observations (rows) with missing or infinite values will automatically be excluded from the computations. |
... |
potential further arguments passed to lower level functions. |
formula |
a |
data |
data frame from which variables specified in
|
subset |
an optional vector specifying a subset of observations to be used in the fitting process. |
weights |
an optional vector of weights to be used NOT USED YET. |
na.action |
a function which indicates what should happen
when the data contain |
model |
|
contrasts |
an optional list. See the |
offset |
this can be used to specify an a priori
known component to be included in the linear predictor
during fitting. An |
y |
Response variable. A vector with |
intercept |
wheather to use constant term (default is |
la |
values of the transformation parameter for which it is necessary
to compute the score test. Default value of lambda is
|
lik |
likelihood for the augmented model. If true the value of the likelihood for the augmented model will be calculated and returend otherwise (default) only the value of the score test will be given |
nocheck |
Whether to check input arguments. If |
trace |
Whether to print intermediate results. Default is |
An S3 object of class score.object
will be returned which is basically a list
containing the following elements:
la
: vector containing the values of lambda for which fan plot is constructed
Score
: a vector containing the values of the score test for
each value of the transformation parameter.
Lik
: value of the likelihood. This output is produced only if lik=TRUE.
FSDA team, [email protected]
Atkinson, A.C. and Riani, M. (2000), Robust Diagnostic Regression Analysis Springer Verlag, New York.
## Not run: data(wool) XX <- wool y <- XX[, ncol(XX)] X <- XX[, 1:(ncol(XX)-1), drop=FALSE] (out <- score(X, y)) # call 'score' with all default parameters (out <- score(cycles~., data=wool)) # use the formula interface (out <- score(cycles~., data=wool, lik=TRUE)) # return the likelihood data(loyalty) head(loyalty) ## la is a vector containing the values of \lambda which have to be tested (out <- score(amount_spent~., data=loyalty, la=c(0.25, 1/3, 0.4, 0.5))) (out <- score(amount_spent~., data=loyalty, la=c(0.25, 1/3, 0.4, 0.5), lik=TRUE)) ## End(Not run)
## Not run: data(wool) XX <- wool y <- XX[, ncol(XX)] X <- XX[, 1:(ncol(XX)-1), drop=FALSE] (out <- score(X, y)) # call 'score' with all default parameters (out <- score(cycles~., data=wool)) # use the formula interface (out <- score(cycles~., data=wool, lik=TRUE)) # return the likelihood data(loyalty) head(loyalty) ## la is a vector containing the values of \lambda which have to be tested (out <- score(amount_spent~., data=loyalty, la=c(0.25, 1/3, 0.4, 0.5))) (out <- score(amount_spent~., data=loyalty, la=c(0.25, 1/3, 0.4, 0.5), lik=TRUE)) ## End(Not run)
score
An object of class score.object
holds information about
the result of a call to score
.
The functions print()
and summary()
are used to obtain and print a
summary of the results. An object of class score
is a list containing at least the following components:
la
: vector containing the values of lambda for which fan plot is constructed
Score
: a vector containing the values of the score test for
each value of the transformation parameter.
Lik
: value of the likelihood. This output is produced only if lik=TRUE.
## Not run: data(wool) (out <- score(cycles~., data=wool, lik=TRUE)) class(out) summary(out) ## End(Not run)
## Not run: data(wool) (out <- score(cycles~., data=wool, lik=TRUE)) class(out) summary(out) ## End(Not run)
Computes S estimators in multivariate analysis
smult( x, monitoring = FALSE, plot = FALSE, bdp, nsamp, conflev = 0.975, nocheck = FALSE, trace = FALSE, ... )
smult( x, monitoring = FALSE, plot = FALSE, bdp, nsamp, conflev = 0.975, nocheck = FALSE, trace = FALSE, ... )
x |
An n x p data matrix (n observations and p variables). Rows of x represent observations, and columns represent variables. Missing values (NA's) and infinite values (Inf's) are allowed, since observations (rows) with missing or infinite values will automatically be excluded from the computations. |
monitoring |
Wheather to perform monitoring of Mahalanobis distances and other specific quantities |
plot |
Plots the Mahalanobis distances against index number. If
|
bdp |
Measures the fraction of outliers the algorithm should resist.
In this case any value greater than 0 but smaller or equal than 0.5 will do fine (default is |
nsamp |
Number of subsamples which will be extracted to find the robust estimator.
If |
conflev |
Confidence level which is used to declare units as outliers (scalar).
Usually |
nocheck |
It controls whether to perform checks on matrix Y. If |
trace |
Whether to print intermediate results. Default is |
... |
potential further arguments passed to lower level functions. |
This function follows the lines of MATLAB/R code developed during the years by many authors. For more details see http://www.econ.kuleuven.be/public/NDBAE06/programs/ and the R package CovSest The core of these routines, e.g. the resampling approach, however, has been completely redesigned, with considerable increase of the computational performance.
Depending on the input parameter monitoring
, one of
the following objects will be returned:
FSDA team, [email protected]
Maronna, R.A., Martin D. and Yohai V.J. (2006), Robust Statistics, Theory and Methods, Wiley, New York.
## Not run: data(hbk, package="robustbase") (out <- smult(hbk[,1:3])) class(out) summary(out) ## Generate contaminated data (200,3) n <- 200 p <- 3 set.seed(123456) X <- matrix(rnorm(n*p), nrow=n) Xcont <- X Xcont[1:5, ] <- Xcont[1:5,] + 3 out1 <- smult(Xcont, trace=TRUE) # no plots (plot defaults to FALSE) names(out1) ## plot=TRUE - generates: (1) a plot of Mahalanobis distances against ## index number. The confidence level used to draw the confidence bands for ## the MD is given by the input option conflev. If conflev is ## not specified a nominal 0.975 confidence interval will be used and ## (2) a scatter plot matrix with the outliers highlighted. (out1 <- smult(Xcont, trace=TRUE, plot=TRUE)) ## plots is a list: the spm shows the labels of the outliers. (out1 <- smult(Xcont, trace=TRUE, plot=list(labeladd="1"))) ## plots is a list: the spm uses the variable names provided by 'nameY'. (out1 <- smult(Xcont, trace=TRUE, plot=list(nameY=c("A", "B", "C")))) ## smult() with monitoring (out2 <- smult(Xcont, monitoring=TRUE, trace=TRUE)) names(out2) ## Forgery Swiss banknotes examples. data(swissbanknotes) (out1 <- smult(swissbanknotes[101:200,], plot=TRUE)) (out1 <- smult(swissbanknotes[101:200,], plot=list(labeladd="1"))) ## End(Not run)
## Not run: data(hbk, package="robustbase") (out <- smult(hbk[,1:3])) class(out) summary(out) ## Generate contaminated data (200,3) n <- 200 p <- 3 set.seed(123456) X <- matrix(rnorm(n*p), nrow=n) Xcont <- X Xcont[1:5, ] <- Xcont[1:5,] + 3 out1 <- smult(Xcont, trace=TRUE) # no plots (plot defaults to FALSE) names(out1) ## plot=TRUE - generates: (1) a plot of Mahalanobis distances against ## index number. The confidence level used to draw the confidence bands for ## the MD is given by the input option conflev. If conflev is ## not specified a nominal 0.975 confidence interval will be used and ## (2) a scatter plot matrix with the outliers highlighted. (out1 <- smult(Xcont, trace=TRUE, plot=TRUE)) ## plots is a list: the spm shows the labels of the outliers. (out1 <- smult(Xcont, trace=TRUE, plot=list(labeladd="1"))) ## plots is a list: the spm uses the variable names provided by 'nameY'. (out1 <- smult(Xcont, trace=TRUE, plot=list(nameY=c("A", "B", "C")))) ## smult() with monitoring (out2 <- smult(Xcont, monitoring=TRUE, trace=TRUE)) names(out2) ## Forgery Swiss banknotes examples. data(swissbanknotes) (out1 <- smult(swissbanknotes[101:200,], plot=TRUE)) (out1 <- smult(swissbanknotes[101:200,], plot=list(labeladd="1"))) ## End(Not run)
smult.object
ObjectsAn object of class smult.object
holds information about
the result of a call to smult
.
The object itself is basically a list
with the following
components:
loc |
p-by-1 vector containing S estimate of location. |
shape |
p-by-p matrix containing robust estimate of the shape matrix. Remark: det|shape|=1. |
scale |
robust estimate of the scale. |
cov |
|
bs |
a (p+1) vector containing the units forming best subset associated with S estimate of location. |
md |
n-by-1 vector containing the estimates of the robust Mahalanobis distances (in squared units). This vector contains the distances of each observation from the location of the data, relative to the scatter matrix cov. |
outliers |
A vector containing the list of the units declared
as outliers using confidence level specified in input scalar
|
conflev |
Confidence level that was used to declare outliers. |
singsub |
Number of subsets without full rank. Notice that
|
weights |
n x 1 vector containing the estimates of the weights. |
X |
the data matrix X |
The object has class "smult"
.
## Not run: data(hbk, package="robustbase") (out <- smult(hbk[,1:3])) class(out) summary(out) ## End(Not run)
## Not run: data(hbk, package="robustbase") (out <- smult(hbk[,1:3])) class(out) summary(out) ## End(Not run)
smulteda.object
ObjectsAn object of class smulteda.object
holds information about
the result of a call to smult
with monitoring=TRUE
.
The object itself is basically a list
with the following
components:
Loc |
length(bdp)-by-p matrix containing S estimate of location for each value of |
Shape |
p-by-p-by-length(bdp) 3D array containing robust estimate of the shape for each value of bdp. Remark: det|shape|=1. |
Scale |
length(bdp) vector containing robust estimate of the scale for each value of bdp. |
Cov |
p-by-p-by-length(bdp) 3D array containing robust estimate
of covariance matrix for each value of |
Bs |
(p+1)-by-length(bdp) matrix containing the units forming best subset for each value of bdp. |
MAL |
n-by-length(bdp) matrix containing the estimates of the robust Mahalanobis distances (in squared units) for each value of bdp. |
Outliers |
n-by-length(bdp) matrix containing flags for the outliers.
Boolean matrix containing the list of the
units declared as outliers for each value of bdp using confidence
level specified in input scalar |
Weights |
n x length(bdp) matrix containing the weights for each value of bdp. |
conflev |
Confidence level that was used to declare outliers. |
singsub |
Number of subsets without full rank. Notice that
|
bdp |
vector which contains the values of bdp which have been used. |
X |
the data matrix X. |
The object has class "smulteda"
.
## Not run: data(hbk, package="robustbase") (out <- smult(hbk[,1:3], monitoring=TRUE)) class(out) summary(out) ## End(Not run)
## Not run: data(hbk, package="robustbase") (out <- smult(hbk[,1:3], monitoring=TRUE)) class(out) summary(out) ## End(Not run)
Produces an interactive scatterplot matrix with boxplots or histograms on the main diagonal and possibly robust bivariate contours
spmplot( X, group, plot, variables, col, cex, pch, labeladd, label, legend, dispopt = c("hist", "box"), tag, datatooltip, databrush, trace = FALSE, ... )
spmplot( X, group, plot, variables, col, cex, pch, labeladd, label, legend, dispopt = c("hist", "box"), tag, datatooltip, databrush, trace = FALSE, ... )
X |
data matrix (2D array) containing |
group |
grouping variable. Vector with |
plot |
controls the names which are displayed in the margins of the
scatter-plot matrix, the labels of the legend the colors and the symbols.
If
|
variables |
a character string with the names of the variables |
col |
color specification for the data point. Can be different for each group. By default, the order of the colors is blue, red, black, magenta, green, cyan and yelow. |
cex |
the size of the symbols used for plotting. By default |
pch |
specification of the symbols to use. For example, if
there are three groups, and |
labeladd |
logical, controls wheather the elements belonging to the last group
in the scatterplot matrix are labelled with their unit row index
or their rowname. The rowname is taken from the parameter |
label |
a character vector of length |
legend |
logical, controls where a legend is shown or not. |
dispopt |
controls how to fill the diagonals in the plot (main diagonal of
the scatter plot matrix). Set |
tag |
Plot handle. String which identifies the handle of the plot which is about to be created.
The default is |
datatooltip |
If datatooltip is not empty the user can use the mouse in order to have
information about the unit selected, the step in which the unit enters the search and
the associated label. If datatooltip is a list, it is possible to control the aspect
of the data cursor (see MATLAB function |
databrush |
Interactive mouse brushing. If databrush is missing or empty (default), no brushing is done. The activation of this option (databrush is TRUE or a list) enables the user to select a set of trajectories in the current plot and to see them highlighted in the scatterplot matrix. If the scatterplot matrix does not exist it is automatically created. In addition, brushed units can be highlighted in the monitoring MD plot. Note that the window style of the other figures is set equal to that which contains the monitoring residual plot. In other words, if the monitoring residual plot is docked all the other figures will be docked too. If If
|
trace |
Whether to print intermediate results. Default is |
... |
potential further arguments passed to lower level functions. |
none
FSDA team, [email protected]
## Not run: ## Call of spmplot() without optional parameters. ## Iris data: scatter plot matrix with univariate boxplots on the main ## diagonal. X <- iris[,1:4] group <- iris[,5] spmplot(X, group, variables=c('SL','SW','PL','PW'), dispopt="box") ## Example of spmplot() called by routine fsmult(). ## Generate contaminated data. n <- 200; p <- 3 X <- matrix(rnorm(n*p), ncol=3) Xcont <- X Xcont[1:5,] <- Xcont[1:5,] + 3 ## spmplot is called automatically by all outlier detection methods, e.g. fsmult() out <- fsmult(Xcont, plot=TRUE); ## Now test the direct use of fsmult(). Set two groups, e.g. those obtained ## from fsmult(). group = rep(0, n) group[out$outliers] <- 1 ## option 'labeladd' is used to label the outliers ## By default, the legend identifies the groups with the identifiers ## given in vector 'group'. ## Set the colors for the two groups to blue and red. spmplot(Xcont, group, col=c("blue", "red"), labeladd=1, dispopt="box") ## End(Not run)
## Not run: ## Call of spmplot() without optional parameters. ## Iris data: scatter plot matrix with univariate boxplots on the main ## diagonal. X <- iris[,1:4] group <- iris[,5] spmplot(X, group, variables=c('SL','SW','PL','PW'), dispopt="box") ## Example of spmplot() called by routine fsmult(). ## Generate contaminated data. n <- 200; p <- 3 X <- matrix(rnorm(n*p), ncol=3) Xcont <- X Xcont[1:5,] <- Xcont[1:5,] + 3 ## spmplot is called automatically by all outlier detection methods, e.g. fsmult() out <- fsmult(Xcont, plot=TRUE); ## Now test the direct use of fsmult(). Set two groups, e.g. those obtained ## from fsmult(). group = rep(0, n) group[out$outliers] <- 1 ## option 'labeladd' is used to label the outliers ## By default, the legend identifies the groups with the identifiers ## given in vector 'group'. ## Set the colors for the two groups to blue and red. spmplot(Xcont, group, col=c("blue", "red"), labeladd=1, dispopt="box") ## End(Not run)
Sreg_control
object
Creates an object of class Sreg_control
to be used with the fsreg()
function,
containing various control parameters for calling the MATLAB function Sreg()
.
Sreg_control(intercept = TRUE, bdp = 0.5, rhofunc = c("bisquare", "optimal", "hyperbolic", "hampel", "mdpd", "AS"), rhofuncparam, nsamp = 1000, refsteps = 3, reftol = 1e-06, refstepsbestr = 50, reftolbestr = 1e-08, minsctol = 1e-07, bestr = 5, conflev, msg = TRUE, nocheck = FALSE, plot = FALSE)
Sreg_control(intercept = TRUE, bdp = 0.5, rhofunc = c("bisquare", "optimal", "hyperbolic", "hampel", "mdpd", "AS"), rhofuncparam, nsamp = 1000, refsteps = 3, reftol = 1e-06, refstepsbestr = 50, reftolbestr = 1e-08, minsctol = 1e-07, bestr = 5, conflev, msg = TRUE, nocheck = FALSE, plot = FALSE)
intercept |
Indicator for constant term. Scalar. If |
bdp |
Breakdown point. It measures the fraction of outliers the algorithm should resist. In this case any value greater than 0 but smaller or equal than 0.5 will do fine. Note that given bdp nominal efficiency is automatically determined. |
rhofunc |
Specifies the rho function which must be used to weight the residuals. Possible values are 'bisquare' 'optimal' 'hyperbolic' 'hampel'.
The default is 'bisquare'. |
rhofuncparam |
Additional parameters for the specified rho function. For hyperbolic rho function it is possible to set up the value of k = sup CVC (the default value of k is 4.5). For Hampel rho function it is possible to define parameters a, b and c (the default values are a=2, b=4, c=8) |
nsamp |
Number of subsamples which will be extracted to find the robust estimator,
scalar. If |
refsteps |
Number of refining iterations in each subsample (default is |
reftol |
Tolerance for the refining steps. The default value is 1e-6 |
refstepsbestr |
Scalar defining number of refining iterations for each best subset (default = 50). |
reftolbestr |
Tolerance for the refining steps for each of the best subsets. The default value is |
minsctol |
Value of tolerance for the iterative procedure for finding the minimum
value of the scale for each subset and each of the best subsets
(It is used by subroutine |
bestr |
Defins the number of "best betas" to remember from the subsamples.
These will be later iterated until convergence (default is |
conflev |
Confidence level which is used to declare units as outliers. Usually conflev=0.95, 0.975, 0.99 (individual alpha) or conflev=1-0.05/n, 1-0.025/n, 1-0.01/n (simultaneous alpha). Default value is 0.975 |
msg |
Controls whether to display or not messages on the screen If |
nocheck |
Check input arguments, scalar. If |
plot |
Plot on the screen. Scalar. If |
Creates an object of class Sreg_control
to be used with the fsreg()
function,
containing various control parameters.
An object of class "Sreg_control"
which is basically a
list
with components the input arguments of
the function mapped accordingly to the corresponding Matlab function.
FSDA team
See Also as FSR_control
, MMreg_control
and LXS_control
## Not run: data(hbk, package="robustbase") (out <- fsreg(Y~., data=hbk, method="S", control=Sreg_control(bdp=0.25, nsamp=500))) ## End(Not run)
## Not run: data(hbk, package="robustbase") (out <- fsreg(Y~., data=hbk, method="S", control=Sreg_control(bdp=0.25, nsamp=500))) ## End(Not run)
An object of class sreg.object
holds information about
the result of a call to fsreg
.
The object itself is basically a list
with the following
components:
beta |
p-by-1 vector containing the estimated regression parameters (in step n-k). |
scale |
scalar containing the estimate of the scale (sigma). |
bs |
p x 1 vector containing the units forming best subset associated with S estimate of regression coefficient. |
residuals |
residuals. |
fittedvalues |
fitted values. |
outliers |
kx1 vector containing the list of the k units declared as outliers or NULL if the sample is homogeneous. |
conflev |
Confidence level which is used to declare units as outliers.
Usually |
singsub |
Number of subsets wihtout full rank. Notice that
|
weights |
n x 1 vector containing the estimates of the weights |
rhofunc |
Specifies the rho function which has been used to weight the residuals. |
rhofuncparam |
Vector which contains the additional parameters for the specified rho function which has been used. For hyperbolic rho function the value of k =sup CVC. For Hampel rho function the parameters a, b and c. |
X |
the data matrix X |
y |
the response vector y |
The object has class "sreg"
.
## Not run: data(hbk, package="robustbase") (out <- fsreg(Y~., data=hbk, method="S")) class(out) summary(out) ## End(Not run)
## Not run: data(hbk, package="robustbase") (out <- fsreg(Y~., data=hbk, method="S")) class(out) summary(out) ## End(Not run)
Sregeda_control
object
Creates an object of class Sregeda_control
to be used with the fsreg()
function,
containing various control parameters.
Sregeda_control(intercept = TRUE, bdp = seq(0.5, 0.01, -0.01), rhofunc = c("bisquare", "optimal", "hyperbolic", "hampel", "mdpd", "AS"), rhofuncparam, nsamp = 1000, refsteps = 3, reftol = 1e-06, refstepsbestr = 50, reftolbestr = 1e-08, minsctol = 1e-07, bestr = 5, conflev, msg = TRUE, nocheck = FALSE, plot = FALSE)
Sregeda_control(intercept = TRUE, bdp = seq(0.5, 0.01, -0.01), rhofunc = c("bisquare", "optimal", "hyperbolic", "hampel", "mdpd", "AS"), rhofuncparam, nsamp = 1000, refsteps = 3, reftol = 1e-06, refstepsbestr = 50, reftolbestr = 1e-08, minsctol = 1e-07, bestr = 5, conflev, msg = TRUE, nocheck = FALSE, plot = FALSE)
intercept |
Indicator for constant term. Scalar. If |
bdp |
Breakdown point. It measures the fraction of outliers the algorithm should resist. In this case any value greater than 0 but smaller or equal than 0.5 will do fine. The default value of bdp is a sequence from 0.5 to 0.01 with step 0.01 |
rhofunc |
Specifies the rho function which must be used to weight the residuals. Possible values are 'bisquare' 'optimal' 'hyperbolic' 'hampel'.
The default is 'bisquare'. |
rhofuncparam |
Additional parameters for the specified rho function. For hyperbolic rho function it is possible to set up the value of k = sup CVC (the default value of k is 4.5). For Hampel rho function it is possible to define parameters a, b and c (the default values are a=2, b=4, c=8) |
nsamp |
Number of subsamples which will be extracted to find the robust estimator,
scalar. If |
refsteps |
Number of refining iterations in each subsample (default is |
reftol |
Tolerance for the refining steps. The default value is 1e-6 |
refstepsbestr |
Scalar defining number of refining iterations for each best subset (default = 50). |
reftolbestr |
Tolerance for the refining steps for each of the best subsets. The default value is |
minsctol |
Value of tolerance for the iterative procedure for finding the minimum
value of the scale for each subset and each of the best subsets
(It is used by subroutine |
bestr |
Defins the number of "best betas" to remember from the subsamples.
These will be later iterated until convergence (default is |
conflev |
Confidence level which is used to declare units as outliers. Usually conflev=0.95, 0.975, 0.99 (individual alpha) or conflev=1-0.05/n, 1-0.025/n, 1-0.01/n (simultaneous alpha). Default value is 0.975 |
msg |
Controls whether to display or not messages on the screen If |
nocheck |
Check input arguments, scalar. If |
plot |
Plot on the screen. Scalar. If |
Creates an object of class Sregeda_control
to be used with the fsreg()
function,
containing various control parameters.
An object of class "Sregeda_control"
which is basically a
list
with components the input arguments of
the function mapped accordingly to the corresponding Matlab function.
FSDA team
See Also as FSR_control
, MMreg_control
and LXS_control
## Not run: data(hbk, package="robustbase") (out <- fsreg(Y~., data=hbk, method="S", monitoring=TRUE, control=Sregeda_control(nsamp=500, rhofunc='hyperbolic'))) ## End(Not run)
## Not run: data(hbk, package="robustbase") (out <- fsreg(Y~., data=hbk, method="S", monitoring=TRUE, control=Sregeda_control(nsamp=500, rhofunc='hyperbolic'))) ## End(Not run)
sregeda
ObjectsAn object of class sregeda.object
holds information about
the result of a call to fsreg
when method="S"
and
monitoring=TRUE
.
The object itself is basically a list
with the following
components:
Beta |
matrix containing the S estimator of regression coefficients for each value of bdp. |
Scale |
vector containing the estimate of the scale (sigma) for each value of bdp. This is the value of the objective function. |
BS |
p x 1 vector containing the units forming best subset associated with S estimate of regression coefficient. |
RES |
n x length(bdp) matrix containing the monitoring
of scaled residuals for each value of |
Weights |
n x length(bdp) matrix containing the estimates of the weights for each value of bdp |
Outliers |
Boolean matrix containing the list of the units declared as outliers for each value of bdp using confidence level specified in input scalar conflev. |
conflev |
Confidence level which is used to declare units as outliers. Remark: conflev will be used to draw the horizontal line (confidence band) in the plot. |
Singsub |
Number of subsets wihtout full rank. Notice that singsub[bdp[jj]] > 0.1*(number of subsamples) produces a warning |
rhofunc |
Specifies the rho function which has been used to weight the residuals. |
rhofuncparam |
Vector which contains the additional parameters for the specified rho function which has been used. For hyperbolic rho function the value of k =sup CVC. For Hampel rho function the parameters a, b and c. |
X |
the data matrix X |
y |
the response vector y |
The object has class "sregeda"
.
## Not run: data(hbk, package="robustbase") (out <- fsreg(Y~., data=hbk, method="S", monitoring=TRUE)) class(out) summary(out) ## End(Not run)
## Not run: data(hbk, package="robustbase") (out <- fsreg(Y~., data=hbk, method="S", monitoring=TRUE)) class(out) summary(out) ## End(Not run)
fsdalms
objectssummary
method for class "fsdalms"
.
## S3 method for class 'fsdalms' summary(object, correlation = FALSE, ...) ## S3 method for class 'summary.fsdalms' print(x, digits = max(3, getOption("digits") - 3), signif.stars = getOption("show.signif.stars"), ...)
## S3 method for class 'fsdalms' summary(object, correlation = FALSE, ...) ## S3 method for class 'summary.fsdalms' print(x, digits = max(3, getOption("digits") - 3), signif.stars = getOption("show.signif.stars"), ...)
object , x
|
an object of class |
correlation |
logical; if |
digits |
the number of significant digits to use when printing. |
signif.stars |
logical indicating if “significance stars”
should be printer, see |
... |
further arguments passed to or from other methods. |
summary.fsdalms()
, the S3 method, simply returns an (S3) object of class "summary.fsdalms"
for which there's a print
method:
print.summary.fsdalms
prints summary statistics for the forward search (FS) regression estimates.
While the function print.fsdalms
prints only the robust estimates
of the coefficients, print.summary.fsdalms
will print also the regression table.
summary.fsdalms
returns an summary.fsdalms
object, whereas the
print
methods returns its first argument via
invisible
, as all print
methods do.
## Not run: data(Animals, package = "MASS") brain <- Animals[c(1:24, 26:25, 27:28),] lbrain <- log(brain) (fs <- fsreg(brain~body, data=lbrain, method="LTS")) summary(fs) ## compare to the result of ltsReg() from 'robustbase' library(robustbase) (lts <- ltsReg(brain~body, data=lbrain)) summary(lts) ## End(Not run)
## Not run: data(Animals, package = "MASS") brain <- Animals[c(1:24, 26:25, 27:28),] lbrain <- log(brain) (fs <- fsreg(brain~body, data=lbrain, method="LTS")) summary(fs) ## compare to the result of ltsReg() from 'robustbase' library(robustbase) (lts <- ltsReg(brain~body, data=lbrain)) summary(lts) ## End(Not run)
fsdalts
objectssummary
method for class "fsdalts"
.
## S3 method for class 'fsdalts' summary(object, correlation = FALSE, ...) ## S3 method for class 'summary.fsdalts' print(x, digits = max(3, getOption("digits") - 3), signif.stars = getOption("show.signif.stars"), ...)
## S3 method for class 'fsdalts' summary(object, correlation = FALSE, ...) ## S3 method for class 'summary.fsdalts' print(x, digits = max(3, getOption("digits") - 3), signif.stars = getOption("show.signif.stars"), ...)
object , x
|
an object of class |
correlation |
logical; if |
digits |
the number of significant digits to use when printing. |
signif.stars |
logical indicating if “significance stars”
should be printer, see |
... |
further arguments passed to or from other methods. |
summary.fsdalts()
, the S3 method, simply returns an (S3) object of class "summary.fsdalts"
for which there's a print
method:
print.summary.fsdalts
prints summary statistics for the forward search (FS) regression estimates.
While the function print.fsdalts
prints only the robust estimates
of the coefficients, print.summary.fsdalts
will print also the regression table.
summary.fsdalts
returns an summary.fsdalts
object, whereas the
print
methods returns its first argument via
invisible
, as all print
methods do.
## Not run: data(Animals, package = "MASS") brain <- Animals[c(1:24, 26:25, 27:28),] lbrain <- log(brain) (fs <- fsreg(brain~body, data=lbrain, method="LTS")) summary(fs) ## compare to the result of ltsReg() from 'robustbase' library(robustbase) (lts <- ltsReg(brain~body, data=lbrain)) summary(lts) ## End(Not run)
## Not run: data(Animals, package = "MASS") brain <- Animals[c(1:24, 26:25, 27:28),] lbrain <- log(brain) (fs <- fsreg(brain~body, data=lbrain, method="LTS")) summary(fs) ## compare to the result of ltsReg() from 'robustbase' library(robustbase) (lts <- ltsReg(brain~body, data=lbrain)) summary(lts) ## End(Not run)
summary
method for class "fsr"
.
## S3 method for class 'fsr' summary(object, correlation = FALSE, ...) ## S3 method for class 'summary.fsr' print(x, digits = max(3, getOption("digits") - 3), signif.stars = getOption("show.signif.stars"), ...)
## S3 method for class 'fsr' summary(object, correlation = FALSE, ...) ## S3 method for class 'summary.fsr' print(x, digits = max(3, getOption("digits") - 3), signif.stars = getOption("show.signif.stars"), ...)
object , x
|
an object of class |
correlation |
logical; if |
digits |
the number of significant digits to use when printing. |
signif.stars |
logical indicating if “significance stars”
should be printer, see |
... |
further arguments passed to or from other methods. |
summary.fsr()
, the S3 method, simply returns an (S3) object of class "summary.fsr"
for which there's a print
method:
print.summary.fsr
prints summary statistics for the forward search (FS) regression estimates.
While the function print.fsr
prints only the robust estimates
of the coefficients, print.summary.fsr
will print also the regression table.
summary.fsr
returns an summary.fsr
object, whereas the
print
methods returns its first argument via
invisible
, as all print
methods do.
## Not run: data(Animals, package = "MASS") brain <- Animals[c(1:24, 26:25, 27:28),] lbrain <- log(brain) (fs <- fsreg(brain~body, data=lbrain, method="FS")) summary(fs) ## End(Not run)
## Not run: data(Animals, package = "MASS") brain <- Animals[c(1:24, 26:25, 27:28),] lbrain <- log(brain) (fs <- fsreg(brain~body, data=lbrain, method="FS")) summary(fs) ## End(Not run)
Six variables measured on 100 genuine and 100 counterfeit old (printed before the second world war) Swiss 1000-franc bank notes (Flury and Riedwyl, 1988).
data(swissbanknotes)
data(swissbanknotes)
A data frame with 200 observations on the following 7 variables.
length
Length of bill, mm
left
Width of left edge, mm
right
Width of right edge, mm
bottom
Bottom margin width, mm
top
Top margin width, mm
diagonal
Length of image diagonal, mm
class
1 = genuine, 2 = counterfeit
Flury, B. and Riedwyl, H. (1988). Multivariate Statistics: A practical approach. London: Chapman & Hall.
Weisberg, S. (2005). Applied Linear Regression, 3rd edition. New York: Wiley, Problem 12.5.
library(rrcov) data(swissbanknotes) head(swissbanknotes) plot(CovMcd(swissbanknotes[, 1:6]), which="pairs", col=swissbanknotes$class)
library(rrcov) data(swissbanknotes) head(swissbanknotes) plot(CovMcd(swissbanknotes[, 1:6]), which="pairs", col=swissbanknotes$class)
Six dimensions in millimetres of the heads of 200 twenty year old Swiss soldiers (Flury and Riedwyl, 1988, p. 218 and also Flury, 1997, p. 6).
The data were collected to determine the variability in size and shape of heads of young men in order to help in the design of a new protection mask for the Swiss army.
data(swissheads)
data(swissheads)
A data frame with 200 observations on the following 6 variables.
minimal_frontal_breadth
Minimal frontal breadth, mm
breadth_angulus_mandibulae
Breadth of angulus mandibulae, mm
true_facial_height
True facial height, mm
length_glabella_nasi
Length from glabella to apex nasi, mm
length_tragion_nasion
Length from tragion to nasion, mm
length_tragion_gnathion
Length from tragion to gnathion, mm
Flury, B. and Riedwyl, H. (1988). Multivariate Statistics: A practical approach. London: Chapman & Hall.
Atkinson, A. C., Riani, M. and Cerioli, A. (2004) Exploring multivariate data with the forward search, New York: Springer-Verlag.
library(rrcov) data(swissheads) head(swissheads) plot(CovMcd(swissheads), which="pairs")
library(rrcov) data(swissheads) head(swissheads) plot(CovMcd(swissheads), which="pairs")
tclustfsda
with the option monitoring=TRUE
An object of class tclusteda.object
holds information about
the result of a call to tclustfsda
with the option monitoring=TRUE
.
The functions print()
and summary()
are used to obtain and print a
summary of the results. An object of class tclusteda
is a list containing at least the following components:
call |
the matched call |
k |
number of groups |
alpha |
trimming level |
restrfactor |
restriction factor |
IDX |
an |
MU |
a 3 dimensional array of size k-by-p-by-length(alpha) containing the monitoring
of the centroid for each value of alpha. |
SIGMA |
A list of length |
Amon |
Amon stands for alpha monitoring. Matrix of size
|
## Not run: data(hbk, package="robustbase") (out <- tclustfsda(hbk[, 1:3], k=2, monitoring=TRUE)) class(out) summary(out) ## End(Not run)
## Not run: data(hbk, package="robustbase") (out <- tclustfsda(hbk[, 1:3], k=2, monitoring=TRUE)) class(out) summary(out) ## End(Not run)
Partitions the points in the n-by-v data matrix
Y
into k
clusters. This partition minimizes the trimmed sum,
over all clusters, of the within-cluster sums of point-to-cluster-centroid
distances. Rows of Y correspond to points, columns correspond to variables.
Returns in the output object of class tclustfsda.object
an n-by-1 vector
idx
containing the cluster indices of each point. By default,
tclustfsda()
uses (squared), possibly constrained, Mahalanobis distances.
tclustfsda( x, k, alpha, restrfactor = 12, monitoring = FALSE, plot = FALSE, nsamp, refsteps = 15, reftol = 1e-13, equalweights = FALSE, mixt = 0, msg = FALSE, nocheck = FALSE, startv1 = 1, RandNumbForNini, restrtype = c("eigen", "deter"), UnitsSameGroup, numpool, cleanpool, trace = FALSE, ... )
tclustfsda( x, k, alpha, restrfactor = 12, monitoring = FALSE, plot = FALSE, nsamp, refsteps = 15, reftol = 1e-13, equalweights = FALSE, mixt = 0, msg = FALSE, nocheck = FALSE, startv1 = 1, RandNumbForNini, restrtype = c("eigen", "deter"), UnitsSameGroup, numpool, cleanpool, trace = FALSE, ... )
x |
An n x p data matrix (n observations and p variables). Rows of x represent observations, and columns represent variables. Missing values (NA's) and infinite values (Inf's) are allowed, since observations (rows) with missing or infinite values will automatically be excluded from the computations. |
k |
Number of groups. |
alpha |
A scalar between 0 and 0.5 or an integer specifying the number of
observations which have to be trimmed. If More in detail, if |
restrfactor |
Positive scalar which constrains the allowed differences among group scatters.
Larger values imply larger differences of group scatters. On the other hand
a value of 1 specifies the strongest restriction forcing all
eigenvalues/determinants to be equal and so the method looks
for similarly scattered (respectively spherical) clusters.
The default is to apply |
monitoring |
If |
plot |
If
When
The parameter
If The second plot (gscatter plot) shows a series of subplots which monitor the classification
for each value of
Also in the case of
|
nsamp |
If a scalar, it contains the number of subsamples which will be extracted.
If If
REMARK: If |
refsteps |
Number of refining iterations in each subsample. Default is |
reftol |
Tolerance of the refining steps. The default value is 1e-14 |
equalweights |
A logical specifying wheather cluster weights in the concentration
and assignment steps shall be considered. If |
mixt |
Specifies whether mixture modelling or crisp assignment approach to model
based clustering must be used. In the case of mixture modelling parameter mixt also
controls which is the criterion to find the untrimmed units in each step of the maximization.
If |
msg |
Controls whether to display or not messages on the screen. If |
nocheck |
Check input arguments. If |
startv1 |
How to initialize centroids and covariance matrices. Scalar.
If Remark 1: in order to start with a routine which is in the required parameter space, eigenvalue restrictions are immediately applied. Remark 2 - option |
RandNumbForNini |
pre-extracted random numbers to initialize proportions.
Matrix of size k-by-nrow(nsamp) containing the random numbers which
are used to initialize the proportions of the groups. This option is effective just if
|
restrtype |
Type of restriction to be applied on the cluster scatter matrices.
Valid values are |
UnitsSameGroup |
List of the units which must (whenever possible) have
a particular label. For example |
numpool |
The number of parallel sessions to open. If numpool is not defined, then it is set equal to the number of physical cores in the computer. |
cleanpool |
Logical, indicating if the open pool must be closed or not. It is useful to leave it open if there are subsequent parallel sessions to execute, so that to save the time required to open a new pool. |
trace |
Whether to print intermediate results. Default is |
... |
potential further arguments passed to lower level functions. |
This iterative algorithm initializes k
clusters randomly and performs
concentration steps in order to improve the current cluster assignment. The number of
maximum concentration steps to be performed is given by input parameter refsteps
.
For approximately obtaining the global optimum, the system is initialized nsamp
times and concentration steps are performed until convergence or refsteps
is
reached. When processing more complex data sets higher values of nsamp
and
refsteps
have to be specified (obviously implying extra computation time).
However, if more then 10 per cent of the iterations do not converge, a warning message
is issued, indicating that nsamp
has to be increased.
Depending on the input parameter monitoring
, one of
the following objects will be returned:
FSDA team, [email protected]
Garcia-Escudero, L.A., Gordaliza, A., Matran, C. and Mayo-Iscar, A. (2008). A General Trimming Approach to Robust Cluster Analysis. Annals of Statistics, Vol. 36, 1324-1345. doi:10.1214/07-AOS515.
## Not run: data(hbk, package="robustbase") (out <- tclustfsda(hbk[, 1:3], k=2)) class(out) summary(out) ## TCLUST of Gayser data with three groups (k=3), 10%% trimming (alpha=0.1) ## and restriction factor (c=10000) data(geyser2) (out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=10000)) ## Use the plot options to produce more complex plots ---------- ## Plot with all default options out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=10000, plot=TRUE) ## Default confidence ellipses. out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=10000, plot="ellipse") ## Confidence ellipses specified by the user: confidence ellipses set to 0.5 plots <- list(type="ellipse", conflev=0.5) out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=10000, plot=plots) ## Confidence ellipses set to 0.9 plots <- list(type="ellipse", conflev=0.9) out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=10000, plot=plots) ## Contour plots out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=10000, plot="contour") ## Filled contour plots with additional options: contourf plot with a named colormap. ## Here we define four MATLAB-like colormaps, but the user can define anything else, ## presented by a matrix with three columns which are the RGB triplets. summer <- as.matrix(data.frame(x1=seq(from=0, to=1, length=65), x2=seq(from=0.5, to=1, length=65), x3=rep(0.4, 65))) spring <- as.matrix(data.frame(x1=rep(1, 65), x2=seq(from=0, to=1, length=65), x3=seq(from=1, to=0, length=65))) winter <- as.matrix(data.frame(x1=rep(0, 65), x2=seq(from=0, to=1, length=65), x3=seq(from=1, to=0, length=65))) autumn <- as.matrix(data.frame(x1=rep(1, 65), x2=seq(from=0, to=1, length=65), x3=rep(0, 65))) out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=10000, plot=list(type="contourf", cmap=autumn)) out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=10000, plot=list(type="contourf", cmap=winter)) out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=10000, plot=list(type="contourf", cmap=spring)) out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=10000, plot=list(type="contourf", cmap=summer)) ## We compare the output using three different values of restriction factor ## nsamp is the number of subsamples which will be extracted data(geyser2) out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=10000, nsamp=500, plot="ellipse") out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=10, nsamp=500, refsteps=10, plot="ellipse") out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=1, nsamp=500, refsteps=10, plot="ellipse") ## TCLUST applied to M5 data: A bivariate data set obtained from three normal ## bivariate distributions with different scales and proportions 1:2:2. One of the ## components is very overlapped with another one. A 10 per cent background noise is ## added uniformly distributed in a rectangle containing the three normal components ## and not very overlapped with the three mixture components. A precise description ## of the M5 data set can be found in Garcia-Escudero et al. (2008). ## data(M5data) plot(M5data[, 1:2]) ## Scatter plot matrix library(rrcov) plot(CovClassic(M5data[,1:2]), which="pairs") out <- tclustfsda(M5data[,1:2], k=3, alpha=0, restrfactor=1000, nsamp=100, plot=TRUE) out <- tclustfsda(M5data[,1:2], k=3, alpha=0, restrfactor=10, nsamp=100, plot=TRUE) out <- tclustfsda(M5data[,1:2], k=3, alpha=0.1, restrfactor=1, nsamp=1000, plot=TRUE, equalweights=TRUE) out <- tclustfsda(M5data[,1:2], k=3, alpha=0.1, restrfactor=1000, nsamp=100, plot=TRUE) ## TCLUST with simulated data: 5 groups and 5 variables ## n1 <- 100 n2 <- 80 n3 <- 50 n4 <- 80 n5 <- 70 p <- 5 Y1 <- matrix(rnorm(n1 * p) + 5, ncol=p) Y2 <- matrix(rnorm(n2 * p) + 3, ncol=p) Y3 <- matrix(rnorm(n3 * p) - 2, ncol=p) Y4 <- matrix(rnorm(n4 * p) + 2, ncol=p) Y5 <- matrix(rnorm(n5 * p), ncol=p) group <- c(rep(1, n1), rep(2, n2), rep(3, n3), rep(4, n4), rep(5, n5)) Y <- Y1 Y <- rbind(Y, Y2) Y <- rbind(Y, Y3) Y <- rbind(Y, Y4) Y <- rbind(Y, Y5) dim(Y) table(group) out <- tclustfsda(Y, k=5, alpha=0.05, restrfactor=1.3, refsteps=20, plot=TRUE) ## Automatic choice of best number of groups for Geyser data ------------------------ ## data(geyser2) maxk <- 6 CLACLA <- matrix(0, nrow=maxk, ncol=2) CLACLA[,1] <- 1:maxk MIXCLA <- MIXMIX <- CLACLA for(j in 1:maxk) { out <- tclustfsda(geyser2, k=j, alpha=0.1, restrfactor=5) CLACLA[j, 2] <- out$CLACLA } for(j in 1:maxk) { out <- tclustfsda(geyser2, k=j, alpha=0.1, restrfactor=5, mixt=2) MIXMIX[j ,2] <- out$MIXMIX MIXCLA[j, 2] <- out$MIXCLA } oldpar <- par(mfrow=c(1,3)) plot(CLACLA[,1:2], type="l", xlim=c(1, maxk), xlab="Number of groups", ylab="CLACLA") plot(MIXMIX[,1:2], type="l", xlim=c(1, maxk), xlab="Number of groups", ylab="MIXMIX") plot(MIXCLA[,1:2], type="l", xlim=c(1, maxk), xlab="Number of groups", ylab="MIXCLA") par(oldpar) ## Monitoring examples ------------------------------------------ ## Monitoring using Geyser data ## Monitoring using Geyser data (all default options) ## alpha and restriction factor are not specified therefore alpha=c(0.10, 0.05, 0) ## is used while the restriction factor is set to c=12 out <- tclustfsda(geyser2, k=3, monitoring=TRUE) ## Monitoring using Geyser data with alpha and c specified. out <- tclustfsda(geyser2, k=3, restrfac=100, alpha=seq(0.10, 0, by=-0.01), monitoring=TRUE) ## Monitoring using Geyser data with plot argument specified as a list. ## The trimming levels to consider in this case are 31 values of alpha ## out <- tclustfsda(geyser2, k=3, restrfac=100, alpha=seq(0.30, 0, by=-0.01), monitoring=TRUE, plot=list(alphasel=c(0.2, 0.10, 0.05, 0.01)), trace=TRUE) ## Monitoring using Geyser data with argument UnitsSameGroup ## ## Make sure that group containing unit 10 is in a group which is labelled "group 1" ## and group containing unit 12 is in group which is labelled "group 2" ## ## Mixture model is used (mixt=2) ## out <- tclustfsda(geyser2, k=3, restrfac=100, alpha=seq(0.30, 0, by=-0.01), monitoring=TRUE, mixt=2, UnitsSameGroup=c(10, 12), trace=TRUE) ## Monitoring using M5 data data(M5data) ## alphavec=vector which contains the trimming levels to consider ## in this case 31 values of alpha are considered alphavec <- seq(0.10, 0, by=-0.02) out <- tclustfsda(M5data[, 1:2], 3, alpha=alphavec, restrfac=1000, monitoring=TRUE, nsamp=1000, plots=TRUE) ## End(Not run)
## Not run: data(hbk, package="robustbase") (out <- tclustfsda(hbk[, 1:3], k=2)) class(out) summary(out) ## TCLUST of Gayser data with three groups (k=3), 10%% trimming (alpha=0.1) ## and restriction factor (c=10000) data(geyser2) (out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=10000)) ## Use the plot options to produce more complex plots ---------- ## Plot with all default options out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=10000, plot=TRUE) ## Default confidence ellipses. out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=10000, plot="ellipse") ## Confidence ellipses specified by the user: confidence ellipses set to 0.5 plots <- list(type="ellipse", conflev=0.5) out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=10000, plot=plots) ## Confidence ellipses set to 0.9 plots <- list(type="ellipse", conflev=0.9) out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=10000, plot=plots) ## Contour plots out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=10000, plot="contour") ## Filled contour plots with additional options: contourf plot with a named colormap. ## Here we define four MATLAB-like colormaps, but the user can define anything else, ## presented by a matrix with three columns which are the RGB triplets. summer <- as.matrix(data.frame(x1=seq(from=0, to=1, length=65), x2=seq(from=0.5, to=1, length=65), x3=rep(0.4, 65))) spring <- as.matrix(data.frame(x1=rep(1, 65), x2=seq(from=0, to=1, length=65), x3=seq(from=1, to=0, length=65))) winter <- as.matrix(data.frame(x1=rep(0, 65), x2=seq(from=0, to=1, length=65), x3=seq(from=1, to=0, length=65))) autumn <- as.matrix(data.frame(x1=rep(1, 65), x2=seq(from=0, to=1, length=65), x3=rep(0, 65))) out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=10000, plot=list(type="contourf", cmap=autumn)) out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=10000, plot=list(type="contourf", cmap=winter)) out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=10000, plot=list(type="contourf", cmap=spring)) out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=10000, plot=list(type="contourf", cmap=summer)) ## We compare the output using three different values of restriction factor ## nsamp is the number of subsamples which will be extracted data(geyser2) out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=10000, nsamp=500, plot="ellipse") out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=10, nsamp=500, refsteps=10, plot="ellipse") out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=1, nsamp=500, refsteps=10, plot="ellipse") ## TCLUST applied to M5 data: A bivariate data set obtained from three normal ## bivariate distributions with different scales and proportions 1:2:2. One of the ## components is very overlapped with another one. A 10 per cent background noise is ## added uniformly distributed in a rectangle containing the three normal components ## and not very overlapped with the three mixture components. A precise description ## of the M5 data set can be found in Garcia-Escudero et al. (2008). ## data(M5data) plot(M5data[, 1:2]) ## Scatter plot matrix library(rrcov) plot(CovClassic(M5data[,1:2]), which="pairs") out <- tclustfsda(M5data[,1:2], k=3, alpha=0, restrfactor=1000, nsamp=100, plot=TRUE) out <- tclustfsda(M5data[,1:2], k=3, alpha=0, restrfactor=10, nsamp=100, plot=TRUE) out <- tclustfsda(M5data[,1:2], k=3, alpha=0.1, restrfactor=1, nsamp=1000, plot=TRUE, equalweights=TRUE) out <- tclustfsda(M5data[,1:2], k=3, alpha=0.1, restrfactor=1000, nsamp=100, plot=TRUE) ## TCLUST with simulated data: 5 groups and 5 variables ## n1 <- 100 n2 <- 80 n3 <- 50 n4 <- 80 n5 <- 70 p <- 5 Y1 <- matrix(rnorm(n1 * p) + 5, ncol=p) Y2 <- matrix(rnorm(n2 * p) + 3, ncol=p) Y3 <- matrix(rnorm(n3 * p) - 2, ncol=p) Y4 <- matrix(rnorm(n4 * p) + 2, ncol=p) Y5 <- matrix(rnorm(n5 * p), ncol=p) group <- c(rep(1, n1), rep(2, n2), rep(3, n3), rep(4, n4), rep(5, n5)) Y <- Y1 Y <- rbind(Y, Y2) Y <- rbind(Y, Y3) Y <- rbind(Y, Y4) Y <- rbind(Y, Y5) dim(Y) table(group) out <- tclustfsda(Y, k=5, alpha=0.05, restrfactor=1.3, refsteps=20, plot=TRUE) ## Automatic choice of best number of groups for Geyser data ------------------------ ## data(geyser2) maxk <- 6 CLACLA <- matrix(0, nrow=maxk, ncol=2) CLACLA[,1] <- 1:maxk MIXCLA <- MIXMIX <- CLACLA for(j in 1:maxk) { out <- tclustfsda(geyser2, k=j, alpha=0.1, restrfactor=5) CLACLA[j, 2] <- out$CLACLA } for(j in 1:maxk) { out <- tclustfsda(geyser2, k=j, alpha=0.1, restrfactor=5, mixt=2) MIXMIX[j ,2] <- out$MIXMIX MIXCLA[j, 2] <- out$MIXCLA } oldpar <- par(mfrow=c(1,3)) plot(CLACLA[,1:2], type="l", xlim=c(1, maxk), xlab="Number of groups", ylab="CLACLA") plot(MIXMIX[,1:2], type="l", xlim=c(1, maxk), xlab="Number of groups", ylab="MIXMIX") plot(MIXCLA[,1:2], type="l", xlim=c(1, maxk), xlab="Number of groups", ylab="MIXCLA") par(oldpar) ## Monitoring examples ------------------------------------------ ## Monitoring using Geyser data ## Monitoring using Geyser data (all default options) ## alpha and restriction factor are not specified therefore alpha=c(0.10, 0.05, 0) ## is used while the restriction factor is set to c=12 out <- tclustfsda(geyser2, k=3, monitoring=TRUE) ## Monitoring using Geyser data with alpha and c specified. out <- tclustfsda(geyser2, k=3, restrfac=100, alpha=seq(0.10, 0, by=-0.01), monitoring=TRUE) ## Monitoring using Geyser data with plot argument specified as a list. ## The trimming levels to consider in this case are 31 values of alpha ## out <- tclustfsda(geyser2, k=3, restrfac=100, alpha=seq(0.30, 0, by=-0.01), monitoring=TRUE, plot=list(alphasel=c(0.2, 0.10, 0.05, 0.01)), trace=TRUE) ## Monitoring using Geyser data with argument UnitsSameGroup ## ## Make sure that group containing unit 10 is in a group which is labelled "group 1" ## and group containing unit 12 is in group which is labelled "group 2" ## ## Mixture model is used (mixt=2) ## out <- tclustfsda(geyser2, k=3, restrfac=100, alpha=seq(0.30, 0, by=-0.01), monitoring=TRUE, mixt=2, UnitsSameGroup=c(10, 12), trace=TRUE) ## Monitoring using M5 data data(M5data) ## alphavec=vector which contains the trimming levels to consider ## in this case 31 values of alpha are considered alphavec <- seq(0.10, 0, by=-0.02) out <- tclustfsda(M5data[, 1:2], 3, alpha=alphavec, restrfac=1000, monitoring=TRUE, nsamp=1000, plots=TRUE) ## End(Not run)
tclustfsda
An object of class tclustfsda.object
holds information about
the result of a call to tclustfsda
.
The functions print()
and summary()
are used to obtain and print a
summary of the results. An object of class tclustfsda
is a list containing at least the following components:
call |
the matched call |
muopt |
a k-by-p matrix containing cluster centroid locations. Robust estimate of final centroids of the groups |
sigmaopt |
a p-by-p-by-k array rray containing estimated constrained covariance for the k groups |
idx |
a vector of length n containing assignment of each unit to each of the k groups. Cluster names are integer numbers from 1 to k. 0 indicates trimmed observations. |
size |
a matrix of size (k+1)-by-3. The 1st col is sequence from 0 to k (cluster name); the 2nd col is the number of observations in each cluster; the 3rd col is the percentage of observations in each cluster. Remark: 0 denotes unassigned units. |
postprob |
n-by-k matrix containing posterior probabilities. |
emp |
"Empirical" statistics computed on final classification. When convergence is reached,
|
MIXMIX |
BIC which uses parameters estimated using the mixture loglikelihood and the maximized mixture likelihood as goodness of fit measure. Remark: this output is present just if |
MIXCLA |
BIC which uses parameters estimated using the mixture loglikelihood and the maximized mixture likelihood as goodness of fit measure. Remark: this output is present just if |
CLACLA |
BIC which uses the classification likelihood based on parameters estimated using the classification likelihood. Remark: this output is present just if |
notconver |
number of subsets without convergence |
bs |
a vector of length |
obj |
value of the objective function which is minimized (value of the best returned solution). |
equalweights |
if |
h |
number of observations that have determined the centroids (number of untrimmed units). |
fullsol |
a vector of size |
X |
the original data matrix X. |
## Not run: data(hbk, package="robustbase") (out <- tclustfsda(hbk[, 1:3], k=2)) class(out) summary(out) ## End(Not run)
## Not run: data(hbk, package="robustbase") (out <- tclustfsda(hbk[, 1:3], k=2)) class(out) summary(out) ## End(Not run)
tclustfsda
for different
number of groups k
and restriction factors c
Computes the values of BIC (MIXMIX), ICL (MIXCLA) or CLA (CLACLA),
for different values of k
(number of groups) and different values of c
(restriction factor), for a prespecified level of trimming (the last two letters in the name
stand for 'Information Criterion'). In order to minimize
randomness, given k
, the same subsets are used for each value of c
.
tclustIC( x, kk = 1:5, cc = c(1, 2, 4, 8, 16, 32, 64, 128), alpha = 0, whichIC = c("ALL", "MIXMIX", "MIXCLA", "CLACLA"), nsamp, refsteps = 15, reftol = 1e-14, equalweights = FALSE, msg = TRUE, nocheck = FALSE, plot = FALSE, startv1 = 1, restrtype = c("eigen", "deter"), UnitsSameGroup, numpool, cleanpool, trace = FALSE, ... )
tclustIC( x, kk = 1:5, cc = c(1, 2, 4, 8, 16, 32, 64, 128), alpha = 0, whichIC = c("ALL", "MIXMIX", "MIXCLA", "CLACLA"), nsamp, refsteps = 15, reftol = 1e-14, equalweights = FALSE, msg = TRUE, nocheck = FALSE, plot = FALSE, startv1 = 1, restrtype = c("eigen", "deter"), UnitsSameGroup, numpool, cleanpool, trace = FALSE, ... )
x |
An n x p data matrix (n observations and p variables). Rows of x represent observations, and columns represent variables. Missing values (NA's) and infinite values (Inf's) are allowed, since observations (rows) with missing or infinite values will automatically be excluded from the computations. |
kk |
an integer vector specifying the number of mixture components (clusters) for which the BIC is to be calculated. By default |
cc |
an vector specifying the values of the restriction factor which have to be considered. By default |
alpha |
Global trimming level. A scalar between 0 and 0.5 or an integer specifying the number of
observations which have to be trimmed. If More in detail, if |
whichIC |
A character value which specifies which information criteria must be computed
for each
|
nsamp |
If a scalar, it contains the number of subsamples which will be extracted.
If If
REMARK: If |
refsteps |
Number of refining iterations in each subsample. Default is |
reftol |
Tolerance of the refining steps. The default value is 1e-14 |
equalweights |
A logical specifying wheather cluster weights in the concentration
and assignment steps shall be considered. If |
msg |
Controls whether to display or not messages on the screen If |
nocheck |
Check input arguments. If |
plot |
If |
startv1 |
How to initialize centroids and covariance matrices. Scalar.
If Remark 1: in order to start with a routine which is in the required parameter space, eigenvalue restrictions are immediately applied. Remark 2 - option |
restrtype |
Type of restriction to be applied on the cluster scatter matrices.
Valid values are |
UnitsSameGroup |
List of the units which must (whenever possible) have
a particular label. For example |
numpool |
The number of parallel sessions to open. If numpool is not defined, then it is set equal to the number of physical cores in the computer. |
cleanpool |
Logical, indicating if the open pool must be closed or not. It is useful to leave it open if there are subsequent parallel sessions to execute, so that to save the time required to open a new pool. |
trace |
Whether to print intermediate results. Default is |
... |
potential further arguments passed to lower level functions. |
An S3 object of class tclustic.object
FSDA team, [email protected]
Cerioli, A., Garcia-Escudero, L.A., Mayo-Iscar, A. and Riani M. (2017). Finding the Number of Groups in Model-Based Clustering via Constrained Likelihoods, Journal of Computational and Graphical Statistics, pp. 404-416, https://doi.org/10.1080/10618600.2017.1390469.
tclustfsda
, tclustICplot
, tclustICsol
, carbikeplot
## Not run: data(geyser2) (out <- tclustIC(geyser2, whichIC="MIXMIX", plot=FALSE, alpha=0.1)) summary(out) ## End(Not run) ## Not run: data(flea) Y <- as.matrix(flea[, 1:(ncol(flea)-1)]) # select only the numeric variables rownames(Y) <- 1:nrow(Y) head(Y) (out <- tclustIC(Y, whichIC="CLACLA", plot=FALSE, alpha=0.1, nsamp=100, numpool=1)) summary(out) ## End(Not run)
## Not run: data(geyser2) (out <- tclustIC(geyser2, whichIC="MIXMIX", plot=FALSE, alpha=0.1)) summary(out) ## End(Not run) ## Not run: data(flea) Y <- as.matrix(flea[, 1:(ncol(flea)-1)]) # select only the numeric variables rownames(Y) <- 1:nrow(Y) head(Y) (out <- tclustIC(Y, whichIC="CLACLA", plot=FALSE, alpha=0.1, nsamp=100, numpool=1)) summary(out) ## End(Not run)
tclustIC
An object of class tclustic.object
holds information about
the result of a call to tclustIC
.
The functions print()
and summary()
are used to obtain and print a
summary of the results. An object of class tclustic
is a list containing at least the following components:
call |
the matched call |
kk |
a vector containing the values of |
cc |
a vector containing the values of |
alpha |
trimming level |
whichIC |
Information criteria used |
CLACLA |
a matrix of size |
IDXCLA |
a matrix of lists of size |
MIXMIX |
a matrix of size |
IDXMIX |
a matrix of lists of size |
MIXCLA |
a matrix of size |
## Not run: data(hbk, package="robustbase") (out <- tclustIC(hbk[, 1:3])) class(out) summary(out) ## End(Not run)
## Not run: data(hbk, package="robustbase") (out <- tclustIC(hbk[, 1:3])) class(out) summary(out) ## End(Not run)
c
and k
, based on the solutions obtained by tclustIC
The function tclustICplot()
takes as input an object of class
tclustic.object
, the output
of function tclustIC
(that is a series of matrices which contain
the values of the information criteria BIC/ICL/CLA for different values of k
and c
) and plots them as function of c
or of k
. The plot enables
interaction in the sense that, if option databrush
has been activated, it is
possible to click on a point in the plot and to see the associated classification
in the scatter plot matrix.
tclustICplot( out, whichIC = c("ALL", "MIXMIX", "MIXCLA", "CLACLA"), tag, datatooltip, databrush, nameY, trace = FALSE, ... )
tclustICplot( out, whichIC = c("ALL", "MIXMIX", "MIXCLA", "CLACLA"), tag, datatooltip, databrush, nameY, trace = FALSE, ... )
out |
An S3 object of class |
whichIC |
Specifies the information criterion to use for the plot.
See |
tag |
plot handle. String which identifies the handle of the plot which is about to be created. The default is to use tag 'pl_IC'. Notice that if the program finds a plot which has a tag equal to the one specified by the user, then the output of the new plot overwrites the existing one in the same window else a new window is created. |
datatooltip |
Interactive clicking. It is inactive if this parameter is set to FALSE.
If
If datatooltip is a list it may contain the following fields:
|
databrush |
Interactive mouse brushing. If databrush is missing or empty (default), no brushing is done.
The activation of this option (databrush is If If
|
nameY |
Add variable labels in plot. A vector of strings of length |
trace |
Whether to print intermediate results. Default is |
... |
potential further arguments passed to lower level functions. |
FSDA team, [email protected]
Cerioli, A., Garcia-Escudero, L.A., Mayo-Iscar, A. and Riani M. (2017). Finding the Number of Groups in Model-Based Clustering via Constrained Likelihoods, Journal of Computational and Graphical Statistics, pp. 404-416, https://doi.org/10.1080/10618600.2017.1390469.
Hubert L. and Arabie P. (1985), Comparing Partitions, Journal of Classification, Vol. 2, pp. 193-218.
## Not run: data(geyser2) out <- tclustIC(geyser2, whichIC="MIXMIX", plot=FALSE, alpha=0.1) tclustICplot(out, whichIC="MIXMIX") ## End(Not run)
## Not run: data(geyser2) out <- tclustIC(geyser2, whichIC="MIXMIX", plot=FALSE, alpha=0.1) tclustICplot(out, whichIC="MIXMIX") ## End(Not run)
tclustIC
The function tclustICsol()
takes as input an object of class
tclustic.object
, the output
of function tclustIC
(that is a series of matrices which contain
the values of the information criteria BIC/ICL/CLA for different values of k
and c
) and extracts the first best solutions. Two solutions are considered
equivalent if the value of the adjusted Rand index (or the adjusted Fowlkes and
Mallows index) is above a certain threshold. For each tentative solution the program
checks the adjacent values of c
for which the solution is stable.
A matrix with adjusted Rand indexes is given for the extracted solutions.
tclustICsol( out, NumberOfBestSolutions = 5, ThreshRandIndex = 0.7, whichIC = c("ALL", "CLACLA", "MIXMIX", "MIXCLA"), Rand = TRUE, msg = TRUE, plot = FALSE, trace = FALSE, ... )
tclustICsol( out, NumberOfBestSolutions = 5, ThreshRandIndex = 0.7, whichIC = c("ALL", "CLACLA", "MIXMIX", "MIXCLA"), Rand = TRUE, msg = TRUE, plot = FALSE, trace = FALSE, ... )
out |
An S3 object of class |
NumberOfBestSolutions |
Number of best solutions to extract from BIC/ICL matrix. The default value of NumberOfBestSolutions is 5 |
ThreshRandIndex |
Threshold to identify spurious solutions - the threshold of the adjusted Rand index to use to consider two solutions as equivalent. The default value of ThreshRandIndex is 0.7 |
whichIC |
Specifies the information criterion to use to extract best solutions. Possible values for whichIC are:
The default value is |
Rand |
Index to use to compare partitions. If |
msg |
It controls whether to display or not messages (from MATLAB) on the screen. If |
plot |
If |
trace |
Whether to print intermediate results. Default is |
... |
potential further arguments passed to lower level functions. |
An S3 object of class tclusticsol.object
FSDA team, [email protected]
Cerioli, A., Garcia-Escudero, L.A., Mayo-Iscar, A. and Riani M. (2017). Finding the Number of Groups in Model-Based Clustering via Constrained Likelihoods, Journal of Computational and Graphical Statistics, pp. 404-416, https://doi.org/10.1080/10618600.2017.1390469.
Hubert L. and Arabie P. (1985), Comparing Partitions, Journal of Classification, Vol. 2, pp. 193-218.
tclustIC
, tclustfsda
, carbikeplot
## Not run: data(geyser2) out <- tclustIC(geyser2, whichIC="MIXMIX", plot=FALSE, alpha=0.1) ## Plot first two best solutions using as Information criterion MIXMIX print("Best solutions using MIXMIX") outMIXMIX <- tclustICsol(out, whichIC="MIXMIX", plot=TRUE, NumberOfBestSolutions=2) print(outMIXMIX$MIXMIXbs) ## End(Not run)
## Not run: data(geyser2) out <- tclustIC(geyser2, whichIC="MIXMIX", plot=FALSE, alpha=0.1) ## Plot first two best solutions using as Information criterion MIXMIX print("Best solutions using MIXMIX") outMIXMIX <- tclustICsol(out, whichIC="MIXMIX", plot=TRUE, NumberOfBestSolutions=2) print(outMIXMIX$MIXMIXbs) ## End(Not run)
tclustICsol
An object of class tclusticsol.object
holds information about
the result of a call to tclustICsol
.
The functions print()
and summary()
are used to obtain and print a
summary of the results. An object of class tclusticsol
is a list containing at least the following components:
call |
the matched call |
kk |
a vector containing the values of |
cc |
a vector containing the values of |
alpha |
trimming level |
whichIC |
Information criteria used |
MIXMIXbs |
a matrix of lists of size
Remark: the field MIXMIXbs is present only if |
MIXMIXbsari |
a matrix of adjusted Rand indexes (or Fowlkes and Mallows indexes)
associated with the best solutions for MIXMIX. A matrix of size Remark: the field |
ARIMIX |
a matrix of adjusted Rand indexes between two consecutive value of Remark: the field |
MIXCLAbs |
has the same structure as Remark: the field MIXCLAbs is present only if |
MIXCLAbsari |
has the same structure as Remark: the field |
CLACLAbs |
has the same structure as Remark: the field CLACLAbs is present only if |
CLACLAbsari |
has the same structure as Remark: the field |
ARICLA |
a matrix of adjusted Rand indexes between two consecutive value of Remark: the field |
## Not run: data(hbk, package="robustbase") (out <- tclustIC(hbk[, 1:3])) ## Plot first two best solutions using as Information criterion MIXMIX print("Best solutions using MIXMIX") outMIXMIX <- tclustICsol(out, whichIC="MIXMIX", plot=TRUE, NumberOfBestSolutions=2) class(outMIXMIX) summary(outMIXMIX) print(outMIXMIX$MIXMIXbs) ## End(Not run)
## Not run: data(hbk, package="robustbase") (out <- tclustIC(hbk[, 1:3])) ## Plot first two best solutions using as Information criterion MIXMIX print("Best solutions using MIXMIX") outMIXMIX <- tclustICsol(out, whichIC="MIXMIX", plot=TRUE, NumberOfBestSolutions=2) class(outMIXMIX) summary(outMIXMIX) print(outMIXMIX$MIXMIXbs) ## End(Not run)
Performs robust linear grouping analysis.
tclustreg( y, x, k, alphaLik, alphaX, restrfactor = 12, intercept = TRUE, plot = FALSE, nsamp, refsteps = 10, reftol = 1e-13, equalweights = FALSE, mixt = 0, wtrim = 0, we, msg = TRUE, RandNumbForNini, trace = FALSE, ... )
tclustreg( y, x, k, alphaLik, alphaX, restrfactor = 12, intercept = TRUE, plot = FALSE, nsamp, refsteps = 10, reftol = 1e-13, equalweights = FALSE, mixt = 0, wtrim = 0, we, msg = TRUE, RandNumbForNini, trace = FALSE, ... )
y |
Response variable. A vector with |
x |
An n x p data matrix (n observations and p variables). Rows of x represent observations, and columns represent variables. Missing values (NA's) and infinite values (Inf's) are allowed, since observations (rows) with missing or infinite values will automatically be excluded from the computations. |
k |
Number of groups. |
alphaLik |
Trimming level, a scalar between 0 and 0.5 or an
integer specifying the number of observations which have to be trimmed.
If |
alphaX |
Second-level trimming or constrained weighted model for |
restrfactor |
Restriction factor for regression residuals and
covariance matrices of the explanatory variables. Scalar or vector
with two elements. If |
intercept |
wheather to use constant term (default is |
plot |
If |
nsamp |
If a scalar, it contains the number of subsamples which will be extracted.
If |
refsteps |
Number of refining iterations in each subsample. Default is |
reftol |
Tolerance of the refining steps. The default value is 1e-14 |
equalweights |
A logical specifying wheather cluster weights in the concentration
and assignment steps shall be considered. If |
mixt |
Specifies whether mixture modelling or crisp assignment approach to model
based clustering must be used. In the case of mixture modelling parameter mixt also
controls which is the criterion to find the untrimmed units in each step of the maximization.
If |
wtrim |
How to apply the weights on the observations - a flag taking values in c(0, 1, 2, 3, 4).
|
we |
Weights. A vector of size n-by-1 containing application-specific weights Default is a vector of ones. |
msg |
Controls whether to display or not messages on the screen If |
RandNumbForNini |
pre-extracted random numbers to initialize proportions.
Matrix of size k-by-nrow(nsamp) containing the random numbers which
are used to initialize the proportions of the groups. This option is effective only if
|
trace |
Whether to print intermediate results. Default is |
... |
potential further arguments passed to lower level functions. |
An S3 object of class tclustreg.object
FSDA team, [email protected]
Mayo-Iscar A. (2016). The joint role of trimming and constraints in robust estimation for mixtures of gaussian factor analyzers, Computational Statistics and Data Analysis", Vol. 99, pp. 131-147.
Garcia-Escudero, L.A., Gordaliza, A., Greselin, F., Ingrassia, S. and Mayo-Iscar, A. (2017), Robust estimation of mixtures of regressions with random covariates, via trimming and constraints, Statistics and Computing, Vol. 27, pp. 377-402.
Garcia-Escudero, L.A., Gordaliza A., Mayo-Iscar A., and San Martin R. (2010). Robust clusterwise linear regression through trimming, Computational Statistics and Data Analysis, Vol. 54, pp.3057-3069.
Cerioli, A. and Perrotta, D. (2014). Robust Clustering Around Regression Lines with High Density Regions. Advances in Data Analysis and Classification, Vol. 8, pp. 5-26.
Torti F., Perrotta D., Riani, M. and Cerioli A. (2019). Assessing Robust Methodologies for Clustering Linear Regression Data, Advances in Data Analysis and Classification, Vol. 13, pp 227-257.
## Not run: ## The X data have been introduced by Gordaliza, Garcia-Escudero & Mayo-Iscar (2013). ## The dataset presents two parallel components without contamination. data(X) y1 = X[, ncol(X)] X1 = X[,-ncol(X), drop=FALSE] (out <- tclustreg(y1, X1, k=2, alphaLik=0.05, alphaX=0.01, restrfactor=5, plot=TRUE, trace=TRUE)) (out <- tclustreg(y1, X1, k=2, alphaLik=0.05, alphaX=0.01, restrfactor=2, mixt=2, plot=TRUE, trace=TRUE)) ## Examples with fishery data data(fishery) X <- fishery ## some jittering is necessary because duplicated units are not treated: ## this needs to be addressed X <- X + 10^(-8) * abs(matrix(rnorm(nrow(X)*ncol(X)), ncol=2)) y1 <- X[, ncol(X)] X1 <- X[, -ncol(X), drop=FALSE] (out <- tclustreg(y1, X1, k=3, restrfact=50, alphaLik=0.04, alphaX=0.01, trace=TRUE)) ## Example 2: ## Define some arbitrary weightssome arbitrary weights for the units we <- sqrt(X1)/sum(sqrt(X1)) ## tclustreg required parameters k <- 2; restrfact <- 50; alpha1 <- 0.04; alpha2 <- 0.01 ## Now tclust is run on each combination of mixt and wtrim options cat("\nmixt=0; wtrim=0", "\nStandard tclustreg, with classification likelihood and without thinning\n") (out <- tclustreg(y1, X1, k=k, restrfact=restrfact, alphaLik=alpha1, alphaX=alpha2, mixt=0, wtrim=0, trace=TRUE)) cat("\nmixt=2; wtrim=0", "\nMixture likelihood, no thinning\n") (out <- tclustreg(y1, X1, k=k, restrfact=restrfact, alphaLik=alpha1, alphaX=alpha2, mixt=2, wtrim=0, trace=TRUE)) cat("\nmixt=0; wtrim=1", "\nClassification likelihood, thinning based on user weights\n") (out <- tclustreg(y1, X1, k=k, restrfact=restrfact, alphaLik=alpha1, alphaX=alpha2, mixt=0, we=we, wtrim=1, trace=TRUE)) cat("\nmixt=2; wtrim=1", "\nMixture likelihood, thinning based on user weights\n") (out <- tclustreg(y1, X1, k=k, restrfact=restrfact, alphaLik=alpha1, alphaX=alpha2, mixt=2, we=we, wtrim=1, trace=TRUE)) cat("\nmixt=0; wtrim=2", "\nClassification likelihood, thinning based on retention probabilities\n") (out <- tclustreg(y1, X1, k=k, restrfact=restrfact, alphaLik=alpha1, alphaX=alpha2, mixt=0, wtrim=2, trace=TRUE)) cat("\nmixt=2; wtrim=2", "\nMixture likelihood, thinning based on retention probabilities\n") (out <- tclustreg(y1, X1, k=k, restrfact=restrfact, alphaLik=alpha1, alphaX=alpha2, mixt=2, wtrim=2, trace=TRUE)) cat("\nmixt=0; wtrim=3", "\nClassification likelihood, thinning based on bernoulli weights\n") (out <- tclustreg(y1, X1, k=k, restrfact=restrfact, alphaLik=alpha1, alphaX=alpha2, mixt=0, wtrim=3, trace=TRUE)) cat("\nmixt=2; wtrim=3", "\nMixture likelihood, thinning based on bernoulli weights\n") (out <- tclustreg(y1, X1, k=k, restrfact=restrfact, alphaLik=alpha1, alphaX=alpha2, mixt=2, wtrim=3, trace=TRUE)) cat("\nmixt=0; wtrim=4", "\nClassification likelihood, tandem thinning based on bernoulli weights\n") (out <- tclustreg(y1, X1, k=k, restrfact=restrfact, alphaLik=alpha1, alphaX=alpha2, mixt=0, wtrim=4, trace=TRUE)) cat("\nmixt=2; wtrim=4", "\nMixture likelihood, tandem thinning based on bernoulli weights\n") (out <- tclustreg(y1, X1, k=k, restrfact=restrfact, alphaLik=alpha1, alphaX=alpha2, mixt=2, wtrim=4, trace=TRUE)) ## End(Not run)
## Not run: ## The X data have been introduced by Gordaliza, Garcia-Escudero & Mayo-Iscar (2013). ## The dataset presents two parallel components without contamination. data(X) y1 = X[, ncol(X)] X1 = X[,-ncol(X), drop=FALSE] (out <- tclustreg(y1, X1, k=2, alphaLik=0.05, alphaX=0.01, restrfactor=5, plot=TRUE, trace=TRUE)) (out <- tclustreg(y1, X1, k=2, alphaLik=0.05, alphaX=0.01, restrfactor=2, mixt=2, plot=TRUE, trace=TRUE)) ## Examples with fishery data data(fishery) X <- fishery ## some jittering is necessary because duplicated units are not treated: ## this needs to be addressed X <- X + 10^(-8) * abs(matrix(rnorm(nrow(X)*ncol(X)), ncol=2)) y1 <- X[, ncol(X)] X1 <- X[, -ncol(X), drop=FALSE] (out <- tclustreg(y1, X1, k=3, restrfact=50, alphaLik=0.04, alphaX=0.01, trace=TRUE)) ## Example 2: ## Define some arbitrary weightssome arbitrary weights for the units we <- sqrt(X1)/sum(sqrt(X1)) ## tclustreg required parameters k <- 2; restrfact <- 50; alpha1 <- 0.04; alpha2 <- 0.01 ## Now tclust is run on each combination of mixt and wtrim options cat("\nmixt=0; wtrim=0", "\nStandard tclustreg, with classification likelihood and without thinning\n") (out <- tclustreg(y1, X1, k=k, restrfact=restrfact, alphaLik=alpha1, alphaX=alpha2, mixt=0, wtrim=0, trace=TRUE)) cat("\nmixt=2; wtrim=0", "\nMixture likelihood, no thinning\n") (out <- tclustreg(y1, X1, k=k, restrfact=restrfact, alphaLik=alpha1, alphaX=alpha2, mixt=2, wtrim=0, trace=TRUE)) cat("\nmixt=0; wtrim=1", "\nClassification likelihood, thinning based on user weights\n") (out <- tclustreg(y1, X1, k=k, restrfact=restrfact, alphaLik=alpha1, alphaX=alpha2, mixt=0, we=we, wtrim=1, trace=TRUE)) cat("\nmixt=2; wtrim=1", "\nMixture likelihood, thinning based on user weights\n") (out <- tclustreg(y1, X1, k=k, restrfact=restrfact, alphaLik=alpha1, alphaX=alpha2, mixt=2, we=we, wtrim=1, trace=TRUE)) cat("\nmixt=0; wtrim=2", "\nClassification likelihood, thinning based on retention probabilities\n") (out <- tclustreg(y1, X1, k=k, restrfact=restrfact, alphaLik=alpha1, alphaX=alpha2, mixt=0, wtrim=2, trace=TRUE)) cat("\nmixt=2; wtrim=2", "\nMixture likelihood, thinning based on retention probabilities\n") (out <- tclustreg(y1, X1, k=k, restrfact=restrfact, alphaLik=alpha1, alphaX=alpha2, mixt=2, wtrim=2, trace=TRUE)) cat("\nmixt=0; wtrim=3", "\nClassification likelihood, thinning based on bernoulli weights\n") (out <- tclustreg(y1, X1, k=k, restrfact=restrfact, alphaLik=alpha1, alphaX=alpha2, mixt=0, wtrim=3, trace=TRUE)) cat("\nmixt=2; wtrim=3", "\nMixture likelihood, thinning based on bernoulli weights\n") (out <- tclustreg(y1, X1, k=k, restrfact=restrfact, alphaLik=alpha1, alphaX=alpha2, mixt=2, wtrim=3, trace=TRUE)) cat("\nmixt=0; wtrim=4", "\nClassification likelihood, tandem thinning based on bernoulli weights\n") (out <- tclustreg(y1, X1, k=k, restrfact=restrfact, alphaLik=alpha1, alphaX=alpha2, mixt=0, wtrim=4, trace=TRUE)) cat("\nmixt=2; wtrim=4", "\nMixture likelihood, tandem thinning based on bernoulli weights\n") (out <- tclustreg(y1, X1, k=k, restrfact=restrfact, alphaLik=alpha1, alphaX=alpha2, mixt=2, wtrim=4, trace=TRUE)) ## End(Not run)
tclustreg
An object of class tclustreg.object
holds information about
the result of a call to tclustreg
.
The functions print()
and summary()
are used to obtain and print a
summary of the results. An object of class tclustreg
is a list containing at least the following components:
call |
the matched call |
## Not run: ## The X data have been introduced by Gordaliza, Garcia-Escudero & Mayo-Iscar (2013). ## The dataset presents two parallel components without contamination. data(X) y1 = X[, ncol(X)] X1 = X[,-ncol(X), drop=FALSE] out <- tclustreg(y1, X1, k=2, alphaLik=0.05, alphaX=0.01, restrfactor=5, trace=TRUE) class(out) str(out) ## End(Not run)
## Not run: ## The X data have been introduced by Gordaliza, Garcia-Escudero & Mayo-Iscar (2013). ## The dataset presents two parallel components without contamination. data(X) y1 = X[, ncol(X)] X1 = X[,-ncol(X), drop=FALSE] out <- tclustreg(y1, X1, k=2, alphaLik=0.05, alphaX=0.01, restrfactor=5, trace=TRUE) class(out) str(out) ## End(Not run)
tclustreg
for different number of groups k
and restriction factors c
.(the last two letters stand for 'Information Criterion') computes
the values of BIC (MIXMIX), ICL (MIXCLA) or CLA (CLACLA), for different values
of k
(number of groups) and different values of c
(restriction factor for the variances of the residuals), for
a prespecified level of trimming. In order to minimize randomness, given k
,
the same subsets are used for each value of c
.
tclustregIC( y, x, alphaLik, alphaX, intercept = TRUE, plot = FALSE, nsamp, refsteps = 10, reftol = 1e-13, equalweights = FALSE, wtrim = 0, we, msg = TRUE, RandNumbForNini, trace = FALSE, ... )
tclustregIC( y, x, alphaLik, alphaX, intercept = TRUE, plot = FALSE, nsamp, refsteps = 10, reftol = 1e-13, equalweights = FALSE, wtrim = 0, we, msg = TRUE, RandNumbForNini, trace = FALSE, ... )
y |
Response variable. A vector with |
x |
An n x p data matrix (n observations and p variables). Rows of x represent observations, and columns represent variables. Missing values (NA's) and infinite values (Inf's) are allowed, since observations (rows) with missing or infinite values will automatically be excluded from the computations. |
alphaLik |
Trimming level, a scalar between 0 and 0.5 or an
integer specifying the number of observations which have to be trimmed.
If |
alphaX |
Second-level trimming or constrained weighted model for |
intercept |
wheather to use constant term (default is |
plot |
If |
nsamp |
If a scalar, it contains the number of subsamples which will be extracted.
If |
refsteps |
Number of refining iterations in each subsample. Default is |
reftol |
Tolerance of the refining steps. The default value is 1e-14 |
equalweights |
A logical specifying wheather cluster weights in the concentration
and assignment steps shall be considered. If |
wtrim |
How to apply the weights on the observations - a flag taking values in c(0, 1, 2, 3, 4).
|
we |
Weights. A vector of size n-by-1 containing application-specific weights Default is a vector of ones. |
msg |
Controls whether to display or not messages on the screen If |
RandNumbForNini |
pre-extracted random numbers to initialize proportions.
Matrix of size k-by-nrow(nsamp) containing the random numbers which
are used to initialize the proportions of the groups. This option is effective only if
|
trace |
Whether to print intermediate results. Default is |
... |
potential further arguments passed to lower level functions. |
An S3 object of class tclustreg.object
FSDA team, [email protected]
Torti F., Perrotta D., Riani, M. and Cerioli A. (2019). Assessing Robust Methodologies for Clustering Linear Regression Data, Advances in Data Analysis and Classification, Vol. 13, pp 227-257.
The wool data give the number of cycles to failure of a worsted yarn under cycles of repeated loading. The variables are: length of test specimen; amplitude of loading cycle; load
data(wool)
data(wool)
A data frame with 27 rows and 4 variables
The X dataset has been simulated by Gordaliza, Garcia-Escudero and Mayo-Iscar during the Workshop ADVANCES IN ROBUST DATA ANALYSIS AND CLUSTERING held in Ispra on October 21st-25th 2013. It is a bivariate dataset of 200 observations. It presents two parallel components without contamination.
data(X)
data(X)
A data frame with 200 rows and 2 variables
Simulated data to test tclustIC() and tclustICsol(), carbike() functions
data(z1)
data(z1)
A data frame with 150 rows and 2 variables. The variables are as follows:
X1
X2
Maitra, R. and Melnykov, V. (2010), Simulating data to study performance of finite mixture modeling and clustering algorithms, The Journal of Computational and Graphical Statistics, Vol. 19, pp. 354-376.
data(z1) head(z1) ## Not run: (out <- tclustIC(z1, plots=FALSE, whichIC="CLACLA")) (outCLACLA <- tclustICsol(out, whichIC="CLACLA", plot=FALSE)) carbikeplot(outCLACLA) ## End(Not run)
data(z1) head(z1) ## Not run: (out <- tclustIC(z1, plots=FALSE, whichIC="CLACLA")) (outCLACLA <- tclustICsol(out, whichIC="CLACLA", plot=FALSE)) carbikeplot(outCLACLA) ## End(Not run)