Title: | Computed ABC Analysis |
---|---|
Description: | For a given data set, the package provides a novel method of computing precise limits to acquire subsets which are easily interpreted. Closely related to the Lorenz curve, the ABC curve visualizes the data by graphically representing the cumulative distribution function. Based on an ABC analysis the algorithm calculates, with the help of the ABC curve, the optimal limits by exploiting the mathematical properties pertaining to distribution of analyzed items. The data containing positive values is divided into three disjoint subsets A, B and C, with subset A comprising very profitable values, i.e. largest data values ("the important few"), subset B comprising values where the yield equals to the effort required to obtain it, and the subset C comprising of non-profitable values, i.e., the smallest data sets ("the trivial many"). Package is based on "Computed ABC Analysis for rational Selection of most informative Variables in multivariate Data", PLoS One. Ultsch. A., Lotsch J. (2015) <DOI:10.1371/journal.pone.0129767>. |
Authors: | Michael Thrun, Jorn Lotsch, Alfred Ultsch |
Maintainer: | Florian Lerch <[email protected]> |
License: | GPL-3 |
Version: | 1.2.1 |
Built: | 2024-10-28 06:38:06 UTC |
Source: | CRAN |
Computed ABC Analysis allows the optimal calculation of three disjoint subsets A,B,C in data sets containing positive values:
subset A containing few most profitable values, i.e. largest data values ("the important few"), subset B containing data, where the profit gain equals effort required to obtain this gain, and the subset C of non-profitable values, i.e. the smallest data sets ("the trivial many").
This package calculates the three subsets A, B and C by means of an algorithm based on statistically valid definitions of thresholds for the three sets A,B and C.
Check out our new Umatrix package for visualisation and clustering of high-dimensional data on our Webpage.
Michael Thrun, Jorn Lotsch, Alfred Ultsch
http://www.uni-marburg.de/fb12/datenbionik
Ultsch. A ., Lotsch J.: Computed ABC Analysis for Rational Selection of Most Informative Variables in Multivariate Data, PloS one, Vol. 10(6), pp. e0129767. doi 10.1371/journal.pone.0129767, 2015.
data("SwissInhabitants") abc=ABCanalysis(SwissInhabitants,PlotIt=TRUE) SetA=SwissInhabitants[abc$Aind] SetB=SwissInhabitants[abc$Bind] SetC=SwissInhabitants[abc$Cind]
data("SwissInhabitants") abc=ABCanalysis(SwissInhabitants,PlotIt=TRUE) SetA=SwissInhabitants[abc$Aind] SetB=SwissInhabitants[abc$Bind] SetC=SwissInhabitants[abc$Cind]
divide the Data in 3 classes A, B and C such that
A=Data[Aind] : with low effort much yield
B=Data[Bind] : yield and effort are about equal
C=Data[Cind] : with much effort low yield
ABCanalysis(Data,ABCcurvedata,PlotIt=FALSE)
ABCanalysis(Data,ABCcurvedata,PlotIt=FALSE)
Data |
vector(1:n) describes an array of data: n cases in rows of one variable, if matrix or dataframe then first column will be used. |
ABCcurvedata |
only for internal usage, list from ABCcurve |
PlotIt |
default(FALSE), if variable is used, a plot is made, set with arbitrary value |
Pareto point: Minimum distance to (0,1) = minimal unrealized potential
BreakEven Point: B_x
is the x value of the point, where the slope of ABCcurve equals one.
For further description to p
in variable AlimitIndInInterpolation
see ABCcurve
Output is of type list which parts are described in the following
Aind |
vector [1:j], A==Data(Aind) : with little effort much Yield |
Bind |
vector [1:l], B==Data(Bind) : effort and Yield are balanced |
Cind |
(vector [1:m], C==Data(Cind) : much effort for little Yield |
ABexchanged |
Boolean, TRUE if Point A is the Break Even and point B is the Pareto Point, FALSE otherwise |
A |
c(Ax,Ay), Pareto point or BreakEven Point indicated by ABexchanged |
B |
c(Bx,By), Pareto point or BreakEven Point indicated by ABexchanged |
C |
Submarginal point: minimum distance to |
smallestAData |
Boundary AB, defined by point A or B with ABexchanged |
smallestBData |
Boundary BC, defined by point C |
AlimitIndInInterpolation |
index of AB Boundary in [ |
BlimitIndInInterpolation |
index of BC Boundary in [ |
Michael Thrun
http://www.uni-marburg.de/fb12/datenbionik
Ultsch. A ., Lotsch J.: Computed ABC Analysis for Rational Selection of Most Informative Variables in Multivariate Data, PloS one, Vol. 10(6), pp. e0129767. doi 10.1371/journal.pone.0129767, 2015.
data("SwissInhabitants") abc=ABCanalysis(SwissInhabitants,PlotIt=TRUE) A=abc$Aind B=abc$Bind C=abc$Cind Agroup=SwissInhabitants[A] Bgroup=SwissInhabitants[B] Cgroup=SwissInhabitants[C]
data("SwissInhabitants") abc=ABCanalysis(SwissInhabitants,PlotIt=TRUE) A=abc$Aind B=abc$Bind C=abc$Cind Agroup=SwissInhabitants[A] Bgroup=SwissInhabitants[B] Cgroup=SwissInhabitants[C]
calculate points A B C of the ABC Analysis from a given curve.
p[1:m] |
a vector of values specifying where interpolation took place |
ABC[1:m] |
given values of the curve at positions from p |
BreakEvenPunktIndex = BreakEvenPunktIndex, ParetoPunktIndex = ParetoPunktIndex, SubmarginalPunktIndex = SubmarginalPunktIndex, ABx = Effort[AB], ABy = Yield[AB], BCx = Effort[BC], BCy = Yield[BC], Bx = Effort[B], By = Yield[B]))
BreakEvenPunktIndex |
Index of breakeven point |
ParetoPunktIndex |
Index of pareto point |
SubmarginalPunktIndex |
Index of submarginal point |
ABx |
Position of AB point on x axis |
ABy |
Position of AB point on y axis |
BCx |
Position of BC point on x axis |
BCy |
Position of BC point on y axis |
Bx |
Position of the unused point (breakeven or pareto) on the x axis |
By |
Position of the unused point (breakeven or pareto) on the y axis |
Florian Lerch
Displays ABC Curve : cumulative percentage of largest Data (effort) vs cumlative percentage of sum of largest data (yield) with set limits generated by an calculated ABCanalysis.
ABCanalysisPlot(Data, LineType = 0, LineWidth = 3, ShowUniform = TRUE,title, limits = TRUE, MarkPoints = TRUE, ABCcurvedata,ResetPlotDefaults=TRUE)
ABCanalysisPlot(Data, LineType = 0, LineWidth = 3, ShowUniform = TRUE,title, limits = TRUE, MarkPoints = TRUE, ABCcurvedata,ResetPlotDefaults=TRUE)
Data |
vector[1:n] describes an array of data: n cases in rows of one variable |
LineType |
integer, optional, for plot default: LineType=0 for solid line; for other line codes see documentation about pch |
LineWidth |
integer, optional, width of Line, see |
ShowUniform |
boolean, optional, the ABC curve of the uniform distribution is shown in plot if TRUE (default) |
title |
string, optional, see parameter |
limits |
boolean, = TRUE, lines of division in A, B and C are drawn, default = FALSE |
MarkPoints |
boolean, optional, default= TRUE, Mark the three points of interest |
ABCcurvedata |
optional, see ABCcurve |
ResetPlotDefaults |
optional, default =TRUE. If ResetPlotDefaults=FALSE, multiple plots in one window possible, but no resetting of plot to default parameters. |
object is a list of items with
ABC |
Output of ABCplot |
ABCanalysis |
Output of ABCanalysis |
The Break Even point is always marked with a green star.
The diagonal from (0,1) to (1,0) is the equilibrium, where effort equals yield.
Michael Thrun
http://www.uni-marburg.de/fb12/datenbionik
## Standard Example data("SwissInhabitants") abc=ABCanalysisPlot(SwissInhabitants) ## Multiple plots in one Window: m=runif(4,100,200) s=runif(4,1,10) Data=sapply(1:4,FUN=function(x,m,s) rnorm(1000,m,s),m,s) # windows() #screen devices should not be used in examples etc par(mfrow=c(2,2)) for (i in 1:4) { ABCanalysisPlot(Data[,i],ResetPlotDefaults=FALSE) }
## Standard Example data("SwissInhabitants") abc=ABCanalysisPlot(SwissInhabitants) ## Multiple plots in one Window: m=runif(4,100,200) s=runif(4,1,10) Data=sapply(1:4,FUN=function(x,m,s) rnorm(1000,m,s),m,s) # windows() #screen devices should not be used in examples etc par(mfrow=c(2,2)) for (i in 1:4) { ABCanalysisPlot(Data[,i],ResetPlotDefaults=FALSE) }
Only the first column of Data is used, anything not beeinh positive numerical value is set to zero
ABCcleanData(Data)
ABCcleanData(Data)
Data |
vector[1:n] describes an array of data: n cases in rows of one variable |
Data <0 are set to zero, non-numeric values (NA,NaN,etc.) in Data are set to zero strings and chars are set to zero infinitive numbers are set to max(Data)
Output is of type list which's parts are described in the following
CleanedData |
vector [1:m], columnvector containing Data>=0 and zeros for all NA, NaN and negative values in Data(1:n) |
Data2CleanInd |
vector [1:k], Index such that CleanedData = nantozero(Data(Data2CleanInd)) |
RemovedInd |
vector [1:l], Index such that Data(RemovedInd) is the data that has been removed if RemoveSmallYields==1 |
http://www.uni-marburg.de/fb12/datenbionik
Michael Thrun
Calculates cumulative percentage of largest data (effort) and cumulative percentages of sum of largest Data (yield) with spline interpolation (second order, piecewise) of values in-between.
ABCcurve(Data, p)
ABCcurve(Data, p)
Data |
vector[1:n] describes an array of data: n cases in rows of one variable |
p |
optional, an vector of values specifying where interpolation takes place, created by |
Output is of type list which parts are described in the following
Curve |
A list with
|
CleanedData |
vector [1:m], columnvector containing Data>=0 and zeros for all NA, NaN and negative values in Data(1:n) |
Slope |
A list with
|
Michael Thrun
http://www.uni-marburg.de/fb12/datenbionik
Ultsch. A ., Lotsch J.: Computed ABC Analysis for Rational Selection of Most Informative Variables in Multivariate Data, PloS one, Vol. 10(6), pp. e0129767. doi 10.1371/journal.pone.0129767, 2015.
Plots cumulative percentage of largest data (effort) vs. cumulative percentage of sum of largest data (yield)
ABCplot(Data, LineType = 0, LineWidth = 3, ShowUniform = TRUE, title, ABCcurvedata,defaultAxes = TRUE)
ABCplot(Data, LineType = 0, LineWidth = 3, ShowUniform = TRUE, title, ABCcurvedata,defaultAxes = TRUE)
Data |
vector[1:n], describes an array of data: n cases in rows of one variable |
LineType |
for plot default: LineType=0 for a line, other line codes see documentation about |
LineWidth |
integer, width of Line, see |
ShowUniform |
bool, =TRUE: the ABC curve of the uniform distribution is shown in plot |
title |
string, optional, see parameter |
ABCcurvedata |
optional, see ABCcurve |
defaultAxes |
optional, boolean, see parameter |
Output is of type list which parts are described in the following
ABCx |
vector [1:k], cumulative population in percent |
ABCy |
vector [1:k], cumulative high Data in percent |
The diagonal from (1,0) to (0,1) is the Equilibrium, where effort equals yield
Michael Thrun
http://www.uni-marburg.de/fb12/datenbionik
data("SwissInhabitants") vec=ABCplot(SwissInhabitants)
data("SwissInhabitants") vec=ABCplot(SwissInhabitants)
Only the first column of Data is used, anything not beeing positive numerical value is set to zero
ABCRemoveSmallYields(Data,CumSumSmallestPercentage)
ABCRemoveSmallYields(Data,CumSumSmallestPercentage)
Data |
vector[1:n] describes an array of data: n cases in rows of one variable |
CumSumSmallestPercentage |
(default =0.5),the smallest data up to a cumulated sum of less than CumSumSmallestPercentage |
Data <0 are set to zero, non-numeric values (NA,NaN,etc.) in Data are set to zero strings and chars are set to zero infinitive numbers are set to max(Data) the smallest data up to a cumulated sum of less than CumSumSmallestPercentage of the total sum (yield) is removed
Output is of type list which's parts are described in the following
SubstantialData |
columnvector containing Data>=0 and zeros for all NaN and negative values in Data(1:n) |
Data2CleanInd |
Index such that SubstantialData = nantozero(Data(Data2SubstantialInd)) |
RemovedInd |
Data(RemovedInd) is the data that has been removed |
http://www.uni-marburg.de/fb12/datenbionik
Michael Thrun
divide the Data in 3 classes A, B and C such that
A=Data[Aind] : with low effort much yield
B=Data[Bind] : yield and effort are about equal
C=Data[Cind] : with much effort low yield
calculatedABCanalysis(Data)
calculatedABCanalysis(Data)
Data |
vector(1:n) describes an array of data: n cases in rows of one variable, if matrix or dataframe then first column will be used. |
Pareto point: Minimum distance to (0,1) = minimal unrealized potential
BreakEven Point: B_x
is the x value of the point, where the slope of ABCcurve equals one.
For further description to p
in variable AlimitIndInInterpolation
see ABCcurve
Output is of type list which parts are described in the following
Aind |
vector [1:j], A==Data(Aind) : with little effort much Yield |
Bind |
vector [1:l], B==Data(Bind) : effort and Yield are balanced |
Cind |
(vector [1:m], C==Data(Cind) : much effort for little Yield |
smallestAData |
Boundary AB, defined by point A or B with ABexchanged |
smallestBData |
Boundary BC, defined by point C |
Michael Thrun
http://www.uni-marburg.de/fb12/datenbionik
Ultsch. A ., Lotsch J.: Computed ABC Analysis for Rational Selection of Most Informative Variables in Multivariate Data, PloS one, Vol. 10(6), pp. e0129767. doi 10.1371/journal.pone.0129767, 2015.
data("SwissInhabitants") abc=calculatedABCanalysis(SwissInhabitants) A=abc$Aind B=abc$Bind C=abc$Cind Agroup=SwissInhabitants[A] Bgroup=SwissInhabitants[B] Cgroup=SwissInhabitants[C]
data("SwissInhabitants") abc=calculatedABCanalysis(SwissInhabitants) A=abc$Aind B=abc$Bind C=abc$Cind Agroup=SwissInhabitants[A] Bgroup=SwissInhabitants[B] Cgroup=SwissInhabitants[C]
Gini index for an ABC curve
Gini4ABC(p, ABC)
Gini4ABC(p, ABC)
p |
vector [1:k], cumulative population in percent |
ABC |
vector [1:k], cumulative high data in percent |
Gini gini index i.e. the integral over ABC(p) / 0.5 *100
given in percent i.e in [0..100]
FL?MT?
calculation of the Gini-Index from Data
GiniIndex(Data,p)
GiniIndex(Data,p)
Data |
vector[1:n] describes an array of data: n cases in rows of one variable |
p |
optional, an vector of values specifying where interpolation takes place, created by |
uses ABCcurve and Gini4ABC
Gini |
gini index i.e. the integral over Area *200 -100 given in percent i.e in [0..100] |
p |
vector [1:k], cumulative population in percent |
ABC |
vector [1:k], cumulative high data in percent |
CleanedData |
vector [1:m], columnvector containing Data>=0 and zeros for all NA, NaN and negative values in Data(1:n) |
Michael Thrun
Number of inhabitants in the 2896 villages of Switzerland in the year 1900.
data("SwissInhabitants")
data("SwissInhabitants")
This data set consists of the number of inhabitants in the 2896 communes, i.e. cities and villages, in the year 1900. The individual count is the total number of persons living in the particular commune. The data set is unordered for anonymity reasons. The data set has been used as part of a larger data set to identify patterns of concentration in Switzerland (see reference).
Schuler,M., Ullmann, D. Eidgenossische Volkszahlung:Bevoelkerungsentwicklung der Gemeinden, Bundesamt fur Statistik, Neuchatel, Switzerland, 2002
Behnisch, M., Ultsch, A.: Population Patterns in Switzerland 1850-2000, in: Gaul, W. et al (Eds), Advances in Data Analysis, Data Handling and Business Intelligence, Springer, Heidelberg, pp. 163-173, 2010.
data(SwissInhabitants) ## maybe str(SwissInhabitants) ; plot(SwissInhabitants) ...
data(SwissInhabitants) ## maybe str(SwissInhabitants) ; plot(SwissInhabitants) ...