Title: | Visualization 2D of Binary Classification Models |
---|---|
Description: | Visual contour and 2D point and contour plots for binary classification modeling under algorithms such as 'glm', 'rf', 'gbm', 'nnet' and 'svm', presented over two dimensions generated by 'famd' and 'mca' methods. Package 'FactoMineR' for multivariate reduction functions and package 'MBA' for interpolation functions are used. The package can be used to visualize the discriminant power of input variables and algorithmic modeling, explore outliers, compare algorithm behaviour, etc. It has been created initially for teaching purposes, but it has also many practical uses under the 'XAI' paradigm. |
Authors: | Javier Portela [aut, cre] |
Maintainer: | Javier Portela <[email protected]> |
License: | GPL (>= 3) |
Version: | 0.1.1 |
Built: | 2024-11-08 10:32:37 UTC |
Source: | CRAN |
Breast Cancer Winsconsin dataset
data(breastwisconsin1)
data(breastwisconsin1)
An object of class data.frame
with 699 rows and 10 columns.
https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(original)
This function presents visual graphics by means of FAMD. FAMD function is Factorial Analysis for Mixed Data (interval and categorical) Dependent classification variable is set as supplementary variable. Machine learning algorithm predictions are presented in a filled contour setting
famdcontour(dataf=dataf,listconti,listclass,vardep,proba="", title="",title2="",depcol="",listacol="",alpha1=0.7,alpha2=0.7,alpha3=0.7, classvar=1,intergrid=0,selec=0,modelo="glm",nodos=3,maxit=200,decay=0.01, sampsize=400,mtry=2,nodesize=10,ntree=400,ntreegbm=500,shrink=0.01, bag.fraction=1,n.minobsinnode=10,C=100,gamma=10,Dime1="Dim.1",Dime2="Dim.2")
famdcontour(dataf=dataf,listconti,listclass,vardep,proba="", title="",title2="",depcol="",listacol="",alpha1=0.7,alpha2=0.7,alpha3=0.7, classvar=1,intergrid=0,selec=0,modelo="glm",nodos=3,maxit=200,decay=0.01, sampsize=400,mtry=2,nodesize=10,ntree=400,ntreegbm=500,shrink=0.01, bag.fraction=1,n.minobsinnode=10,C=100,gamma=10,Dime1="Dim.1",Dime2="Dim.2")
dataf |
data frame. |
listconti |
Interval variables to use, in format c("var1","var2",...). |
listclass |
Class variables to use, in format c("var1","var2",...). |
vardep |
Dependent binary classification variable. |
proba |
vector of probability predictions obtained externally (optional) |
title |
plot main title |
title2 |
plot subtitle |
depcol |
vector of two colors for points |
listacol |
vector of colors for labels |
alpha1 |
alpha transparency for majoritary class |
alpha2 |
alpha transparency for minoritary class |
alpha3 |
alpha transparency for fit probability plots |
classvar |
1 if dependent variable categories are plotted as supplementary |
intergrid |
scale of grid for contour:0 if automatic |
selec |
1 if stepwise logistic variable selection is required, 0 if not. |
modelo |
name of model: "glm","gbm","rf,","nnet","svm". |
nodos |
nnet: nodes |
maxit |
nnet: iterations |
decay |
nnet: decay |
sampsize |
rf: sampsize |
mtry |
rf: mtry |
nodesize |
rf: nodesize |
ntree |
rf: ntree |
ntreegbm |
gbm: ntree |
shrink |
gbm: shrink |
bag.fraction |
gbm: bag.fraction |
n.minobsinnode |
gbm:n.minobsinnode |
C |
svm Radial: C |
gamma |
svm Radial: gamma |
Dime1 , Dime2
|
FAMD Dimensions to consider. Dim.1 and Dim.2 by default. |
FAMD algorithm from FactoMineR package is used to compute point coordinates on dimensions (Dim.1 and Dim.2 by default). Minority class on dependent variable category is represented as red, majority category as green. Color scheme can be altered using depcol and listacol, as well as alpha transparency values.
For predictive modeling, selec=1 selects variables with a simple stepwise logistic regression. By default select=0. Logistic regression is used by default. Basic parameter setting is supported for algorithms nnet, rf,gbm and svm-RBF. A vector of fitted probabilities obtained externally from other algorithms can be imported in parameter proba=nameofvector. Contour curves are then computed based on this vector.
Contour curves are build by the following process: i) the chosen algorithm model is trained and all observations are predicted-fitted. ii) A grid of points on the two chosen FAMD dimensions is built iii) package MBA is used to interpol probability estimates over the grid, based on previously fitted observations.
In order to represent interval variables, categories of class variables, and points in the same plot, a proportional projection of interval variables coordinates over the two dimensions range is applied. Since space of input variables is frequently larger than two dimensions, sometimes overlapping of points is produced; a frequency variable is used, and alpha values may be adjusted to avoid wrong interpretations of the presence of dependent variable category/color.
Check missings. Missing values are not allowed.
By default selec=0. Setting selec=1 may sometimes imply that no variables are selected; an error message is shown n this case.
Models with only two input variables could lead to plot generation problems.
Be sure that variables named in listconti are all numeric.
If some numeric variable is constant at one single value, process is stopped since numeric Min-max standarization is performed, and NaN values are generated.
A list with the following objects:
plot of points on FAMD first two dimensions
plot of points and contour curves
plot of points and variables
plot of points variable and contour curves
plot of points colored by fitted probability
plot of points colored by abs difference
data frame used for graph1
data frame used for contour curves
data frame used for variable names
interval variables used-selected
class variables used-selected
Pages J. (2004). Analyse factorielle de donnees mixtes. Revue Statistique Appliquee. LII (4). pp. 93-111.
data(breastwisconsin1) dataf<-breastwisconsin1 listconti=c( "clump_thickness","uniformity_of_cell_shape","mitosis") listclass=c("") vardep="classes" result<-famdcontour(dataf=dataf,listconti,listclass,vardep)
data(breastwisconsin1) dataf<-breastwisconsin1 listconti=c( "clump_thickness","uniformity_of_cell_shape","mitosis") listclass=c("") vardep="classes" result<-famdcontour(dataf=dataf,listconti,listclass,vardep)
This function adds outlier marks to famdcontour using ggrepel package.
famdcontourlabel( dataf = dataf, Idt = "", inf = 0.1, sup = 0.9, cutprob = 0.5, ... )
famdcontourlabel( dataf = dataf, Idt = "", inf = 0.1, sup = 0.9, cutprob = 0.5, ... )
dataf |
data frame. |
Idt |
Identification variable, default "", row number |
inf , sup
|
Quantiles for x,y outliers |
cutprob |
cut point for outliers based on prob.estimation error |
... |
options to be passed from famdcontour |
An identification variable can be set in Idt parameter. By default, number of row is used. There are two source of outliers: i) outliers in the two FAMD dimension space, where the cutpoints are set as quantiles given (inf=0.1 and sup=0.9 in both dimensions by default) and ii) outliers with respect to the fitted probability. The dependent variable is set to 1 for the mimority class, and 0 for the majority class. Points considered outliers are those for which abs(vardep-fittedprob) excede parameter cutprob.
A list with the following objects:
plots for dimension outliers
plots for fit outliers
data(breastwisconsin1) dataf<-breastwisconsin1 listconti=c( "clump_thickness","uniformity_of_cell_shape","mitosis") listclass=c("") vardep="classes" result<-famdcontourlabel(dataf=dataf,listconti=listconti, listclass=listclass,vardep=vardep)
data(breastwisconsin1) dataf<-breastwisconsin1 listconti=c( "clump_thickness","uniformity_of_cell_shape","mitosis") listclass=c("") vardep="classes" result<-famdcontourlabel(dataf=dataf,listconti=listconti, listclass=listclass,vardep=vardep)
Home Mortgage Disclosure Act dataset
data(Hmda)
data(Hmda)
An object of class data.frame
with 2380 rows and 13 columns.
Stock, J. H. and Watson, M. W. (2007). Introduction to Econometrics, 2nd ed. Boston: Addison Wesley.
This function presents visual graphics by means of Multiple correspondence Analysis projection. Interval variables are categorized to bins. Dependent classification variable is set as supplementary variable. Machine learning algorithm predictions are presented in a filled contour setting.
mcacontour(dataf=dataf,listconti,listclass,vardep,proba="",bins=8, Dime1="Dim.1",Dime2="Dim.2",classvar=1,intergrid=0,selec=0, title="",title2="",listacol="",depcol="",alpha1=0.8,alpha2=0.8,alpha3=0.7,modelo="glm", nodos=3,maxit=200,decay=0.01,sampsize=400,mtry=2,nodesize=5, ntree=400,ntreegbm=500,shrink=0.01,bag.fraction=1,n.minobsinnode=10,C=100,gamma=10)
mcacontour(dataf=dataf,listconti,listclass,vardep,proba="",bins=8, Dime1="Dim.1",Dime2="Dim.2",classvar=1,intergrid=0,selec=0, title="",title2="",listacol="",depcol="",alpha1=0.8,alpha2=0.8,alpha3=0.7,modelo="glm", nodos=3,maxit=200,decay=0.01,sampsize=400,mtry=2,nodesize=5, ntree=400,ntreegbm=500,shrink=0.01,bag.fraction=1,n.minobsinnode=10,C=100,gamma=10)
dataf |
data frame. |
listconti |
Interval variables to use, in format c("var1","var2",...). |
listclass |
Class variables to use, in format c("var1","var2",...). |
vardep |
Dependent binary classification variable. |
proba |
vector of probability predictions obtained externally (optional) |
bins |
Number of bins for categorize interval variables . |
Dime1 |
FAMD Dimensions to consider. Dim.1 and Dim.2 by default. |
Dime2 |
FAMD Dimensions to consider. Dim.1 and Dim.2 by default. |
classvar |
1 if dependent variable categories are plotted as supplementary |
intergrid |
scale of grid for contour:0 if automatic |
selec |
1 if stepwise logistic variable selection is required, 0 if not. |
title |
plot main title |
title2 |
plot subtitle |
listacol |
vector of colors for labels |
depcol |
vector of two colors for points |
alpha1 |
alpha transparency for majoritary class |
alpha2 |
alpha transparency for minoritary class |
alpha3 |
alpha transparency for fit probability plots |
modelo |
name of model: "glm","gbm","rf,","nnet","svm". |
nodos |
nnet: nodes |
maxit |
nnet: iterations |
decay |
nnet: decay |
sampsize |
rf: sampsize |
mtry |
rf: mtry |
nodesize |
rf: nodesize |
ntree |
rf: ntree |
ntreegbm |
gbm: ntree |
shrink |
gbm: shrink |
bag.fraction |
gbm: bag.fraction |
n.minobsinnode |
gbm:n.minobsinnode |
C |
svm Radial: C |
gamma |
svm Radial: gamma |
This function applies MCA (Multiple Correspondence Analysis) in order to project points and categories of class variables in the same plot. In addition, interval variables listed in listconti are categorized to the number given in bins parameter (by default 8 bins). Further explanation about machine learning classification and contour curves, see the famdcontour function documentation.
A list with the following objects:
plot of points on MCA two dimensions
plot of points and variables
plot of points and contour curves
plot of points, contour curves and variables
plot of points colored by fitted probability
plot of points colored by abs difference
dataset used for graph1
dataset used for graph2
dataset used for graph3
dataset used for graph4
interval variables used
class variables used
color schemes and other parameters
data(breastwisconsin1) dataf<-breastwisconsin1 listconti=c( "clump_thickness","uniformity_of_cell_shape","mitosis") listclass=c("") vardep="classes" result<-mcacontour(dataf=dataf,listconti,listclass,vardep)
data(breastwisconsin1) dataf<-breastwisconsin1 listconti=c( "clump_thickness","uniformity_of_cell_shape","mitosis") listclass=c("") vardep="classes" result<-mcacontour(dataf=dataf,listconti,listclass,vardep)
This function is similar to mcacontour but points are jittered in every plot
mcacontourjit(dataf=dataf,jit=0.1,alpha1=0.8,alpha2=0.8,alpha3=0.7,title="",...)
mcacontourjit(dataf=dataf,jit=0.1,alpha1=0.8,alpha2=0.8,alpha3=0.7,title="",...)
dataf |
data frame. |
jit |
jit distance. Default 0.1. |
alpha1 |
alpha transparency for majoritary class |
alpha2 |
alpha transparency for minoritary class |
alpha3 |
alpha transparency for fit probability plots |
title |
plot main title |
... |
options to be passed from mcacontour |
A list with the following objects:
plot of points on MCA two dimensions
plot of points and variables
plot of points and contour curves
plot of points, contour curves and variables
plot of points colored by fitted probability
plot of points colored by abs difference
data(breastwisconsin1) dataf<-breastwisconsin1 listconti=c( "clump_thickness","uniformity_of_cell_shape","mitosis") listclass=c("") vardep="classes" result<-mcacontourjit(dataf=dataf,listconti=listconti,listclass=listclass,vardep=vardep,jit=0.1)
data(breastwisconsin1) dataf<-breastwisconsin1 listconti=c( "clump_thickness","uniformity_of_cell_shape","mitosis") listclass=c("") vardep="classes" result<-mcacontourjit(dataf=dataf,listconti=listconti,listclass=listclass,vardep=vardep,jit=0.1)
This function presents visual graphics by means of Multiple correspondence Analysis projection. Interval variables are categorized to bins. Dependent classification variable is set as supplementary variable. It is used as base for mcacontour function.
mcamodelobis(dataf=dataf,listconti,listclass, vardep,bins=8,selec=1, Dime1="Dim.1",Dime2="Dim.2")
mcamodelobis(dataf=dataf,listconti,listclass, vardep,bins=8,selec=1, Dime1="Dim.1",Dime2="Dim.2")
dataf |
data frame. |
listconti |
Interval variables to use, in format c("var1","var2",...). |
listclass |
Class variables to use, in format c("var1","var2",...). |
vardep |
Dependent binary classification variable. |
bins |
Number of bins for categorize interval variables . |
selec |
1 if stepwise logistic variable selection is required, 0 if not. |
Dime1 , Dime2
|
MCA Dimensions to consider. Dim.1 and Dim.2 by default. |
A list with the following objects:
dataset used for graph1
dataset used for graph2
dataset used for graph2
interval variables used
class variables used
axis definition in plot
axis definition in plot
data(breastwisconsin1) dataf<-breastwisconsin1 listconti=c( "clump_thickness","uniformity_of_cell_shape","mitosis") listclass=c("") vardep="classes" result<-mcacontour(dataf=dataf,listconti,listclass,vardep,bins=8,title="",selec=1)
data(breastwisconsin1) dataf<-breastwisconsin1 listconti=c( "clump_thickness","uniformity_of_cell_shape","mitosis") listclass=c("") vardep="classes" result<-mcacontour(dataf=dataf,listconti,listclass,vardep,bins=8,title="",selec=1)
nba dataset
data(nba)
data(nba)
An object of class data.frame
with 1340 rows and 21 columns.
https://data.world/exercises/logistic-regression-exercise-1
Pima indian diabetes dataset
data(pima)
data(pima)
An object of class data.frame
with 768 rows and 9 columns.
https://sci2s.ugr.es/keel/dataset.php?cod=21
spiral sample data
data(spiral)
data(spiral)
An object of class data.frame
with 803 rows and 3 columns.