Title: | Modeling and Map Production using Random Forest and Related Stochastic Models |
---|---|
Description: | Creates sophisticated models of training data and validates the models with an independent test set, cross validation, or Out Of Bag (OOB) predictions on the training data. Create graphs and tables of the model validation results. Applies these models to GIS .img files of predictors to create detailed prediction surfaces. Handles large predictor files for map making, by reading in the .img files in chunks, and output to the .txt file the prediction for each data chunk, before reading the next chunk of data. |
Authors: | Elizabeth Freeman, Tracey Frescino |
Maintainer: | Elizabeth Freeman <[email protected]> |
License: | Unlimited |
Version: | 3.4.0.4 |
Built: | 2024-12-09 07:00:20 UTC |
Source: | CRAN |
Creates sophisticated models of training data and validates the models with an independent test set, cross validation, or with Out Of Bag (OOB) predictions on the training data. Create graphs and tables of the model validation results. Applies these models to GIS .img files of predictors to create detailed prediction surfaces. Handles large predictor files for map making, by reading in the .img files in chunks, and output to the .txt file the prediction for each data chunk, before reading the next chunk of data.
Package: | ModelMap |
Type: | Package |
Version: | 3.4.0.4 |
Date: | 2023-04-04 |
License: | Unlimited. This code was written and prepared by a U.S. Government employee on official time, and therefore it is in the public domain and not subject to copyright. |
This package provides a push button approach to complex model building and production mapping. It contains three main functions: model.build
,model.diagnostics
, and model.mapmake
.
In addition it contains a simple function get.test
that can be used to randomly divide a training dataset into training and test/validation sets; build.rastLUT
that uses GUI prompts to walk a user through the process of setting up a Raster look up table to link predictors from the training data with the rasters used for map contruction; model.explore
, for preliminary data exploration; and, model.importance.plot
and model.interaction.plot
for interpreting the effects of individual model predictors.
ModelMap
can be run in a traditional R command mode, where all arguments are specified in the function call. However it can also be used in a full push button mode, where you type in the simple command such as model.build, and GUI pop-up windows ask questions about the type of model, the file locations of the data, etc...
Random Forest is implemented through the randomForest
package within R
. Random Forest is more user friendly than Stochastic Gradient Boosting, as it has fewer parameters to be set by the user, and is less sensitive to tuning of these parameters. A Random Forest model consists of multiple trees that vote on predictions. For each tree a random subset of the training data is used to construct the tree, with the remaining data points used to construct out-of-bag (OOB) error estimates. At each node of the tree a random selection of predictors is chosen to determine the split. The number of predictors used to select the splits is the primary user specified parameter that can affect model performance, and this parameter can be automatically optimized using the randomForest
function tuneRF()
. Random Forest will not over fit data, therefore the only penalty of increasing the number of trees is computation time. Random Forest can compute variable importance, an advantage over some "black box" modeling techniques if it is important to understand the ecological relationships underlying a model (Brieman, 2001).
Quantile Regression Forests is implemented through the quantregForest
package.
Conditional Forests is implemented with the cforest()
function in the party
package. As stated in the party
package, ensembles of conditional inference trees have not yet been extensively tested, so this routine is meant for the expert user only and its current state is rather experimental.
For Presence-Absence data, the package PresenceAbsence
is used for model validation.
For model diagnostics the package corrplot
is used to plot the correlation between predictor variables.
For map making, the raster
is used to read and write .img
files.
For interaction plots, the fields
package is used to produce image plots.
Author: Elizabeth Freeman and Tracey Frescino
Maintainer: Elizabeth Freeman <[email protected]>
Breiman, L. (2001) Random Forests. Machine Learning, 45:5-32.
Elith, J., Leathwick, J. R. and Hastie, T. (2008). A working guide to boosted regression trees. Journal of Animal Ecology. 77:802-813.
Friedman, J.H. (2001). Greedy function approximation: a gradient boosting machine. Ann. Stat., 29(5):1189-1232.
Friedman, J.H. (2002). Stochastic gradient boosting. Comput. Stat. Data An., 38(4):367-378.
Liaw, A. and Wiener, M. (2002). Classification and Regression by randomForest. R News 2(3), 18–22.
N. Meinshausen (2006) "Quantile Regression Forests", Journal of Machine Learning Research 7, 983-999 http://jmlr.csail.mit.edu/papers/v7/
Ridgeway, G., (1999). The state of boosting. Comp. Sci. Stat. 31:172-181
Carolin Strobl, Anne-Laure Boulesteix, Achim Zeileis and Torsten Hothorn (2007). Bias in Random Forest variable Importance Measures: Illustrations, Sources and a Solution. BMC Bioinformatics, 8, 25. http://www.biomedcentral.co,/1471-2105/8/25
Carolin Strobl, James Malley and Gerhard Tutz (2009). An Introduction to Recursive Partitioning: Rationale, Application, and Characteristics of Classification and Regression Trees, Bagging, and Random forests. Phsycological Methods, 14(4), 323-348.
Torsten Hothorn, Berthold Lausen, Axel Benner and Martin Radespiel-Troeger (2004). Bagging Survival Trees. Statistics in Medicine, 23(1), 77-91.
Torsten Hothorn, Peter Buhlmann, Sandrine Dudoit, Annette Molinaro and Mark J. ven der Laan (2006a). Survival Ensembles. Biostatistics, 7(3), 355-373.
Torston Hothorn, Kurt Hornik and Achim Zeileis (2006b). Unbiased Recursive Partitioning: A Conditional Inference Framework. JOurnal of Computational and Graphical Statistics, 15(3), 651-674. Preprint available from http://statmath.wu-wein.ac.at/~zeileis/papers/Hothorn+Hornik+Zeileis-2006.pdf
GUI prompts will help the user build a Look-Up-Table to associated predictor variable with their corresponding spatial rasters.
build.rastLUT(imageList=NULL,predList=NULL,qdata.trainfn=NULL, rastLUTfn=NULL,folder=NULL)
build.rastLUT(imageList=NULL,predList=NULL,qdata.trainfn=NULL, rastLUTfn=NULL,folder=NULL)
imageList |
Vector. A vector of character strings giving names and full paths to all raster data files used in model. |
predList |
Vector. A vector of character strings giving the predictor names used as headers in the model training data. |
qdata.trainfn |
String. The name (full path or base name with path specified by |
rastLUTfn |
String. The name of the file output for the Look-Up-Table. By default, if a file name is provided by the |
folder |
String. The folder used for output. Do not add ending slash to path string. If |
This function helps the user create a raster Look-Up-Table to be used later by model.mapmake()
. Currently this function only works in a Windows environment.
First, if "folder"
is not given, the user selects the output folder for the Look-UP-Table.
Second, if "predList"
or "qdatatrainfn"
are not given, the user selects the file containing the training data. The header of the file is used to generate a selection list of possible predictor variables.
Third, if "imageList"
is not provided, the user selects the rasters.
Finally, the function steps through each band of each raster, and the user selects the appropriate predictor.
Returns a data frame containing the raster Look-Up-Table. Also Writes a .csv
file containing the raster Look-Up-Table.
Elizabeth Freeman
## Not run: folder<-system.file("extdata", "helpexamples", package = "ModelMap") qdata.trainfn = paste(folder,"/DATATRAIN.csv",sep="") #build.rastLUT( qdata.trainfn=qdata.trainfn, # folder=folder) ## End(Not run) # end dontrun
## Not run: folder<-system.file("extdata", "helpexamples", package = "ModelMap") qdata.trainfn = paste(folder,"/DATATRAIN.csv",sep="") #build.rastLUT( qdata.trainfn=qdata.trainfn, # folder=folder) ## End(Not run) # end dontrun
transform color names to transparent versions of rgb color codes
col2trans(col.names,alpha=0.5)
col2trans(col.names,alpha=0.5)
col.names |
Vector. Vector of color names from |
alpha |
Number. Number between 0 and 1 giving alpha channel (opacity) value |
Translates a vector of color names to a vector of transparent rgb color codes. Color names must be from names given by colors
.
Outputs a vector of transparent color codes.
Elizabeth Freeman
col.names=c("blue","violetred4","thistle3","yellowgreen") col2trans(col.names,alpha=.2) ###to see effect of alpha### alpha<-(0:10)/10 colmat<-matrix( 1:(length(alpha)*length(col.names)), nrow=length(alpha), ncol=length(col.names), byrow=TRUE) color.codes<-vector("character",0) for(i in 1:length(alpha)){ color.codes<-c(color.codes,col2trans(col.names,alpha=alpha[i])) } #make plot# plot( c(0,1),c(0,1), type="n",xlab="alpha",ylab="color name",yaxt="n",xaxs="i",yaxs="i") abline(h=(0:100)/100) image( z=colmat, x=(0:length(alpha))/length(alpha), y=(0:length(col.names))/length(col.names), col=color.codes, add=TRUE ) op<-par(xpd=TRUE) text( col.names, x=-.08, y=(1:length(col.names)-.5)/length(col.names), srt=90) par(op)
col.names=c("blue","violetred4","thistle3","yellowgreen") col2trans(col.names,alpha=.2) ###to see effect of alpha### alpha<-(0:10)/10 colmat<-matrix( 1:(length(alpha)*length(col.names)), nrow=length(alpha), ncol=length(col.names), byrow=TRUE) color.codes<-vector("character",0) for(i in 1:length(alpha)){ color.codes<-c(color.codes,col2trans(col.names,alpha=alpha[i])) } #make plot# plot( c(0,1),c(0,1), type="n",xlab="alpha",ylab="color name",yaxt="n",xaxs="i",yaxs="i") abline(h=(0:100)/100) image( z=colmat, x=(0:length(alpha))/length(alpha), y=(0:length(col.names))/length(col.names), col=color.codes, add=TRUE ) op<-par(xpd=TRUE) text( col.names, x=-.08, y=(1:length(col.names)-.5)/length(col.names), srt=90) par(op)
Uses random selection to split a dataset into training and test data sets
get.test(proportion.test, qdatafn = NULL, seed = NULL, folder=NULL, qdata.trainfn = paste(strsplit(qdatafn, split = ".csv")[[1]], "_train.csv", sep = ""), qdata.testfn = paste(strsplit(qdatafn, split = ".csv")[[1]], "_test.csv", sep = ""))
get.test(proportion.test, qdatafn = NULL, seed = NULL, folder=NULL, qdata.trainfn = paste(strsplit(qdatafn, split = ".csv")[[1]], "_train.csv", sep = ""), qdata.testfn = paste(strsplit(qdatafn, split = ".csv")[[1]], "_test.csv", sep = ""))
proportion.test |
Number. The proportion of the training data that will be randomly extracted for use as a test set. Value between 0 and 1. |
qdatafn |
String. The name (basename or full path) of the data file to be split into training and test data. This data should include both response and predictor variables. The file must be a comma-delimited file |
seed |
Integer. The number used to initialize randomization to randomly select rows for a test data set. If you want to produce the same model later, use the same seed. If |
folder |
String. The folder used for all output from predictions and/or maps. Do not add ending slash to path string. If |
qdata.trainfn |
String. The name of the file output of training data. By default, |
qdata.testfn |
String. The name of the file output of test data. By default, |
This function should be run once, before starting analysis to create training and test sets. If the cross validation option is to be used with RF or SGB models, or if the OOB option is to be used for RF models, then this step is unnecessary.
Outputs a training data file and test data file. Unless qdata.trainfn
or qdata.testfn
are specified, the output will be located in folder
. The output will have the same rows and columns as the original data.
Elizabeth Freeman
## Not run: qdatafn<-system.file("extdata", "helpexamples","DATATRAIN.csv", package = "ModelMap") qdata<-read.table(file=qdatafn,sep=",",header=TRUE,check.names=FALSE) get.test( proportion.test=0.2, qdatafn=qdatafn, seed=42, folder=getwd(), qdata.trainfn="example.train.csv", qdata.testfn="example.test.csv") ## End(Not run) # end dontrun
## Not run: qdatafn<-system.file("extdata", "helpexamples","DATATRAIN.csv", package = "ModelMap") qdata<-read.table(file=qdatafn,sep=",",header=TRUE,check.names=FALSE) get.test( proportion.test=0.2, qdatafn=qdatafn, seed=42, folder=getwd(), qdata.trainfn="example.train.csv", qdata.testfn="example.test.csv") ## End(Not run) # end dontrun
Create sophisticated models using Random Forest, Quantile Regression Forests, Conditional Forests, or Stochastic Gradient Boosting from training data
model.build(model.type = NULL, qdata.trainfn = NULL, folder = NULL, MODELfn = NULL, predList = NULL, predFactor = FALSE, response.name = NULL, response.type = NULL, unique.rowname = NULL, seed = NULL, na.action = NULL, keep.data = TRUE, ntree = switch(model.type,RF=500,QRF=1000,CF=500,500), mtry = switch(model.type,RF=NULL,QRF=ceiling(length(predList)/3), CF = min(5,length(predList)-1),NULL), replace = TRUE, strata = NULL, sampsize = NULL, proximity = FALSE, importance=FALSE, quantiles=c(0.1,0.5,0.9), subset = NULL, weights = NULL, controls = NULL, xtrafo = NULL, ytrafo = NULL, scores = NULL)
model.build(model.type = NULL, qdata.trainfn = NULL, folder = NULL, MODELfn = NULL, predList = NULL, predFactor = FALSE, response.name = NULL, response.type = NULL, unique.rowname = NULL, seed = NULL, na.action = NULL, keep.data = TRUE, ntree = switch(model.type,RF=500,QRF=1000,CF=500,500), mtry = switch(model.type,RF=NULL,QRF=ceiling(length(predList)/3), CF = min(5,length(predList)-1),NULL), replace = TRUE, strata = NULL, sampsize = NULL, proximity = FALSE, importance=FALSE, quantiles=c(0.1,0.5,0.9), subset = NULL, weights = NULL, controls = NULL, xtrafo = NULL, ytrafo = NULL, scores = NULL)
model.type |
String. Model type. |
qdata.trainfn |
String. The name (full path or base name with path specified by |
folder |
String. The folder used for all output from predictions and/or maps. Do not add ending slash to path string. If |
MODELfn |
String. The file name to use to save files related to the model object. If |
predList |
String. A character vector of the predictor short names used to build the model. These names must match the column names in the training/test data files and the names in column two of the If both |
predFactor |
String. A character vector of predictor short names of the predictors from |
response.name |
String. The name of the response variable used to build the model. If |
response.type |
String. Response type: |
unique.rowname |
String. The name of the unique identifier used to identify each row in the training data. If |
seed |
Integer. The number used to initialize randomization to build RF or SGB models. If you want to produce the same model later, use the same seed. If |
na.action |
String. Model validation. Specifies the action to take if there are |
keep.data |
Logical. RF and SGB models. Should a copy of the predictor data be included in the model object. Useful for if |
ntree |
Integer. RF QRF and CF models. The number of random forest trees for a RF model. The default is 500 trees. |
mtry |
Integer. RF QRF and CF models. Number of variables to try at each node of Random Forest trees. By default, RF models will use the |
replace |
Logical. RF models. Should sampling of cases be done with or without replacement? |
strata |
Factor or String. RF models. A (factor) variable that is used for stratified sampling. Can be in the form of either the name of the column in |
sampsize |
Vector. RF models. Size(s) of sample to draw. For classification, if |
proximity |
Logical. RF models. Should proximity measure among the rows be calculated for unsupervised models? |
importance |
Logical. QRF models. For QRF models only, importance must be specified at the time of model building. If TRUE importance of predictors is assessed at the given |
quantiles |
Numeric. Used for QRF models if |
subset |
CF models. An optional vector specifying a subset of observations to be used in the fitting process. Note: |
weights |
CF models. An optional vector of weights to be used in the fitting process. Non-negative integer valued weights are allowed as well as non-negative real weights. Observations are sampled (with or without replacement) according to probabilities |
controls |
CF models. An object of class |
xtrafo |
CF models. A function to be applied to all input variables. By default, the |
ytrafo |
CF models. A function to be applied to all response variables. By default, the |
scores |
CF models. An optional named list of scores to be attached to ordered factors. Note: |
This package provides a push button approach to complex model building and production mapping. It contains three main functions: model.build
,model.diagnostics
, and model.mapmake
.
In addition it contains a simple function get.test
that can be used to randomly divide a training dataset into training and test/validation sets; build.rastLUT
that uses GUI prompts to walk a user through the process of setting up a Raster look up table to link predictors from the training data with the rasters used for map contruction; model.explore
, for preliminary data exploration; and, model.importance.plot
and model.interaction.plot
for interpreting the effects of individual model predictors.
These functions can be run in a traditional R command mode, where all arguments are specified in the function call. However they can also be used in a full push button mode, where you type in, for example, the simple command model.build
, and GUI pop up windows will ask questions about the type of model, the file locations of the data, etc...
When running the ModelMap
package on non-Windows platforms, file names and folders need to be specified in the argument list, but other pushbutton selections are handled by the select.list()
function, which is platform independent.
Binary, categorical, and continuous response models are supported for Random Forest and Conditional Forest. Quantile Random Forest is appropriate for only continuous response models.
Random Forest is implemented through the randomForest
package within R
. Random Forest is more user friendly than Stochastic Gradient Boosting, as it has fewer parameters to be set by the user, and is less sensitive to tuning of these parameters. A Random Forest model consists of multiple trees that vote on predictions. For each tree a random subset of the training data is used to construct the tree, with the remaining data points used to construct out-of-bag (OOB) error estimates. At each node of the tree a random selection of predictors is chosen to determine the split. The number of predictors used to select the splits (argument mtry
) is the primary user specified parameter that can affect model performance.
By default mtry
will be automatically optimized using the randomForest
package tuneRF()
function. Note that this is a stochastic process. If there is a chance that models may be combined later with the randomForest
package combine
function then for consistency it is important to provide the mtry
argument rather that using the default optimization process.
Random Forest will not over fit data, therefore the only penalty of increasing the number of trees is computation time. Random Forest can compute variable importance, an advantage over some "black box" modeling techniques if it is important to understand the ecological relationships underlying a model (Brieman, 2001).
Quantile Regression Forests is implemented through the quantregForest
package.
Conditional Forests is implemented with the cforest()
function in the party
package. As stated in the party
package, ensembles of conditional inference trees have not yet been extensively tested, so this routine is meant for the expert user only and its current state is rather experimental.
For CF models, ModelMap
currently only supports binary, categorical and continuous response models. Also, for some CF model parameters (subset
, weights
, and scores
) ModelMap
only provides OOB and independent test set diagnostics, and does not support cross validation diagnostics.
Stochastic gradient boosting is not currently supported by ModelMap
.
The function will return the model object. Additionally, it will write a text file to disk, in the folder specified by folder
. This file lists the values of each argument as chosen from GUI prompts used for the function call.
Elizabeth Freeman and Tracey Frescino
Breiman, L. (2001) Random Forests. Machine Learning, 45:5-32.
Elith, J., Leathwick, J. R. and Hastie, T. (2008). A working guide to boosted regression trees. Journal of Animal Ecology. 77:802-813.
Liaw, A. and Wiener, M. (2002). Classification and Regression by randomForest. R News 2(3), 18–22.
N. Meinshausen (2006) "Quantile Regression Forests", Journal of Machine Learning Research 7, 983-999 http://jmlr.csail.mit.edu/papers/v7/
Ridgeway, G., (1999). The state of boosting. Comp. Sci. Stat. 31:172-181
Carolin Strobl, Anne-Laure Boulesteix, Achim Zeileis and Torsten Hothorn (2007). Bias in Random Forest variable Importance Measures: Illustrations, Sources and a Solution. BMC Bioinformatics, 8, 25. http://www.biomedcentral.co,/1471-2105/8/25
Carolin Strobl, James Malley and Gerhard Tutz (2009). An Introduction to Recursive Partitioning: Rationale, Application, and Characteristics of Classification and Regression Trees, Bagging, and Random forests. Phsycological Methods, 14(4), 323-348.
Torsten Hothorn, Berthold Lausen, Axel Benner and Martin Radespiel-Troeger (2004). Bagging Survival Trees. Statistics in Medicine, 23(1), 77-91.
Torsten Hothorn, Peter Buhlmann, Sandrine Dudoit, Annette Molinaro and Mark J. ven der Laan (2006a). Survival Ensembles. Biostatistics, 7(3), 355-373.
Torston Hothorn, Kurt Hornik and Achim Zeileis (2006b). Unbiased Recursive Partitioning: A Conditional Inference Framework. Journal of Computational and Graphical Statistics, 15(3), 651-674. Preprint available from http://statmath.wu-wein.ac.at/~zeileis/papers/Hothorn+Hornik+Zeileis-2006.pdf
get.test
, model.diagnostics
, model.mapmake
## Not run: ########################################################################### ############################# Run this set up code: ####################### ########################################################################### # set seed: seed=38 # Define training and test files: qdata.trainfn = system.file("extdata", "helpexamples","DATATRAIN.csv", package = "ModelMap") # Define folder for all output: folder=getwd() #identifier for individual training and test data points unique.rowname="ID" ########################################################################### ############## Pick one of the following sets of definitions: ############# ########################################################################### ########## Continuous Response, Continuous Predictors ############ #file name: MODELfn="RF_Bio_TC" #predictors: predList=c("TCB","TCG","TCW") #define which predictors are categorical: predFactor=FALSE # Response name and type: response.name="BIO" response.type="continuous" ########## binary Response, Continuous Predictors ############ #file name to store model: MODELfn="RF_CONIFTYP_TC" #predictors: predList=c("TCB","TCG","TCW") #define which predictors are categorical: predFactor=FALSE # Response name and type: response.name="CONIFTYP" # This variable is 1 if a conifer or mixed conifer type is present, # otherwise 0. response.type="binary" ########## Continuous Response, Categorical Predictors ############ # In this example, NLCD is a categorical predictor. # # You must decide what you want to happen if there are categories # present in the data to be predicted (either the validation/test set # or in the image file) that were not present in the original training data. # Choices: # na.action = "na.omit" # Any validation datapoint or image pixel with a value for any # categorical predictor not found in the training data will be # returned as NA. # na.action = "na.roughfix" # Any validation datapoint or image pixel with a value for any # categorical predictor not found in the training data will have # the most common category for that predictor substituted, # and the a prediction will be made. # You must also let R know which of the predictors are categorical, in other # words, which ones R needs to treat as factors. # This vector must be a subset of the predictors given in predList #file name to store model: MODELfn="RF_BIO_TCandNLCD" #predictors: predList=c("TCB","TCG","TCW","NLCD") #define which predictors are categorical: predFactor=c("NLCD") # Response name and type: response.name="BIO" response.type="continuous" ########################################################################### ########################### build model: ################################## ########################################################################### ### create model before batching (only run this code once ever!) ### model.obj = model.build( model.type="RF", qdata.trainfn=qdata.trainfn, folder=folder, unique.rowname=unique.rowname, MODELfn=MODELfn, predList=predList, predFactor=predFactor, response.name=response.name, response.type=response.type, seed=seed, na.action="na.roughfix" ) ## End(Not run) # end dontrun
## Not run: ########################################################################### ############################# Run this set up code: ####################### ########################################################################### # set seed: seed=38 # Define training and test files: qdata.trainfn = system.file("extdata", "helpexamples","DATATRAIN.csv", package = "ModelMap") # Define folder for all output: folder=getwd() #identifier for individual training and test data points unique.rowname="ID" ########################################################################### ############## Pick one of the following sets of definitions: ############# ########################################################################### ########## Continuous Response, Continuous Predictors ############ #file name: MODELfn="RF_Bio_TC" #predictors: predList=c("TCB","TCG","TCW") #define which predictors are categorical: predFactor=FALSE # Response name and type: response.name="BIO" response.type="continuous" ########## binary Response, Continuous Predictors ############ #file name to store model: MODELfn="RF_CONIFTYP_TC" #predictors: predList=c("TCB","TCG","TCW") #define which predictors are categorical: predFactor=FALSE # Response name and type: response.name="CONIFTYP" # This variable is 1 if a conifer or mixed conifer type is present, # otherwise 0. response.type="binary" ########## Continuous Response, Categorical Predictors ############ # In this example, NLCD is a categorical predictor. # # You must decide what you want to happen if there are categories # present in the data to be predicted (either the validation/test set # or in the image file) that were not present in the original training data. # Choices: # na.action = "na.omit" # Any validation datapoint or image pixel with a value for any # categorical predictor not found in the training data will be # returned as NA. # na.action = "na.roughfix" # Any validation datapoint or image pixel with a value for any # categorical predictor not found in the training data will have # the most common category for that predictor substituted, # and the a prediction will be made. # You must also let R know which of the predictors are categorical, in other # words, which ones R needs to treat as factors. # This vector must be a subset of the predictors given in predList #file name to store model: MODELfn="RF_BIO_TCandNLCD" #predictors: predList=c("TCB","TCG","TCW","NLCD") #define which predictors are categorical: predFactor=c("NLCD") # Response name and type: response.name="BIO" response.type="continuous" ########################################################################### ########################### build model: ################################## ########################################################################### ### create model before batching (only run this code once ever!) ### model.obj = model.build( model.type="RF", qdata.trainfn=qdata.trainfn, folder=folder, unique.rowname=unique.rowname, MODELfn=MODELfn, predList=predList, predFactor=predFactor, response.name=response.name, response.type=response.type, seed=seed, na.action="na.roughfix" ) ## End(Not run) # end dontrun
Takes model object and makes predictions, runs model diagnostics, and creates graphs and tables of the results.
model.diagnostics(model.obj = NULL, qdata.trainfn = NULL, qdata.testfn = NULL, folder = NULL, MODELfn = NULL, response.name = NULL, unique.rowname = NULL, diagnostic.flag=NULL, seed = NULL, prediction.type=NULL, MODELpredfn = NULL, na.action = NULL, v.fold = 10, device.type = NULL, DIAGNOSTICfn = NULL, res=NULL, jpeg.res = 72, device.width = 7, device.height = 7, units="in", pointsize=12, cex=par()$cex, req.sens, req.spec, FPC, FNC, quantiles=NULL, all=TRUE, subset = NULL, weights = NULL, mtry = NULL, controls = NULL, xtrafo = NULL, ytrafo = NULL, scores = NULL)
model.diagnostics(model.obj = NULL, qdata.trainfn = NULL, qdata.testfn = NULL, folder = NULL, MODELfn = NULL, response.name = NULL, unique.rowname = NULL, diagnostic.flag=NULL, seed = NULL, prediction.type=NULL, MODELpredfn = NULL, na.action = NULL, v.fold = 10, device.type = NULL, DIAGNOSTICfn = NULL, res=NULL, jpeg.res = 72, device.width = 7, device.height = 7, units="in", pointsize=12, cex=par()$cex, req.sens, req.spec, FPC, FNC, quantiles=NULL, all=TRUE, subset = NULL, weights = NULL, mtry = NULL, controls = NULL, xtrafo = NULL, ytrafo = NULL, scores = NULL)
model.obj |
|
|||||||||||||||||||||||||||||||||||
qdata.trainfn |
String. The name (full path or base name with path specified by |
|||||||||||||||||||||||||||||||||||
qdata.testfn |
String. The name (full path or base name with path specified by |
|||||||||||||||||||||||||||||||||||
folder |
String. The folder used for all output from predictions and/or maps. Do not add ending slash to path string. If |
|||||||||||||||||||||||||||||||||||
MODELfn |
String. The file name to use to save the generated model object. If |
|||||||||||||||||||||||||||||||||||
response.name |
String. The name of the response variable used to build the model. The |
|||||||||||||||||||||||||||||||||||
unique.rowname |
String. The name of the unique identifier used to identify each row in the training data. If |
|||||||||||||||||||||||||||||||||||
diagnostic.flag |
String. The name of a column used to identify a subset of rows in the training data or test data to
use for model diagnostics. This column must be either a logical vector ( |
|||||||||||||||||||||||||||||||||||
seed |
Integer. The number used to initialize randomization to build RF or SGB models. If you want to produce the same model later, use the same seed. If |
|||||||||||||||||||||||||||||||||||
prediction.type |
String. Prediction type. |
|||||||||||||||||||||||||||||||||||
MODELpredfn |
String. Model validation. A character string used to construct the output file names for the validation diagnostics, for example the prediction |
|||||||||||||||||||||||||||||||||||
na.action |
String. Model validation. Specifies the action to take if there are |
|||||||||||||||||||||||||||||||||||
v.fold |
Integer (or logical |
|||||||||||||||||||||||||||||||||||
device.type |
String or vector of strings. Model validation. One or more device types for graphical output from model validation diagnostics. Current choices:
|
|||||||||||||||||||||||||||||||||||
DIAGNOSTICfn |
String. Model validation. Name used as base to create names for output files from model validation diagnostics. The filename can be the full path, or it can be the simple basename, in which case the output will be to the folder specified by |
|||||||||||||||||||||||||||||||||||
res |
Integer. Model validation. Pixels per inch for jpeg, png, and tiff plots. The default is 72dpi, good for on screen viewing. For printing, suggested setting is 300dpi. |
|||||||||||||||||||||||||||||||||||
jpeg.res |
Integer. Model validation. Deprecated. Ignored unless |
|||||||||||||||||||||||||||||||||||
device.width |
Integer. Model validation. The device width for diagnostic plots in inches. |
|||||||||||||||||||||||||||||||||||
device.height |
Integer. Model validation. The device height for diagnostic plots in inches. |
|||||||||||||||||||||||||||||||||||
units |
Model validation. The units in which |
|||||||||||||||||||||||||||||||||||
pointsize |
Integer. Model validation. The default pointsize of plotted text, interpreted as big points (1/72 inch) at |
|||||||||||||||||||||||||||||||||||
cex |
Integer. Model validation. The cex for diagnostic plots. |
|||||||||||||||||||||||||||||||||||
req.sens |
Numeric. Model validation. The required sensitivity for threshold optimization for binary response model evaluation. |
|||||||||||||||||||||||||||||||||||
req.spec |
Numeric. Model validation. The required specificity for threshold optimization for binary response model evaluation. |
|||||||||||||||||||||||||||||||||||
FPC |
Numeric. Model validation. The False Positive Cost for threshold optimization for binary response model evaluation. |
|||||||||||||||||||||||||||||||||||
FNC |
Numeric. Model validation. The False Negative Cost for threshold optimization for binary response model evaluation. |
|||||||||||||||||||||||||||||||||||
quantiles |
Numeric Vector. QRF models. The quantiles to predict. A numeric vector with values between zero and one. If model was built without specifying quantiles, quantile importance can not be calculated, but |
|||||||||||||||||||||||||||||||||||
all |
Logical. QRF models. |
|||||||||||||||||||||||||||||||||||
subset |
CF models. NOT SUPPORTED. Only needed for |
|||||||||||||||||||||||||||||||||||
weights |
CF models. NOT SUPPORTED. Only needed for |
|||||||||||||||||||||||||||||||||||
mtry |
Integer. Only needed for |
|||||||||||||||||||||||||||||||||||
controls |
CF models. Only needed for |
|||||||||||||||||||||||||||||||||||
xtrafo |
CF models. Only needed for |
|||||||||||||||||||||||||||||||||||
ytrafo |
CF models. Only needed for |
|||||||||||||||||||||||||||||||||||
scores |
CF models. NOT SUPPORTED. Only needed for |
model.diagnostics()
takes model object and makes predictions, runs model diagnostics, and creates graphs and tables of the results.
model.diagnostics()
can be run in a traditional R command mode, where all arguments are specified in the function call. However it can also be used in a full push button mode, where you type in the simple command model.map()
, and GUI pop up windows will ask questions about the type of model, the file locations of the data, etc...
When running model.map()
on non-Windows platforms, file names and folders need to be specified in the argument list, but other pushbutton selections are handled by the select.list()
function, which is platform independent.
Diagnostic predictions are made my one of four methods, and a text file is generated consisting of three columns: Observation ID, observed values and predicted values. If predition.type = "CV")
an additional column indicates which cross-fold each observation fell into. If the models response type is categorical then in addition a column giving the category predicted by majority vote, there are also categories for each possible response category giving the proportion of trees that predicted that category.
A variable importance graph is made. If response.type = "categorical"
, category specific graphs are generated for variable importance. These show how much the model accuracy for each category is affected when the values of each predictor variable is randomly permuted.
The package corrplot
is used to generate a plot of correlation between predictor variables. If there are highly correlated predictor variables, then the variable importances of "RF"
and "QRF"
models need to be interpreted with care, and users may want to consider looking at the conditional variable importances available for "CF"
models produced by the party
package.
If model.type = "RF"
, the OOB error is plotted as a function of number of trees in the model. If response.type = "binary"
or If response.type = "categorical"
category specific graphs are generated for OOB error as a function of number of trees.
If response.type = "binary"
, a summary graph is made using the PresenceAbsence
package and a *.csv
spreadsheets are created of optimized thresholds by several methods with their associated error statistics, and predicted prevalence.
If response.type = "continuous"
a scatterplot of observed vs. predicted is created with a simple linear regression line. The graph is labeled with slope and intercept of this line as well as Pearson's and Spearman's correlation coefficients.
If response.type = "categorical"
, a confusion matrix is generated, that includes erros of ommission and comission, as well as Kappa, Percent Correctly Classified (PCC) and the Multicategorical Area Under the Curve (MAUC) as defined by Hand and Till (2001) and calculated by the package HandTill2001
.
The function will return a dataframe of the row ID, and the Observed and predicted values.
For Binary response models the predicted probability of presence is returned.
For Categorical Response models the predicted category (by majority vote) is returned as well as a column for each category giving the probability of that category. If necessary, make.names
is applied to the categories to create valid column names.
For Continuous response models the predicted value is returned.
If prediction.type = "CV"
the dataframe also includes a column indicating which cross-validation fold each datapoint was in.
Importance currently unavailable for QRF models.
If you are running cross validation diagnostics on a CF model, the model parameters will NOT automatically be passed to model.diagnostics()
. For cross validation, it is the users responsibility to be certain that the CF arguments are the same in model.build()
and model.diagnostics()
.
Also, for some CF model parameters (subset
, weights
, and scores
) ModelMap
only provides OOB and independent test set diagnostics, and does not support cross validation diagnostics.
Elizabeth Freeman and Tracey Frescino
Breiman, L. (2001) Random Forests. Machine Learning, 45:5-32.
Elith, J., Leathwick, J. R. and Hastie, T. (2008). A working guide to boosted regression trees. Journal of Animal Ecology. 77:802-813.
Hand, D. J., & Till, R. J. (2001). A simple generalisation of the area under the ROC curve for multiple class classification problems. Machine Learning, 45(2), 171-186.
Liaw, A. and Wiener, M. (2002). Classification and Regression by randomForest. R News 2(3), 18–22.
Ridgeway, G., (1999). The state of boosting. Comp. Sci. Stat. 31:172-181
get.test
, model.build
, model.mapmake
## Not run: ########################################################################### ############################# Run this set up code: ####################### ########################################################################### # set seed: seed=38 # Define training and test files: qdata.trainfn = system.file("extdata", "helpexamples","DATATRAIN.csv", package = "ModelMap") qdata.testfn = system.file("extdata", "helpexamples","DATATEST.csv", package = "ModelMap") # Define folder for all output: folder=getwd() #identifier for individual training and test data points unique.rowname="ID" ########################################################################### ############## Pick one of the following sets of definitions: ############# ########################################################################### ########## Continuous Response, Continuous Predictors ############ #file name to store model: MODELfn="RF_Bio_TC" #predictors: predList=c("TCB","TCG","TCW") #define which predictors are categorical: predFactor=FALSE # Response name and type: response.name="BIO" response.type="continuous" ########## binary Response, Continuous Predictors ############ #file name to store model: MODELfn="RF_CONIFTYP_TC" #predictors: predList=c("TCB","TCG","TCW") #define which predictors are categorical: predFactor=FALSE # Response name and type: response.name="CONIFTYP" # This variable is 1 if a conifer or mixed conifer type is present, # otherwise 0. response.type="binary" ########## Continuous Response, Categorical Predictors ############ # In this example, NLCD is a categorical predictor. # # You must decide what you want to happen if there are categories # present in the data to be predicted (either the validation/test set # or in the image file) that were not present in the original training data. # Choices: # na.action = "na.omit" # Any validation datapoint or image pixel with a value for any # categorical predictor not found in the training data will be # returned as NA. # na.action = "na.roughfix" # Any validation datapoint or image pixel with a value for any # categorical predictor not found in the training data will have # the most common category for that predictor substituted, # and the a prediction will be made. # You must also let R know which of the predictors are categorical, in other # words, which ones R needs to treat as factors. # This vector must be a subset of the predictors given in predList #file name to store model: MODELfn="RF_BIO_TCandNLCD" #predictors: predList=c("TCB","TCG","TCW","NLCD") #define which predictors are categorical: predFactor=c("NLCD") # Response name and type: response.name="BIO" response.type="continuous" ########################################################################### ########################### build model: ################################## ########################################################################### ### create model ### model.obj = model.build( model.type="RF", qdata.trainfn=qdata.trainfn, folder=folder, unique.rowname=unique.rowname, MODELfn=MODELfn, predList=predList, predFactor=predFactor, response.name=response.name, response.type=response.type, seed=seed, na.action="na.roughfix" ) ########################################################################### #### Then Run this code make validation predictions and diagnostics: ###### ########################################################################### ### for Out-of-Bag predictions ### MODELpredfn<-paste(MODELfn,"_OOB",sep="") PRED.OOB<-model.diagnostics( model.obj=model.obj, qdata.trainfn=qdata.trainfn, folder=folder, unique.rowname=unique.rowname, # Model Validation Arguments prediction.type="OOB", MODELpredfn=MODELpredfn, device.type=c("default","jpeg","pdf"), na.action="na.roughfix" ) PRED.OOB ### for Cross-Validation predictions ### #MODELpredfn<-paste(MODELfn,"_CV",sep="") #PRED.CV<-model.diagnostics( model.obj=model.obj, # qdata.trainfn=qdata.trainfn, # folder=folder, # unique.rowname=unique.rowname, # seed=seed, # # Model Validation Arguments # prediction.type="CV", # MODELpredfn=MODELpredfn, # device.type=c("default","jpeg","pdf"), # v.fold=10, # na.action="na.roughfix" #) #PRED.CV ### for Independent Test Set predictions ### #MODELpredfn<-paste(MODELfn,"_TEST",sep="") #PRED.TEST<-model.diagnostics( model.obj=model.obj, # qdata.testfn=qdata.testfn, # folder=folder, # unique.rowname=unique.rowname, # # Model Validation Arguments # prediction.type="TEST", # MODELpredfn=MODELpredfn, # device.type=c("default","jpeg","pdf"), # na.action="na.roughfix" #) #PRED.TEST ) ## End(Not run) # end dontrun
## Not run: ########################################################################### ############################# Run this set up code: ####################### ########################################################################### # set seed: seed=38 # Define training and test files: qdata.trainfn = system.file("extdata", "helpexamples","DATATRAIN.csv", package = "ModelMap") qdata.testfn = system.file("extdata", "helpexamples","DATATEST.csv", package = "ModelMap") # Define folder for all output: folder=getwd() #identifier for individual training and test data points unique.rowname="ID" ########################################################################### ############## Pick one of the following sets of definitions: ############# ########################################################################### ########## Continuous Response, Continuous Predictors ############ #file name to store model: MODELfn="RF_Bio_TC" #predictors: predList=c("TCB","TCG","TCW") #define which predictors are categorical: predFactor=FALSE # Response name and type: response.name="BIO" response.type="continuous" ########## binary Response, Continuous Predictors ############ #file name to store model: MODELfn="RF_CONIFTYP_TC" #predictors: predList=c("TCB","TCG","TCW") #define which predictors are categorical: predFactor=FALSE # Response name and type: response.name="CONIFTYP" # This variable is 1 if a conifer or mixed conifer type is present, # otherwise 0. response.type="binary" ########## Continuous Response, Categorical Predictors ############ # In this example, NLCD is a categorical predictor. # # You must decide what you want to happen if there are categories # present in the data to be predicted (either the validation/test set # or in the image file) that were not present in the original training data. # Choices: # na.action = "na.omit" # Any validation datapoint or image pixel with a value for any # categorical predictor not found in the training data will be # returned as NA. # na.action = "na.roughfix" # Any validation datapoint or image pixel with a value for any # categorical predictor not found in the training data will have # the most common category for that predictor substituted, # and the a prediction will be made. # You must also let R know which of the predictors are categorical, in other # words, which ones R needs to treat as factors. # This vector must be a subset of the predictors given in predList #file name to store model: MODELfn="RF_BIO_TCandNLCD" #predictors: predList=c("TCB","TCG","TCW","NLCD") #define which predictors are categorical: predFactor=c("NLCD") # Response name and type: response.name="BIO" response.type="continuous" ########################################################################### ########################### build model: ################################## ########################################################################### ### create model ### model.obj = model.build( model.type="RF", qdata.trainfn=qdata.trainfn, folder=folder, unique.rowname=unique.rowname, MODELfn=MODELfn, predList=predList, predFactor=predFactor, response.name=response.name, response.type=response.type, seed=seed, na.action="na.roughfix" ) ########################################################################### #### Then Run this code make validation predictions and diagnostics: ###### ########################################################################### ### for Out-of-Bag predictions ### MODELpredfn<-paste(MODELfn,"_OOB",sep="") PRED.OOB<-model.diagnostics( model.obj=model.obj, qdata.trainfn=qdata.trainfn, folder=folder, unique.rowname=unique.rowname, # Model Validation Arguments prediction.type="OOB", MODELpredfn=MODELpredfn, device.type=c("default","jpeg","pdf"), na.action="na.roughfix" ) PRED.OOB ### for Cross-Validation predictions ### #MODELpredfn<-paste(MODELfn,"_CV",sep="") #PRED.CV<-model.diagnostics( model.obj=model.obj, # qdata.trainfn=qdata.trainfn, # folder=folder, # unique.rowname=unique.rowname, # seed=seed, # # Model Validation Arguments # prediction.type="CV", # MODELpredfn=MODELpredfn, # device.type=c("default","jpeg","pdf"), # v.fold=10, # na.action="na.roughfix" #) #PRED.CV ### for Independent Test Set predictions ### #MODELpredfn<-paste(MODELfn,"_TEST",sep="") #PRED.TEST<-model.diagnostics( model.obj=model.obj, # qdata.testfn=qdata.testfn, # folder=folder, # unique.rowname=unique.rowname, # # Model Validation Arguments # prediction.type="TEST", # MODELpredfn=MODELpredfn, # device.type=c("default","jpeg","pdf"), # na.action="na.roughfix" #) #PRED.TEST ) ## End(Not run) # end dontrun
Graphically explores the relationships between the training data and the predictor rasters.
model.explore(qdata.trainfn = NULL, folder = NULL, predList = NULL, predFactor = FALSE, response.name = NULL, response.type = NULL, response.colors = NULL, unique.rowname = NULL, OUTPUTfn = NULL, device.type = NULL, allow.default.graphics=FALSE, res=NULL, jpeg.res = 72, MAXCELL=100000, device.width = NULL, device.height = NULL, units="in", pointsize=12, cex=1, rastLUTfn = NULL, create.extrapolation.masks = FALSE, na.value = -9999, col.ramp = rainbow(101, start = 0, end = 0.5), col.cat = palette()[-1])
model.explore(qdata.trainfn = NULL, folder = NULL, predList = NULL, predFactor = FALSE, response.name = NULL, response.type = NULL, response.colors = NULL, unique.rowname = NULL, OUTPUTfn = NULL, device.type = NULL, allow.default.graphics=FALSE, res=NULL, jpeg.res = 72, MAXCELL=100000, device.width = NULL, device.height = NULL, units="in", pointsize=12, cex=1, rastLUTfn = NULL, create.extrapolation.masks = FALSE, na.value = -9999, col.ramp = rainbow(101, start = 0, end = 0.5), col.cat = palette()[-1])
qdata.trainfn |
String. The name (full path or base name with path specified by |
|||||||||||||||||||||||||||||||||||
folder |
String. The folder used for all output from predictions and/or maps. Do not add ending slash to path string. If |
|||||||||||||||||||||||||||||||||||
predList |
String. A character vector of the predictor short names used to build the model. These names must match the column names in the training/test data files and the names in column two of the |
|||||||||||||||||||||||||||||||||||
predFactor |
String. A character vector of predictor short names of the predictors from |
|||||||||||||||||||||||||||||||||||
response.name |
String. The name of the response variable used to build the model. If |
|||||||||||||||||||||||||||||||||||
response.type |
String. Response type: |
|||||||||||||||||||||||||||||||||||
response.colors |
Data frame. A two column data frame. Column names must be: |
|||||||||||||||||||||||||||||||||||
unique.rowname |
String. The name of the unique identifier used to identify each row in the training data. If |
|||||||||||||||||||||||||||||||||||
OUTPUTfn |
String. Filename that ouput file names will be based on. |
|||||||||||||||||||||||||||||||||||
device.type |
String or vector of strings. Model validation. One or more device types for graphical output from model validation diagnostics. Current choices:
Note that the |
|||||||||||||||||||||||||||||||||||
allow.default.graphics |
Logical. Should the default on-screen graphics device be allowed. USE WITH CAUTION! These graphics are complicated and slow to produce. If the on-screen default graphics device is moved or closed before the plot is completed it can crash the entire R session. |
|||||||||||||||||||||||||||||||||||
res |
Integer. Model validation. Pixels per inch for jpeg, png, and tiff plots. The default is 72dpi, good for on screen viewing. For printing, suggested setting is 300dpi. |
|||||||||||||||||||||||||||||||||||
jpeg.res |
Integer. Graphical output. Deprecated. Ignored unless |
|||||||||||||||||||||||||||||||||||
MAXCELL |
Integer. Graphical output. The maximum number of raster cells used to create the graphical output. Rasters larger than this value will be subsampled for the graphical maps and figures. The default value of Note: |
|||||||||||||||||||||||||||||||||||
device.width |
Integer. Model validation. The device width for diagnostic plots in inches. |
|||||||||||||||||||||||||||||||||||
device.height |
Integer. Model validation. The device height for diagnostic plots in inches. |
|||||||||||||||||||||||||||||||||||
units |
Model validation. The units in which |
|||||||||||||||||||||||||||||||||||
pointsize |
Integer. Model validation. The default pointsize of plotted text, interpreted as big points (1/72 inch) at |
|||||||||||||||||||||||||||||||||||
cex |
Integer. Model validation. The cex for diagnostic plots. |
|||||||||||||||||||||||||||||||||||
rastLUTfn |
String. The file name (full path or base name with path specified by Example of comma-delimited file:
|
|||||||||||||||||||||||||||||||||||
create.extrapolation.masks |
Logical. If |
|||||||||||||||||||||||||||||||||||
na.value |
Value used in rasters to indicate |
|||||||||||||||||||||||||||||||||||
col.ramp |
Color ramp to use for continuous predictors |
|||||||||||||||||||||||||||||||||||
col.cat |
Vector. Vector of colors to use for categorical predictors. |
The model.explore
function is intended to aid with preliminary data exploration before model building. It includes graphical tools to explore the relationships between the training data (both predictors and responses) as well as the predictor rasters. It uses the corrplot
package to create a correlation plot of the continuous predictor. This can aid in interpreting the model.importance.plot
output from the models, as Random Forest models divide importance between correlated predictors, while Stochastic Gradient Boosting models assing the majority of the importance to the correlated predictor that is used earlies in the model.
The model.explore
function also can aid in identifying if additional training data is needed. For example, the maps of the extrapolation masks for the predictor rasters help spot areas of the map where the pixels lie outside the range of the training data, and therefore any model predictions will be extrapolations, and possibly unreliable. The user can decide to either collect additional training data, or mask out these areas of the final prediction output of model.mapmake
.
To increase speed, the default behavior for large predictor rasters is to create the graphics from subsampled rasters. (Note: for categorical predictors, the full raster is always used to identify all categories found in the map area.) If create.extrapolation.masks=TRUE
, then the full rasters are used for the extrapolation masks, regardless of size of the reasters. This option runs much slower, as large rasters need to be read into R a block at a time.
Function does not return a value, but does create files.
Graphical files are created for each predictor variable, with file type determined by device.type
. In addition, if create.extrapolation.masks
, an extrapolation mask raster is created for each predictor as well as an overall extrapolation mask, with the value 1
for pixels with predictor values within the range of the training data, or categories found in the training data, and the value 0
for pixels outside the range of the training data, categories not found in the training data, or NA value. The overall extrapolation mask has 0
if any of the predictors for that pixel are extrapolated. Note that this option is much slower to run.
The default graphics device is disabled unless allow.default.graphics
is set to TRUE
. These graphics can be slow to produce, and if the on screen graphics device is moved or closed while the graphic is in progress, it can crash R. It is recomended that graphics be written to a file by using jpeg, pdf, etc... device.type
.
Elizabeth Freeman
## Not run: ########################################################################### ############################# Run this set up code: ####################### ########################################################################### ###Define training and test files: qdata.trainfn = system.file("extdata", "helpexamples","DATATRAIN.csv", package = "ModelMap") ###Define folder for all output: folder=getwd() ###identifier for individual training and test data points unique.rowname="ID" ###predictors: predList=c("TCB","TCG","TCW","NLCD") ###define which predictors are categorical: predFactor=c("NLCD") ###Create a the filename (including path) for the rast Look up Tables ### rastLUTfn.2001 <- system.file( "extdata", "helpexamples", "LUT_2001.csv", package="ModelMap") ###Load rast LUT table, and add path to the predictor raster filenames in column 1 ### rastLUT.2001 <- read.table(rastLUTfn.2001,header=FALSE,sep=",",stringsAsFactors=FALSE) for(i in 1:nrow(rastLUT.2001)){ rastLUT.2001[i,1] <- system.file("extdata", "helpexamples", rastLUT.2001[i,1], package="ModelMap") } #################Continuous Response################### ###Response name and type: response.name="BIO" response.type="continuous" ###file name to store model: OUTPUTfn="BIO_TCandNLCD.img" ###run model.explore model.explore( qdata.trainfn=qdata.trainfn, folder=folder, predList=predList, predFactor=predFactor, response.name=response.name, response.type=response.type, unique.rowname=unique.rowname, OUTPUTfn=OUTPUTfn, device.type="jpeg", jpeg.res=144, # Raster arguments rastLUTfn=rastLUT.2001, na.value=-9999, # colors for continuous predictors col.ramp=rainbow(101,start=0,end=.5), # colors for categorical predictors col.cat=c("wheat1","springgreen2","darkolivegreen4", "darkolivegreen2","yellow","thistle2", "brown2","brown4") ) ## End(Not run) # end dontrun
## Not run: ########################################################################### ############################# Run this set up code: ####################### ########################################################################### ###Define training and test files: qdata.trainfn = system.file("extdata", "helpexamples","DATATRAIN.csv", package = "ModelMap") ###Define folder for all output: folder=getwd() ###identifier for individual training and test data points unique.rowname="ID" ###predictors: predList=c("TCB","TCG","TCW","NLCD") ###define which predictors are categorical: predFactor=c("NLCD") ###Create a the filename (including path) for the rast Look up Tables ### rastLUTfn.2001 <- system.file( "extdata", "helpexamples", "LUT_2001.csv", package="ModelMap") ###Load rast LUT table, and add path to the predictor raster filenames in column 1 ### rastLUT.2001 <- read.table(rastLUTfn.2001,header=FALSE,sep=",",stringsAsFactors=FALSE) for(i in 1:nrow(rastLUT.2001)){ rastLUT.2001[i,1] <- system.file("extdata", "helpexamples", rastLUT.2001[i,1], package="ModelMap") } #################Continuous Response################### ###Response name and type: response.name="BIO" response.type="continuous" ###file name to store model: OUTPUTfn="BIO_TCandNLCD.img" ###run model.explore model.explore( qdata.trainfn=qdata.trainfn, folder=folder, predList=predList, predFactor=predFactor, response.name=response.name, response.type=response.type, unique.rowname=unique.rowname, OUTPUTfn=OUTPUTfn, device.type="jpeg", jpeg.res=144, # Raster arguments rastLUTfn=rastLUT.2001, na.value=-9999, # colors for continuous predictors col.ramp=rainbow(101,start=0,end=.5), # colors for categorical predictors col.cat=c("wheat1","springgreen2","darkolivegreen4", "darkolivegreen2","yellow","thistle2", "brown2","brown4") ) ## End(Not run) # end dontrun
Takes two models and produces a back to back bar chart to compare the importance of the predictor variables. Models can be any combination of Random Forest or Stochastic Gradient Boosting, as long as both models have the same predictor variables.
model.importance.plot(model.obj.1 = NULL, model.obj.2 = NULL, model.name.1 = "Model 1", model.name.2 = "Model 2", imp.type.1 = NULL, imp.type.2 = NULL, type.label=TRUE, class.1 = NULL, class.2 = NULL, quantile.1=NULL, quantile.2=NULL, col.1="grey", col.2="black", scale.by = "sum", sort.by = "model.obj.1", cf.mincriterion.1 = 0, cf.conditional.1 = FALSE, cf.threshold.1 = 0.2, cf.nperm.1 = 1, cf.mincriterion.2 = 0, cf.conditional.2 = FALSE, cf.threshold.2 = 0.2, cf.nperm.2 = 1, predList = NULL, folder = NULL, PLOTfn = NULL, device.type = NULL, res=NULL, jpeg.res = 72, device.width = 7, device.height = 7, units="in", pointsize=12, cex=par()$cex,...)
model.importance.plot(model.obj.1 = NULL, model.obj.2 = NULL, model.name.1 = "Model 1", model.name.2 = "Model 2", imp.type.1 = NULL, imp.type.2 = NULL, type.label=TRUE, class.1 = NULL, class.2 = NULL, quantile.1=NULL, quantile.2=NULL, col.1="grey", col.2="black", scale.by = "sum", sort.by = "model.obj.1", cf.mincriterion.1 = 0, cf.conditional.1 = FALSE, cf.threshold.1 = 0.2, cf.nperm.1 = 1, cf.mincriterion.2 = 0, cf.conditional.2 = FALSE, cf.threshold.2 = 0.2, cf.nperm.2 = 1, predList = NULL, folder = NULL, PLOTfn = NULL, device.type = NULL, res=NULL, jpeg.res = 72, device.width = 7, device.height = 7, units="in", pointsize=12, cex=par()$cex,...)
model.obj.1 |
|
|||||||||||||||||||||||||||||||||||
model.obj.2 |
|
|||||||||||||||||||||||||||||||||||
model.name.1 |
String. Label for left side of barchart. |
|||||||||||||||||||||||||||||||||||
model.name.2 |
String. Label for right side of barchart. |
|||||||||||||||||||||||||||||||||||
imp.type.1 |
Number. Type of importance to use for model 1. Importance type 1 is permutation based, as described in Breiman (2001). Importance type 2 is model based. For RF models is the decrease in node impurities attributable to each predictor variable. For SGB models, it is the reduction attributable to each variable in predicting the gradient on each iteration. Default for random forest models is |
|||||||||||||||||||||||||||||||||||
imp.type.2 |
Number. Type of importance to use for model 2. Importance type 1 is permutation based, as described in Breiman (2001). Importance type 2 is model based. For RF models is the decrease in node impurities attributable to each predictor variable. For SGB models, it is the reduction attributable to each variable in predicting the gradient on each iteration. Default for random forest models is |
|||||||||||||||||||||||||||||||||||
type.label |
Logical. Should axis labels include importance type for each side of plot. |
|||||||||||||||||||||||||||||||||||
class.1 |
String. For binary and categorical random forest models. If the name a class is specified, the class-specific relative influence is used for plot. If |
|||||||||||||||||||||||||||||||||||
class.2 |
String. For binary and categorical random forest models. If the name a class is specified, the class-specific relative influence is used for plot. If |
|||||||||||||||||||||||||||||||||||
quantile.1 |
Numeric. QRF models. Quantile to use for model 1. Must be one of the quantiles used in building the QRF model. |
|||||||||||||||||||||||||||||||||||
quantile.2 |
Numeric. QRF models. Quantile to use for model 2. Must be one of the quantiles used in building the QRF model. |
|||||||||||||||||||||||||||||||||||
col.1 |
String. For binary and categorical random forest models. Color to use for bars for model 1. Defaults to grey. |
|||||||||||||||||||||||||||||||||||
col.2 |
String. For binary and categorical random forest models. Color to use for bars for model 2. Defaults to black. |
|||||||||||||||||||||||||||||||||||
scale.by |
String. Scale by: |
|||||||||||||||||||||||||||||||||||
sort.by |
String. Sort by: |
|||||||||||||||||||||||||||||||||||
cf.mincriterion.1 |
Number. CF models. The value of the test statistic or 1 - p-value that must be exceeded in order to include a split in the computation of the importance. The default |
|||||||||||||||||||||||||||||||||||
cf.conditional.1 |
Logical. CF models. A logical determining whether unconditional or conditional computation of the importance is performed for |
|||||||||||||||||||||||||||||||||||
cf.threshold.1 |
Number. CF models. The value of the test statistic or 1 - p-value of the association between the variable of interest and a covariate that must be exceeded inorder to include the covariate in the conditioning scheme for the variable of interest (only relevant if |
|||||||||||||||||||||||||||||||||||
cf.nperm.1 |
Number. CF models. The number of permutations performed. |
|||||||||||||||||||||||||||||||||||
cf.mincriterion.2 |
Number. CF models. The value of the test statistic or 1 - p-value that must be exceeded in order to include a split in the computation of the importance. The default |
|||||||||||||||||||||||||||||||||||
cf.conditional.2 |
Logical. CF models. A logical determining whether unconditional or conditional computation of the importance is performed for |
|||||||||||||||||||||||||||||||||||
cf.threshold.2 |
Number. CF models. The value of the test statistic or 1 - p-value of the association between the variable of interest and a covariate that must be exceeded inorder to include the covariate in the conditioning scheme for the variable of interest (only relevant if |
|||||||||||||||||||||||||||||||||||
cf.nperm.2 |
Number. CF models. The number of permutations performed. |
|||||||||||||||||||||||||||||||||||
predList |
String. A character vector of the predictor short names used to build the models. If |
|||||||||||||||||||||||||||||||||||
folder |
String. The folder used for all output. Do not add ending slash to path string. If |
|||||||||||||||||||||||||||||||||||
PLOTfn |
String. The file name to use to save the generated graphical plots. If |
|||||||||||||||||||||||||||||||||||
device.type |
String or vector of strings. Model validation. One or more device types for graphical output from model validation diagnostics. Current choices:
|
|||||||||||||||||||||||||||||||||||
res |
Integer. Model validation. Pixels per inch for jpeg, png, and tiff plots. The default is 72dpi, good for on screen viewing. For printing, suggested setting is 300dpi. |
|||||||||||||||||||||||||||||||||||
jpeg.res |
Integer. Model validation. Deprecated. Ignored unless |
|||||||||||||||||||||||||||||||||||
device.width |
Integer. Model validation. The device width for diagnostic plots in inches. |
|||||||||||||||||||||||||||||||||||
device.height |
Integer. Model validation. The device height for diagnostic plots in inches. |
|||||||||||||||||||||||||||||||||||
units |
Model validation. The units in which |
|||||||||||||||||||||||||||||||||||
pointsize |
Integer. Model validation. The default pointsize of plotted text, interpreted as big points (1/72 inch) at |
|||||||||||||||||||||||||||||||||||
cex |
Integer. Model validation. The cex for diagnostic plots. |
|||||||||||||||||||||||||||||||||||
... |
Arguments to be passed to methods, such as graphical parameters (see |
The importance measures used in this plot depend on the model type (RF verses SGB) and the response type (continuous, categorical, or binary).
Importance type 1 is permutation based, as described in Breiman (2001). Importance is calculated by randomly permuting each predictor variable and computing the associated reduction in predictive performance using Out Of Bag error for RF models and training error for SGB models. Note that for SGB models permutation based importance measures are still considered experimental. Importance type 2 is model based. For RF models, importance type 2 is calculated by the decrease in node impurities attributable to each predictor variable. For SGB models, importance type 2 is the reduction attributable to each variable in predicting the gradient on each iteration as described in described in Friedman (2001).
For RF models:
response type |
type |
Importance Measure | ||||
"continuous" |
1 |
permutation | %IncMSE | |||
"binary" |
1 |
permutation | Mean Decrease Accuracy | |||
"categorical" |
1 |
permutation | Mean Decrease Accuracy | |||
"continuous" |
2 |
node impurity | Residual sum of squares | |||
"binary" |
2 |
node impurity | Mean Decrease Gini | |||
"categorical" |
2 |
node impurity | Mean Decrease Gini |
For Random Forest models, if imp.type
not specified, importance type defaults to imp.type
of 1
- permutation importance. For SGB models, permutation importance is considered experimental so importance defaults to imp.type
of 2
- reduction of gradient of the loss function.
Also, for binary and categorical Random Forest models, class specific importance plots can be generated by the use of the class
argument. Note that class specific importance is only available for Random Forest models with importance type 1.
For CF models:
response type |
type |
Importance Measure | ||||
"continuous" |
1 |
permutation | Mean Decrease Accuracy | |||
"binary" |
1 |
permutation | Mean Decrease Accuracy | |||
"categorical" |
1 |
permutation | Mean Decrease Accuracy | |||
"continuous" |
2 |
node impurity | Not Available | |||
"binary" |
2 |
node impurity | Mean Decrease in AUC | |||
"categorical" |
2 |
node impurity | Not Available |
For binary CF models, ifimportance.type = 2, function uses AUC-based variables importances as described by Janitza et al. (2012). Here, the area under the curve instead of the accuracy is used to calculate the importance of each variable. This AUC-based variable importance measure is more robust towards class imbalance.
Also, for CF models, if cf.conditional = TRUE
, the importance of each variable is computed by permuting within a grid defined by the covariates that are associated (with 1 - p-value greater than threshold) to the variable of interest. The resulting variable importance score is conditional in the sense of beta coefficients in regression models, but represents the effect of a variable in both main effects and interactions. See Strobl et al. (2008) for details. Conditional improtance can be slow for large datasets.
The function returns a two element list: IMP1
is the variable importance for model.obj.1
; and, IMP2
is the variable importance for model.obj.2
. This is mostly intended for CF models, where calculating the conditional importance can represent a considerable time investment. For other model types it would be just as easy to recalcuate importances on the fly as needed.
Importance currently unavailable for QRF models.
Elizabeth Freeman
Breiman, L. (2001) Random Forests. Machine Learning, 45:5-32.
Alexander Hapfelmeier, Torsten Hothorn, Kurt Ulm, and Carolin Strobl (2012). A New Variable Importance Measure for Random Forests with Missing Data. Statistics and Computing, http://dx.doi.org/10.1007/s11222-012-9349-1
Torsten Hothorn, Kurt Hornik, and Achim Zeileis (2006b). Unbiased Recursive Partitioning: A Conditional Inference Framework. Journal of Computational and Graphical Statistics, 15 (3), 651-674. Preprint available from http://statmath.wu-wien.ac.at/~zeileis/papers/Hothorn+Hornik+Zeileis-2006.pdf
Silke Janitza, Carolin Strobl and Anne-Laure Boulesteix (2013). An AUC-based Permutation Variable Importance Measure for Random Forests. BMC Bioinformatics.2013, 14 119. http://www.biomedcentral.com/1471-2105/14/119
Carolin Strobl, Anne-Laure Boulesteix, Thomas Kneib, Thomas Augustin, and Achim Zeileis (2008). Conditional Variable Importance for Random Forests. BMC Bioinformatics, 9, 307. http://www.biomedcentral.com/1471-2105/9/307
## Not run: ########################################################################### ############################# Run this set up code: ####################### ########################################################################### # set seed: seed=38 # Define training and test files: qdata.trainfn = system.file("extdata", "helpexamples","DATATRAIN.csv", package = "ModelMap") # Define folder for all output: folder=getwd() #identifier for individual training and test data points unique.rowname="ID" ################################################################## ########## Continuous Response, Continuous Predictors ############ ################################################################## #file names: MODELfn.RF="RF_Bio_TC" #predictors: predList=c("TCB","TCG","TCW") #define which predictors are categorical: predFactor=FALSE # Response name and type: response.name="BIO" response.type="continuous" ########## Build Models ################################# model.obj.RF = model.build( model.type="RF", qdata.trainfn=qdata.trainfn, folder=folder, unique.rowname=unique.rowname, MODELfn=MODELfn.RF, predList=predList, predFactor=predFactor, response.name=response.name, response.type=response.type, seed=seed ) ########## Make Imortance Plot - RF Importance type 1 vs 2 ####### model.importance.plot( model.obj.1=model.obj.RF, model.obj.2=model.obj.RF, model.name.1="PercentIncMSE", model.name.2="IncNodePurity", imp.type.1=1, imp.type.2=2, scale.by="sum", sort.by="predList", predList=predList, main="Imp type 1 vs Imp type 2", device.type="default") ################################################################## ########## Categorical Response, Continuous Predictors ########### ################################################################## file name: MODELfn="RF_NLCD_TC" predictors: predList=c("TCB","TCG","TCW") define which predictors are categorical: predFactor=FALSE Response name and type: response.name="NLCD" response.type="categorical" ########## Build Model ################################# model.obj.NLCD = model.build( model.type="RF", qdata.trainfn=qdata.trainfn, folder=folder, unique.rowname=unique.rowname, MODELfn=MODELfn, predList=predList, predFactor=predFactor, response.name=response.name, response.type=response.type, seed=seed) ############## Make Imortance Plot ################### model.importance.plot( model.obj.1=model.obj.NLCD, model.obj.2=model.obj.NLCD, model.name.1="NLCD=41", model.name.2="NLCD=42", class.1="41", class.2="42", scale.by="sum", sort.by="predList", predList=predList, main="Class 41 vs. Class 42", device.type="default") ################################################################## ############## Conditonal inference forest models ################ ################################################################## #predictors: predList=c("TCB","TCG","TCW","NLCD") #define which predictors are categorical: predFactor=c("NLCD") #binary response response.name="CONIFTYP" response.type="binary" MODELfn.CF="CF_CONIFTYP_TCandNLCD" ####################### Build Model ############################## model.obj.CF = model.build( model.type="CF", qdata.trainfn=qdata.trainfn, folder=folder, unique.rowname=unique.rowname, MODELfn=MODELfn.CF, predList=predList, predFactor=predFactor, response.name=response.name, response.type=response.type, seed=seed ) ################## Make Imortance Plot ########################## #Conditional vs. Unconditional importance# model.importance.plot( model.obj.1=model.obj.CF, model.obj.2=model.obj.CF, model.name.1="conditional", model.name.2="unconditional", imp.type.1=1, imp.type.2=1, cf.conditional.1=TRUE, cf.conditional.2=FALSE, scale.by="sum", sort.by="predList", predList=predList, main="Conditional verses Unconditional", device.type="default" ) ## End(Not run) # end dontrun
## Not run: ########################################################################### ############################# Run this set up code: ####################### ########################################################################### # set seed: seed=38 # Define training and test files: qdata.trainfn = system.file("extdata", "helpexamples","DATATRAIN.csv", package = "ModelMap") # Define folder for all output: folder=getwd() #identifier for individual training and test data points unique.rowname="ID" ################################################################## ########## Continuous Response, Continuous Predictors ############ ################################################################## #file names: MODELfn.RF="RF_Bio_TC" #predictors: predList=c("TCB","TCG","TCW") #define which predictors are categorical: predFactor=FALSE # Response name and type: response.name="BIO" response.type="continuous" ########## Build Models ################################# model.obj.RF = model.build( model.type="RF", qdata.trainfn=qdata.trainfn, folder=folder, unique.rowname=unique.rowname, MODELfn=MODELfn.RF, predList=predList, predFactor=predFactor, response.name=response.name, response.type=response.type, seed=seed ) ########## Make Imortance Plot - RF Importance type 1 vs 2 ####### model.importance.plot( model.obj.1=model.obj.RF, model.obj.2=model.obj.RF, model.name.1="PercentIncMSE", model.name.2="IncNodePurity", imp.type.1=1, imp.type.2=2, scale.by="sum", sort.by="predList", predList=predList, main="Imp type 1 vs Imp type 2", device.type="default") ################################################################## ########## Categorical Response, Continuous Predictors ########### ################################################################## file name: MODELfn="RF_NLCD_TC" predictors: predList=c("TCB","TCG","TCW") define which predictors are categorical: predFactor=FALSE Response name and type: response.name="NLCD" response.type="categorical" ########## Build Model ################################# model.obj.NLCD = model.build( model.type="RF", qdata.trainfn=qdata.trainfn, folder=folder, unique.rowname=unique.rowname, MODELfn=MODELfn, predList=predList, predFactor=predFactor, response.name=response.name, response.type=response.type, seed=seed) ############## Make Imortance Plot ################### model.importance.plot( model.obj.1=model.obj.NLCD, model.obj.2=model.obj.NLCD, model.name.1="NLCD=41", model.name.2="NLCD=42", class.1="41", class.2="42", scale.by="sum", sort.by="predList", predList=predList, main="Class 41 vs. Class 42", device.type="default") ################################################################## ############## Conditonal inference forest models ################ ################################################################## #predictors: predList=c("TCB","TCG","TCW","NLCD") #define which predictors are categorical: predFactor=c("NLCD") #binary response response.name="CONIFTYP" response.type="binary" MODELfn.CF="CF_CONIFTYP_TCandNLCD" ####################### Build Model ############################## model.obj.CF = model.build( model.type="CF", qdata.trainfn=qdata.trainfn, folder=folder, unique.rowname=unique.rowname, MODELfn=MODELfn.CF, predList=predList, predFactor=predFactor, response.name=response.name, response.type=response.type, seed=seed ) ################## Make Imortance Plot ########################## #Conditional vs. Unconditional importance# model.importance.plot( model.obj.1=model.obj.CF, model.obj.2=model.obj.CF, model.name.1="conditional", model.name.2="unconditional", imp.type.1=1, imp.type.2=1, cf.conditional.1=TRUE, cf.conditional.2=FALSE, scale.by="sum", sort.by="predList", predList=predList, main="Conditional verses Unconditional", device.type="default" ) ## End(Not run) # end dontrun
Image or Perspective plot of two-way model interactions. Ranges of two specified predictor variables are plotted on X and Y axis, and fitted model values are plotted on the Z axis. The remaining predictor variables are fixed at their mean (for continuous predictors) or their most common value (for categorical predictors).
model.interaction.plot(model.obj = NULL, x = NULL, y = NULL, response.category=NULL, quantiles=NULL, all=FALSE, obs=1, qdata.trainfn = NULL, folder = NULL, MODELfn = NULL, PLOTfn = NULL, pred.means = NULL, xlab = NULL, ylab = NULL, x.range = NULL, y.range = NULL, z.range = NULL, ticktype = "detailed", theta = 55, phi = 40, smooth = "none", plot.type = NULL, device.type = NULL, res=NULL, jpeg.res = 72, device.width = 7, device.height = 7, units="in", pointsize=12, cex=par()$cex, col = NULL, xlim = NULL, ylim = NULL, zlim = NULL, ...)
model.interaction.plot(model.obj = NULL, x = NULL, y = NULL, response.category=NULL, quantiles=NULL, all=FALSE, obs=1, qdata.trainfn = NULL, folder = NULL, MODELfn = NULL, PLOTfn = NULL, pred.means = NULL, xlab = NULL, ylab = NULL, x.range = NULL, y.range = NULL, z.range = NULL, ticktype = "detailed", theta = 55, phi = 40, smooth = "none", plot.type = NULL, device.type = NULL, res=NULL, jpeg.res = 72, device.width = 7, device.height = 7, units="in", pointsize=12, cex=par()$cex, col = NULL, xlim = NULL, ylim = NULL, zlim = NULL, ...)
model.obj |
|
||||||||||||||||||||||||||||||
x |
String or Integer. Name of predictor variable to be plotted on the x axis. Alternativly, can be a number indicating a variable name from |
||||||||||||||||||||||||||||||
y |
String or Integer. Name of predictor variable to be plotted on the y axis. Alternatively, can be a number indicating a variable name from |
||||||||||||||||||||||||||||||
response.category |
String. Used for categorical response models. Specify which category of response variable to use. This category's probabilities will be plotted on the z axis. |
||||||||||||||||||||||||||||||
quantiles |
Numeric. Used for QRF models. Specify which quantile of response variable to use. This quantile will be plotted on the z axis. Note: unlike other functions |
||||||||||||||||||||||||||||||
all |
Logical. Used for QRF models. A logical value. |
||||||||||||||||||||||||||||||
obs |
Numeric. Used for QRF models. An integer number. Determines the maximal number of observations per node to use for prediction. The input is ignored for all=TRUE. The default is obs=1. |
||||||||||||||||||||||||||||||
qdata.trainfn |
String. The name (full path or base name with path specified by |
||||||||||||||||||||||||||||||
folder |
String. The folder used for all output. Do not add ending slash to path string. If |
||||||||||||||||||||||||||||||
MODELfn |
String. The file name used to save the generated model object, only used if |
||||||||||||||||||||||||||||||
PLOTfn |
String. The file name to use to save the generated graphical plots. The filename can be the full path, or it can be the simple basename, in which case the output will be to the folder specified by |
||||||||||||||||||||||||||||||
pred.means |
Vector. Allows specification of values for other predictor variables. If Null, other predictors are set to their mean value (for continuous predictors) or their most common value (for factored predictors). |
||||||||||||||||||||||||||||||
xlab |
String. Allows manual specification of the x label. |
||||||||||||||||||||||||||||||
ylab |
String. Allows manual specification of the y label. |
||||||||||||||||||||||||||||||
x.range |
Vector. Manual range specification for the x axis. Alternate argument name for |
||||||||||||||||||||||||||||||
y.range |
Vector. Manual range specification for the y axis. Alternate argument name for |
||||||||||||||||||||||||||||||
z.range |
Vector. Manual range specification for the z axis. Alternate argument name for |
||||||||||||||||||||||||||||||
ticktype |
Character: "simple" draws just an arrow parallel to the axis to indicate direction of increase; "detailed" (default) draws normal ticks as per 2D plots. If |
||||||||||||||||||||||||||||||
theta |
Numeric. Angles defining the viewing direction. |
||||||||||||||||||||||||||||||
phi |
Numeric. Angles defining the viewing direction. |
||||||||||||||||||||||||||||||
smooth |
String. controls smoothing of the predicted surface. Options are |
||||||||||||||||||||||||||||||
plot.type |
Character. |
||||||||||||||||||||||||||||||
device.type |
String or vector of strings. Model validation. One or more device types for graphical output from model validation diagnostics. Current choices:
|
||||||||||||||||||||||||||||||
res |
Integer. Model validation. Pixels per inch for jpeg, png, and tiff plots. The default is 72dpi, good for on screen viewing. For printing, suggested setting is 300dpi. |
||||||||||||||||||||||||||||||
jpeg.res |
Integer. Model validation. Deprecated. Ignored unless |
||||||||||||||||||||||||||||||
device.width |
Integer. Model validation. The device width for diagnostic plots in inches. |
||||||||||||||||||||||||||||||
device.height |
Integer. Model validation. The device height for diagnostic plots in inches. |
||||||||||||||||||||||||||||||
units |
Model validation. The units in which |
||||||||||||||||||||||||||||||
pointsize |
Integer. Model validation. The default pointsize of plotted text, interpreted as big points (1/72 inch) at |
||||||||||||||||||||||||||||||
cex |
Integer. Model validation. The cex for diagnostic plots. |
||||||||||||||||||||||||||||||
col |
Vector. Color table to use for image plots ( see help file on image for details). |
||||||||||||||||||||||||||||||
xlim |
Vector. X limits. Alternate argument name for |
||||||||||||||||||||||||||||||
ylim |
Vector. Y limits. Alternate argument name for |
||||||||||||||||||||||||||||||
zlim |
Vector. Z limits. Alternate argument name for |
||||||||||||||||||||||||||||||
... |
additional graphical parameters (see |
This function provides a diagnostic plot useful in visualizing two-way interactions between predictor variables. Two of the predictor variables from the model are used to produce a grid of possible combinations of predictor values over the range of both variables. The remaining predictor variables from the model are fixed at either their means (for continuous predictors) or their most common value (for categorical predictors). Model predictions are generated over this grid and plotted as the z axis.
This function works with both continuous and categorical predictors, though the perspective plot should be interpreted with care for categorical predictors. In particular, the smooth
option is not appropriate if either of the two selected predictor variables is categorical.
For categorical response models, a particular value must be specified for the response using the response.category
argument.
Elizabeth Freeman
This function is adapted from gbm.perspec
version 2.9 April 2007, J Leathwick/J Elith. See appendix S3 from:
Elith, J., Leathwick, J. R. and Hastie, T. (2008). A working guide to boosted regression trees. Journal of Animal Ecology. 77:802-813.
## Not run: ########################################################################### ############################# Run this set up code: ####################### ########################################################################### # set seed: seed=38 # Define training and test files: qdata.trainfn = system.file("extdata", "helpexamples","DATATRAIN.csv", package = "ModelMap") qdata.testfn = system.file("extdata", "helpexamples","DATATEST.csv", package = "ModelMap") # Define folder for all output: folder=getwd() ########## Continuous Response, Categorical Predictors ############ #file name to store model: MODELfn="RF_BIO_TCandNLCD" #predictors: predList=c("TCB","TCG","TCW","NLCD") #define which predictors are categorical: predFactor=c("NLCD") # Response name and type: response.name="BIO" response.type="continuous" #identifier for individual training and test data points unique.rowname="ID" ########################################################################### ########################### build model: ################################## ########################################################################### ### create model ### model.obj = model.build( model.type="RF", qdata.trainfn=qdata.trainfn, folder=folder, unique.rowname=unique.rowname, MODELfn=MODELfn, predList=predList, predFactor=predFactor, response.name=response.name, response.type=response.type, seed=seed, na.action=na.roughfix ) ########################################################################### ###################### make interaction plots: ############################ ########################################################################### ######################### ### Perspective Plots ### ######################### ### specify first and third predictors in 'predList (both continuous) ### model.interaction.plot( model.obj, x=1,y=3, main=response.name, plot.type="persp", device.type="default") ### specify predictors in 'predList' by name (one continuous one factored) ### model.interaction.plot(model.obj, x="TCB", y="NLCD", main=response.name, plot.type="persp", device.type="default") ################### ### Image Plots ### ################### ### same as previous example, but image plot ### l <- seq(100,0,length.out=101) c <- seq(0,100,length.out=101) col.ramp <- hcl(h = 120, c = c, l = l) model.interaction.plot( model.obj, x="TCB", y="NLCD", main=response.name, plot.type="image", device.type="default", col = col.ramp) ######################### ### 3-way Interaction ### ######################### ### use 'pred.means' argument to fix values of additional predictors ### ### factored 3rd predictor ### interaction between TCG and TCW for 3 most common values of NLCD nlcd<-levels(model.obj$predictor.data$NLCD) nlcd.counts<-table(model.obj$predictor.data$NLCD) nlcd.ordered<-nlcd[order(nlcd.counts,decreasing=TRUE)] for(i in nlcd.ordered[1:3]){ pred.means=list(NLCD=i) model.interaction.plot( model.obj, x="TCG", y="TCW", main=paste("NLCD=",i," (",nlcd.counts[i]," plots)", sep=""), pred.means=pred.means, z.range=c(0,110), theta=290, plot.type="persp", device.type="default") } ### continuos 3rd predictor ### tcb<-seq( min(model.obj$predictor.data$TCB), max(model.obj$predictor.data$TCB), length=3) tcb<-signif(tcb,2) for(i in tcb){ pred.means=list(TCB=i) model.interaction.plot( model.obj, x="TCG", y="TCW", main=paste("TCB =",i), pred.means=pred.means, z.range=c(0,120), theta=290, plot.type="persp", device.type="default") } ### 4-way Interesting combos ### tcb=c(1300,2900,3400) nlcd=c(11,90,95) for(i in 1:3){ pred.means=list(TCB=tcb[i],NLCD=nlcd[i]) model.interaction.plot( model.obj, x="TCG", y="TCW", main=paste("TCB =",tcb[i]," NLCD =",nlcd[i]), pred.means=pred.means, z.range=c(0,120), theta=290, plot.type="persp", device.type="default") } ## End(Not run) #end dontrun
## Not run: ########################################################################### ############################# Run this set up code: ####################### ########################################################################### # set seed: seed=38 # Define training and test files: qdata.trainfn = system.file("extdata", "helpexamples","DATATRAIN.csv", package = "ModelMap") qdata.testfn = system.file("extdata", "helpexamples","DATATEST.csv", package = "ModelMap") # Define folder for all output: folder=getwd() ########## Continuous Response, Categorical Predictors ############ #file name to store model: MODELfn="RF_BIO_TCandNLCD" #predictors: predList=c("TCB","TCG","TCW","NLCD") #define which predictors are categorical: predFactor=c("NLCD") # Response name and type: response.name="BIO" response.type="continuous" #identifier for individual training and test data points unique.rowname="ID" ########################################################################### ########################### build model: ################################## ########################################################################### ### create model ### model.obj = model.build( model.type="RF", qdata.trainfn=qdata.trainfn, folder=folder, unique.rowname=unique.rowname, MODELfn=MODELfn, predList=predList, predFactor=predFactor, response.name=response.name, response.type=response.type, seed=seed, na.action=na.roughfix ) ########################################################################### ###################### make interaction plots: ############################ ########################################################################### ######################### ### Perspective Plots ### ######################### ### specify first and third predictors in 'predList (both continuous) ### model.interaction.plot( model.obj, x=1,y=3, main=response.name, plot.type="persp", device.type="default") ### specify predictors in 'predList' by name (one continuous one factored) ### model.interaction.plot(model.obj, x="TCB", y="NLCD", main=response.name, plot.type="persp", device.type="default") ################### ### Image Plots ### ################### ### same as previous example, but image plot ### l <- seq(100,0,length.out=101) c <- seq(0,100,length.out=101) col.ramp <- hcl(h = 120, c = c, l = l) model.interaction.plot( model.obj, x="TCB", y="NLCD", main=response.name, plot.type="image", device.type="default", col = col.ramp) ######################### ### 3-way Interaction ### ######################### ### use 'pred.means' argument to fix values of additional predictors ### ### factored 3rd predictor ### interaction between TCG and TCW for 3 most common values of NLCD nlcd<-levels(model.obj$predictor.data$NLCD) nlcd.counts<-table(model.obj$predictor.data$NLCD) nlcd.ordered<-nlcd[order(nlcd.counts,decreasing=TRUE)] for(i in nlcd.ordered[1:3]){ pred.means=list(NLCD=i) model.interaction.plot( model.obj, x="TCG", y="TCW", main=paste("NLCD=",i," (",nlcd.counts[i]," plots)", sep=""), pred.means=pred.means, z.range=c(0,110), theta=290, plot.type="persp", device.type="default") } ### continuos 3rd predictor ### tcb<-seq( min(model.obj$predictor.data$TCB), max(model.obj$predictor.data$TCB), length=3) tcb<-signif(tcb,2) for(i in tcb){ pred.means=list(TCB=i) model.interaction.plot( model.obj, x="TCG", y="TCW", main=paste("TCB =",i), pred.means=pred.means, z.range=c(0,120), theta=290, plot.type="persp", device.type="default") } ### 4-way Interesting combos ### tcb=c(1300,2900,3400) nlcd=c(11,90,95) for(i in 1:3){ pred.means=list(TCB=tcb[i],NLCD=nlcd[i]) model.interaction.plot( model.obj, x="TCG", y="TCW", main=paste("TCB =",tcb[i]," NLCD =",nlcd[i]), pred.means=pred.means, z.range=c(0,120), theta=290, plot.type="persp", device.type="default") } ## End(Not run) #end dontrun
Applies models to either ERDAS Imagine image (.img) files or ESRI Grids of predictors to create detailed prediction surfaces. It will handle large predictor files for map making, by reading in the .img
files in rows, and output to the .img
file the prediction for each data row, before reading the next row of data.
model.mapmake(model.obj= NULL, folder = NULL, MODELfn = NULL, rastLUTfn = NULL, na.action = NULL, na.value=-9999, keep.predictor.brick=FALSE, map.sd = FALSE, OUTPUTfn = NULL, quantiles=NULL)
model.mapmake(model.obj= NULL, folder = NULL, MODELfn = NULL, rastLUTfn = NULL, na.action = NULL, na.value=-9999, keep.predictor.brick=FALSE, map.sd = FALSE, OUTPUTfn = NULL, quantiles=NULL)
model.obj |
|
||||||||||||||||||
folder |
String. The folder used for all output from predictions and/or maps. Do not add ending slash to path string. If |
||||||||||||||||||
MODELfn |
String. The file name to use to save the generated model object. If |
||||||||||||||||||
rastLUTfn |
String. The file name (full path or base name with path specified by Example of comma-delimited file:
|
||||||||||||||||||
na.action |
String. Model validation. Specifies the action to take if there are |
||||||||||||||||||
na.value |
Number. Value that indicates |
||||||||||||||||||
keep.predictor.brick |
Logical. Map Production. If |
||||||||||||||||||
map.sd |
Logical. Map Production. If This option is only available if the The names of the additional maps default to:
|
||||||||||||||||||
OUTPUTfn |
String. Map Production. Filename of output file for map production. The filename can be the full path, or it can be the simple basename, in which case the output will be to the folder specified by The If the output filename does not include an extension, the default extension of For continuous random forest models with |
||||||||||||||||||
quantiles |
Numeric Vector. QRF models. The quantiles to predict. A numeric vector with values between zero and one. The quantile map output will be a multilayer raster with one layer for each quantile. If |
model.mapmake()
can be run in a traditional R command mode, where all arguments are specified in the function call. However it can also be used in a full push button mode, where you type in the simple command model.mapmake()
, and GUI pop up windows will ask questions about the type of model, the file locations of the data, etc...
When running model.mapmake()
on non-Windows platforms, file names and folders need to be specified in the argument list, but other pushbutton selections are handled by the select.list()
function, which is platform independent.
The R package raster
is used to read spatial rasters into R. The data for production mapping should be in the form of pixel-based raster layers representing the predictors in the model. If there is more than one predictor or raster layer, the layers must all have the same number of columns and rows. The layers must also have the same extent, projection, and pixel size, for effective model development and accuracy. The raster
package function compareRaster()
is used to check predictor layers for consistency.
The layers must also be in (single or multi-band) raster data formats that can be read by package raster
, for example ESRI Grid or ERDAS Imagine image files. The predictor layers must have continuous or categorical data values. See writeRaster
for a list of available formats.
To improve processing speed, the raster
package is used to create a raster brick object with a layer for each predictor in the model. By default, this brick is a temporary file that is automatically deleated as soon as the map is completed. If keep.predictor.brick=TRUE
, the predictor brick with be saved as a native raster
package file, with a file name created by appending '_brick'
to the OUTPUTfn
. Warning: these bricks can be quite large, as they contain all the predictor data for every pixel in the map.
When creating maps of non-rectangular study regions there may be large portions of the rectangle where you have no predictors, and are uninterested in making predictions. The suggested value for the pixels outside the study area is -9999
. These pixels will be ignored in the predictions, thus saving computing time.
The function model.mapmake()
outputs an rater file of map information suitable to be imported into a GIS. Maps can also be imported back into R using the function raster()
from the raster
package. The file extension of OUTPUTfn
determines the write format. If OUTPUTfn
does not include a file extension, output will default to an ERDAS Imagine image file with extension ".img"
For Binary response models the output is in the form of predicted probability of presence for each pixel. For Continuous response models the output is the predicted value for each pixel. For Categorical response models the map output depends on the category labels. If the categorical response variable is numeric, the map output will use the original numeric categories. If the categories are non-numeric (for example, character strings), map output is in the form of integer class codes for each pixel, coded for each level of the factored response, and a CSV file containing a look up table is also generated to associate the integer codes with the original values of the response categories.
The first predictor from predList
is used to determine projection of output Imagine Image file.
The model.mapmake()
function does not return a value, instead it writes a raster file of map information (suitable for importing into a GIS) to the specified folder. The output raster is saved in the format specifed by the file extension of OUTPUTfn
The model.mapmake()
function also writes a text file listing the projections of all predictor rasters.
For categorical response models, a csv file map key is written giving the integer code associated with each response category.
If keep.predictor.brick = TRUE
then a raster brick of all the predictor rasters from the model is also saved to the specified folder. If keep.predictor.brick = FALSE
(the default) then the predictor brick is written to a temprary file, and deleted. Warning: the predictor bricks can be quite large, and saving them can require quite a bit of memory.
If model.mapmake()
is interupted it may leave orphan .gri
and .grd
files in your temporary directory. The raster
package functions showTmpFiles
and removeTmpFiles
can be used to locate and remove these files, or they can be deleated manually from the temporary directory.
Elizabeth Freeman and Tracey Frescino
Breiman, L. (2001) Random Forests. Machine Learning, 45:5-32.
Liaw, A. and Wiener, M. (2002). Classification and Regression by randomForest. R News 2(3), 18–22.
Ridgeway, G., (1999). The state of boosting. Comp. Sci. Stat. 31:172-181
Simpson, E. H. (1949). Measurement of diversity. Nature.
get.test
, model.build
, model.diagnostics
, compareRaster
, writeRaster
## Not run: ########################################################################### ############################# Run this set up code: ####################### ########################################################################### # set seed: seed=38 # Define training and test files: qdata.trainfn = system.file("extdata", "helpexamples","DATATRAIN.csv", package = "ModelMap") # Define folder for all output: folder=getwd() #identifier for individual training and test data points unique.rowname="ID" ########################################################################### ######################## Define the model: ################################ ########################################################################### ########## Continuous Response, Continuous Predictors ############ #file name to store model: MODELfn="RF_Bio_TC" #predictors: predList=c("TCB","TCG","TCW") #define which predictors are categorical: predFactor=FALSE # Response name and type: response.name="BIO" response.type="continuous" ########################################################################### ########################### build model: ################################## ########################################################################### ### create model ### model.obj = model.build( model.type="RF", qdata.trainfn=qdata.trainfn, folder=folder, unique.rowname=unique.rowname, MODELfn=MODELfn, predList=predList, predFactor=predFactor, response.name=response.name, response.type=response.type, seed=seed, na.action="na.roughfix" ) ########################################################################### ############ Then Run this code to predict map pixels ##################### ########################################################################### ### Create a the filename (including path) for the rast Look up Tables ### rastLUTfn.2001 <- system.file( "extdata", "helpexamples", "LUT_2001.csv", package="ModelMap") ### Load rast LUT table, and add path to the predictor raster filenames in column 1 ### rastLUT.2001 <- read.table(rastLUTfn.2001,header=FALSE,sep=",",stringsAsFactors=FALSE) for(i in 1:nrow(rastLUT.2001)){ rastLUT.2001[i,1] <- system.file("extdata", "helpexamples", rastLUT.2001[i,1], package="ModelMap") } ### Define filename for map output ### OUTPUTfn.2001 <- "RF_BIO_TCandNLCD_01.img" OUTPUTfn.2001 <- paste(folder,OUTPUTfn.2001,sep="/") ### Create image files of predicted map data ### model.mapmake( model.obj=model.obj, folder=folder, rastLUTfn=rastLUT.2001, # Mapping arguments OUTPUTfn=OUTPUTfn.2001 ) ########################################################################### ################ run this code to create maps in R ######################## ########################################################################### ### Define Color Ramp ### l <- seq(100,0,length.out=101) c <- seq(0,100,length.out=101) col.ramp <- hcl(h = 120, c = c, l = l) ### read in map data ### mapgrid.2001 <- raster(OUTPUTfn.2001) #mapgrid.2001 <- setMinMax(mapgrid.2001) ### create map ### dev.new(width = 5, height = 5) opar <- par(mar=c(3,3,2,1),oma=c(0,0,3,4),xpd=NA) zlim <- c(0,max(maxValue(mapgrid.2001))) legend.label<-rev(pretty(zlim,n=5)) legend.colors<-col.ramp[trunc((legend.label/max(legend.label))*100)+1] image( mapgrid.2001, col = col.ramp, zlim=zlim, asp=1, bty="n", xaxt="n", yaxt="n", main="", xlab="", ylab="") mtext("2001 Imagery",side=3,line=1,cex=1.2) legend( x=xmax(mapgrid.2001),y=ymax(mapgrid.2001), legend=legend.label, fill=legend.colors, bty="n", cex=1.2 ) mtext("Predictions",side=3,line=1,cex=1.5,outer=TRUE) par(opar) ## End(Not run) # end dontrun
## Not run: ########################################################################### ############################# Run this set up code: ####################### ########################################################################### # set seed: seed=38 # Define training and test files: qdata.trainfn = system.file("extdata", "helpexamples","DATATRAIN.csv", package = "ModelMap") # Define folder for all output: folder=getwd() #identifier for individual training and test data points unique.rowname="ID" ########################################################################### ######################## Define the model: ################################ ########################################################################### ########## Continuous Response, Continuous Predictors ############ #file name to store model: MODELfn="RF_Bio_TC" #predictors: predList=c("TCB","TCG","TCW") #define which predictors are categorical: predFactor=FALSE # Response name and type: response.name="BIO" response.type="continuous" ########################################################################### ########################### build model: ################################## ########################################################################### ### create model ### model.obj = model.build( model.type="RF", qdata.trainfn=qdata.trainfn, folder=folder, unique.rowname=unique.rowname, MODELfn=MODELfn, predList=predList, predFactor=predFactor, response.name=response.name, response.type=response.type, seed=seed, na.action="na.roughfix" ) ########################################################################### ############ Then Run this code to predict map pixels ##################### ########################################################################### ### Create a the filename (including path) for the rast Look up Tables ### rastLUTfn.2001 <- system.file( "extdata", "helpexamples", "LUT_2001.csv", package="ModelMap") ### Load rast LUT table, and add path to the predictor raster filenames in column 1 ### rastLUT.2001 <- read.table(rastLUTfn.2001,header=FALSE,sep=",",stringsAsFactors=FALSE) for(i in 1:nrow(rastLUT.2001)){ rastLUT.2001[i,1] <- system.file("extdata", "helpexamples", rastLUT.2001[i,1], package="ModelMap") } ### Define filename for map output ### OUTPUTfn.2001 <- "RF_BIO_TCandNLCD_01.img" OUTPUTfn.2001 <- paste(folder,OUTPUTfn.2001,sep="/") ### Create image files of predicted map data ### model.mapmake( model.obj=model.obj, folder=folder, rastLUTfn=rastLUT.2001, # Mapping arguments OUTPUTfn=OUTPUTfn.2001 ) ########################################################################### ################ run this code to create maps in R ######################## ########################################################################### ### Define Color Ramp ### l <- seq(100,0,length.out=101) c <- seq(0,100,length.out=101) col.ramp <- hcl(h = 120, c = c, l = l) ### read in map data ### mapgrid.2001 <- raster(OUTPUTfn.2001) #mapgrid.2001 <- setMinMax(mapgrid.2001) ### create map ### dev.new(width = 5, height = 5) opar <- par(mar=c(3,3,2,1),oma=c(0,0,3,4),xpd=NA) zlim <- c(0,max(maxValue(mapgrid.2001))) legend.label<-rev(pretty(zlim,n=5)) legend.colors<-col.ramp[trunc((legend.label/max(legend.label))*100)+1] image( mapgrid.2001, col = col.ramp, zlim=zlim, asp=1, bty="n", xaxt="n", yaxt="n", main="", xlab="", ylab="") mtext("2001 Imagery",side=3,line=1,cex=1.2) legend( x=xmax(mapgrid.2001),y=ymax(mapgrid.2001), legend=legend.label, fill=legend.colors, bty="n", cex=1.2 ) mtext("Predictions",side=3,line=1,cex=1.5,outer=TRUE) par(opar) ## End(Not run) # end dontrun