Title: | Subsetting using Focused Identification of the Germplasm Strategy (FIGS) |
---|---|
Description: | Running Focused Identification of the Germplasm Strategy (FIGS) to make best subsets from Genebank Collection. |
Authors: | Khadija Aziz [aut], Zakaria Kehel [aut, cre], Bancy Ngatia [aut], Khadija Aouzal [aut], Zainab Azough [ctb], Amal Ibnelhobyb [ctb], Fawzy Nawar [ctb] |
Maintainer: | Zakaria Kehel <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.0.2 |
Built: | 2024-12-21 06:55:19 UTC |
Source: | CRAN |
200 sites from durum wheat collection and their daily climatic data.
data(durumDaily)
data(durumDaily)
The data includes the site unique identifier and daily data for 4 climatic variables (tmin, tmax, precipitation and relative humidity)
if(interactive()){ # Load durum wheat data with their daily climatic variables obtained from ICARDA database data(durumDaily) }
if(interactive()){ # Load durum wheat data with their daily climatic variables obtained from ICARDA database data(durumDaily) }
200 sites from durum wheat collection and their world clim data.
data(durumWC)
data(durumWC)
The data includes the site unique identifier, longitude, latitude and 55 worldclim data worldclim
if(interactive()){ # Load durum wheat data with world climatic variables obtained from WorldClim database data(durumWC) }
if(interactive()){ # Load durum wheat data with world climatic variables obtained from WorldClim database data(durumWC) }
extractWCdata returns a data frame based on specified climatic variables.
extractWCdata(sites, long, lat, var, res = 2.5)
extractWCdata(sites, long, lat, var, res = 2.5)
sites |
object of class "data.frame" with coordinates of sites from which to extract data. |
long |
character. Name of column from |
lat |
character. Name of column from |
var |
character. Climatic variable(s) to be extracted: 'tavg', 'tmin', 'tmax', 'prec', 'bio', 'srad', 'vapr', 'wind' |
res |
numeric. Spatial resolution. Default 2.5 |
A grid can be created with any particular coordinates and used as input for sites
(see section 'Examples'). extractWCdata
will use the given coordinates to extract data from the WorldClim2.1 database.
The extracted data will most likely contain NA's for sites where climatic data is not available. These should be removed or imputed before using the data to make predictions.
An object of class "data.frame" with specified climatic variables for coordinates in sites
.
Zakaria Kehel, Fawzy Nawar, Bancy Ngatia, Khadija Aouzal
if(interactive()){ # Create grid sp1 <- seq(-16, 115, length = 10) sp2 <- seq(25, 59, length = 10) sp <- expand.grid(x = sp1, y = sp2) # Extract data using grid sp.df0 <- extractWCdata(sp, long = 'x', lat = 'y', var = 'tavg') sp.df <- na.omit(sp.df0) }
if(interactive()){ # Create grid sp1 <- seq(-16, 115, length = 10) sp2 <- seq(25, 59, length = 10) sp <- expand.grid(x = sp1, y = sp2) # Extract data using grid sp.df0 <- extractWCdata(sp, long = 'x', lat = 'y', var = 'tavg') sp.df <- na.omit(sp.df0) }
FIGS subset for wheat sodicity resistance constructed using the harmonized world soil database HWSD
data(FIGS)
data(FIGS)
A data frame with 201 rows and 15 variables
if(interactive()){ data(FIGS) }
if(interactive()){ data(FIGS) }
Return a data frame with accession data for the specified crop.
getAccessions( crop = "", ori = NULL, IG = "", doi = FALSE, taxon = FALSE, collectionYear = FALSE, coor = FALSE, available = FALSE )
getAccessions( crop = "", ori = NULL, IG = "", doi = FALSE, taxon = FALSE, collectionYear = FALSE, coor = FALSE, available = FALSE )
crop |
character. Crop for which to get accession data. See section 'Details' for available crops or use |
ori |
string. Country of origin using the ISO 3166-1 alpha-3 country codes. Default: NULL. |
IG |
integer. Unique identifier of accession. Default: "". |
doi |
boolean. If |
taxon |
boolean. If |
collectionYear |
boolean. If |
coor |
boolean. If |
available |
boolean. If |
Types of crops available include:
'Aegilops'
'Barley'
'Bread wheat'
'Chickpea'
'Durum wheat'
'Faba bean'
'Faba bean BPL'
'Forage and range'
'Lathyrus'
'Lentil'
'Medicago annual'
'Not mandate cereals'
'Pisum'
'Primitive wheat'
'Trifolium'
'Vicia'
'Wheat hybrids'
'Wheat wild relatives'
'Wild Cicer'
'Wild Hordeum'
'Wild Lens'
'Wild Triticum'
Alternatively, the list of available crops can be fetched from ICARDA's online server using getCrops
.
A data frame with accession passport data for specified crop in crop
from the locations in ori
.
Khadija Aouzal, Amal Ibnelhobyb, Zakaria Kehel, Fawzy Nawar
if(interactive()){ # Obtain accession data for durum wheat durum <- getAccessions(crop = 'Durum wheat', coor = TRUE) }
if(interactive()){ # Obtain accession data for durum wheat durum <- getAccessions(crop = 'Durum wheat', coor = TRUE) }
this function allows to obtain a list of crops available in ICARDA's Genebank Documentation System, it returns a list with codes and names of available crops.
getCrops()
getCrops()
The crop codes and names are fetched from ICARDA's online server.
A list containing all crops available in ICARDA's Genebank Documentation System.
Zakaria Kehel, Fawzy Nawar
if(interactive()){ # Get list of available crops crops <- getCrops() }
if(interactive()){ # Get list of available crops crops <- getCrops() }
this function extracts daily values of climatic variables from ICARDA Data, it returns a list or data frame based on specified climatic variables. Each variable will have 365 values for each day of the calendar year.
getDaily(sites, var, cv = FALSE)
getDaily(sites, var, cv = FALSE)
sites |
character. Names of sites from which to extract data. |
var |
character. Climatic variable(s) to be extracted. |
cv |
boolean. If |
ICARDA data has to be accessible either from a local directory on the computer or from an online repository. getDaily
will extract the climatic variables specified in var
for the sites specified in sites
.
For daily data, the function extracts average daily values starting from the first day of the calendar year, i.e. January 1, until the last day of the calendar year, i.e. December 31. Thus, 365 columns with daily values are created for each variable.
An object with specified climatic variables for names in sites
.
If cv = TRUE
, the object is a list containing two data frames: the first one with average daily values of climatic variables, and the second one with daily coefficient of variation for each climatic variable.
If cv = FALSE
, the object is a data frame with average daily values of climatic variables.
Zakaria Kehel, Bancy Ngatia
if(interactive()){ # Extract daily data for durum wheat durum <- getAccessions(crop = 'Durum wheat', coor = TRUE) daily <- getDaily(sites = levels(as.factor(durum$SiteCode)), var = c('tavg', 'prec', 'rh'), cv = TRUE) # Get data frame with coefficient of variation from list object # returned (when cv = TRUE) daily.cv <- daily[[2]] }
if(interactive()){ # Extract daily data for durum wheat durum <- getAccessions(crop = 'Durum wheat', coor = TRUE) daily <- getDaily(sites = levels(as.factor(durum$SiteCode)), var = c('tavg', 'prec', 'rh'), cv = TRUE) # Get data frame with coefficient of variation from list object # returned (when cv = TRUE) daily.cv <- daily[[2]] }
Calculates growing degree days (GDD) as well as cumulative GDD, and returns a list of various data frames based on specified arguments.
getGrowthPeriod(sitecode, crop, base, max, gdd = FALSE)
getGrowthPeriod(sitecode, crop, base, max, gdd = FALSE)
sitecode |
expression. Vector with names of sites from which to extract onset data. |
crop |
character. Type of crop in ICARDA database. See section 'Details' for crops which have calculations available. |
base |
integer. Minimum temperature constraint for the crop. |
max |
integer. Maximum temperature constraint for the crop. |
gdd |
boolean. If |
Growing degree days for various crops are calculated using average daily minimum and maximum temperature values obtained from onset data. The temperature constraints specified in base
and max
are first applied before the calculations are done. These constraints ensure very low or high temperatures which prevent growth of a particular crop are not included.
Crops for which GDD calculations are available include: 'Durum wheat', 'Bread wheat', 'Barley', 'Chickpea', 'Lentil'. Each of these can be supplied as options for the argument crop
.
Cumulative GDD values determine the length of different growing stages. Growing stages vary depending on the type of crop. Durum wheat, bread wheat and barley have five growth stages, i.e. beginning of heading, beginning and completion of flowering, and beginning and completion of grain filling. Chickpea and lentil have four growth stages, i.e. beginning of flowering, completion of 50
The length of the full growth cycle of the crop for each site is also given in the output data frame.
A list object with different data frames depending on specified option in gdd
.
If gdd = TRUE
, the object is a list containing three data frames: the first one with lengths of different growing stages, the second one with original onset data with phenological variables, and the third one with calculated GDD and accumulated GDD for the sites specified in sitecode
.
If gdd = FALSE
, the object is a list containing two data frames: the first one with lengths of different growing stages, and the second one with original onset data with phenological variables for the sites specified in sitecode
.
Khadija Aouzal, Zakaria Kehel, Bancy Ngatia
if(interactive()){ # Calculate GDD for durum wheat data(durumDaily) growth <- getGrowthPeriod(sitecode = durumDaily$site_code, crop = 'Durum wheat', base = 0, max = 35, gdd = TRUE) # Get data frame with lengths of growth stages from list # object returned growth.lengths <- growth[[1]] # Get data frame with phenotypic variables from list # object returned growth.pheno <- growth[[2]] # Get data frame with GDD, cumulative GDD and climatic # variables from list object returned (when gdd = TRUE) growth.gdd <- growth[[3]] }
if(interactive()){ # Calculate GDD for durum wheat data(durumDaily) growth <- getGrowthPeriod(sitecode = durumDaily$site_code, crop = 'Durum wheat', base = 0, max = 35, gdd = TRUE) # Get data frame with lengths of growth stages from list # object returned growth.lengths <- growth[[1]] # Get data frame with phenotypic variables from list # object returned growth.pheno <- growth[[2]] # Get data frame with GDD, cumulative GDD and climatic # variables from list object returned (when gdd = TRUE) growth.gdd <- growth[[3]] }
this function allows to obtain performance measures from Confusion Matrix, it returns a data frame containing performance measures from the confusion matrix given by the caret
package.
getMetrics(y, yhat, classtype)
getMetrics(y, yhat, classtype)
y |
expression. The class variable. |
yhat |
expression. The vector of predicted values. |
classtype |
character or numeric. The number of levels in |
getMetrics
works with target variables that have two, three, four, six or eight classes.
The function relies on the caret
package to obtain the confusion matrix from which performance measures are extracted. It can be run for several algorithms, and the results combined into one data frame for easier comparison (see section 'Examples').
Predictions have to be obtained beforehand and used as input for yhat
. The predict.train
function in caret
should be run without argument type
when obtaining the predictions.
Outputs an object with performance measures calculated from the confusion matrix given by the caret
package. A data frame is the resulting output with the first column giving the name of the performance measure, and the second column giving the corresponding value.
Zakaria Kehel, Bancy Ngatia, Khadija Aziz
if(interactive()){ # Obtain predictions from previous models data(septoriaDurumWC) split.data <- splitData(septoriaDurumWC, seed = 1234, y = "ST_S", p = 0.7) data.train <- split.data$trainset data.test <- split.data$testset knn.mod <- tuneTrain(data = septoriaDurumWC,y = 'ST_S',method = 'knn',positive = 'R') nnet.mod <- tuneTrain(data = septoriaDurumWC,y = 'ST_S',method = 'nnet',positive = 'R') pred.knn <- predict(knn.mod$Model, newdata = data.test[ , -1]) pred.nnet <- predict(nnet.mod$Model, newdata = data.test[ , -1]) metrics.knn <- getMetrics(y = data.test$ST_S, yhat = pred.knn, classtype = 2) metrics.nnet <- getMetrics(y = data.test$ST_S, yhat = pred.nnet, classtype = 2) }
if(interactive()){ # Obtain predictions from previous models data(septoriaDurumWC) split.data <- splitData(septoriaDurumWC, seed = 1234, y = "ST_S", p = 0.7) data.train <- split.data$trainset data.test <- split.data$testset knn.mod <- tuneTrain(data = septoriaDurumWC,y = 'ST_S',method = 'knn',positive = 'R') nnet.mod <- tuneTrain(data = septoriaDurumWC,y = 'ST_S',method = 'nnet',positive = 'R') pred.knn <- predict(knn.mod$Model, newdata = data.test[ , -1]) pred.nnet <- predict(nnet.mod$Model, newdata = data.test[ , -1]) metrics.knn <- getMetrics(y = data.test$ST_S, yhat = pred.knn, classtype = 2) metrics.nnet <- getMetrics(y = data.test$ST_S, yhat = pred.nnet, classtype = 2) }
getMetricsPCA allows to obtain performance measures from Confusion Matrix for algorithms with PCA pre-processing,it returns a data frame containing performance measures from the confusion matrix given by the caret
package when algorithms have been run with PCA pre-processing.
getMetricsPCA(yhat, y, classtype, model)
getMetricsPCA(yhat, y, classtype, model)
yhat |
expression. The vector of predicted values. |
y |
expression. The class variable. |
classtype |
character or numeric. The number of levels in |
model |
expression. The model object to which output of the model has been assigned. |
Works with target variables that have two, three, four, six or eight classes. Similar to getMetrics
but used in the case where models have been run with PCA specified as an option for the preProcess
argument in the train
function of caret
.
Outputs an object with performance measures calculated from the confusion matrix given by the caret
package. A data frame is the resulting output with the first column giving the name of the performance measure, and the second column giving the corresponding value.
Khadija Aziz, Zainab Azough, Zakaria Kehel, Bancy Ngatia
confusionMatrix
,
predict.train
if(interactive()){ # Obtain predictions from several previously run models dataX <- subset(data, select = -y) pred.knn <- predict(model.knn, newdata = dataX) pred.rf <- predict(model.rf, newdata = dataX) # Get metrics for several algorithms metrics.knn <- getMetricsPCA(y = data$y, yhat = pred.knn, classtype = 2, model = model.knn) metrics.rf <- getMetricsPCA(y = data$y, yhat = pred.rf, classtype = 2, model = model.rf) # Indexing for 2-class models to remove extra column with # names of performance measures metrics.all <- cbind(metrics.knn, metrics.rf[ , 2]) # No indexing needed for 3-, 4-, 6- or 8-class models metrics.all <- cbind(metrics.knn, metrics.rf) }
if(interactive()){ # Obtain predictions from several previously run models dataX <- subset(data, select = -y) pred.knn <- predict(model.knn, newdata = dataX) pred.rf <- predict(model.rf, newdata = dataX) # Get metrics for several algorithms metrics.knn <- getMetricsPCA(y = data$y, yhat = pred.knn, classtype = 2, model = model.knn) metrics.rf <- getMetricsPCA(y = data$y, yhat = pred.rf, classtype = 2, model = model.rf) # Indexing for 2-class models to remove extra column with # names of performance measures metrics.all <- cbind(metrics.knn, metrics.rf[ , 2]) # No indexing needed for 3-, 4-, 6- or 8-class models metrics.all <- cbind(metrics.knn, metrics.rf) }
this function Extracts Daily values of climatic variables from remote ICARDA data based on Onset of Planting, it returns a list based on specified climatic variables. Each variable will have 365 values for each day of the (onset) year beginning with planting day.
getOnset(sites, crop, var, cv = FALSE)
getOnset(sites, crop, var, cv = FALSE)
sites |
character. Names of sites from which to extract data. |
crop |
character. Crop code in ICARDA database. See section 'Details' for a list of crops. |
var |
character. Climatic variable(s) to be extracted. |
cv |
boolean. If |
Similar to getDaily
except the extracted data is based on 365 days starting from the onset of planting.
Crops available in ICARDA's genebank documentation system include the following:
'ICAG' = Aegilops
'ICB' = Barley
'ICBW' = Bread wheat
'ILC' = Chickpea
'ICDW' = Durum wheat
'ILB' = Faba bean
'BPL' = Faba bean BPL
'IFMI' = Forage and range
'IFLA' = Lathyrus
'ILL' = Lentil
'IFMA' = Medicago annual
'IC' = Not mandate cereals
'IFPI' = Pisum
'ICPW' = Primitive wheat
'IFTR' = Trifolium
'IFVI' = Vicia
'ICWH' = Wheat hybrids
'ICWW' = Wheat wild relatives
'ILWC' = Wild Cicer
'ICWB' = Wild Hordeum
'ILWL' = Wild Lens
'ICWT' = Wild Triticum
Alternatively, the list of available crops can be fetched from ICARDA's online server using getCrops
.
An object of class "data.frame" with specified climatic variables for names in sites
.
If cv = TRUE
, the object is a list containing three data frames: the first one with average daily values of climatic variables, the second one with daily coefficient of variation for each climatic variable, and the third one with phenotypic variables and number of day in calendar year when each occurs at the sites specified in sites
.
If cv = FALSE
, the object is a list containing two data frames: the first one with average daily values of climatic variables, and the second one with phenotypic variables and number of day in calendar year when each occurs at the sites specified in sites
.
Khadija Aouzal, Amal Ibnelhobyb, Zakaria Kehel, Bancy Ngatia
if(interactive()){ # Extract onset data for durum wheat durum <- getAccessions(crop = 'Durum wheat', coor = TRUE) onset <- getOnset(sites = levels(as.factor(durum$SiteCode)), crop = 'ICDW', var = c('tavg', 'prec', 'rh'), cv = TRUE) # Get data frame with climatic variables from list object returned onset.clim <- onset[[1]] # Get data frame with coefficient of variation from list object # returned (when cv = TRUE) onset.cv <- onset[[2]] # Get data frame with phenotypic variables from list object returned onset.pheno <- onset[[3]] }
if(interactive()){ # Extract onset data for durum wheat durum <- getAccessions(crop = 'Durum wheat', coor = TRUE) onset <- getOnset(sites = levels(as.factor(durum$SiteCode)), crop = 'ICDW', var = c('tavg', 'prec', 'rh'), cv = TRUE) # Get data frame with climatic variables from list object returned onset.clim <- onset[[1]] # Get data frame with coefficient of variation from list object # returned (when cv = TRUE) onset.cv <- onset[[2]] # Get data frame with phenotypic variables from list object returned onset.pheno <- onset[[3]] }
Return a data frame containing traits associated with a particular crop, their description and related identifiers.
getTraits(crop)
getTraits(crop)
crop |
character. Crop for which to get available traits. |
getTraits
returns a data frame of traits together with their IDs and coding system used for each trait.
Possible inputs for crop
include:
'Aegilops'
'Barley'
'Bread wheat'
'Chickpea'
'Durum wheat'
'Faba bean'
'Faba bean BPL'
'Forage and range'
'Lathyrus'
'Lentil'
'Medicago annual'
'Not mandate cereals'
'Pisum'
'Primitive wheat'
'Trifolium'
'Vicia'
'Wheat hybrids'
'Wheat wild relatives'
'Wild Cicer'
'Wild Hordeum'
'Wild Lens'
'Wild Triticum'
A list of available crops to use as input for crop
can also be obtained from ICARDA's online server using getCrops
.
A data frame with traits that are associated with the crop specified in crop
.
Khadija Aouzal, Amal Ibnelhobyb, Zakaria Kehel, Fawzy Nawar
if(interactive()){ # Get traits for bread wheat breadTraits <- getTraits(crop = 'Bread wheat') }
if(interactive()){ # Get traits for bread wheat breadTraits <- getTraits(crop = 'Bread wheat') }
Return a data frame with observed values of accessions for associated Trait
getTraitsData(IG, traitID)
getTraitsData(IG, traitID)
IG |
factor. Unique identifier of accession. |
traitID |
integer. Unique identifier of trait (from |
Possible inputs for traitID
can be found using the getTraits
function (see section 'Examples').
A data frame with scores for the trait specified in traitID
for the accessions given in IG
.
Khadija Aouzal, Amal Ibnelhobyb, Zakaria Kehel, Fawzy Nawar
if(interactive()){ # Check trait ID for septoria and get septoria data for durum wheat durum <- getAccessions(crop = 'Durum wheat', coor = TRUE) durumTraits <- getTraits(crop = 'Durum wheat') septoria <- getTraitsData(IG = durum$IG, traitID = 145) }
if(interactive()){ # Check trait ID for septoria and get septoria data for durum wheat durum <- getAccessions(crop = 'Durum wheat', coor = TRUE) durumTraits <- getTraits(crop = 'Durum wheat') septoria <- getTraitsData(IG = durum$IG, traitID = 145) }
this function returns a map with points showing where accessions are located.
mapAccessions(df, long, lat, y = NULL)
mapAccessions(df, long, lat, y = NULL)
df |
object of class "data.frame" with coordinates of accessions and target variable. |
long |
character. Column name from |
lat |
character. Column name from |
y |
Default: NULL, column name from |
A world map with plotted points showing locations of accessions.
Khadija Aouzal, Zakaria Kehel
if(interactive()){ # Loading FIGS subset for wheat sodicity resistance data(FIGS) # World Map showing locations of accessions mapAccessions(df = FIGS, long = "Longitude", lat = "Latitude") # Map plotting locations of accessions with points coloured # based on a gradient scale of SodicityIndex values mapAccessions(FIGS, long = "Longitude", lat = "Latitude", y = "SodicityIndex") # Map plotting locations of accessions with points # coloured based on levels of y mapAccessions(FIGS, long = "Longitude", lat = "Latitude", y = "PopulationType") }
if(interactive()){ # Loading FIGS subset for wheat sodicity resistance data(FIGS) # World Map showing locations of accessions mapAccessions(df = FIGS, long = "Longitude", lat = "Latitude") # Map plotting locations of accessions with points coloured # based on a gradient scale of SodicityIndex values mapAccessions(FIGS, long = "Longitude", lat = "Latitude", y = "SodicityIndex") # Map plotting locations of accessions with points # coloured based on levels of y mapAccessions(FIGS, long = "Longitude", lat = "Latitude", y = "PopulationType") }
modelingSummary is an automatic function for modeling data, it returns a dataframe containing the metrics of the modeling using five machine learning algorithms: KNN, SVM, RF, NNET, and Bcart. This function is based on spliData, tuneTrain, predict, and getMetrics functions.
modelingSummary( data, y, p = 0.7, length = 10, control = "repeatedcv", number = 10, repeats = 10, process = c("center", "scale"), summary = multiClassSummary, positive, parallelComputing = FALSE, classtype, ... )
modelingSummary( data, y, p = 0.7, length = 10, control = "repeatedcv", number = 10, repeats = 10, process = c("center", "scale"), summary = multiClassSummary, positive, parallelComputing = FALSE, classtype, ... )
data |
object of class "data.frame" with target variable and predictor variables. |
y |
character. Target variable. |
p |
numeric. Proportion of data to be used for training. Default: 0.7 |
length |
integer. Number of values to output for each tuning parameter. If |
control |
character. Resampling method to use. Choices include: "boot", "boot632", "optimism_boot", "boot_all", "cv", "repeatedcv", "LOOCV", "LGOCV", "none", "oob", timeslice, "adaptive_cv", "adaptive_boot", or "adaptive_LGOCV". Default: "repeatedcv". See |
number |
integer. Number of cross-validation folds or number of resampling iterations. Default: 10. |
repeats |
integer. Number of folds for repeated k-fold cross-validation if "repeatedcv" is chosen as the resampling method in |
process |
character. Defines the pre-processing transformation of predictor variables to be done. Options are: "BoxCox", "YeoJohnson", "expoTrans", "center", "scale", "range", "knnImpute", "bagImpute", "medianImpute", "pca", "ica", or "spatialSign". See |
summary |
expression. Computes performance metrics across resamples. For numeric |
positive |
character. The positive class for the target variable if |
parallelComputing |
logical. indicates whether to also use the parallel processing. Default: False |
classtype |
integer.indicates the number of classes of the traits. |
... |
additional arguments to be passed to |
Types of classification and regression models available for use with tuneTrain
can be found using names(getModelInfo())
. The results given depend on the type of model used.
A dataframe contains the metrics of the modeling of five machine learning algorithms: KNN, SVM, RF, NNET, and Bcart.
tuneTrain
relies on package caret
to perform the modeling.
Zakaria Kehel, Khadija Aziz
createDataPartition
,
trainControl
,
train
,
predict.train
,
confusionMatrix
if(interactive()){ data(septoriaDurumWC) models <- modelingSummary(data = septoriaDurumWC, y = "ST_S", positive = "R", classtype = 2) }
if(interactive()){ data(septoriaDurumWC) models <- modelingSummary(data = septoriaDurumWC, y = "ST_S", positive = "R", classtype = 2) }
A sample data including daily data for 4 climatic variables (tmin, tmax, precipitation and relative humidity) and evaluation for Septoria Tritici
data(septoriaDurumWC)
data(septoriaDurumWC)
200 sites from durum wheat collection and their daily climatic data and evaluation for Septoria Tritici.
if(interactive()){ #Load durum wheat data with septoria scores and climatic variables obtained from WorldClim database data(septoriaDurumWC) }
if(interactive()){ #Load durum wheat data with septoria scores and climatic variables obtained from WorldClim database data(septoriaDurumWC) }
this function splits the Data into Train and Test Sets, it returns a list containing two data frames for the train and test sets.
splitData(data, seed = NULL, y, p, ...)
splitData(data, seed = NULL, y, p, ...)
data |
object of class "data.frame" with target variable and predictor variables. |
seed |
integer. Values for the random number generator. Default: NULL. |
y |
character. Target variable. |
p |
numeric. Proportion of data to be used for training. |
... |
additional arguments to be passed to |
splitData
relies on the createDataPartition
function from the caret
package to perform the data split.
If y
is a factor, the sampling of observations for each set is done within the levels of y
such that the class distributions are more or less balanced for each set.
If y
is numeric, the data is split into groups based on percentiles and the sampling done within these subgroups. See createDataPartition
for more details on additional arguments that can be passed.
A list with two data frames: the first as train set, and the second as test set.
Zakaria Kehel, Bancy Ngatia
if(interactive()){ # Split the data into 70/30 train and test sets for factor y data(septoriaDurumWC) split.data <- splitData(septoriaDurumWC, seed = 1234, y = 'ST_S', p = 0.7) # Get training and test sets from list object returned trainset <- split.data$trainset testset <- split.data$testset }
if(interactive()){ # Split the data into 70/30 train and test sets for factor y data(septoriaDurumWC) split.data <- splitData(septoriaDurumWC, seed = 1234, y = 'ST_S', p = 0.7) # Get training and test sets from list object returned trainset <- split.data$trainset testset <- split.data$testset }
tuneTrain splits the Data, it is an automatic function for tuning, training, and making predictions, it returns a list containing a model object, data frame and plot.
tuneTrain( data, y, p = 0.7, method = method, parallelComputing = FALSE, length = 10, control = "repeatedcv", number = 10, repeats = 10, process = c("center", "scale"), summary = multiClassSummary, positive, ... )
tuneTrain( data, y, p = 0.7, method = method, parallelComputing = FALSE, length = 10, control = "repeatedcv", number = 10, repeats = 10, process = c("center", "scale"), summary = multiClassSummary, positive, ... )
data |
object of class "data.frame" with target variable and predictor variables. |
y |
character. Target variable. |
p |
numeric. Proportion of data to be used for training. Default: 0.7 |
method |
character. Type of model to use for classification or regression. |
parallelComputing |
logical. indicates whether to also use the parallel processing. Default: False |
length |
integer. Number of values to output for each tuning parameter. If |
control |
character. Resampling method to use. Choices include: "boot", "boot632", "optimism_boot", "boot_all", "cv", "repeatedcv", "LOOCV", "LGOCV", "none", "oob", timeslice, "adaptive_cv", "adaptive_boot", or "adaptive_LGOCV". Default: "repeatedcv". See |
number |
integer. Number of cross-validation folds or number of resampling iterations. Default: 10. |
repeats |
integer. Number of folds for repeated k-fold cross-validation if "repeatedcv" is chosen as the resampling method in |
process |
character. Defines the pre-processing transformation of predictor variables to be done. Options are: "BoxCox", "YeoJohnson", "expoTrans", "center", "scale", "range", "knnImpute", "bagImpute", "medianImpute", "pca", "ica", or "spatialSign". See |
summary |
expression. Computes performance metrics across resamples. For numeric |
positive |
character. The positive class for the target variable if |
... |
additional arguments to be passed to |
Types of classification and regression models available for use with tuneTrain
can be found using names(getModelInfo())
. The results given depend on the type of model used.
For classification models, class probabilities and ROC curve are given in the results. For regression models, predictions and residuals versus predicted plot are given. y
should be converted to either factor if performing classification or numeric if performing regression before specifying it in tuneTrain
.
A list object with results from tuning and training the model selected in method
, together with predictions and class probabilities. The training and test data sets obtained from splitting the data are also returned.
If y
is factor, class probabilities are calculated for each class. If y
is numeric, predicted values are calculated.
A ROC curve is created if y
is factor. Otherwise, a plot of residuals versus predicted values is created if y
is numeric.
tuneTrain
relies on packages caret
, ggplot2
and plotROC
to perform the modelling and plotting.
Zakaria Kehel, Bancy Ngatia, Khadija Aziz
createDataPartition
,
trainControl
,
train
,
predict.train
,
ggplot
,
geom_roc
,
calc_auc
if(interactive()){ data(septoriaDurumWC) knn.mod <- tuneTrain(data = septoriaDurumWC,y = 'ST_S',method = 'knn',positive = 'R') nnet.mod <- tuneTrain(data = septoriaDurumWC,y = 'ST_S',method = 'nnet',positive = 'R') }
if(interactive()){ data(septoriaDurumWC) knn.mod <- tuneTrain(data = septoriaDurumWC,y = 'ST_S',method = 'knn',positive = 'R') nnet.mod <- tuneTrain(data = septoriaDurumWC,y = 'ST_S',method = 'nnet',positive = 'R') }
varimpPred calculates Variable Importance and makes predictions, it returns a list containing a data frame of variable importance scores, predictions or class probabilities, and corresponding plots.
varimpPred( newdata, y, positive, model, scale = FALSE, auc = FALSE, predict = FALSE, ... )
varimpPred( newdata, y, positive, model, scale = FALSE, auc = FALSE, predict = FALSE, ... )
newdata |
object of class "data.frame" having test data. |
y |
character. Target variable. |
positive |
character. The positive class for the target variable if y is factor. Usually, it is the first level of the factor. |
model |
expression. The model object returned after training a model on training data. |
scale |
boolean. If |
auc |
boolean. If |
predict |
boolean. If |
... |
additional arguments to be passed to |
The importance measure for each variable is calculated based on the type of model.
For example for linear models, the absolute value of the t-statistic of each parameter is used in the importance calculation.
For classification models, with the exception of classification trees, bagged trees and boosted trees, a variable importance score is calculated for each class. See varImp
for details on model-specific metrics.
varimpPred
can be used to obtain either variable importance metrics, predictions, class probabilities, or a combination of these.
For classification models with predict = TRUE
, class probabilities and ROC curve are given in the results.
For regression models with predict = TRUE
, predictions and residuals versus predicted plot are given.
A list object with importance measures for variables in newdata
, predictions for regression models, class probabilities for classification models, and corresponding plots.
newdata
should be either the test data that remains after splitting whole data into training and test sets, or a new data set different from the one used to train the model.
If y
is factor, class probabilities are calculated for each class. If y
is numeric, predicted values are calculated.
A ROC curve is created if predict = TRUE
and y
is factor. Otherwise, a plot of residuals versus predicted values is created if y
is numeric.
varimpPred
relies on packages caret
, ggplot2
and plotROC
to perform the calculations and plotting.
Zakaria Kehel, Bancy Ngatia, Khadija Aziz, Zainab Azough
varImp
,
predict.train
,
ggplot
,
geom_roc
,
calc_auc
if(interactive()){ # Calculate variable importance for classification model data("septoriaDurumWC") knn.mod <- tuneTrain(data = septoriaDurumWC,y = 'ST_S',method = 'knn') testdata <- knn.mod$`Test Data` knn.varimp<- varimpPred(newdata = testdata, y='ST_S', positive = 'R', model = knn.mod$Model) knn.varimp # Calculate variable importance and obtain class probabilities data("septoriaDurumWC") svm.mod <- tuneTrain(data = septoriaDurumWC, y = 'ST_S',method = 'svmLinear2', predict = TRUE, positive = 'R',summary = twoClassSummary) testdata <- svm.mod$`Test Data` svm.varimp <- varimpPred(newdata = testdata, y = 'ST_S', positive = 'R', model = svm.mod$Model, ROC = TRUE, predict = TRUE) svm.varimp # Obtain variable importance plot for only first 20 variables # with highest measure svm.varimp <- varimpPred(newdata = testdata, y = 'ST_S', positive = 'R', model = svm.mod$Model, ROC = TRUE, predict = TRUE, top = 20) svm.varimp }
if(interactive()){ # Calculate variable importance for classification model data("septoriaDurumWC") knn.mod <- tuneTrain(data = septoriaDurumWC,y = 'ST_S',method = 'knn') testdata <- knn.mod$`Test Data` knn.varimp<- varimpPred(newdata = testdata, y='ST_S', positive = 'R', model = knn.mod$Model) knn.varimp # Calculate variable importance and obtain class probabilities data("septoriaDurumWC") svm.mod <- tuneTrain(data = septoriaDurumWC, y = 'ST_S',method = 'svmLinear2', predict = TRUE, positive = 'R',summary = twoClassSummary) testdata <- svm.mod$`Test Data` svm.varimp <- varimpPred(newdata = testdata, y = 'ST_S', positive = 'R', model = svm.mod$Model, ROC = TRUE, predict = TRUE) svm.varimp # Obtain variable importance plot for only first 20 variables # with highest measure svm.varimp <- varimpPred(newdata = testdata, y = 'ST_S', positive = 'R', model = svm.mod$Model, ROC = TRUE, predict = TRUE, top = 20) svm.varimp }