| Title: | Prediction and Interpretation in Decision Trees for Classification and Regression |
|---|---|
| Description: | Optimization of conditional inference trees from the package 'party' for classification and regression. For optimization, the model space is searched for the best tree on the full sample by means of repeated subsampling. Restrictions are allowed so that only trees are accepted which do not include pre-specified uninterpretable split results (cf. Weihs & Buschfeld, 2021a). The function PrInDT() represents the basic resampling loop for 2-class classification (cf. Weihs & Buschfeld, 2021a). The function RePrInDT() (repeated PrInDT()) allows for repeated applications of PrInDT() for different percentages of the observations of the large and the small classes (cf. Weihs & Buschfeld, 2021c). The function NesPrInDT() (nested PrInDT()) allows for an extra layer of subsampling for a specific factor variable (cf. Weihs & Buschfeld, 2021b). The functions PrInDTMulev() and PrInDTMulab() deal with multilevel and multilabel classification. In addition to these PrInDT() variants for classification, the function PrInDTreg() has been developed for regression problems. Finally, the function PostPrInDT() allows for a posterior analysis of the distribution of a specified variable in the terminal nodes of a given tree. In version 2, additionally structured sampling is implemented in functions PrInDTCstruc() and PrInDTRstruc(). In these functions, repeated measurements data can be analyzed, too. Moreover, multilabel 2-stage versions of classification and regression trees are implemented in functions C2SPrInDT() and R2SPrInDT() as well as interdependent multilabel models in functions SimCPrInDT() and SimRPrInDT(). Finally, for mixtures of classification and regression models functions Mix2SPrInDT() and SimMixPrInDT() are implemented. Most of these extensions of PrInDT are described in Buschfeld & Weihs (2025Fc). References: -- Buschfeld, S., Weihs, C. (2025Fc) "Optimizing decision trees for the analysis of World Englishes and sociolinguistic data", Cambridge Elements. -- Weihs, C., Buschfeld, S. (2021a) "Combining Prediction and Interpretation in Decision Trees (PrInDT) - a Linguistic Example" <doi:10.48550/arXiv.2103.02336>; -- Weihs, C., Buschfeld, S. (2021b) "NesPrInDT: Nested undersampling in PrInDT" <doi:10.48550/arXiv.2103.14931>; -- Weihs, C., Buschfeld, S. (2021c) "Repeated undersampling in PrInDT (RePrInDT): Variation in undersampling and prediction, and ranking of predictors in ensembles" <doi:10.48550/arXiv.2108.05129>. |
| Authors: | Claus Weihs [aut, cre], Sarah Buschfeld [aut], Niklas Nitsch [ctb] |
| Maintainer: | Claus Weihs <[email protected]> |
| License: | GPL-2 |
| Version: | 2.0.2 |
| Built: | 2026-05-09 06:17:21 UTC |
| Source: | https://github.com/cran/PrInDT |
The function C2SPrInDT applies two-stage estimation for finding an optimal model for relationships between the two-class factor variables
specified as column indices of 'datain' in the vector 'inddep' and all other factor and numerical variables in the data frame 'datain' by means of
'N' repetitions of random subsampling with percentages 'percl' for the large classes and 'percs' for the small classes. One percentage of observations
for each dependent variable has to be specified for the larger and the smaller class. For example, for three dependent variables, 'percl' consists of
three percentages specified in the order in which the dependent variables appear in 'inddep'.
The dependent variables have to be specified as dummies, i.e. as 0 for 'property absent' or 1 for 'property present' for a certain property the dependent variable is representing.
The indices of the predictors relevant at 1st stage modeling can be specified in the vector 'indind'. If 'indind' is not specied, then all variables in 'datain'
which are not specified in 'inddep' are used as predictors at 1st stage. At 2nd stage, all such variables are used as predictors anyway.
The optimization citerion is the balanced accuracy on the full sample. The trees generated from undersampling can be restricted by not accepting trees
including split results specified in the character strings of the vector 'ctestv'.
The parameters 'conf.level', 'minsplit', and 'minbucket' can be used to control the size of the trees.
C2SPrInDT(datain,ctestv=NA,conf.level=0.95,percl,percs,N=99,indind=NA,inddep, minsplit=NA,minbucket=NA)C2SPrInDT(datain,ctestv=NA,conf.level=0.95,percl,percs,N=99,indind=NA,inddep, minsplit=NA,minbucket=NA)
datain |
Input data frame with class factor variables and the influential variables, |
ctestv |
Vector of character strings of forbidden split results; |
conf.level |
(1 - significance level) in function |
percl |
list of undersampling percentages of larger class (numerical, > 0 and <= 1): one per dependent class variable in the same order as in 'inddep' |
percs |
list of undersampling percentage of smaller class (numerical, > 0 and <= 1); one per dependent class variable in the same order as in 'inddep' |
N |
no. of repetitions (integer > 0); default = 99 |
indind |
indices of independent variables at stage 1; default = NA (means all independent variables used) |
inddep |
indices of dependent variables |
minsplit |
Minimum number of elements in a node to be splitted; default = 20 |
minbucket |
Minimum number of elements in a node; default = 7 |
See Buschfeld & Weihs (2025), Optimizing decision trees for the analysis of World Englishes and sociolinguistic data. Cambridge University Press, section 4.5.6.1, for further information.
Standard output can be produced by means of print(name) or just name as well as plot(name) where 'name' is the output data
frame of the function.
Best trees at stage 1
Best trees at stage 2
names of classification variables
balanced accuracies of best trees at both stages
data <- PrInDT::data_land # load data dataclean <- data[,c(1:7,23:24,11:13,22,8:10)] # only relevant features indind <- c(1:9) # original predictors inddep <- c(14:16) # dependent variables dataland <- na.omit(dataclean) ctestv <- NA perc <- c(0.45,0.05,0.25) # percentages of observations of larger class, # 1 per dependent class variable perc2 <- c(0.75,0.95,0.75) # percentages of observations of smaller class, # 1 per dependent class variable outland <- C2SPrInDT(dataland,percl=perc,percs=perc2,N=19,indind=indind,inddep=inddep) outland plot(outland)data <- PrInDT::data_land # load data dataclean <- data[,c(1:7,23:24,11:13,22,8:10)] # only relevant features indind <- c(1:9) # original predictors inddep <- c(14:16) # dependent variables dataland <- na.omit(dataclean) ctestv <- NA perc <- c(0.45,0.05,0.25) # percentages of observations of larger class, # 1 per dependent class variable perc2 <- c(0.75,0.95,0.75) # percentages of observations of smaller class, # 1 per dependent class variable outland <- C2SPrInDT(dataland,percl=perc,percs=perc2,N=19,indind=indind,inddep=inddep) outland plot(outland)
The use of language(s) on public signs and categorization criteria.
data_landdata_land
A data frame with 149 observations and 28 columns
who coded the sign: Factor (3 levels "C","E","S") (anonymized)
who took a photograph of the sign: Factor (2 levels "L","S") (anonymized)
where the sign was found: Factor (11 levels "digi","door","graf",...)
kind of sign: Factor (5 levels "com","commem","infra","reg","trans")
was the sign permanent? Factor (2 levels "no","yes")
kind of proper noun on the sign: Factor (9 levels "bn","bn+","cn",...)
number of languages on sign: Factor (4 levels "1","2","3","4+")
French on sign? Factor (2 levels "0","1")
Dutch on sign? Factor (2 levels "0","1")
English on sign? Factor (2 levels "0","1")
Italian on sign? Factor (2 levels "0","1")
Spanish on sign? Factor (2 levels "0","1")
German on sign? Factor (2 levels "0","1")
12 infrequent languages: Factor (2 levels "0","1")
brand name unclear: Factor (2 levels "0","1")
type of multilingualism on sign: Factor (6 levels "0","1","2","3","4","uc")
location of sign: Factor (2 levels "M","P") (anonymized)
Sarah Buschfeld, TU Dortmund
Usage of subject pronouns and its predictors; speaker level "adult" very frequent.
data_speakerdata_speaker
A data frame with 3370 observations and 6 columns
subject pronoun realized? Factor (2 levels "zero","realized")
age: Numerical (in months)
ethnic group: Factor (3 levels "C","I","n_a") (anonymized)
mean length of utterance: Factor (5 levels "1","2","3","adult","OL")
pronoun type: Factor (5 levels "dem","it_con","it_ex","it_ref","refer")
speaker: Factor (2 levels "adult","child")
Sarah Buschfeld, TU Dortmund
Vowel length and categorization criteria.
data_voweldata_vowel
A data frame with 82 observations and 22 columns
nickname: Factor (43 levels "Nick1","Nick2",...,"Nick43") (anonymized)
linguistic background: Factor (2 levels "mono","multi")
mean length of utterance: Factor (3 levels "1","2","3")
phone label: Factor (2 levels "fleece","kit")
lexeme: Factor (14 levels "bee","cheek","cheese","chicken", ...)
duration of phone to the left of the vowel: Numerical (in msec)
duration of phone to the right of the vowel: Numerical (in msec)
duration of word: Numerical (in msec)
minimum pitch of vowel: Numerical (in Hertz)
maximum pitch of vowel: Numerical (in Hertz)
mean intensity of vowel: Numerical (in decibel)
first formant F1 at midpoint of vowel (50%): Numerical (in Hertz)
second formant F2 at midpoint of vowel (50%): numercial (in Hertz)
vowel length: Numerical (in msec)
class of consonant to the left of the vowel: Factor (6 levels "l","r","tsh",...)
class of consonant to the right of the vowel: Factor (7 levels "?"(glottal stop),"empty","nas",...)
ethnic group: Factor (6 levels "C1a","C1b","C1c","C2a","C2b","C2c") (anonymized)
gender: Factor (2 levels "female","male)
age: Numerical (in months)
number of syllables in lexeme: integer (1,2)
speed of speech: Numerical (word duration / syllables; in msec)
country: Factor (2 levels "E","S") (anonymized)
Sarah Buschfeld, TU Dortmund
Usage of subject pronouns and its predictors.
data_zerodata_zero
A data frame with 1024 observations and 7 columns
subject pronoun realized? Factor (2 levels "zero","realized")
age: Numerical (in months)
linguistic background: Factor (3 levels "mono","multi", NA)
ethnic group: Factor (6 levels "C1a","C1b","C1c","C2a","C2b","C2c") (anonymized)
gender: Factor (2 levels "female","male")
mean length of utterance: Factor (4 levels "1","2","3","OL")
pronoun: Factor (7 levels "I","you_s","he","she","it","we","they")
Sarah Buschfeld, TU Dortmund
The function Mix2SPrInDT applies 'N' repetitions of subsampling for finding an optimal subsample to model
the relationship between the dependent variables specified in the sublist 'targets' of the list 'datalist' and all other factor and numerical variables
in the corresponding data frame specified in the sublist 'datanames' of the list 'datalist' in the same order as 'targets'.
The function is prepared to handle classification tasks with 2 or more classes and regression tasks. At first stage, the targets are estimated only on
the basis of the exogenous predictors. At second stage, summaries of the predictions of the endogenous features from the first stage models are added as new predictors.
Subsampling of observations and predictors uses the percentages specified in the matrix 'percent', one row per estimation taks. For classification tasks,
percentages 'percl' and 'percs' for the larger and the smaller class have to be specified. For regression tasks, 'pobs' and 'ppre' have to specified
for observations and predictors, respectively.
For generating summaries, the variables representing the substructure have to be specified in the sublist 'datastruc' of 'datalist'.
The sublist 'summ' of 'datalist' includes for each discrete target the classes for which you want to calculate the summary percentages and for each
continuous target just NA (in the same order as in the sublist 'targets'). For a discrete target in this list, you can provide a sublist of classes to be combined
(for an example see the below example).
The optimization citerion is the balanced accuracy for classification and R2 for regression, both evaluated on the full sample.
The trees generated from undersampling can be restricted by not accepting trees
including split results specified in the character strings of the vector 'ctestv'.
The parameters 'conf.level', 'minsplit', and 'minbucket' can be used to control the size of the trees.
Mix2SPrInDT(datalist,ctestv=NA,N=99,percent=NA,conf.level=0.95,minsplit=NA,minbucket=NA)Mix2SPrInDT(datalist,ctestv=NA,N=99,percent=NA,conf.level=0.95,minsplit=NA,minbucket=NA)
datalist |
list(datanames,targets,datastruc,summ) Input data: For specification see the above description |
ctestv |
Vector of character strings of forbidden split results; |
N |
Number of repetitions of subsampling for predictors (integer); default = 99 |
percent |
matrix of percent spefications: For specification see the above description; default: 'percent = NA' meaning default values for percentages. |
conf.level |
(1 - significance level) in function |
minsplit |
Minimum number of elements in a node to be splitted; default = 20 |
minbucket |
Minimum number of elements in a node; default = 7 |
See Buschfeld & Weihs (2025), Optimizing decision trees for the analysis of World Englishes and sociolinguistic data. Cambridge University Press, section 4.5.6.1, for further information.
Standard output can be produced by means of print(name) or just name as well as plot(name where 'name' is the output data
frame of the function.
Best trees at stage 1
Best trees at stage 2
names of dependent variables
number of models of tasks
levels of tasks
accuracies of best trees at both stages
# zero data datazero <- PrInDT::data_zero datazero <- na.omit(datazero) # cleaned full data: no NAs names(datazero)[names(datazero)=="real"] <- "zero" CHILDzero <- PrInDT::participant_zero # interpretation restrictions (split exclusions) ctestv <- rbind('ETH == {C2a, C1a}','MLU == {1, 3}') # split exclusions ## # multi-level data datastrat <- PrInDT::data_zero datamult <- na.omit(datastrat) # ctestv <- NA datamult$mult[datamult$ETH %in% c("C1a","C1b","C1c") & datamult$real == "zero"] <- "zero1" datamult$mult[datamult$ETH %in% c("C2a","C2b","C2c") & datamult$real == "zero"] <- "zero2" datamult$mult[datamult$real == "realized"] <- "real" datamult$mult <- as.factor(datamult$mult) # mult is new class variable datamult$real <- NULL # remove old class variable CHILDmult <- CHILDzero ## # vowel data data <- PrInDT::data_vowel data <- na.omit(data) CHILDvowel <- data$Nickname data$Nickname <- NULL syllable <- 3 - data$syllables data$syllabels <- NULL data$syllables <- syllable data$speed <- data$word_duration / data$syllables names(data)[names(data) == "target"] <- "vowel" datavowel <- data ## # function preparation and call datanames <- list("datazero","datamult","datavowel") targets <- c("zero","mult","vowel") datastruc <- list(CHILDzero,CHILDmult,CHILDvowel) summult <- paste("zero1","zero2",sep=",") summ <- c("zero",summult,NA) datalist <- list(datanames=datanames,targets=targets,datastruc=datastruc,summ=summ) percent <- matrix(NA,nrow=3,ncol=2) percent[1,] <- c("percl=0.075","percs=0.9") # percentages for datazero # no percentages needed for datapast percent[3,] <- c("pobs=0.9","ppre=c(0.9,0.8)") # percentages for datavowel out2SMix <- Mix2SPrInDT(datalist,ctestv=ctestv,N=19,percent=percent,conf.level=0.99) out2SMix plot(out2SMix)# zero data datazero <- PrInDT::data_zero datazero <- na.omit(datazero) # cleaned full data: no NAs names(datazero)[names(datazero)=="real"] <- "zero" CHILDzero <- PrInDT::participant_zero # interpretation restrictions (split exclusions) ctestv <- rbind('ETH == {C2a, C1a}','MLU == {1, 3}') # split exclusions ## # multi-level data datastrat <- PrInDT::data_zero datamult <- na.omit(datastrat) # ctestv <- NA datamult$mult[datamult$ETH %in% c("C1a","C1b","C1c") & datamult$real == "zero"] <- "zero1" datamult$mult[datamult$ETH %in% c("C2a","C2b","C2c") & datamult$real == "zero"] <- "zero2" datamult$mult[datamult$real == "realized"] <- "real" datamult$mult <- as.factor(datamult$mult) # mult is new class variable datamult$real <- NULL # remove old class variable CHILDmult <- CHILDzero ## # vowel data data <- PrInDT::data_vowel data <- na.omit(data) CHILDvowel <- data$Nickname data$Nickname <- NULL syllable <- 3 - data$syllables data$syllabels <- NULL data$syllables <- syllable data$speed <- data$word_duration / data$syllables names(data)[names(data) == "target"] <- "vowel" datavowel <- data ## # function preparation and call datanames <- list("datazero","datamult","datavowel") targets <- c("zero","mult","vowel") datastruc <- list(CHILDzero,CHILDmult,CHILDvowel) summult <- paste("zero1","zero2",sep=",") summ <- c("zero",summult,NA) datalist <- list(datanames=datanames,targets=targets,datastruc=datastruc,summ=summ) percent <- matrix(NA,nrow=3,ncol=2) percent[1,] <- c("percl=0.075","percs=0.9") # percentages for datazero # no percentages needed for datapast percent[3,] <- c("pobs=0.9","ppre=c(0.9,0.8)") # percentages for datavowel out2SMix <- Mix2SPrInDT(datalist,ctestv=ctestv,N=19,percent=percent,conf.level=0.99) out2SMix plot(out2SMix)
PrInDT with additional undersampling of a factor with two unbalanced levelsFunction for additional undersampling of the factor 'nesvar' with two unbalanced levels to avoid dominance of the level with higher frequency.
The factor 'nesvar' is allowed not be part of the input data frame 'datain'. The data of this factor is given in the vector 'nesunder'.
The observations in 'nesunder' have to represent the same cases as in 'datain' in the same ordering.PrInDT is called 'repin' times with subsamples of the original data so that the level with the larger frequency in the vector 'nesunder' has
approximately the same number of values as the level with the smaller frequency.
Only the arguments 'nesvar', 'nesunder', and 'repin' relate to the additional undersampling, all the other arguments relate to the standard
PrInDT procedure.
As in PrInDT, the aim is to optimally model the relationship between the two-class factor variable 'classname' and all other factor and
numerical variables in the data frame 'datain' by means of 'N' repetitions of undersampling. The trees generated by PrInDT can be
restricted by excluding unacceptable trees which include split results specified in the character strings of the vector 'ctestv'.
The probability threshold 'thres' for the prediction of the smaller class may be specified (default = 0.5).
Undersampling may be stratified in two ways by the feature 'strat'.
The results are evaluated on the full sample and on the subsamples of 'nesunder'.
The parameters 'conf.level', 'minsplit', and 'minbucket' can be used to control the size of the trees.
Reference
Weihs, C., Buschfeld, S. 2021b. NesPrInDT: Nested undersampling in PrInDT.
arXiv:2103.14931
NesPrInDT(datain,classname,ctestv=NA,N,plarge,psmall=1.0,conf.level=0.95, thres=0.5,stratvers=0,strat=NA,seedl=TRUE,nesvar,nesunder,repin, minsplit=NA,minbucket=NA)NesPrInDT(datain,classname,ctestv=NA,N,plarge,psmall=1.0,conf.level=0.95, thres=0.5,stratvers=0,strat=NA,seedl=TRUE,nesvar,nesunder,repin, minsplit=NA,minbucket=NA)
datain |
Input data frame with class factor variable 'classname' and the |
classname |
Name of class variable (character) |
ctestv |
Vector of character strings of forbidden split results; |
N |
Number of repetitions (integer > 0) |
plarge |
Undersampling percentage of larger class (numerical, > 0 and <= 1) |
psmall |
Undersampling percentage of smaller class (numerical, > 0 and <= 1); |
conf.level |
(1 - significance level) in function |
thres |
Probability threshold for prediction of smaller class; default = 0.5 |
stratvers |
Version of stratification; |
strat |
Name of one (!) stratification variable for undersampling (character); |
seedl |
Should the seed for random numbers be set (TRUE / FALSE)? |
nesvar |
Name of factor to be undersampled (character) |
nesunder |
Data of factor to be undersampled (integer) |
repin |
Number of repetitions (integer) for undersampling of 'nesvar' |
minsplit |
Minimum number of elements in a node to be splitted; |
minbucket |
Minimum number of elements in a node; |
Standard output can be produced by means of print(name) or just name as well as plot(name) where 'name' is the output data
frame of the function.
The plot function will produce a series of more than one plot. If you use R, you might want to specify windows(record=TRUE) before
plot(name) to save the whole series of plots. In R-Studio this functionality is provided automatically.
balanced accuracies on undersamples
indices of best trees on undersamples
balanced accuracies of ensembles of 3 best trees on undersamples
balanced accuracies on full sample
balanced accuracy on full sample of best ensemble of 3 trees from undersampling
indices of best trees on full sample
3 best trees of all undersamples of 'nesunder'; refer to an individual tree as treesb[[k]], k = 1, ..., 3*repin
# data input and preparation --> data frame with # class variable, factors, and numericals (no character variables)!! data <- PrInDT::data_speaker data <- na.omit(data) nesvar <- "SPEAKER" N <- 49 # no. of repetitions in inner loop plarge <- 0.06 # sampling percentage for larger class in nesunder-subsample psmall <- 1 # sampling percentage for smaller class in nesunder-subsample nesunder <- data$SPEAKER data[,nesvar] <- list(NULL) outNes <- NesPrInDT(data,"class",ctestv=NA,N,plarge,psmall,conf.level=0.95,nesvar=nesvar, nesunder=nesunder,repin=5) outNes plot(outNes) hist(outNes$undba,main=" ",xlab = "balanced accuracies of 3 best trees of all undersamples")# data input and preparation --> data frame with # class variable, factors, and numericals (no character variables)!! data <- PrInDT::data_speaker data <- na.omit(data) nesvar <- "SPEAKER" N <- 49 # no. of repetitions in inner loop plarge <- 0.06 # sampling percentage for larger class in nesunder-subsample psmall <- 1 # sampling percentage for smaller class in nesunder-subsample nesunder <- data$SPEAKER data[,nesvar] <- list(NULL) outNes <- NesPrInDT(data,"class",ctestv=NA,N,plarge,psmall,conf.level=0.95,nesvar=nesvar, nesunder=nesunder,repin=5) outNes plot(outNes) hist(outNes$undba,main=" ",xlab = "balanced accuracies of 3 best trees of all undersamples")
The function OptPrInDT applies an iterative technique for finding optimal undersampling percentages
'percl' for the larger class and 'percs' for the smaller class by a nested grid search for the use of the function PrInDT for
the relationship between the two-class factor variable 'classname' and all other factor and numerical variables
in the data frame 'data' by means of 'N' repetitions of undersampling. The optimization citerion is the balanced accuracy
on the validation sample 'valdat' (default = full sample 'data'). The trees generated from undersampling can be restricted by not accepting trees
including split results specified in the character strings of the vector 'ctestv'.
The inputs plmax and psmax determine the maximal values of the percentages and the inputs distl and dists the
the distances to the next smaller percentage to be tried.
The parameters 'conf.level', 'minsplit', and 'minbucket' can be used to control the size of the trees.
The parameter 'steps' controls, how many of the 3 possible optimization steps should be carried out; default=3.
OptPrInDT(data,classname,ctestv=NA,N=99,plmax=0.09,psmax=0.9, distl=0.01,dists=0.1,conf.level=0.95,minsplit=NA,minbucket=NA, valdat=data,steps=3)OptPrInDT(data,classname,ctestv=NA,N=99,plmax=0.09,psmax=0.9, distl=0.01,dists=0.1,conf.level=0.95,minsplit=NA,minbucket=NA, valdat=data,steps=3)
data |
Input data frame with class factor variable 'classname' and the |
classname |
Name of class variable (character) |
ctestv |
Vector of character strings of forbidden split results; |
N |
Number (> 7) of repetitions (integer) |
plmax |
Maximal undersampling percentage of larger class (numerical, > 0 and <= 1); |
psmax |
Maximal undersampling percentage of smaller class (numerical, > 0 and <= 1); |
distl |
Distance to the next lower undersampling percentage of larger class (numerical, > 0 and < 1); |
dists |
Distance to the next lower undersampling percentage of smaller class (numerical, > 0 and < 1); |
conf.level |
(1 - significance level) in function |
minsplit |
Minimum number of elements in a node to be splitted; |
minbucket |
Minimum number of elements in a node; |
valdat |
validation data; default = data |
steps |
number of optimization steps = 1, 2, 3; default = 3 |
See help("RePrInDT") and help("PrInDT") for further information.
Standard output can be produced by means of print(name$besttree) or just name$besttree as well as plot(name$besttree) where 'name' is the output data
frame of the function.
best tree on full sample
balanced accuracy of best tree on full sample
undersampling percentage of large class of best tree on full sample
undersampling percentage of small class of best tree on full sample
datastrat <- PrInDT::data_zero data <- na.omit(datastrat) # cleaned full data: no NAs # interpretation restrictions (split exclusions) ctestv <- rbind('ETH == {C2a, C1a}','MLU == {1, 3}') # split exclusions # call of OptPrInDT out <- OptPrInDT(data,"real",ctestv,N=24,conf.level=0.99,steps=1) # unstratified out # print best model and ensembles as well as all observations plot(out)datastrat <- PrInDT::data_zero data <- na.omit(datastrat) # cleaned full data: no NAs # interpretation restrictions (split exclusions) ctestv <- rbind('ETH == {C2a, C1a}','MLU == {1, 3}') # split exclusions # call of OptPrInDT out <- OptPrInDT(data,"real",ctestv,N=24,conf.level=0.99,steps=1) # unstratified out # print best model and ensembles as well as all observations plot(out)
Participants (random excerpt according to data_zero)
participant_zeroparticipant_zero
A dataset with 1024 observations and 1 column
participant of study (P1 - P51)
Sarah Buschfeld, TU Dortmund
The conditional inference tree 'ct' is analyzed according to the distribution of a variable 'var' in its terminal nodes.
In the case of a discrete variable 'var', the appearance of the different levels is considered for each terminal node.
In the case of a continuous variable 'var', means and standard deviations of 'var' or the target variable are considered for each terminal node.
In particular, this function can be used for the posterior analysis of a tree regarding the distribution of a variable not present in the tree.
PostPrInDT(datain, ct, target, var, vardata, vt)PostPrInDT(datain, ct, target, var, vardata, vt)
datain |
input data frame with the observatios of all variables used in the model |
ct |
conditional inference tree to be analyzed |
target |
name of target variable of 'ct' (character) |
var |
name of variable of interest (character) |
vardata |
observations of 'var' |
vt |
type of variables: 'dd' for discrete target (classification) and discrete variable 'var', 'dc' for discrete target (classification) and continuous 'var', 'cd' for continuous target (regression) and discrete 'var', and 'cc' for continuous target (regression) and continuous 'var'. |
None: Relevant output is produced by the function.
data <- PrInDT::data_zero data <- na.omit(data) outAll <- PrInDTAll(data,"real") PostPrInDT(data,outAll$treeAll,"real","ETH",data$ETH,vt="dd") PostPrInDT(data,outAll$treeAll,"real","AGE",data$AGE,vt="dc") datareg <- PrInDT::data_vowel outregAll <- PrInDTregAll(datareg,"target") PostPrInDT(datareg,outregAll$treeAll,"target","Nickname",datareg$Nickname,vt="cd") PostPrInDT(datareg,outregAll$treeAll,"target","AGE",datareg$AGE,vt="cc")data <- PrInDT::data_zero data <- na.omit(data) outAll <- PrInDTAll(data,"real") PostPrInDT(data,outAll$treeAll,"real","ETH",data$ETH,vt="dd") PostPrInDT(data,outAll$treeAll,"real","AGE",data$AGE,vt="dc") datareg <- PrInDT::data_vowel outregAll <- PrInDTregAll(datareg,"target") PostPrInDT(datareg,outregAll$treeAll,"target","Nickname",datareg$Nickname,vt="cd") PostPrInDT(datareg,outregAll$treeAll,"target","AGE",datareg$AGE,vt="cc")
The function PrInDT uses ctrees (conditional inference trees from the package "party") for optimal modeling of
the relationship between the two-class factor variable 'classname' and all other factor and numerical variables
in the data frame 'datain' by means of 'N' repetitions of undersampling. The optimization citerion is the balanced accuracy
on the validation sample 'valdat' (default = full input sample 'datain'). The trees generated from undersampling can be restricted by not accepting trees
including split results specified in the character strings of the vector 'ctestv'.
The undersampling percentages are 'percl' for the larger class and 'percs' for the smaller class (default = 1).
A probability threshold 'thres' for the prediction of the smaller class may be specified (default = 0.5).
Undersampling may be stratified in two ways by the option 'stratvers'.
The parameters 'conf.level', 'minsplit', and 'minbucket' can be used to control the size of the trees.
In the case of repeated measurements ('indrep=1'), the values of the substructure variable have to be given in 'repvar'.
Only one value of 'classname' is allowed for each value of 'repvar'.
If for a value of 'repvar' the percentage 'thr' of the observed occurence of a value of 'classname' is not reached by the number of predictions of the value of 'classname', a misclassification is detected.
PrInDT(datain,classname,ctestv=NA,N,percl,percs=1,conf.level=0.95,thres=0.5, stratvers=0,strat=NA,seedl=TRUE,minsplit=NA,minbucket=NA,repvar=NA,indrep=0, valdat=datain,thr=0.5)PrInDT(datain,classname,ctestv=NA,N,percl,percs=1,conf.level=0.95,thres=0.5, stratvers=0,strat=NA,seedl=TRUE,minsplit=NA,minbucket=NA,repvar=NA,indrep=0, valdat=datain,thr=0.5)
datain |
Input data frame with class factor variable 'classname' and the |
classname |
Name of class variable (character) |
ctestv |
Vector of character strings of forbidden split results; |
N |
Number (> 2) of repetitions (integer) |
percl |
Undersampling percentage of larger class (numerical, > 0 and <= 1); |
percs |
Undersampling percentage of smaller class (numerical, > 0 and <= 1); |
conf.level |
(1 - significance level) in function |
thres |
Probability threshold for prediction of smaller class (numerical, >= 0 and < 1); default = 0.5 |
stratvers |
Version of stratification; |
strat |
Name of one (!) stratification variable for undersampling (character); |
seedl |
Should the seed for random numbers be set (TRUE / FALSE)? |
minsplit |
Minimum number of elements in a node to be splitted; |
minbucket |
Minimum number of elements in a node; |
repvar |
Values of variable defining the substructure in the case of repeated measurements, length = dim(valdat)[1] necessary; default = NA |
indrep |
Indicator of repeated measurements ('indrep=1'); default = 0 |
valdat |
Validation data; default = datain |
thr |
threshold for element classification: minimum percentage of correct class entries; default = 0.5 |
For the optimzation of the trees, we employ a method we call Sumping (Subsampling umbrella of model parameters), a variant of Bumping (Bootstrap umbrella of model parameters) (Tibshirani & Knight, 1999) which uses subsampling instead of bootstrapping. The aim of the optimization is to identify conditional inference trees with maximum predictive power on the full sample under interpretability restrictions.
References
– Tibshirani, R., Knight, K. 1999. Model Search and Inference By Bootstrap "bumping".
Journal of Computational and Graphical Statistics, Vol. 8, No. 4 (Dec., 1999), pp. 671-686
– Weihs, C., Buschfeld, S. 2021a. Combining Prediction and Interpretation in Decision Trees (PrInDT) -
a Linguistic Example. arXiv:2103.02336
Standard output can be produced by means of print(name) or just name as well as plot(name) where 'name' is the output data
frame of the function.
The plot function will produce a series of more than one plot. If you use R, you might want to specify windows(record=TRUE) before
plot(name) to save the whole series of plots. In R-Studio this functionality is provided automatically.
best tree on validation sample
2nd-best tree on validation sample
3rd-best tree on validation sample
best tree on test sample
2nd-best tree on test sample
3rd-best tree on test sample
accuracies: largeClass, smallClass, balanced of 'tree1st', both for validation and test sample
accuracies: largeClass, smallClass, balanced of 'tree2nd', both for validation and test sample
accuracies: largeClass, smallClass, balanced of 'tree3rd', both for validation and test sample
accuracies: largeClass, smallClass, balanced of ensemble of all interpretable, 3 best acceptable, and all acceptable trees on validation sample
vector of balanced accuracies of all trees from undersampling
vector of test accuracies of all trees from undersampling
tree based on all observations
balanced accuracy of 'treeAll'
criterion of interpretability of 'treeall' (TRUE / FALSE)
confusion matrix of 'treeAll'
Accuracy of full sample tree on Elements of large class
Accuracy of full sample tree on Elements of small class
balanced accuracy of full sample tree on Elements
Names of misclassified Elements by full sample tree of large class
Names of misclassified Elements by full sample tree of small class
Accuracy of best tree on Elements of large class
Accuracy of best tree on Elements of small class
balanced accuracy of best tree on Elements
Names of misclassified Elements by best tree of large class
Names of misclassified Elements by best tree of small class
Label of large class
Label of small class
Validation set after edit
Threshold for repeated measurements
datastrat <- PrInDT::data_zero data <- na.omit(datastrat) # cleaned full data: no NAs # interpretation restrictions (split exclusions) ctestv <- rbind('ETH == {C2a, C1a}','MLU == {1, 3}') # split exclusions N <- 41 # no. of repetitions conf.level <- 0.99 # 1 - significance level (mincriterion) in ctree percl <- 0.08 # undersampling percentage of the larger class percs <- 0.95 # undersampling percentage of the smaller class # calls of PrInDT out <- PrInDT(data,"real",ctestv,N,percl,percs,conf.level) # unstratified out # print best model and ensembles as well as all observations plot(out) out <- PrInDT(data,"real",ctestv,N,percl,percs,conf.level,stratvers=1, strat="SEX") # percentage stratification out <- PrInDT(data,"real",ctestv,N,percl,percs,conf.level,stratvers=50, strat="SEX") # stratification with minimum no. of tokens out <- PrInDT(data,"real",ctestv,N,percl,percs,conf.level,thres=0.4) # threshold = 0.4datastrat <- PrInDT::data_zero data <- na.omit(datastrat) # cleaned full data: no NAs # interpretation restrictions (split exclusions) ctestv <- rbind('ETH == {C2a, C1a}','MLU == {1, 3}') # split exclusions N <- 41 # no. of repetitions conf.level <- 0.99 # 1 - significance level (mincriterion) in ctree percl <- 0.08 # undersampling percentage of the larger class percs <- 0.95 # undersampling percentage of the smaller class # calls of PrInDT out <- PrInDT(data,"real",ctestv,N,percl,percs,conf.level) # unstratified out # print best model and ensembles as well as all observations plot(out) out <- PrInDT(data,"real",ctestv,N,percl,percs,conf.level,stratvers=1, strat="SEX") # percentage stratification out <- PrInDT(data,"real",ctestv,N,percl,percs,conf.level,stratvers=50, strat="SEX") # stratification with minimum no. of tokens out <- PrInDT(data,"real",ctestv,N,percl,percs,conf.level,thres=0.4) # threshold = 0.4
ctree based on all observations in 'datain'. Interpretability is checked (see 'ctestv'); probability threshold can be specified. The parameters 'conf.level', 'minsplit', and 'minbucket' can be used to control the size of the trees.
Reference
Weihs, C., Buschfeld, S. 2021a. Combining Prediction and Interpretation in Decision Trees (PrInDT) -
a Linguistic Example. arXiv:2103.02336
In the case of repeated measurements ('indrep=1'), the values of the substructure variable have to be given in 'repvar'. Only one value of 'classname' is allowed for each value of 'repvar'. If for a value of 'repvar' the percentage 'thr' of the observed occurence of a value of 'classname' is not reached by the number of predictions of the value of 'classname', a misclassification is detected.
PrInDTAll(datain, classname, ctestv=NA, conf.level=0.95, thres=0.5, minsplit=NA,minbucket=NA,repvar=NA,indrep=0,thr=0.5)PrInDTAll(datain, classname, ctestv=NA, conf.level=0.95, thres=0.5, minsplit=NA,minbucket=NA,repvar=NA,indrep=0,thr=0.5)
datain |
Input data frame with class factor variable 'classname' and the |
classname |
Name of class variable (character) |
ctestv |
Vector of character strings of forbidden split results; |
conf.level |
(1 - significance level) in function |
thres |
Probability threshold for prediction of smaller class (numerical, >= 0 and < 1); default = 0.5 |
minsplit |
Minimum number of elements in a node to be splitted; |
minbucket |
Minimum number of elements in a node; |
repvar |
Values of variable defining the substructure in the case of repeated measurements, length = dim(datain)[1] necessary; default=NA |
indrep |
Indicator of repeated measurements ('indrep=1'); default = 0 |
thr |
threshold for element classification: minimum percentage of correct class entries; default = 0.5 |
Standard output can be produced by means of print(name) or just name as well as plot(name) where 'name' is the output data
frame of the function.
ctree based on all observations
balanced accuracy of 'treeall'
criterion of interpretability of 'treeall' (TRUE / FALSE)
confusion matrix of 'treeall'
Accuracy of full sample tree on Elements of large class
Accuracy of full sample tree on Elements of small class
balanced accuracy of full sample tree on Elements
Names of misclassified Elements by full sample tree of large class
Names of misclassified Elements by full sample tree of small class
Label of large class
Label of small class
Threshold for repeated measurements
datastrat <- PrInDT::data_zero data <- na.omit(datastrat) ctestv <- rbind('ETH == {C2a,C1a}','MLU == {1, 3}') conf.level <- 0.99 # 1 - significance level (mincriterion) in ctree outAll <- PrInDTAll(data,"real",ctestv,conf.level) print(outAll) # print model based on all observations plot(outAll) # plot model based on all observationsdatastrat <- PrInDT::data_zero data <- na.omit(datastrat) ctestv <- rbind('ETH == {C2a,C1a}','MLU == {1, 3}') conf.level <- 0.99 # 1 - significance level (mincriterion) in ctree outAll <- PrInDTAll(data,"real",ctestv,conf.level) print(outAll) # print model based on all observations plot(outAll) # plot model based on all observations
ctrees based on the full sample of the smaller class and consecutive parts of the larger class of the nesting variable 'nesvar'.
The variable 'nesvar' has to be part of the data frame 'datain'.
Interpretability is checked (see 'ctestv'); probability threshold can be specified.
The parameters 'conf.level', 'minsplit', and 'minbucket' can be used to control the size of the trees.
Reference
Weihs, C., Buschfeld, S. 2021b. NesPrInDT: Nested undersampling in PrInDT.
arXiv:2103.14931
PrInDTAllparts(datain, classname, ctestv=NA, conf.level=0.95, thres=0.5, nesvar, divt,minsplit=NA,minbucket=NA)PrInDTAllparts(datain, classname, ctestv=NA, conf.level=0.95, thres=0.5, nesvar, divt,minsplit=NA,minbucket=NA)
datain |
Input data frame with class factor variable 'classname' and the |
classname |
Name of class variable (character) |
ctestv |
Vector of character strings of forbidden split results; |
conf.level |
(1 - significance level) in function |
thres |
Probability threshold for prediction of smaller class (numerical, >= 0 and < 1); default = 0.5 |
nesvar |
Name of nesting variable (character) |
divt |
Number of parts of nesting variable nesvar for which models should be determined individually |
minsplit |
Minimum number of elements in a node to be splitted; |
minbucket |
Minimum number of elements in a node; |
Standard output can be produced by means of print(name) or just name where 'name' is the output data
frame of the function.
balanced accuracy of tree on full sample
name of nesting variable
number of consecutive parts of the sample
balanced accuracy of trees on 'divt' consecutive parts of the sample
data <- PrInDT::data_speaker data <- na.omit(data) nesvar <- "SPEAKER" outNesAll <- PrInDTAllparts(data,"class",ctestv=NA,conf.level=0.95,thres=0.5,nesvar,divt=8) outNesAlldata <- PrInDT::data_speaker data <- na.omit(data) nesvar <- "SPEAKER" outNesAll <- PrInDTAllparts(data,"class",ctestv=NA,conf.level=0.95,thres=0.5,nesvar,divt=8) outNesAll
The function PrInDTCstruc applies structured subsampling for finding an optimal subsample to model
the relationship between the two-class factor variable 'classname' and all other factor and numerical variables
in the data frame 'datain' by means of 'N' repetitions of subsampling from a substructure and 'Ni' repetitions of subsampling from the predictors.
The optimization citerion is the balanced accuracy on the validation sample 'valdat' (default is the full input sample 'datain').
Other criteria are possible (cf. parameter description of 'crit'). The trees generated from undersampling can be restricted by not accepting trees
including split results specified in the character strings of the vector 'ctestv'.
The substructure of the observations used for subsampling in modelling is specified by the list 'Struc' which consists of the factor variable 'name' representing the substructure,
the name 'check' of the factor variable with the information about the categories of the substructure, and the matrix 'labs' which specifies the values of 'check'
corresponding to two categories in its rows, i.e. in 'labs[1,]' and 'labs[2,]'. The names of the categories have to be specified by rownames(labs).
See parameter description of 'Struc' for its specification for 'vers="b"' and 'indrep > 0'.
The number of predictors 'Pit' to be included in the model and the number of elements of the substructure 'Eit' have to be specified (lists allowed), and
undersampling of the categories of 'classname' can be controlled by 'undersamp=TRUE/FALSE'.
In structured subsampling, 'N' repetitions of subsampling of the variable 'name' with 'Eit' different elements of each category in 'check' are realized. If 'Eit' is a list, each entry is employed individually. If 'Eit' is larger than the maximum available number of elements with a certain value of 'check', the maximum possible number of elements is used.
Four different versions of structured subsampling exist:
a) just of the elements in the substructure (possibly with additional undersampling) with parameters 'N' and 'Eit',
b) just of the predictors with parameters 'Ni' and 'Pit',
c) of the predictors and for each subset of predictors subsampling of the elements of the substructure (possibly with additional undersampling)
with parameters 'N', 'Ni', 'Eit', and 'Pit', and
d) of the elements of the substructure (possibly with additional undersampling) and for each of these subsets subsampling of the predictors
with the same parameters as version c).
Sampling of the elements of the substructure can be influenced by using weights of the elements ('weights=TRUE') according to the number of appearances of the smaller
class of 'classnames'. This way, elements with more realisations in the smaller class are preferred in modelling.
The parameters 'conf.level', 'minsplit', and 'minbucket' can be used to control the size of the trees.
The parameter 'indrep' indicates repeated measurement situations ('indrep > 0', only implemented for 'crit="ba"'); default = 0.
Repeated measurements are multiple measurements of the same variable taken on the same subjects (or objects) either under different conditions or over two or more time periods.
The variable with the repeatedly observed subjects (or objects) has to be specified by 'name' in 'Struc'. Only one value of 'classname' is allowed for each value of 'Struc$name'.
In case of 'indrep > 0' it is automatically assumed that the same number of
subjects (or objects) of the two classes under study have to be used for model building. Possible such numbers can be specified by 'Eit'.
For indrep=1, models are optimized for individual observations and the classes of the substructure elements are subsequently determined via the parameter 'thr'.
For indrep=2, models are optimized so that the criterion "ba" is maximal for the substructure elements taking 'thr' into account.
For indrep > 0 and 'valdat' unequal 'datain', the variable 'repvar' defines the substructure for 'valdat', length = dim(valdat)[1] necessary; default = NA'
PrInDTCstruc(datain,classname,ctestv=NA,Struc=NA,vers="d",weight=FALSE, Eit=NA,Pit=NA,N=99,Ni=99,undersamp=TRUE,crit="ba",ktest=0, stest=integer(length=0),conf.level=0.95,indrep=0, minsplit=NA,minbucket=NA,repvar=NA,valdat=datain,thr=0.5)PrInDTCstruc(datain,classname,ctestv=NA,Struc=NA,vers="d",weight=FALSE, Eit=NA,Pit=NA,N=99,Ni=99,undersamp=TRUE,crit="ba",ktest=0, stest=integer(length=0),conf.level=0.95,indrep=0, minsplit=NA,minbucket=NA,repvar=NA,valdat=datain,thr=0.5)
datain |
Input data frame with class factor variable 'classname' and the |
classname |
Name of class variable (character) |
ctestv |
Vector of character strings of forbidden split results; |
Struc |
= list(name,check,labs), cf. description for explanations; Struc not needed for vers="b"; for indrep=1, Struc = list(name); |
vers |
Version of structured subsampling: "a", "b", "c", "d", cf. description; |
weight |
Weights to be used for subsampling of elements of substructure (logical, TRUE or FALSE); default = FALSE |
Eit |
List of number of elements of substructure (integers); |
Pit |
List of number of predictors (integers) |
N |
Number of repetitions of subsampling from substructure (integer) |
Ni |
Number of repetitions of subsampling from predictors |
undersamp |
Undersampling of categories of 'classname' to be used ((logical, TRUE or FALSE) |
crit |
Optimisation criterion: "ba" for balanced accuracy, "bat" for balanced accuracy on test sets, "ta" for test accuracy,
"tal" for test accuracy of continuing parts of length 'ktest' in substructure elements 'stest'; |
ktest |
Length of continuing parts to be tested (for crit="tal"); |
stest |
Part of substructure to be tested (for crit="tal")(integer vector); |
conf.level |
(1 - significance level) in function |
indrep |
Indicator for repeated measurements, i.e. more than one observation with the same class for each element; |
minsplit |
Minimum number of elements in a node to be splitted; |
minbucket |
Minimum number of elements in a node; |
repvar |
Values of variable defining the substructure corresponding to valdat in the case of repeated measurements, length = dim(valdat)[1] necessary; default = NA |
valdat |
validation data; default = datain |
thr |
threshold for element classfication: minimum percentage of correct class entries; default = 0.5 |
See Buschfeld & Weihs (2025), Optimizing decision trees for the analysis of World Englishes and sociolinguistic data. Cambridge University Press, section 4.5.1, for further information.
Standard output can be produced by means of print(name$besttree) or just name$besttree as well as plot(name$besttree) where 'name' is the output data
frame of the function.
Best tree
Number of interpretable trees, overall number of trees
Number of predictors in training set for best tree
Size of training set for best tree
Accuracy of best tree on large class
Accuracy of best tree on small class
Balanced accuracy of best tree
Test accuracy of best tree
Number of elements with misclassified parts longer than 'ktest' for best tree
Elements with long misclassified parts for best tree
Indices of long misclassified parts for best tree
Elements of 1st category of substructure used by best tree
Elements of 2nd category of substructure used by best tree
Predictors used by best tree
Training set for best tree
Test set for best tree
labs from Struc
Label of large class
Label of small class
Version used for structured subsampling
Accuracy of large class on Elements of best tree
Accuracy of small class on Elements of best tree
Balanced accuracy of best tree on Elements
Names of misclassified Elements of large class
Names of misclassified Elements of small class
Threshold for element classification
data <- PrInDT::data_zero data <- na.omit(data) # cleaned full data: no NAs # interpretation restrictions (split exclusions) ctestv <- rbind('ETH == {C2a, C1a}','MLU == {1, 3}') # split exclusions # substructure name <- PrInDT::participant_zero check <- "data$ETH" labs <- matrix(1:6,nrow=2,ncol=3) labs[1,] <- c("C1a","C1b","C1c") labs[2,] <- c("C2a","C2b","C2c") rownames(labs) <- c("Children 1","Children 2") Struc <- list(name=name,check=check,labs=labs) out <- PrInDTCstruc(data,"real",ctestv,Struc,vers="c",weight=TRUE,N=5,Pit=5,conf.level=0.99) out plot(out) # indrep = 1 Struc <- list(name=name) out <- PrInDTCstruc(data,"real",Struc=Struc,vers="c",Pit=5,Eit=5,N=5,crit="ba",indrep=1, conf.level=0.99)data <- PrInDT::data_zero data <- na.omit(data) # cleaned full data: no NAs # interpretation restrictions (split exclusions) ctestv <- rbind('ETH == {C2a, C1a}','MLU == {1, 3}') # split exclusions # substructure name <- PrInDT::participant_zero check <- "data$ETH" labs <- matrix(1:6,nrow=2,ncol=3) labs[1,] <- c("C1a","C1b","C1c") labs[2,] <- c("C2a","C2b","C2c") rownames(labs) <- c("Children 1","Children 2") Struc <- list(name=name,check=check,labs=labs) out <- PrInDTCstruc(data,"real",ctestv,Struc,vers="c",weight=TRUE,N=5,Pit=5,conf.level=0.99) out plot(out) # indrep = 1 Struc <- list(name=name) out <- PrInDTCstruc(data,"real",Struc=Struc,vers="c",Pit=5,Eit=5,N=5,crit="ba",indrep=1, conf.level=0.99)
PrInDT
Multiple label classification based on resampling by PrInDT. We consider two ways of modeling (Binary relevance modeling,
dependent binary modeling) and three ways of model evaluation: single assessment, joint assessment, and true prediction
(see the Value section for more information).
Variables should be arranged in 'datain' according to indices specified in 'indind', 'indaddind', and 'inddep'.
Please note that the dependent variables have to be specified as dummies, i.e. as 'absent' (value 0) or 'present' (value 1).
Undersampling is repeated 'N' times.
Undersampling percentages 'percl' for the larger class and 'percs' for the smaller class can be
specified, one each per dependent class variable.
The parameters 'conf.level', 'minsplit', and 'minbucket' can be used to control the size of the trees.
References
- Buschfeld, S., Weihs, C. and Ronan, P.. 2024. Modeling linguistic landscapes: A comparison of St Martin’s two capitals Philipsburg and Marigot Linguistic Landscape 10(3): 302–334. https://doi.org/10.1075/ll.23070.bus
- Probst, P., Au, Q., Casalicchio, G., Stachl, C., and Bischl, B. 2017. Multilabel Classification with R Package mlr. arXiv:1703.08991v2
PrInDTMulab(datain,classnames=NA,ctestv=NA,conf.level=0.95,percl,percs=1,N, indind=NA,indaddind=NA,inddep,minsplit=NA,minbucket=NA)PrInDTMulab(datain,classnames=NA,ctestv=NA,conf.level=0.95,percl,percs=1,N, indind=NA,indaddind=NA,inddep,minsplit=NA,minbucket=NA)
datain |
Input data frame with class factor variable 'classname' and the |
classnames |
names of class variables (character vector); default = NA: from old version: superfluous now |
ctestv |
Vector of character strings of forbidden split results; |
conf.level |
(1 - significance level) in function |
percl |
list of undersampling percentages of larger class (numerical, > 0 and <= 1): one per dependent class variable |
percs |
list of undersampling percentage of smaller class (numerical, > 0 and <= 1); one per dependent class variable |
N |
no. of repetitions (integer > 0) |
indind |
indices of independent variables |
indaddind |
indices of additional independent variables used in the case of dependent binary relevance modeling |
inddep |
indices of dependent variables |
minsplit |
Minimum number of elements in a node to be splitted; |
minbucket |
Minimum number of elements in a node; |
Standard output can be produced by means of print(name) or just name as well as plot(name) where 'name' is the output data
frame of the function.
The plot function will produce a series of more than one plot. If you use R, you might want to specify windows(record=TRUE) before
plot(name) to save the whole series of plots. In R-Studio this functionality is provided automatically.
model errors for Binary Relevance (single assessment) - only independent predictors are used for modeling one label at a time, the other labels are not used as predictors. As the performance measure for the resulting classification rules, the balanced accuracy of the best model from PrInDT is employed for each individual label.
combined error for Binary Relevance (joint assessment) - the best prediction models for the different labels are combined to assess the combined prediction. The 01-accuracy counts a label combination as correct only if all labels are correctly predicted. The hamming accuracy corresponds to the proportion of labels whose value is correctly predicted.
model errors for Dependent Binary Relevance (Extended Model) (single assessment) - each label is trained by means of an extended model which not only includes the independent predictors but also the other labels. For these labels, the truly observed values are used for estimation and prediction. In the extended model, other labels, which are not treated as dependent variables, can also be used as additional predictors.
combined errors for Dependent Binary Relevance (Extended Model) (joint assessment)
combined errors for Dependent Binary Relevance (True Prediction) - in the prediction phase, the values of all modeled labels are first predicted by the independent predictors only and then the predicted labels are used in the estimated extended model in a 2nd step to ultimately predict the labels.
column names of input data
indices of dependent variables (labels to be modeled)
list of trees from Binary Relevance modeling, one tree for each label; refer to an individual tree as treebr[[i]],
i = 1, ..., no. of labels
list of trees from Dependent Binary Relevance modeling, one for each label; refer to an individual tree as treedbr[[i]],
i = 1, ..., no. of labels
data <- PrInDT::data_land # load data dataclean <- data[,c(1:7,23:24,11:13,22,8:10)] # only relevant features indind <- c(1:9) # original predictors indaddind <- c(10:13) # additional predictors inddep <- c(14:16) # dependent variables dataclean <- na.omit(dataclean) ctestv <- NA N <- 21 # no. of repetitions perc <- c(0.45,0.05,0.25) # percentages of observations of larger class, # 1 per dependent class variable perc2 <- c(0.75,0.95,0.75) # percentages of observations of smaller class, # 1 per dependent class variable ## # Call PrInDT: language by language ## outmult <- PrInDTMulab(dataclean,ctestv=NA,conf.level=0.95,percl=perc,percs=perc2,N=N, indind=indind,indaddind=indaddind,inddep=inddep) print(outmult) plot(outmult)data <- PrInDT::data_land # load data dataclean <- data[,c(1:7,23:24,11:13,22,8:10)] # only relevant features indind <- c(1:9) # original predictors indaddind <- c(10:13) # additional predictors inddep <- c(14:16) # dependent variables dataclean <- na.omit(dataclean) ctestv <- NA N <- 21 # no. of repetitions perc <- c(0.45,0.05,0.25) # percentages of observations of larger class, # 1 per dependent class variable perc2 <- c(0.75,0.95,0.75) # percentages of observations of smaller class, # 1 per dependent class variable ## # Call PrInDT: language by language ## outmult <- PrInDTMulab(dataclean,ctestv=NA,conf.level=0.95,percl=perc,percs=perc2,N=N, indind=indind,indaddind=indaddind,inddep=inddep) print(outmult) plot(outmult)
Multiple label classification based on all observations. We consider two ways of modeling (Binary relevance modeling,
dependent binary modeling) and three ways of model evaluation: single
assessment, joint assessment, and true prediction (see the Value section for more information).
Interpretability is checked (see ctestv).
Variables should be arranged in 'datain' according to indices specified in 'indind', 'indaddind', and 'inddep'.
Please note that the dependent variables have to be specified as dummies, i.e. as 'absent' (value 0) or 'present' (value 1).
The parameters 'conf.level', 'minsplit', and 'minbucket' can be used to control the size of the trees.
Reference
Probst, P., Au, Q., Casalicchio, G., Stachl, C., and Bischl, B. 2017. Multilabel Classification with
R Package mlr. arXiv:1703.08991v2
PrInDTMulabAll(datain,classnames=NA,ctestv=NA,conf.level=0.95,indind=NA, indaddind=NA,inddep,minsplit=NA,minbucket=NA)PrInDTMulabAll(datain,classnames=NA,ctestv=NA,conf.level=0.95,indind=NA, indaddind=NA,inddep,minsplit=NA,minbucket=NA)
datain |
Input data frame with class factor variable 'classname' and the |
classnames |
names of class variables (character vector) |
ctestv |
Vector of character strings of forbidden split results; |
conf.level |
(1 - significance level) in function |
indind |
indices of independent variables |
indaddind |
indices of additional predictors used in the case of dependent binary relevance modeling |
inddep |
indices of dependent variables |
minsplit |
Minimum number of elements in a node to be splitted; |
minbucket |
Minimum number of elements in a node; |
Standard output can be produced by means of print(name) or just name as well as plot(name) where 'name' is the output data
frame of the function.
The plot function will produce a series of more than one plot. If you use R, you might want to specify windows(record=TRUE) before
plot(name) to save the whole series of plots. In R-Studio this functionality is provided automatically.
model errors for Binary Relevance (single assessment) - only independent predictors are used for modeling one label at a time, the other labels are not used as predictors. The classification rules are trained on all observations. As the performance measure for the resulting classification rules, the balanced accuracy of the models for each individual label is employed.
combined error for Binary Relevance (joint assessment) - the best prediction models for the different labels are combined to assess the combined prediction. The 01-accuracy counts a label combination as correct only if all labels are correctly predicted. The hamming accuracy corresponds to the proportion of labels whose value is correctly predicted.
model errors in Dependent Binary Relevance (Extended Model) (single assessment) - each label is trained by means of an extended model which not only includes the independent predictors but also the other labels. For these labels the truly observed values are used for estimation and prediction. In the extended model, further labels, which are not treated as dependent variables, can be used as additional predictors.
combined errors for Dependent Binary Relevance (Extended Model) (joint assessment)
combined errors for Dependent Binary Relevance (True Prediction) - in the prediction phase, the values of all modeled labels are first predicted by the independent predictors only (see Binary Relevance) and then the predicted labels are used in the estimated extended model in a 2nd step to ultimately predict the labels.
column names of input data
indices of dependent variables (labels to be modeled)
list of trees from Binary Relevance modeling, one tree for each label; refer to an individual tree as treeabr[[i]],
i = 1, ..., no. of labels
list of trees from Dependent Binary Relevance modeling, one for each label; refer to an individual tree as treeadbr[[i]],
i = 1, ..., no. of labels
data <- PrInDT::data_land # load data dataclean <- data[,c(1:7,23:24,11:13,22,8:10)] # only relevant features indind <- c(1:9) # original predictors indaddind <- c(10:13) # additional predictors inddep <- c(14:16) # dependent variables dataclean <- na.omit(dataclean) ctestv <- NA ## # Call PrInDTAll: language by language ## outmultAll <- PrInDTMulabAll(dataclean,colnames(dataclean)[inddep],ctestv,conf.level=0.95, indind,indaddind,inddep) outmultAll plot(outmultAll)data <- PrInDT::data_land # load data dataclean <- data[,c(1:7,23:24,11:13,22,8:10)] # only relevant features indind <- c(1:9) # original predictors indaddind <- c(10:13) # additional predictors inddep <- c(14:16) # dependent variables dataclean <- na.omit(dataclean) ctestv <- NA ## # Call PrInDTAll: language by language ## outmultAll <- PrInDTMulabAll(dataclean,colnames(dataclean)[inddep],ctestv,conf.level=0.95, indind,indaddind,inddep) outmultAll plot(outmultAll)
PrInDT analysis for a classification problem with more than 2 classes. For each combination of one class vs.
the other classes a 2-class PrInDT analysis is carried out.
The percentages for undersampling of the larger class ('percl' in PrInDT) are chosen so that the resulting sizes
are comparable with the size of the smaller classes for which all their observations are used in undersampling ('percs' = 1 in PrInDT).
The class with the highest probability in the K (= number of classes) analyses is chosen for prediction.
Interpretability is checked (see 'ctestv').
The parameters 'conf.level', 'minsplit', and 'minbucket' can be used to control the size of the trees.
PrInDTMulev(datain,classname,ctestv=NA,N,conf.level=0.95,seedl=FALSE, minsplit=NA,minbucket=NA)PrInDTMulev(datain,classname,ctestv=NA,N,conf.level=0.95,seedl=FALSE, minsplit=NA,minbucket=NA)
datain |
Input data frame with class factor variable 'classname' and the |
classname |
Name of class variable (character) |
ctestv |
Vector of character strings of forbidden split results; |
N |
Number of repetitions (integer > 0) |
conf.level |
(1 - significance level) in function |
seedl |
Should the seed for random numbers be set (TRUE / FALSE)? |
minsplit |
Minimum number of elements in a node to be splitted; |
minbucket |
Minimum number of elements in a node; |
Standard output can be produced by means of print(name) or just name as well as plot(name) where 'name' is the output data
frame of the function.
The plot function will produce a series of more than one plot. If you use R, you might want to specify windows(record=TRUE) before
plot(name) to save the whole series of plots. In R-Studio this functionality is provided automatically.
levels of class variable
trees for the levels of the class variable; refer to an individual tree as trees[[k]], k = 1, ..., no. of levels
balanced accuracy of combined predictions
confusion matrix of combined predictions
no. of non-interpretable trees
balanced accuracies of best models for individual classes
datastrat <- PrInDT::data_zero data <- na.omit(datastrat) ctestv <- NA data$rel[data$ETH %in% c("C1a","C1b","C1c") & data$real == "zero"] <- "zero1" data$rel[data$ETH %in% c("C2a","C2b","C2c") & data$real == "zero"] <- "zero2" data$rel[data$real == "realized"] <- "real" data$rel <- as.factor(data$rel) # rel is new class variable data$real <- NULL # remove old class variable N <- 51 conf.level <- 0.99 # 1 - significance level (mincriterion) in ctree out <- PrInDTMulev(data,"rel",ctestv,N,conf.level) out # print best models based on subsamples plot(out) # corresponding plotsdatastrat <- PrInDT::data_zero data <- na.omit(datastrat) ctestv <- NA data$rel[data$ETH %in% c("C1a","C1b","C1c") & data$real == "zero"] <- "zero1" data$rel[data$ETH %in% c("C2a","C2b","C2c") & data$real == "zero"] <- "zero2" data$rel[data$real == "realized"] <- "real" data$rel <- as.factor(data$rel) # rel is new class variable data$real <- NULL # remove old class variable N <- 51 conf.level <- 0.99 # 1 - significance level (mincriterion) in ctree out <- PrInDTMulev(data,"rel",ctestv,N,conf.level) out # print best models based on subsamples plot(out) # corresponding plots
ctree for more than 2 classes on all observations. Interpretability is checked (see 'ctestv').
The parameters 'conf.level', 'minsplit', and 'minbucket' can be used to control the size of the trees.
PrInDTMulevAll(datain,classname,ctestv=NA,conf.level=0.95,minsplit=NA,minbucket=NA)PrInDTMulevAll(datain,classname,ctestv=NA,conf.level=0.95,minsplit=NA,minbucket=NA)
datain |
Input data frame with class factor variable 'classname' and the |
classname |
Name of class variable (character) |
ctestv |
Vector of character strings of forbidden split results; |
conf.level |
(1 - significance level) in function |
minsplit |
Minimum number of elements in a node to be splitted; |
minbucket |
Minimum number of elements in a node; |
Standard output can be produced by means of print(name) or just name as well as plot(name) where 'name' is the output data
frame of the function.
ctree based on all observations
balanced accuracy of 'treeall'
criterion of interpretability of 'treeall' (TRUE / FALSE)
confusion matrix of 'treeall'
datastrat <- PrInDT::data_zero data <- na.omit(datastrat) ctestv <- rbind('ETH == {C2a,C1a}', 'MLU == {1, 3}') data$rel[data$ETH %in% c("C1a","C1b","C1c") & data$real == "zero"] <- "zero1" data$rel[data$ETH %in% c("C2a","C2b","C2c") & data$real == "zero"] <- "zero2" data$rel[data$real == "realized"] <- "real" data$rel <- as.factor(data$rel) # rel is new class variable data$real <- NULL # remove old class variable conf.level <- 0.99 # 1 - significance level (mincriterion) in ctree outAll <- PrInDTMulevAll(data,"rel",ctestv,conf.level) outAll # print model based on all observations plot(outAll)datastrat <- PrInDT::data_zero data <- na.omit(datastrat) ctestv <- rbind('ETH == {C2a,C1a}', 'MLU == {1, 3}') data$rel[data$ETH %in% c("C1a","C1b","C1c") & data$real == "zero"] <- "zero1" data$rel[data$ETH %in% c("C2a","C2b","C2c") & data$real == "zero"] <- "zero2" data$rel[data$real == "realized"] <- "real" data$rel <- as.factor(data$rel) # rel is new class variable data$real <- NULL # remove old class variable conf.level <- 0.99 # 1 - significance level (mincriterion) in ctree outAll <- PrInDTMulevAll(data,"rel",ctestv,conf.level) outAll # print model based on all observations plot(outAll)
The function PrInDTreg realizes regression tree optimization to identify the best interpretable tree; interpretability is checked (see 'ctestv').
The relationship between the target variable 'regname' and all other factor and numerical variables
in the data frame 'datain' is optimally modeled by means of 'N' repetitions of subsampling.
The optimization criterion is the R2 of the model on the validation sample 'valdat'.
Default for the validation sample is the full sample 'datain'.
Multiple subsampling percentages of observations and predictors can be specified (in 'pobs' and 'ppre', correspondingly).
The trees generated from subsampling can be restricted by
rejecting unacceptable trees which include split results specified in the character strings of the vector 'ctestv'.
The parameters 'conf.level', 'minsplit', and 'minbucket' can be used to control the size of the trees.
Besides the maximal R2, the minimal MAE (Mean Absolute Error) is reported.
PrInDTreg(datain,regname,ctestv=NA,N,pobs=c(0.9,0.7),ppre=c(0.9,0.7), conf.level=0.95,seedl=TRUE,minsplit=NA,minbucket=NA,valdat=datain)PrInDTreg(datain,regname,ctestv=NA,N,pobs=c(0.9,0.7),ppre=c(0.9,0.7), conf.level=0.95,seedl=TRUE,minsplit=NA,minbucket=NA,valdat=datain)
datain |
Input data frame with class factor variable 'classname' and the |
regname |
name of regressand variable (character) |
ctestv |
Vector of character strings of forbidden split results; |
N |
Number of repetitions (integer > 0) |
pobs |
Vector of resampling percentages of observations (numerical, > 0 and <= 1) |
ppre |
Vector of resampling percentages of predictor variables (numerical, > 0 and <= 1) |
conf.level |
(1 - significance level) in function |
seedl |
Should the seed for random numbers be set (TRUE / FALSE)? |
minsplit |
Minimum number of elements in a node to be splitted; |
minbucket |
Minimum number of elements in a node; |
valdat |
Validation data; default = datain |
For the optimzation of the trees, we employ a method we call Sumping (Subsampling umbrella of model parameters), a variant of Bumping (Bootstrap umbrella of model parameters) (Tibshirani & Knight, 1999). We use subsampling instead of bootstrapping. The aim of the optimization is to identify conditional inference trees with maximum predictive power on the full sample under interpretability restrictions.
Reference
Tibshirani, R., Knight, K. 1999. Model Search and Inference By Bootstrap "bumping".
Journal of Computational and Graphical Statistics, Vol. 8, No. 4 (Dec., 1999), pp. 671-686
Standard output can be produced by means of print(name) or just name as well as plot(name) where 'name' is the output data
frame of the function.
The plot function will produce a series of more than one plot. If you use R, you might want to specify windows(record=TRUE) before
plot(name) to save the whole series of plots. In R-Studio this functionality is provided automatically.
Mean number of interpretable trees over the combinations of individual percentages in 'pobs' and 'ppre'
best resampled regression tree on the validation data set
Best model achieved for %observations
Best model achieved for %predictors
maximum R2 on the validation data set for resampled regression trees (for 'ctmax')
minimum MAE (Mean Absolute Error) on the validation data set for resampled regression trees (for 'ctmax')
interpretability of best tree 'ctmax'
second best resampled regression tree on the full data set
second best model achieved for %observations
second best model achieved for %features
second best R2 on the validation data set for resampled regression trees (for 'ctmax2')
second best MAE on the validation data set for resampled regression trees (for 'ctmax2')
interpretability of second-best tree 'ctmax2'
data <- PrInDT::data_vowel data <- na.omit(data) ctestv <- 'vowel_maximum_pitch <= 320' N <- 30 # no. of repetitions pobs <- c(0.70,0.60) # percentages of observations ppre <- c(0.90,0.70) # percentages of predictors outreg <- PrInDTreg(data,"target",ctestv,N,pobs,ppre) outreg plot(outreg)data <- PrInDT::data_vowel data <- na.omit(data) ctestv <- 'vowel_maximum_pitch <= 320' N <- 30 # no. of repetitions pobs <- c(0.70,0.60) # percentages of observations ppre <- c(0.90,0.70) # percentages of predictors outreg <- PrInDTreg(data,"target",ctestv,N,pobs,ppre) outreg plot(outreg)
Regression tree based on the full sample; interpretability is checked (see 'ctestv').
The relationship between the target variable 'regname' and all other factor and numerical variables
in the data frame 'datain' is modeled based on all observations.
The parameters 'conf.level', 'minsplit', and 'minbucket' can be used to control the size of the trees.
Besides the maximal R2, the minimal MAE (Mean Absolute Error) is reported.
PrInDTregAll(datain,regname,ctestv=NA,conf.level=0.95,minsplit=NA,minbucket=NA)PrInDTregAll(datain,regname,ctestv=NA,conf.level=0.95,minsplit=NA,minbucket=NA)
datain |
Input data frame with class factor variable 'classname' and the |
regname |
name of regressand variable (character) |
ctestv |
Vector of character strings of forbidden split results; |
conf.level |
(1 - significance level) in function |
minsplit |
Minimum number of elements in a node to be splitted; |
minbucket |
Minimum number of elements in a node; |
Standard output can be produced by means of print(name) or just name as well as plot(name) where 'name' is the output data
frame of the function.
tree based on all observations
goodness of fit of 'treeall' based on all observations
MAE of 'treeall' based on all observations
criterion of interpretability of 'treeall' (TRUE / FALSE)
data <- PrInDT::data_vowel data <- na.omit(data) ctestv <- 'vowel_maximum_pitch <= 320' outreg <- PrInDTregAll(data,"target",ctestv) outreg plot(outreg)data <- PrInDT::data_vowel data <- na.omit(data) ctestv <- 'vowel_maximum_pitch <= 320' outreg <- PrInDTregAll(data,"target",ctestv) outreg plot(outreg)
The function PrInDTRstruc applies structured subsampling for finding an optimal subsample to model
the relationship between the continuous variable 'regname' and all other factor and numerical variables
in the data frame 'datain' by means of 'M' repetitions of subsampling from a substructure and 'N' repetitions of subsampling from the predictors.
The optimization citerion is the goodness of fit R2 on the validation sample 'valdat' (default = 'datain').
The trees generated from undersampling can be restricted by not accepting trees
including split results specified in the character strings of the vector 'ctestv'.
The substructure of the observations used for subsampling is specified by the list 'Struc' which consists of the factor variable 'name' representing the substructure,
the name 'check' of the factor variable with the information about the categories of the substructure, and the matrix 'labs' which specifies the values of 'check'
corresponding to two categories in its rows, i.e. in 'labs[1,]' and 'labs[2,]'. The names of the categories have to be specified by rownames(labs).
In structured subsampling, 'M' repetitions of subsampling of the variable 'name' with 'Mit' different elements of each category in 'check' are realized. If 'Mit' is a list, each entry is employed individually. If 'Mit' is larger than the maximum available number of elements with a certain value of 'check', the maximum possible number of elements is used.
The number of predictors 'Pit' to be included in the model and the number of elements of the substructure 'Mit' have to be specified (lists allowed).
The percentages of involved observations and predictors can be controlled by the parameters 'pobs' and 'ppre', respectively.
The parameter 'Struc' is needed for all versions of subsampling except "b". Four different versions of structured subsampling exist:
a) just of the elements in the substructure with parameters 'M' and 'Mit',
b) just of the predictors with parameters 'N' and 'Pit',
c) of the predictors and for each subset of predictors subsampling of the elements of the substructure with parameters 'M', 'N', 'Mit', 'Pit', 'pobs', and 'ppre', and
d) of the elements of the substructure and for each of these subsets subsampling of the predictors with the same parameters as version c).
The parameters 'conf.level', 'minsplit', and 'minbucket' can be used to control the size of the trees.
Besides the maximal R2, the minimal MAE (Mean Absolute Error) is reported.
Repeated measurements can also be handled by this function (indrep=1). They are multiple measurements of the same variable taken on the same subjects (or objects) either under different conditions
or over two or more time periods.
The variable with the repeatedly observed subjects (or objects) is assumed to be 'name' in 'Struc'.
The measure MAE is split according to the values of 'name'.
PrInDTRstruc(datain,regname,ctestv=NA,Struc=NA,vers="d",M=NA,Mit=NA,N=99,Pit=NA, pobs=c(0.9,0.7),ppre=c(0.9,0.7),conf.level=0.95,minsplit=NA,minbucket=NA, valdat=datain,indrep=0)PrInDTRstruc(datain,regname,ctestv=NA,Struc=NA,vers="d",M=NA,Mit=NA,N=99,Pit=NA, pobs=c(0.9,0.7),ppre=c(0.9,0.7),conf.level=0.95,minsplit=NA,minbucket=NA, valdat=datain,indrep=0)
datain |
Input data frame with continuous target variable 'regname' and the |
regname |
Name of target variable (character) |
ctestv |
Vector of character strings of forbidden split results; |
Struc |
= list(name,check,labs), cf. description for explanations; Struc not needed for vers="b" |
vers |
Version of structured subsampling: "a", "b", "c", "d", cf. description; |
M |
Number of repetitions of subsampling from substructure (integer) in versions "a" and "d"; |
Mit |
List of number of elements of substructure (integers); |
N |
Number of repetitions of subsampling from predictors (integer) in versions "b" and "c"; |
Pit |
List of number of predictors (integers) |
pobs |
Percentage(s) of observations for subsampling in versions "c" and "d"; |
ppre |
Percentage(s) of predictors for subsampling in versions "c" and "d"; |
conf.level |
(1 - significance level) in function |
minsplit |
Minimum number of elements in a node to be splitted; |
minbucket |
Minimum number of elements in a node; |
valdat |
Validation data; default = datain |
indrep |
Indicator for repeated measurements, i.e. more than one observation with the same class for each element; |
See Buschfeld & Weihs (2025), Optimizing decision trees for the analysis of World Englishes and sociolinguistic data. Cambridge University Press, section 4.5.4, for further information.
Standard output can be produced by means of print(name$besttree) or just name$besttree as well as plot(name$besttree) where 'name' is the output data
frame of the function.
Best tree
Number of interpretable trees, overall number of trees
Size of training set for best tree
R squared of best tree
Mean R squared of objects in substructure
MAE (Mean Absolute Error) of best tree
Mean MAE of objects in substructure
Elements of 1st category of substructure used by best tree
Elements of 2nd category of substructure used by best tree
Predictors used by best tree
Training set for best tree
labs from Struc
Version used for structured subsampling
Number of different numbers of substructure elements
Number of different numbers of predictors
Number of repetitions of selection of substructure elements
Number of repetitions of selection of predictors
Indicator of repeated measurements: indrep=1
data <- PrInDT::data_vowel data <- na.omit(data) CHILDvowel <- data$Nickname data$Nickname <- NULL data$syllables <- 3 - data$syllables data$speed <- data$word_duration / data$syllables ## NEW NEW names(data)[names(data) == "target"] <- "vowel_length" # interpretation restrictions (split exclusions) ctestv <- rbind('ETH == {C2a, C1a}','MLU == {1, 3}') # split exclusions name <- CHILDvowel check <- "data$ETH" labs <- matrix(1:6,nrow=2,ncol=3) labs[1,] <- c("C1a","C1b","C1c") labs[2,] <- c("C2a","C2b","C2c") rownames(labs) <- c("children 1","children 2") Struc <- list(name=name,check=check,labs=labs) outstruc <- PrInDTRstruc(data,"vowel_length",ctestv=ctestv,Struc=Struc,vers="d", M=3,Mit=21,N=9,pobs=c(0.95,0.7),ppre=c(1,0.7),conf.level=0.99) outstruc plot(outstruc)data <- PrInDT::data_vowel data <- na.omit(data) CHILDvowel <- data$Nickname data$Nickname <- NULL data$syllables <- 3 - data$syllables data$speed <- data$word_duration / data$syllables ## NEW NEW names(data)[names(data) == "target"] <- "vowel_length" # interpretation restrictions (split exclusions) ctestv <- rbind('ETH == {C2a, C1a}','MLU == {1, 3}') # split exclusions name <- CHILDvowel check <- "data$ETH" labs <- matrix(1:6,nrow=2,ncol=3) labs[1,] <- c("C1a","C1b","C1c") labs[2,] <- c("C2a","C2b","C2c") rownames(labs) <- c("children 1","children 2") Struc <- list(name=name,check=check,labs=labs) outstruc <- PrInDTRstruc(data,"vowel_length",ctestv=ctestv,Struc=Struc,vers="d", M=3,Mit=21,N=9,pobs=c(0.95,0.7),ppre=c(1,0.7),conf.level=0.99) outstruc plot(outstruc)
The function R2SPrInDT applies 'N' repetitions of subsampling for finding an optimal subsample to model
the relationship between the continuous variables with indices 'inddep' and all other factor and numerical variables
in the data frame 'datain'.
Subsampling of observations and predictors uses the percentages in 'pobs1' and 'ppre1', respectively, at stage 1, and the percentages 'pobs2' and 'ppre2'
at stage 2, accordingly.
The optimization criterion is the goodness of fit R2 on the full sample.
The trees generated from undersampling can be restricted by not accepting trees
including split results specified in the character strings of the vector 'ctestv'.
The parameters 'conf.level', 'minsplit', and 'minbucket' can be used to control the size of the trees.
R2SPrInDT(data,ctestv=NA,inddep,N=99,pobs1=c(0.90,0.70),ppre1=c(0.90,0.70), pobs2=pobs1,ppre2=ppre1,conf.level=0.95,minsplit=NA,minbucket=NA)R2SPrInDT(data,ctestv=NA,inddep,N=99,pobs1=c(0.90,0.70),ppre1=c(0.90,0.70), pobs2=pobs1,ppre2=ppre1,conf.level=0.95,minsplit=NA,minbucket=NA)
data |
Input data frame with continuous target variable 'regname' and the |
ctestv |
Vector of character strings of forbidden split results; |
inddep |
Column indices of target variables in datain |
N |
Number of repetitions of subsampling from predictors (integer) in versions "b" and "c"; |
pobs1 |
Percentage(s) of observations for subsampling at stage 1; |
ppre1 |
Percentage(s) of predictors for subsampling at stage 1; |
pobs2 |
Percentage(s) of observations for subsampling at stage 2"; |
ppre2 |
Percentage(s) of predictors for subsampling at stage 2; |
conf.level |
(1 - significance level) in function |
minsplit |
Minimum number of elements in a node to be splitted; |
minbucket |
Minimum number of elements in a node; |
See Buschfeld & Weihs (2025), Optimizing decision trees for the analysis of World Englishes and sociolinguistic data. Cambridge University Press, section 4.5.6.1, for further information.
Standard output can be produced by means of print(name) or just name as well as plot(name where 'name' is the output data
frame of the function.
Best trees at stage 1
Best trees at stage 2
names of dependent variables
R2s of best trees at both stages
data <- PrInDT::data_vowel data <- na.omit(data) CHILDvowel <- data$Nickname data$Nickname <- NULL syllable <- 3 - data$syllables data$syllabels <- NULL data$syllables <- syllable data$speed <- data$word_duration / data$syllables names(data)[names(data) == "target"] <- "vowel_length" # interpretation restrictions (split exclusions) ctestv <- rbind('ETH == {C2a, C1a}','MLU == {1, 3}') # split exclusions inddep <- c(13,9) out2SR <- R2SPrInDT(data,ctestv=ctestv,inddep=inddep,N=9,conf.level=0.99) out2SR plot(out2SR)data <- PrInDT::data_vowel data <- na.omit(data) CHILDvowel <- data$Nickname data$Nickname <- NULL syllable <- 3 - data$syllables data$syllabels <- NULL data$syllables <- syllable data$speed <- data$word_duration / data$syllables names(data)[names(data) == "target"] <- "vowel_length" # interpretation restrictions (split exclusions) ctestv <- rbind('ETH == {C2a, C1a}','MLU == {1, 3}') # split exclusions inddep <- c(13,9) out2SR <- R2SPrInDT(data,ctestv=ctestv,inddep=inddep,N=9,conf.level=0.99) out2SR plot(out2SR)
PrInDT for specified percentage combinationsBy the function RePrInDT, the function PrInDT is called repeatedly according to all combinations of the percentages specified in the vectors 'plarge' and 'psmall'.
The relationship between the two-class factor variable 'classname' and all other factor and numerical variables
in the data frame 'datain' is optimally modeled by means of 'N' repetitions of undersampling.
The optimization citerion is the balanced accuracy on the validation sample 'valdat' (default = full input sample 'datain').
The trees generated from undersampling can be restricted by rejecting
unacceptable trees which include split results specified in the character strings of the vector 'ctestv'.
The probability threshold 'thres' for the prediction of the smaller class may be specified (default = 0.5).
Undersampling may be stratified in two ways by the feature 'strat'.
The parameters 'conf.level', 'minsplit', and 'minbucket' can be used to control the size of the trees.
Reference
Weihs, C., Buschfeld, S. 2021c. Repeated undersampling in PrInDT (RePrInDT): Variation in undersampling and prediction,
and ranking of predictors in ensembles. arXiv:2108.05129
RePrInDT(datain, classname, ctestv=NA, N, plarge, psmall, conf.level=0.95, thres=0.5, stratvers=0, strat=NA, seedl=TRUE,minsplit=NA,minbucket=NA, valdat=datain)RePrInDT(datain, classname, ctestv=NA, N, plarge, psmall, conf.level=0.95, thres=0.5, stratvers=0, strat=NA, seedl=TRUE,minsplit=NA,minbucket=NA, valdat=datain)
datain |
Input data frame with class factor variable 'classname' and the |
classname |
Name of class variable (character) |
ctestv |
Vector of character strings of forbidden split results; |
N |
Number of repetitions (integer > 0) |
plarge |
Vector of undersampling percentages of larger class (numerical, > 0 and <= 1) |
psmall |
Vector of undersampling percentages of smaller class (numerical, > 0 and <= 1) |
conf.level |
(1 - significance level) in function |
thres |
Probability threshold for prediction of smaller class (numerical, >= 0 and < 1); default = 0.5 |
stratvers |
Version of stratification; |
strat |
Name of one (!) stratification variable for undersampling (character); |
seedl |
Should the seed for random numbers be set (TRUE / FALSE)? |
minsplit |
Minimum number of elements in a node to be splitted; |
minbucket |
Minimum number of elements in a node; |
valdat |
Validation data; default = datain |
Standard output can be produced by means of print(name) or just name as well as plot(name) where 'name' is the output data
frame of the function.
The plot function will produce a series of more than one plot. If you use R, you might want to specify windows(record=TRUE) before
plot(name) to save the whole series of plots. In R-Studio this functionality is provided automatically.
best trees for the different percentage combinations; refer to an individual tree as treesb[[k]], k = 1, ..., length(plarge)*length(psmall)
accuracies of best trees on full sample
accuracies of ensemble of 3 best trees on full sample
mean of permutation losses for the predictors
datastrat <- PrInDT::data_zero data <- na.omit(datastrat) # cleaned full data: no NAs # interpretation restrictions (split exclusions) ctestv <- rbind('ETH == {C2a, C1a}', 'MLU == {1, 3}') N <- 51 # no. of repetitions conf.level <- 0.99 # 1 - significance level (mincriterion) in ctree psmall <- c(0.95,1) # percentages of the small class plarge <- c(0.09,0.1) # percentages of the large class outRe <- RePrInDT(data,"real",ctestv,N,plarge,psmall,conf.level) outRe plot(outRe)datastrat <- PrInDT::data_zero data <- na.omit(datastrat) # cleaned full data: no NAs # interpretation restrictions (split exclusions) ctestv <- rbind('ETH == {C2a, C1a}', 'MLU == {1, 3}') N <- 51 # no. of repetitions conf.level <- 0.99 # 1 - significance level (mincriterion) in ctree psmall <- c(0.95,1) # percentages of the small class plarge <- c(0.09,0.1) # percentages of the large class outRe <- RePrInDT(data,"real",ctestv,N,plarge,psmall,conf.level) outRe plot(outRe)
The function SimCPrInDT applies interdependent estimation (endogenous case) for finding an optimal model for relationships between the two-class factor variables
specified as column indices of 'data' in the vector 'inddep' and all other factor and numerical variables in the data frame 'data' by means of
'N' repetitions of random subsampling with percentages 'percl' for the large classes and 'percs' for the small classes. One percentage of observations
for each dependent variable has to be specified for the larger and the smaller class. For example, for three dependent variables, 'percl' consists of
three percentages specified in the order in which the dependent variables appear in 'inddep'.
The dependent variables have to be specified as dummies, i.e. as 'property absent' (value 0) or 'property present' (value 1).
The optimization citerion is the balanced accuracy on the full sample.
In an additional step, the mean balanced accuracy over all class variables is optimized (joint optimization).
The trees generated from undersampling can be restricted by not accepting trees
including split results specified in the character strings of the vector 'ctestv'.
The parameters 'conf.level', 'minsplit', and 'minbucket' can be used to control the size of the trees.
SimCPrInDT(data,ctestv=NA,inddep,percl,percs,N=99,M,psize,conf.level=0.95, minsplit=NA,minbucket=NA)SimCPrInDT(data,ctestv=NA,inddep,percl,percs,N=99,M,psize,conf.level=0.95, minsplit=NA,minbucket=NA)
data |
Input data frame with class factor variables and the influential variables, |
ctestv |
Vector of character strings of forbidden split results; |
inddep |
indices of dependent variables |
percl |
list of undersampling percentages of larger class (numerical, > 0 and <= 1): one per dependent class variable in the same order as in 'inddep' |
percs |
list of undersampling percentage of smaller class (numerical, > 0 and <= 1); one per dependent class variable in the same order as in 'inddep' |
N |
no. of repetitions of subsampling of observations (integer > 0); default = 99 |
M |
no. of repetitions of subsampling of predictors |
psize |
no. of predictors in the subsamples of the predictors |
conf.level |
(1 - significance level) in function |
minsplit |
Minimum number of elements in a node to be splitted; |
minbucket |
Minimum number of elements in a node; |
See Buschfeld & Weihs (2025), Optimizing decision trees for the analysis of World Englishes and sociolinguistic data. Cambridge University Press, section 4.5.6.2, for further information.
Standard output can be produced by means of print(name) or just name as well as plot(name) where 'name' is the output data
frame of the function.
Best trees at stage 1
Best trees at stage 2
Best trees from mean maximization
names of classification variables
balanced accuracies of best trees at both stages
data <- PrInDT::data_land # load data dataclean <- data[,c(1:7,23:24,11:13,22,8:10)] # only relevant features inddep <- c(14:16) # dependent variables dataland <- na.omit(dataclean) ctestv <- NA perc <- c(0.45,0.05,0.25) # percentages of observations of larger class, # 1 per dependent class variable perc2 <- c(0.75,0.95,0.75) # percentages of observations of smaller class, # 1 per dependent class variable outland <- SimCPrInDT(dataland,percl=perc,percs=perc2,N=9,inddep=inddep,M=2,psize=10) outland plot(outland)data <- PrInDT::data_land # load data dataclean <- data[,c(1:7,23:24,11:13,22,8:10)] # only relevant features inddep <- c(14:16) # dependent variables dataland <- na.omit(dataclean) ctestv <- NA perc <- c(0.45,0.05,0.25) # percentages of observations of larger class, # 1 per dependent class variable perc2 <- c(0.75,0.95,0.75) # percentages of observations of smaller class, # 1 per dependent class variable outland <- SimCPrInDT(dataland,percl=perc,percs=perc2,N=9,inddep=inddep,M=2,psize=10) outland plot(outland)
The function SimMixPrInDT applies structured subsampling for finding an optimal subsample to model
the relationship between the dependent variables specified in the sublist 'targets' of the list 'datalist' and all other factor and numerical variables
in the corresponding individual data frame specified in the sublist 'datanames' of the list 'datalist' in the same order as 'targets'. The data frames have to be different of each other!
The function is prepared to handle classification tasks with 2 or more classes and regression tasks. At first stage, the targets are estimated based on
the full sample of all exogenous variables and on summaries of the observed endogenous variables.
For generating summaries, the variables representing the substructure have to be specified in the sublist 'datastruc' of 'datalist'.
The sublist 'summ' of 'datalist' includes for each discrete target the classes for which you want to calculate the summary percentages and for each
continuous target just NA (in the same order as in the sublist 'targets'). For a discrete target in this list, you can provide a sublist of classes to be combined
(for an example see the below example).
At second stage, structured subsampling is used for improving the models from first stage.
The substructure of the observations used for structured subsampling is specified by the list 'Struc' which here only consists of
the name 'check' of the variable with the information about the categories of the substructure (without specification of the dataset names already specified
in 'datanames', see example below), and the matrix 'labs' which specifies the values of 'check' corresponding to two categories in its rows, i.e. in 'labs[1,]' and 'labs[2,]'.
The names of the categories have to be specified by rownames(labs).
In structured subsampling, first 'M' repetitions of subsampling of the variable 'name' with 'nsub' different elements of each category in 'check' are realized. If 'nsub' is a list, each entry is employed individually. If 'nsub' is larger than the maximum available number of elements with a certain value of 'check', the maximum possible number of elements is used. Then,
for each of the subsamples 'N' repetitions of subsampling in classification or regression with the specified percentages of classes, observations, and predictors are carried out.
These percentages are specified in the matrix 'percent', one row per estimation task. For binary classification tasks,
percentages 'percl' and 'percs' for the larger and the smaller class have to be specified. For multilevel classification tasks, NA is specified (see the below example)
since the percentages are generated automatically. For regression tasks, 'pobs' and 'ppre' have to be specified for observations and predictors, respectively.
The optimization citerion is balanced accuracy for classification and goodness of fit R2 for regression on the full sample, respectively.
At stage 2, the models are optimized individually. At stage 3, the models are optimized on the maximum of joint elements in their substructures.
The trees generated from undersampling can be restricted by not accepting trees
including split results specified in the character strings of the vector 'ctestv'.
The parameters 'conf.level', 'minsplit', and 'minbucket' can be used to control the size of the trees.
SimMixPrInDT(datalist,ctestv=NA,Struc,M=12,N=99,nsub,percent=NA,conf.level=0.95, minsplit=NA,minbucket=NA)SimMixPrInDT(datalist,ctestv=NA,Struc,M=12,N=99,nsub,percent=NA,conf.level=0.95, minsplit=NA,minbucket=NA)
datalist |
list(datanames,targets,datastruc,summ) Input data: For specification see the above description |
ctestv |
Vector of character strings of forbidden split results; |
Struc |
list(name,check,labs) Paprametes for structured subsampling, as explained in the desciption above. |
M |
Number of repetitions of subsampling of elements of substructure; default = 12 |
N |
Number of repetitions of subsampling for predictors (integer); default = 99 |
nsub |
(List of) numbers of different elements of substructure per subsample |
percent |
matrix of percent spefications: For specification see the above description; default: 'percent = NA' meaning default values for percentages. |
conf.level |
(1 - significance level) in function |
minsplit |
Minimum number of elements in a node to be splitted; default = 20 |
minbucket |
Minimum number of elements in a node; default = 7 |
See Buschfeld & Weihs (2025), Optimizing decision trees for the analysis of World Englishes and sociolinguistic data. Cambridge University Press, section 4.5.6.2, for further information.
Standard output can be produced by means of print(name) or just name as well as plot(name where 'name' is the output data
frame of the function.
Best trees at stage 1 (Full sample)
Best trees at stage 2 (Individual optimization)
Best trees at stage 3 (Joint optimization)
names of dependent variables
number of models of tasks
levels of tasks
accuracies of best trees at both stages
# zero data datazero <- PrInDT::data_zero datazero <- na.omit(datazero) # cleaned full data: no NAs names(datazero)[names(datazero)=="real"] <- "zero" CHILDzero <- PrInDT::participant_zero # interpretation restrictions (split exclusions) ctestv <- rbind('ETH == {C2a, C1a}','MLU == {1, 3}') # split exclusions ## # multi-level data datastrat <- PrInDT::data_zero datamult <- na.omit(datastrat) # ctestv <- NA datamult$mult[datamult$ETH %in% c("C1a","C1b","C1c") & datamult$real == "zero"] <- "zero1" datamult$mult[datamult$ETH %in% c("C2a","C2b","C2c") & datamult$real == "zero"] <- "zero2" datamult$mult[datamult$real == "realized"] <- "real" datamult$mult <- as.factor(datamult$mult) # mult is new class variable datamult$real <- NULL # remove old class variable CHILDmult <- CHILDzero ## # vowel data data <- PrInDT::data_vowel data <- na.omit(data) CHILDvowel <- data$Nickname CHILDvowel <- as.factor(gsub("Nick","P",CHILDvowel)) data$Nickname <- NULL syllable <- 3 - data$syllables data$syllabels <- NULL data$syllables <- syllable data$speed <- data$word_duration / data$syllables names(data)[names(data) == "target"] <- "vowel" datavowel <- data ## # function preparation and call # datalist and percent datanames <- list("datazero","datamult","datavowel") targets <- c("zero","mult","vowel") datastruc <- list(CHILDzero,CHILDmult,CHILDvowel) summult <- paste("zero1","zero2",sep=",") summ <- c("zero",summult,NA) datalist <- list(datanames=datanames,targets=targets,datastruc=datastruc,summ=summ) percent <- matrix(NA,nrow=3,ncol=2) percent[1,] <- c("percl=0.075","percs=0.9") # percentages for datazero # no percentages needed for datapast percent[3,] <- c("pobs=0.9","ppre=c(0.9,0.8)") # percentages for datavowel # substructures labs <- matrix(1:6,nrow=2,ncol=3) labs[1,] <- c("C1a","C1b","C1c") labs[2,] <- c("C2a","C2b","C2c") rownames(labs) <- c("children 1","children 2") Struc <- list(check="ETH",labs=labs) outSimMix <- SimMixPrInDT(datalist,ctestv=ctestv,Struc=Struc,M=2,N=9,nsub=c(19,20), percent=percent,conf.level=0.99) outSimMix plot(outSimMix)+# zero data datazero <- PrInDT::data_zero datazero <- na.omit(datazero) # cleaned full data: no NAs names(datazero)[names(datazero)=="real"] <- "zero" CHILDzero <- PrInDT::participant_zero # interpretation restrictions (split exclusions) ctestv <- rbind('ETH == {C2a, C1a}','MLU == {1, 3}') # split exclusions ## # multi-level data datastrat <- PrInDT::data_zero datamult <- na.omit(datastrat) # ctestv <- NA datamult$mult[datamult$ETH %in% c("C1a","C1b","C1c") & datamult$real == "zero"] <- "zero1" datamult$mult[datamult$ETH %in% c("C2a","C2b","C2c") & datamult$real == "zero"] <- "zero2" datamult$mult[datamult$real == "realized"] <- "real" datamult$mult <- as.factor(datamult$mult) # mult is new class variable datamult$real <- NULL # remove old class variable CHILDmult <- CHILDzero ## # vowel data data <- PrInDT::data_vowel data <- na.omit(data) CHILDvowel <- data$Nickname CHILDvowel <- as.factor(gsub("Nick","P",CHILDvowel)) data$Nickname <- NULL syllable <- 3 - data$syllables data$syllabels <- NULL data$syllables <- syllable data$speed <- data$word_duration / data$syllables names(data)[names(data) == "target"] <- "vowel" datavowel <- data ## # function preparation and call # datalist and percent datanames <- list("datazero","datamult","datavowel") targets <- c("zero","mult","vowel") datastruc <- list(CHILDzero,CHILDmult,CHILDvowel) summult <- paste("zero1","zero2",sep=",") summ <- c("zero",summult,NA) datalist <- list(datanames=datanames,targets=targets,datastruc=datastruc,summ=summ) percent <- matrix(NA,nrow=3,ncol=2) percent[1,] <- c("percl=0.075","percs=0.9") # percentages for datazero # no percentages needed for datapast percent[3,] <- c("pobs=0.9","ppre=c(0.9,0.8)") # percentages for datavowel # substructures labs <- matrix(1:6,nrow=2,ncol=3) labs[1,] <- c("C1a","C1b","C1c") labs[2,] <- c("C2a","C2b","C2c") rownames(labs) <- c("children 1","children 2") Struc <- list(check="ETH",labs=labs) outSimMix <- SimMixPrInDT(datalist,ctestv=ctestv,Struc=Struc,M=2,N=9,nsub=c(19,20), percent=percent,conf.level=0.99) outSimMix plot(outSimMix)+
The function SimRPrInDT applies structured subsampling for finding an optimal subsample to model
the relationship between the continuous variables with indices 'inddep' and all other factor and numerical variables
in the data frame 'datain'.
The substructure of the observations used for subsampling is specified by the list 'Struc' which consists of the factor variable 'name' representing the substructure,
the name 'check' of the factor variable with the information about the categories of the substructure, and the matrix 'labs' which specifies the values of 'check'
corresponding to two categories in its rows, i.e. in 'labs[1,]' and 'labs[2,]'. The names of the categories have to be specified by rownames(labs).
In structured subsampling, first 'M' repetitions of subsampling of the variable 'name' with 'nsub' different elements of each category in 'check' are realized. If 'nsub' is a list, each entry is employed individually. If 'nsub' is larger than the maximum available number of elements with a certain value of 'check', the maximum possible number of elements is used. Then,
for each of the subsamples 'N' repetitions of subsampling of 'ppre' percentages of the predictors are carried out.
Subsampling of observations can additionally be restricted to 'pobs' percentages.
The optimization citerion is the goodness of fit R2 on the full sample. At stage 2, the models are optimized individually.
At stage 3, the mean of accuracies is optimized over all models.
Struc=NA causes random subsampling of observations instead of structured subsampling.
The trees generated from undersampling can be restricted by not accepting trees
including split results specified in the character strings of the vector 'ctestv'.
The parameters 'conf.level', 'minsplit', and 'minbucket' can be used to control the size of the trees.
SimRPrInDT(data,ctestv=NA,Struc=NA,inddep,N=99,pobs=0.9,ppre=c(0.9,0.7), M=1,nsub=1,conf.level=0.95,minsplit=NA,minbucket=NA)SimRPrInDT(data,ctestv=NA,Struc=NA,inddep,N=99,pobs=0.9,ppre=c(0.9,0.7), M=1,nsub=1,conf.level=0.95,minsplit=NA,minbucket=NA)
data |
Input data frame with continuous target variables with column indices 'inddep' and the |
ctestv |
Vector of character strings of forbidden split results; |
Struc |
= list(name,check,labs), cf. description for explanations |
inddep |
Column indices of target variables in datain |
N |
Number of repetitions of subsampling (integer) of predictors; default = 99 |
pobs |
Percentage(s) of observations for subsampling; default=0.9 |
ppre |
Percentage(s) of predictors for subsampling; default=c(0.9,0.7) |
M |
Number of repetitions of subsampling of elements of substructure |
nsub |
(List of) numbers of different elements of substructure per subsample |
conf.level |
(1 - significance level) in function |
minsplit |
Minimum number of elements in a node to be splitted; |
minbucket |
Minimum number of elements in a node; |
See Buschfeld & Weihs (2025), Optimizing decision trees for the analysis of World Englishes and sociolinguistic data. Cambridge University Press, section 4.5.6.2, for further information.
Standard output can be produced by means of print(name) or just name as well as plot(name where 'name' is the output data
frame of the function.
Best trees at stage 1
Best trees for the different values of 'nsub' at stage 2
Best trees for the different values of 'nsub' after mean optimization
Used structure
Best numbers of elements in substructure: 2nd stage, 3rd stage
names of dependent variables
R2s of best trees at stages 1, 2, mean max.
data <- PrInDT::data_vowel data <- na.omit(data) CHILDvowel <- data$Nickname data$Nickname <- NULL syllable <- 3 - data$syllables data$syllabels <- NULL data$syllables <- syllable data$speed <- data$word_duration / data$syllables names(data)[names(data) == "target"] <- "vowel_length" # interpretation restrictions (split exclusions) ctestv <- rbind('ETH == {C2a, C1a}','MLU == {1, 3}') # split exclusions # structure definition name <- CHILDvowel check <- "data$ETH" labs <- matrix(1:6,nrow=2,ncol=3) labs[1,] <- c("C1a","C1b","C1c") labs[2,] <- c("C2a","C2b","C2c") rownames(labs) <- c("children 1","children 2") Struc <- list(name=name,check=check,labs=labs) # column indices of dependent variables inddep <- c(13,9) outSimR <- SimRPrInDT(data,ctestv=ctestv,Struc=Struc,inddep=inddep,N=3,M=2, nsub=c(19,20),conf.level=0.99) outSimR plot(outSimR)data <- PrInDT::data_vowel data <- na.omit(data) CHILDvowel <- data$Nickname data$Nickname <- NULL syllable <- 3 - data$syllables data$syllabels <- NULL data$syllables <- syllable data$speed <- data$word_duration / data$syllables names(data)[names(data) == "target"] <- "vowel_length" # interpretation restrictions (split exclusions) ctestv <- rbind('ETH == {C2a, C1a}','MLU == {1, 3}') # split exclusions # structure definition name <- CHILDvowel check <- "data$ETH" labs <- matrix(1:6,nrow=2,ncol=3) labs[1,] <- c("C1a","C1b","C1c") labs[2,] <- c("C2a","C2b","C2c") rownames(labs) <- c("children 1","children 2") Struc <- list(name=name,check=check,labs=labs) # column indices of dependent variables inddep <- c(13,9) outSimR <- SimRPrInDT(data,ctestv=ctestv,Struc=Struc,inddep=inddep,N=3,M=2, nsub=c(19,20),conf.level=0.99) outSimR plot(outSimR)