Title: | Semi-Supervised Classification, Regression and Clustering Methods |
---|---|
Description: | Providing a collection of techniques for semi-supervised classification, regression and clustering. In semi-supervised problem, both labeled and unlabeled data are used to train a classifier. The package includes a collection of semi-supervised learning techniques: self-training, co-training, democratic, decision tree, random forest, 'S3VM' ... etc, with a fairly intuitive interface that is easy to use. |
Authors: | Francisco Jesús Palomares Alabarce [aut, cre] , José Manuel Benítez [ctb] , Isaac Triguero [ctb] , Christoph Bergmeir [ctb] , Mabel González [ctb] |
Maintainer: | Francisco Jesús Palomares Alabarce <[email protected]> |
License: | GPL-3 |
Version: | 0.9.3.3 |
Built: | 2024-11-26 06:51:52 UTC |
Source: | CRAN |
Abalone
data(abalone)
data(abalone)
Predict the age of abalone from physical measurements
https://archive.ics.uci.edu/ml/datasets/Abalone
An S4 method to best split
best_split(object, ...)
best_split(object, ...)
object |
DecisionTree object |
... |
This parameter is included for compatibility reasons. |
Function to get best split in Decision Tree. Find the best split for node. "Beast" means that the mean of impurity is the least possible. To find the best division. Let's iterate through all the features. All threshold / feature pairs will be computed in the numerical features. In the features that are not numerical, We get the best group of possible values will be obtained based on an algorithm with the function get_levels_categoric
## S4 method for signature 'DecisionTreeClassifier' best_split(object, X, y, parms)
## S4 method for signature 'DecisionTreeClassifier' best_split(object, X, y, parms)
object |
DecisionTree object |
X |
is data |
y |
is class values |
parms |
parms in function |
A list with: best_idx name of the feature with the best split or Null if it not be found best_thr: threshold found in the best split, or Null if it not be found
Breast
data(breast)
data(breast)
: Diagnostic Wisconsin Breast Cancer Database
https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)
Function to calculate gini index. Formula is: 1 - n:num_classes sum probabilitie_class ^ 2
calculate_gini(column_factor)
calculate_gini(column_factor)
column_factor |
class values |
Model from conclust
This function takes an unlabeled dataset and two lists of must-link and cannot-link constraints
as input and produce a clustering as output.
cclsSSLR( n_clusters = NULL, mustLink = NULL, cantLink = NULL, max_iter = 1, tabuIter = 100, tabuLength = 20 )
cclsSSLR( n_clusters = NULL, mustLink = NULL, cantLink = NULL, max_iter = 1, tabuIter = 100, tabuLength = 20 )
n_clusters |
A number of clusters to be considered. Default is NULL (num classes) |
mustLink |
A list of must-link constraints. NULL Default, constrints same label |
cantLink |
A list of cannot-link constraints. NULL Default, constrints with different label |
max_iter |
maximum iterations in KMeans. Default is 1 |
tabuIter |
Number of iteration in Tabu search |
tabuLength |
The number of elements in the Tabu list |
This models only returns labels, not centers
Tran Khanh Hiep, Nguyen Minh Duc, Bui Quoc Trung
Pairwise Constrained Clustering by Local Search
2016
library(tidyverse) library(caret) library(SSLR) library(tidymodels) data <- iris set.seed(1) #% LABELED cls <- which(colnames(iris) == "Species") labeled.index <- createDataPartition(data$Species, p = .2, list = FALSE) data[-labeled.index,cls] <- NA m <- cclsSSLR(max_iter = 1) %>% fit(Species ~ ., data) #Get labels (assing clusters), type = "raw" return factor labels <- m %>% cluster_labels() print(labels)
library(tidyverse) library(caret) library(SSLR) library(tidymodels) data <- iris set.seed(1) #% LABELED cls <- which(colnames(iris) == "Species") labeled.index <- createDataPartition(data$Species, p = .2, list = FALSE) data[-labeled.index,cls] <- NA m <- cclsSSLR(max_iter = 1) %>% fit(Species ~ ., data) #Get labels (assing clusters), type = "raw" return factor labels <- m %>% cluster_labels() print(labels)
Function to check value in leaf from numeric until character
check_value(value, threshold)
check_value(value, threshold)
value |
is the value in leaf node |
threshold |
in leaf node |
TRUE if <= in numeric or %in% in factor
Check interface
check_xy_interface(x, y)
check_xy_interface(x, y)
x |
data without class labels |
y |
values class |
Model from conclust
This function takes an unlabeled dataset and two lists of must-link and cannot-link constraints
as input and produce a clustering as output.
ckmeansSSLR(n_clusters = NULL, mustLink = NULL, cantLink = NULL, max_iter = 10)
ckmeansSSLR(n_clusters = NULL, mustLink = NULL, cantLink = NULL, max_iter = 10)
n_clusters |
A number of clusters to be considered. Default is NULL (num classes) |
mustLink |
A list of must-link constraints. NULL Default, constrints same label |
cantLink |
A list of cannot-link constraints. NULL Default, constrints with different label |
max_iter |
maximum iterations in KMeans. Default is 10 |
This models only returns labels, not centers
Wagstaff, Cardie, Rogers, Schrodl
Constrained K-means Clustering with Background Knowledge
2001
library(tidyverse) library(caret) library(SSLR) library(tidymodels) data <- iris set.seed(1) #% LABELED cls <- which(colnames(iris) == "Species") labeled.index <- createDataPartition(data$Species, p = .2, list = FALSE) data[-labeled.index,cls] <- NA m <- ckmeansSSLR() %>% fit(Species ~ ., data) #Get labels (assing clusters), type = "raw" return factor labels <- m %>% cluster_labels() print(labels)
library(tidyverse) library(caret) library(SSLR) library(tidymodels) data <- iris set.seed(1) #% LABELED cls <- which(colnames(iris) == "Species") labeled.index <- createDataPartition(data$Species, p = .2, list = FALSE) data[-labeled.index,cls] <- NA m <- ckmeansSSLR() %>% fit(Species ~ ., data) #Get labels (assing clusters), type = "raw" return factor labels <- m %>% cluster_labels() print(labels)
Cluster labels
cluster_labels(object, ...)
cluster_labels(object, ...)
object |
object |
... |
other parameters to be passed |
Get labels of clusters raw returns factor or numeric values
## S3 method for class 'model_sslr_fitted' cluster_labels(object, type = "class", ...)
## S3 method for class 'model_sslr_fitted' cluster_labels(object, type = "class", ...)
object |
model_sslr_fitted model built |
type |
of predict in principal model: class, raw |
... |
other parameters to be passed |
Co-Training by Committee (CoBC) is a semi-supervised learning algorithm
with a co-training style. This algorithm trains N
classifiers with the learning
scheme defined in the learner
argument using a reduced set of labeled examples. For
each iteration, an unlabeled
example is labeled for a classifier if the most confident classifications assigned by the
other N-1
classifiers agree on the labeling proposed. The unlabeled examples
candidates are selected randomly from a pool of size u
.
The final prediction is the average of the estimates of the N regressors.
coBC(learner, N = 3, perc.full = 0.7, u = 100, max.iter = 50)
coBC(learner, N = 3, perc.full = 0.7, u = 100, max.iter = 50)
learner |
model from parsnip package for training a supervised base classifier using a set of instances. This model need to have probability predictions in classification mode |
N |
The number of classifiers used as committee members. All these classifiers
are trained using the |
perc.full |
A number between 0 and 1. If the percentage of new labeled examples reaches this value the self-labeling process is stopped. Default is 0.7. |
u |
Number of unlabeled instances in the pool. Default is 100. |
max.iter |
Maximum number of iterations to execute in the self-labeling process. Default is 50. |
For regression tasks, labeling data is very expensive computationally. Its so slow.
This method trains an ensemble of diverse classifiers. To promote the initial diversity
the classifiers are trained from the reduced set of labeled examples by Bagging.
The stopping criterion is defined through the fulfillment of one of the following
criteria: the algorithm reaches the number of iterations defined in the max.iter
parameter or the portion of unlabeled set, defined in the perc.full
parameter,
is moved to the enlarged labeled set of the classifiers.
(When model fit) A list object of class "coBC" containing:
The final N
base classifiers trained using the enlarged labeled set.
List of N
vectors of indexes related to the training instances
used per each classifier. These indexes are relative to the y
argument.
The indexes of all training instances used to
train the N
models. These indexes include the initial labeled instances
and the newly labeled instances. These indexes are relative to the y
argument.
List of three vectors with the same information in model.index
but the indexes are relative to instances.index
vector.
The levels of y
factor in classification.
The function provided in the pred
argument.
The list provided in the pred.pars
argument.
Avrim Blum and Tom Mitchell.
Combining labeled and unlabeled data with co-training.
In Eleventh Annual Conference on Computational Learning Theory, COLT’ 98, pages 92-100, New York, NY, USA, 1998. ACM.
ISBN 1-58113-057-0. doi: 10.1145/279943.279962.
Mohamed Farouk Abdel-Hady, Mohamed Farouk Abdel-Hady and Günther Palm.
Semi-supervised Learning for Regression with Cotraining by Committee
Institute of Neural Information Processing
University of Ulm
D-89069 Ulm, Germany
library(tidyverse) library(tidymodels) library(caret) library(SSLR) data(wine) set.seed(1) train.index <- createDataPartition(wine$Wine, p = .7, list = FALSE) train <- wine[ train.index,] test <- wine[-train.index,] cls <- which(colnames(wine) == "Wine") #% LABELED labeled.index <- createDataPartition(wine$Wine, p = .2, list = FALSE) train[-labeled.index,cls] <- NA #We need a model with probability predictions from parsnip #https://tidymodels.github.io/parsnip/articles/articles/Models.html #It should be with mode = classification #For example, with Random Forest rf <- rand_forest(trees = 100, mode = "classification") %>% set_engine("randomForest") m <- coBC(learner = rf,N = 3, perc.full = 0.7, u = 100, max.iter = 3) %>% fit(Wine ~ ., data = train) #Accuracy predict(m,test) %>% bind_cols(test) %>% metrics(truth = "Wine", estimate = .pred_class)
library(tidyverse) library(tidymodels) library(caret) library(SSLR) data(wine) set.seed(1) train.index <- createDataPartition(wine$Wine, p = .7, list = FALSE) train <- wine[ train.index,] test <- wine[-train.index,] cls <- which(colnames(wine) == "Wine") #% LABELED labeled.index <- createDataPartition(wine$Wine, p = .2, list = FALSE) train[-labeled.index,cls] <- NA #We need a model with probability predictions from parsnip #https://tidymodels.github.io/parsnip/articles/articles/Models.html #It should be with mode = classification #For example, with Random Forest rf <- rand_forest(trees = 100, mode = "classification") %>% set_engine("randomForest") m <- coBC(learner = rf,N = 3, perc.full = 0.7, u = 100, max.iter = 3) %>% fit(Wine ~ ., data = train) #Accuracy predict(m,test) %>% bind_cols(test) %>% metrics(truth = "Wine", estimate = .pred_class)
This function combines the probabilities predicted by the committee of classifiers.
coBCCombine(h.prob, classes)
coBCCombine(h.prob, classes)
h.prob |
A list of probability matrices. |
classes |
The classes in the same order that appear
in the columns of each matrix in |
A probability matrix
CoBC is a semi-supervised learning algorithm with a co-training
style. This algorithm trains N
classifiers with the learning scheme defined in
gen.learner
using a reduced set of labeled examples. For each iteration, an unlabeled
example is labeled for a classifier if the most confident classifications assigned by the
other N-1
classifiers agree on the labeling proposed. The unlabeled examples
candidates are selected randomly from a pool of size u
.
coBCG(y, gen.learner, gen.pred, N = 3, perc.full = 0.7, u = 100, max.iter = 50)
coBCG(y, gen.learner, gen.pred, N = 3, perc.full = 0.7, u = 100, max.iter = 50)
y |
A vector with the labels of training instances. In this vector the
unlabeled instances are specified with the value |
gen.learner |
A function for training |
gen.pred |
A function for predicting the probabilities per classes.
This function must be two parameters, model and indexes, where the model
is a classifier trained with |
N |
The number of classifiers used as committee members. All these classifiers
are trained using the |
perc.full |
A number between 0 and 1. If the percentage of new labeled examples reaches this value the self-labeling process is stopped. Default is 0.7. |
u |
Number of unlabeled instances in the pool. Default is 100. |
max.iter |
Maximum number of iterations to execute in the self-labeling process. Default is 50. |
coBCG can be helpful in those cases where the method selected as
base classifier needs a learner
and pred
functions with other
specifications. For more information about the general coBC method,
please see coBC
function. Essentially, coBC
function is a wrapper of coBCG
function.
A list object of class "coBCG" containing:
The final N
base classifiers trained using the enlarged labeled set.
List of N
vectors of indexes related to the training instances
used per each classifier. These indexes are relative to the y
argument.
The indexes of all training instances used to
train the N
models. These indexes include the initial labeled instances
and the newly labeled instances. These indexes are relative to the y
argument.
List of three vectors with the same information in model.index
but the indexes are relative to instances.index
vector.
The levels of y
factor.
library(SSLR) library(caret) ## Load Wine data set data(wine) cls <- which(colnames(wine) == "Wine") x <- wine[, - cls] # instances without classes y <- wine[, cls] # the classes x <- scale(x) # scale the attributes ## Prepare data set.seed(20) # Use 50% of instances for training tra.idx <- sample(x = length(y), size = ceiling(length(y) * 0.5)) xtrain <- x[tra.idx,] # training instances ytrain <- y[tra.idx] # classes of training instances # Use 70% of train instances as unlabeled set tra.na.idx <- sample(x = length(tra.idx), size = ceiling(length(tra.idx) * 0.7)) ytrain[tra.na.idx] <- NA # remove class information of unlabeled instances # Use the other 50% of instances for inductive testing tst.idx <- setdiff(1:length(y), tra.idx) xitest <- x[tst.idx,] # testing instances yitest <- y[tst.idx] # classes of testing instances ## Example: Training from a set of instances with 1-NN (knn3) as base classifier. gen.learner1 <- function(indexes, cls) caret::knn3(x = xtrain[indexes,], y = cls, k = 1) gen.pred1 <- function(model, indexes) predict(model, xtrain[indexes,]) set.seed(1) trControl_coBCG <- list(gen.learner = gen.learner1, gen.pred = gen.pred1) md1 <- train_generic(ytrain, method = "coBCG", trControl = trControl_coBCG) # Predict probabilities per instances using each model h.prob <- lapply( X = md1$model, FUN = function(m) predict(m, xitest) ) # Combine the predictions cls1 <- coBCCombine(h.prob, md1$classes) table(cls1, yitest) confusionMatrix(cls1, yitest)$overall[1] ## Example: Training from a distance matrix with 1-NN (oneNN) as base classifier. dtrain <- as.matrix(proxy::dist(x = xtrain, method = "euclidean", by_rows = TRUE)) gen.learner2 <- function(indexes, cls) { m <- SSLR::oneNN(y = cls) attr(m, "tra.idxs") <- indexes m } gen.pred2 <- function(model, indexes) { tra.idxs <- attr(model, "tra.idxs") d <- dtrain[indexes, tra.idxs] prob <- predict(model, d, distance.weighting = "none") prob } set.seed(1) trControl_coBCG2 <- list(gen.learner = gen.learner2, gen.pred = gen.pred2) md2 <- train_generic(ytrain, method = "coBCG", trControl = trControl_coBCG2) # Predict probabilities per instances using each model ditest <- proxy::dist(x = xitest, y = xtrain[md2$instances.index,], method = "euclidean", by_rows = TRUE) h.prob <- list() ninstances <- nrow(dtrain) for (i in 1:length(md2$model)) { m <- md2$model[[i]] D <- ditest[, md2$model.index.map[[i]]] h.prob[[i]] <- predict(m, D) } # Combine the predictions cls2 <- coBCCombine(h.prob, md2$classes) table(cls2, yitest) confusionMatrix(cls2, yitest)$overall[1]
library(SSLR) library(caret) ## Load Wine data set data(wine) cls <- which(colnames(wine) == "Wine") x <- wine[, - cls] # instances without classes y <- wine[, cls] # the classes x <- scale(x) # scale the attributes ## Prepare data set.seed(20) # Use 50% of instances for training tra.idx <- sample(x = length(y), size = ceiling(length(y) * 0.5)) xtrain <- x[tra.idx,] # training instances ytrain <- y[tra.idx] # classes of training instances # Use 70% of train instances as unlabeled set tra.na.idx <- sample(x = length(tra.idx), size = ceiling(length(tra.idx) * 0.7)) ytrain[tra.na.idx] <- NA # remove class information of unlabeled instances # Use the other 50% of instances for inductive testing tst.idx <- setdiff(1:length(y), tra.idx) xitest <- x[tst.idx,] # testing instances yitest <- y[tst.idx] # classes of testing instances ## Example: Training from a set of instances with 1-NN (knn3) as base classifier. gen.learner1 <- function(indexes, cls) caret::knn3(x = xtrain[indexes,], y = cls, k = 1) gen.pred1 <- function(model, indexes) predict(model, xtrain[indexes,]) set.seed(1) trControl_coBCG <- list(gen.learner = gen.learner1, gen.pred = gen.pred1) md1 <- train_generic(ytrain, method = "coBCG", trControl = trControl_coBCG) # Predict probabilities per instances using each model h.prob <- lapply( X = md1$model, FUN = function(m) predict(m, xitest) ) # Combine the predictions cls1 <- coBCCombine(h.prob, md1$classes) table(cls1, yitest) confusionMatrix(cls1, yitest)$overall[1] ## Example: Training from a distance matrix with 1-NN (oneNN) as base classifier. dtrain <- as.matrix(proxy::dist(x = xtrain, method = "euclidean", by_rows = TRUE)) gen.learner2 <- function(indexes, cls) { m <- SSLR::oneNN(y = cls) attr(m, "tra.idxs") <- indexes m } gen.pred2 <- function(model, indexes) { tra.idxs <- attr(model, "tra.idxs") d <- dtrain[indexes, tra.idxs] prob <- predict(model, d, distance.weighting = "none") prob } set.seed(1) trControl_coBCG2 <- list(gen.learner = gen.learner2, gen.pred = gen.pred2) md2 <- train_generic(ytrain, method = "coBCG", trControl = trControl_coBCG2) # Predict probabilities per instances using each model ditest <- proxy::dist(x = xitest, y = xtrain[md2$instances.index,], method = "euclidean", by_rows = TRUE) h.prob <- list() ninstances <- nrow(dtrain) for (i in 1:length(md2$model)) { m <- md2$model[[i]] D <- ditest[, md2$model.index.map[[i]]] h.prob[[i]] <- predict(m, D) } # Combine the predictions cls2 <- coBCCombine(h.prob, md2$classes) table(cls2, yitest) confusionMatrix(cls2, yitest)$overall[1]
coBCReg is based on an ensemble of N diverse regressors. At each iteration and for each regressor, the companion committee labels the unlabeled examples then the regressor select the most informative newly-labeled examples for itself, where the selection confidence is based on estimating the validation error. The final prediction is the average of the estimates of the N regressors.
coBCReg(learner, N = 3, perc.full = 0.7, u = 100, max.iter = 50)
coBCReg(learner, N = 3, perc.full = 0.7, u = 100, max.iter = 50)
learner |
model from parsnip package for training a supervised base classifier using a set of instances. This model need to have probability predictions |
N |
The number of classifiers used as committee members. All these classifiers
are trained using the |
perc.full |
A number between 0 and 1. If the percentage of new labeled examples reaches this value the self-labeling process is stopped. Default is 0.7. |
u |
Number of unlabeled instances in the pool. Default is 100. |
max.iter |
Maximum number of iterations to execute in the self-labeling process. Default is 50. |
For regression tasks, labeling data is very expensive computationally. Its so slow.
Mohamed Farouk Abdel-Hady, Mohamed Farouk Abdel-Hady and Günther Palm.
Semi-supervised Learning for Regression with Cotraining by Committee
Institute of Neural Information Processing
University of Ulm
D-89069 Ulm, Germany
coBCReg is based on an ensemble of N diverse regressors. At each iteration and for each regressor, the companion committee labels the unlabeled examples then the regressor select the most informative newly-labeled examples for itself, where the selection confidence is based on estimating the validation error. The final prediction is the average of the estimates of the N regressors.
coBCRegG( y, gen.learner, gen.pred, N = 3, perc.full = 0.7, u = 100, max.iter = 50, gr = 1 )
coBCRegG( y, gen.learner, gen.pred, N = 3, perc.full = 0.7, u = 100, max.iter = 50, gr = 1 )
y |
A vector with the labels of training instances. In this vector the
unlabeled instances are specified with the value |
gen.learner |
A function for training |
gen.pred |
A function for predicting the probabilities per classes.
This function must be two parameters, model and indexes, where the model
is a classifier trained with |
N |
The number of classifiers used as committee members. All these classifiers
are trained using the |
perc.full |
A number between 0 and 1. If the percentage of new labeled examples reaches this value the self-labeling process is stopped. Default is 0.7. |
u |
Number of unlabeled instances in the pool. Default is 100. |
max.iter |
Maximum number of iterations to execute in the self-labeling process. Default is 50. |
gr |
growing rate |
For regression tasks, labeling data is very expensive computationally. Its so slow.
Mohamed Farouk Abdel-Hady, Mohamed Farouk Abdel-Hady and Günther Palm.
Semi-supervised Learning for Regression with Cotraining by Committee
Institute of Neural Information Processing
University of Ulm
D-89069 Ulm, Germany
A dataset containing 56 times series z-normalized. Time series length is 286.
data(coffee)
data(coffee)
A data frame with 56 rows and 287 variables including the class.
https://www.cs.ucr.edu/~eamonn/time_series_data_2018/
The initialization is the same as seeded kmeans, the difference is that in the following steps the allocation of the clusters in the labelled data does not change
constrained_kmeans(max_iter = 10, method = "euclidean")
constrained_kmeans(max_iter = 10, method = "euclidean")
max_iter |
maximum iterations in KMeans. Default is 10 |
method |
distance method in KMeans: "euclidean", "maximum", "manhattan", "canberra", "binary" or "minkowski" |
Sugato Basu, Arindam Banerjee, Raymond Mooney
Semi-supervised clustering by seeding
July 2002
In Proceedings of 19th International Conference on Machine Learning
library(tidyverse) library(caret) library(SSLR) library(tidymodels) data <- iris set.seed(1) #% LABELED cls <- which(colnames(iris) == "Species") labeled.index <- createDataPartition(data$Species, p = .2, list = FALSE) data[-labeled.index,cls] <- NA m <- constrained_kmeans() %>% fit(Species ~ ., data) #Get labels (assing clusters), type = "raw" return factor labels <- m %>% cluster_labels() print(labels) #Get centers centers <- m %>% get_centers() print(centers)
library(tidyverse) library(caret) library(SSLR) library(tidymodels) data <- iris set.seed(1) #% LABELED cls <- which(colnames(iris) == "Species") labeled.index <- createDataPartition(data$Species, p = .2, list = FALSE) data[-labeled.index,cls] <- NA m <- constrained_kmeans() %>% fit(Species ~ ., data) #Get labels (assing clusters), type = "raw" return factor labels <- m %>% cluster_labels() print(labels) #Get centers centers <- m %>% get_centers() print(centers)
COREG is a semi-supervised learning for regression with a co-training style. This technique uses two kNN regressors with different distance metrics. For each iteration, each regressor labels the unlabeled example which can be most confidently labeled for the other learner, where the labeling confidence is estimated through considering the consistency of the regressor with the labeled example set. The final prediction is made by averaging the predictions of both the refined kNN regressors
COREG(max.iter = 50, k1 = 3, k2 = 5, p1 = 3, p2 = 5, u = 100)
COREG(max.iter = 50, k1 = 3, k2 = 5, p1 = 3, p2 = 5, u = 100)
max.iter |
maximum number of iterations to execute the self-labeling process. Default is 50. |
k1 |
parameter in first KNN |
k2 |
parameter in second KNN |
p1 |
distance order 1. Default is 3 |
p2 |
distance order 1. Default is 5 |
u |
Number of unlabeled instances in the pool. Default is 100. |
labeling data is very expensive computationally. Its so slow. For executing this model, we need RANN installed.
Zhi-Hua Zhou and Ming Li.
Semi-Supervised Regression with Co-Training.
National Laboratory for Novel Software Technology
Nanjing University, Nanjing 210093, China
library(SSLR) m <- COREG(max.iter = 1)
library(SSLR) m <- COREG(max.iter = 1)
Class DecisionTreeClassifier Slots: max_depth, n_classes_, n_features_, tree_, classes, min_samples_split, min_samples_leaf
Democratic Co-Learning is a semi-supervised learning algorithm with a
co-training style. This algorithm trains N classifiers with different learning schemes
defined in list gen.learners
. During the iterative process, the multiple classifiers
with different inductive biases label data for each other.
democratic(learners, schemes = NULL)
democratic(learners, schemes = NULL)
learners |
List of models from parsnip package for training a supervised base classifier using a set of instances. This model need to have probability predictions |
schemes |
List of schemes (col x names in each learner). Default is null, it means that learner uses all x columns |
This method trains an ensemble of diverse classifiers. To promote the initial diversity
the classifiers must represent different learning schemes.
When x.inst is FALSE
all learners
defined must be able to learn a classifier
from the precomputed matrix in x
.
The iteration process of the algorithm ends when no changes occurs in
any model during a complete iteration.
The generation of the final hypothesis is
produced via a weigthed majority voting.
(When model fit) A list object of class "democratic" containing:
A vector with the confidence-weighted vote assigned to each classifier.
A list with the final N base classifiers trained using the enlarged labeled set.
List of N vectors of indexes related to the training instances
used per each classifier. These indexes are relative to the y
argument.
The indexes of all training instances used to
train the N models
. These indexes include the initial labeled instances
and the newly labeled instances. These indexes are relative to the y
argument.
List of three vectors with the same information in model.index
but the indexes are relative to instances.index
vector.
The levels of y
factor.
The functions provided in the preds
argument.
The set of lists provided in the preds.pars
argument.
The value provided in the x.inst
argument.
library(tidyverse) library(tidymodels) library(caret) library(SSLR) data(wine) set.seed(1) train.index <- createDataPartition(wine$Wine, p = .7, list = FALSE) train <- wine[ train.index,] test <- wine[-train.index,] cls <- which(colnames(wine) == "Wine") #% LABELED labeled.index <- createDataPartition(wine$Wine, p = .2, list = FALSE) train[-labeled.index,cls] <- NA #We need a model with probability predictions from parsnip #https://tidymodels.github.io/parsnip/articles/articles/Models.html #It should be with mode = classification rf <- rand_forest(trees = 100, mode = "classification") %>% set_engine("randomForest") bt <- boost_tree(trees = 100, mode = "classification") %>% set_engine("C5.0") m <- democratic(learners = list(rf,bt)) %>% fit(Wine ~ ., data = train) #' \donttest{ #Accuracy predict(m,test) %>% bind_cols(test) %>% metrics(truth = "Wine", estimate = .pred_class) #With schemes set.seed(1) m <- democratic(learners = list(rf,bt), schemes = list(c("Malic.Acid","Ash"), c("Magnesium","Proline")) ) %>% fit(Wine ~ ., data = train) #Accuracy predict(m,test) %>% bind_cols(test) %>% metrics(truth = "Wine", estimate = .pred_class) #'}
library(tidyverse) library(tidymodels) library(caret) library(SSLR) data(wine) set.seed(1) train.index <- createDataPartition(wine$Wine, p = .7, list = FALSE) train <- wine[ train.index,] test <- wine[-train.index,] cls <- which(colnames(wine) == "Wine") #% LABELED labeled.index <- createDataPartition(wine$Wine, p = .2, list = FALSE) train[-labeled.index,cls] <- NA #We need a model with probability predictions from parsnip #https://tidymodels.github.io/parsnip/articles/articles/Models.html #It should be with mode = classification rf <- rand_forest(trees = 100, mode = "classification") %>% set_engine("randomForest") bt <- boost_tree(trees = 100, mode = "classification") %>% set_engine("C5.0") m <- democratic(learners = list(rf,bt)) %>% fit(Wine ~ ., data = train) #' \donttest{ #Accuracy predict(m,test) %>% bind_cols(test) %>% metrics(truth = "Wine", estimate = .pred_class) #With schemes set.seed(1) m <- democratic(learners = list(rf,bt), schemes = list(c("Malic.Acid","Ash"), c("Magnesium","Proline")) ) %>% fit(Wine ~ ., data = train) #Accuracy predict(m,test) %>% bind_cols(test) %>% metrics(truth = "Wine", estimate = .pred_class) #'}
This function combines the probabilities predicted by the set of classifiers.
democraticCombine(pred, W, classes)
democraticCombine(pred, W, classes)
pred |
A list with the prediction for each classifier. |
W |
A vector with the confidence-weighted vote assigned to each classifier during the training process. |
classes |
the classes. |
The classification proposed.
Democratic is a semi-supervised learning algorithm with a co-training
style. This algorithm trains N classifiers with different learning schemes defined in
list gen.learners
. During the iterative process, the multiple classifiers with
different inductive biases label data for each other.
democraticG(y, gen.learners, gen.preds)
democraticG(y, gen.learners, gen.preds)
y |
A vector with the labels of training instances. In this vector the
unlabeled instances are specified with the value |
gen.learners |
A list of functions for training N different supervised base classifiers. Each function needs two parameters, indexes and cls, where indexes indicates the instances to use and cls specifies the classes of those instances. |
gen.preds |
A list of functions for predicting the probabilities per classes.
Each function must be two parameters, model and indexes, where the model
is a classifier trained with |
democraticG can be helpful in those cases where the method selected as
base classifier needs a learner
and pred
functions with other
specifications. For more information about the general democratic method,
please see democratic
function. Essentially, democratic
function is a wrapper of democraticG
function.
A list object of class "democraticG" containing:
A vector with the confidence-weighted vote assigned to each classifier.
A list with the final N base classifiers trained using the enlarged labeled set.
List of N vectors of indexes related to the training instances
used per each classifier. These indexes are relative to the y
argument.
The indexes of all training instances used to
train the N models
. These indexes include the initial labeled instances
and the newly labeled instances. These indexes are relative to the y
argument.
List of three vectors with the same information in model.index
but the indexes are relative to instances.index
vector.
The levels of y
factor.
Yan Zhou and Sally Goldman.
Democratic co-learning.
In IEEE 16th International Conference on Tools with Artificial Intelligence (ICTAI),
pages 594-602. IEEE, Nov 2004. doi: 10.1109/ICTAI.2004.48.
model from RSSL package
An Expectation Maximization like approach to Semi-Supervised Least Squares Classification
As studied in Krijthe & Loog (2016), minimizes the total loss of the labeled and unlabeled objects by finding the weight vector and labels that minimize the total loss. The algorithm proceeds similar to EM, by subsequently applying a weight update and a soft labeling of the unlabeled objects. This is repeated until convergence.
By default (method="block") the weights of the classifier are updated, after which the unknown labels are updated. method="simple" uses LBFGS to do this update simultaneously. Objective="responsibility" corresponds to the responsibility based, instead of the label based, objective function in Krijthe & Loog (2016), which is equivalent to hard-label self-learning.
EMLeastSquaresClassifierSSLR( x_center = FALSE, scale = FALSE, verbose = FALSE, intercept = TRUE, lambda = 0, eps = 1e-09, y_scale = FALSE, alpha = 1, beta = 1, init = "supervised", method = "block", objective = "label", save_all = FALSE, max_iter = 1000 )
EMLeastSquaresClassifierSSLR( x_center = FALSE, scale = FALSE, verbose = FALSE, intercept = TRUE, lambda = 0, eps = 1e-09, y_scale = FALSE, alpha = 1, beta = 1, init = "supervised", method = "block", objective = "label", save_all = FALSE, max_iter = 1000 )
x_center |
logical; Should the features be centered? |
scale |
Should the features be normalized? (default: FALSE) |
verbose |
logical; Controls the verbosity of the output |
intercept |
logical; Whether an intercept should be included |
lambda |
numeric; L2 regularization parameter |
eps |
Stopping criterion for the minimization |
y_scale |
logical; whether the target vector should be centered |
alpha |
numeric; the mixture of the new responsibilities and the old in each iteration of the algorithm (default: 1) |
beta |
numeric; value between 0 and 1 that determines how much to move to the new solution from the old solution at each step of the block gradient descent |
init |
objective character; "random" for random initialization of labels, "supervised" to use supervised solution as initialization or a numeric vector with a coefficient vector to use to calculate the initialization |
method |
character; one of "block", for block gradient descent or "simple" for LBFGS optimization (default="block") |
objective |
character; "responsibility" for hard label self-learning or "label" for soft-label self-learning |
save_all |
logical; saves all classifiers trained during block gradient descent |
max_iter |
integer; maximum number of iterations |
Krijthe, J.H. & Loog, M., 2016. Optimistic Semi-supervised Least Squares Classification. In International Conference on Pattern Recognition (To Appear).
library(tidyverse) #' \donttest{ library(tidymodels) library(caret) library(SSLR) data(breast) set.seed(1) train.index <- createDataPartition(breast$Class, p = .7, list = FALSE) train <- breast[ train.index,] test <- breast[-train.index,] cls <- which(colnames(breast) == "Class") #% LABELED labeled.index <- createDataPartition(breast$Class, p = .2, list = FALSE) train[-labeled.index,cls] <- NA m <- EMLeastSquaresClassifierSSLR() %>% fit(Class ~ ., data = train) #Accuracy predict(m,test) %>% bind_cols(test) %>% metrics(truth = "Class", estimate = .pred_class) #Accesing model from RSSL model <- m$model #' }
library(tidyverse) #' \donttest{ library(tidymodels) library(caret) library(SSLR) data(breast) set.seed(1) train.index <- createDataPartition(breast$Class, p = .7, list = FALSE) train <- breast[ train.index,] test <- breast[-train.index,] cls <- which(colnames(breast) == "Class") #% LABELED labeled.index <- createDataPartition(breast$Class, p = .2, list = FALSE) train[-labeled.index,cls] <- NA m <- EMLeastSquaresClassifierSSLR() %>% fit(Class ~ ., data = train) #Accuracy predict(m,test) %>% bind_cols(test) %>% metrics(truth = "Class", estimate = .pred_class) #Accesing model from RSSL model <- m$model #' }
model from RSSL package Semi-Supervised Nearest Mean Classifier using Expectation Maximization
Expectation Maximization applied to the nearest mean classifier assuming Gaussian classes with a spherical covariance matrix.
Starting from the supervised solution, uses the Expectation Maximization algorithm (see Dempster et al. (1977)) to iteratively update the means and shared covariance of the classes (Maximization step) and updates the responsibilities for the unlabeled objects (Expectation step).
EMNearestMeanClassifierSSLR(method = "EM", scale = FALSE, eps = 1e-04)
EMNearestMeanClassifierSSLR(method = "EM", scale = FALSE, eps = 1e-04)
method |
character; Currently only "EM" |
scale |
Should the features be normalized? (default: FALSE) |
eps |
Stopping criterion for the maximinimization |
Dempster, A., Laird, N. & Rubin, D., 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B, 39(1), pp.1-38.
library(tidyverse) library(tidymodels) library(caret) library(SSLR) data(breast) set.seed(1) train.index <- createDataPartition(breast$Class, p = .7, list = FALSE) train <- breast[ train.index,] test <- breast[-train.index,] cls <- which(colnames(breast) == "Class") #% LABELED labeled.index <- createDataPartition(breast$Class, p = .2, list = FALSE) train[-labeled.index,cls] <- NA m <- EMNearestMeanClassifierSSLR() %>% fit(Class ~ ., data = train) #Accesing model from RSSL model <- m$model #Accuracy predict(m,test) %>% bind_cols(test) %>% metrics(truth = "Class", estimate = .pred_class)
library(tidyverse) library(tidymodels) library(caret) library(SSLR) data(breast) set.seed(1) train.index <- createDataPartition(breast$Class, p = .7, list = FALSE) train <- breast[ train.index,] test <- breast[-train.index,] cls <- which(colnames(breast) == "Class") #% LABELED labeled.index <- createDataPartition(breast$Class, p = .2, list = FALSE) train[-labeled.index,cls] <- NA m <- EMNearestMeanClassifierSSLR() %>% fit(Class ~ ., data = train) #Accesing model from RSSL model <- m$model #Accuracy predict(m,test) %>% bind_cols(test) %>% metrics(truth = "Class", estimate = .pred_class)
model from RSSL package R Implementation of entropy regularized logistic regression implementation as proposed by Grandvalet & Bengio (2005). An extra term is added to the objective function of logistic regression that penalizes the entropy of the posterior measured on the unlabeled examples.
EntropyRegularizedLogisticRegressionSSLR( lambda = 0, lambda_entropy = 1, intercept = TRUE, init = NA, scale = FALSE, x_center = FALSE )
EntropyRegularizedLogisticRegressionSSLR( lambda = 0, lambda_entropy = 1, intercept = TRUE, init = NA, scale = FALSE, x_center = FALSE )
lambda |
l2 Regularization |
lambda_entropy |
Weight of the labeled observations compared to the unlabeled observations |
intercept |
logical; Whether an intercept should be included |
init |
Initial parameters for the gradient descent |
scale |
logical; Should the features be normalized? (default: FALSE) |
x_center |
logical; Should the features be centered? |
Grandvalet, Y. & Bengio, Y., 2005. Semi-supervised learning by entropy minimization. In L. K. Saul, Y. Weiss, & L. Bottou, eds. Advances in Neural Information Processing Systems 17. Cambridge, MA: MIT Press, pp. 529-536.
library(tidyverse) library(caret) library(tidymodels) library(SSLR) data(breast) set.seed(1) train.index <- createDataPartition(breast$Class, p = .7, list = FALSE) train <- breast[ train.index,] test <- breast[-train.index,] cls <- which(colnames(breast) == "Class") #% LABELED labeled.index <- createDataPartition(breast$Class, p = .2, list = FALSE) train[-labeled.index,cls] <- NA m <- EntropyRegularizedLogisticRegressionSSLR() %>% fit(Class ~ ., data = train) #Accuracy predict(m,test) %>% bind_cols(test) %>% metrics(truth = "Class", estimate = .pred_class)
library(tidyverse) library(caret) library(tidymodels) library(SSLR) data(breast) set.seed(1) train.index <- createDataPartition(breast$Class, p = .7, list = FALSE) train <- breast[ train.index,] test <- breast[-train.index,] cls <- which(colnames(breast) == "Class") #% LABELED labeled.index <- createDataPartition(breast$Class, p = .2, list = FALSE) train[-labeled.index,cls] <- NA m <- EntropyRegularizedLogisticRegressionSSLR() %>% fit(Class ~ ., data = train) #Accuracy predict(m,test) %>% bind_cols(test) %>% metrics(truth = "Class", estimate = .pred_class)
An S4 method to fit decision tree.
fit_decision_tree(object, ...)
fit_decision_tree(object, ...)
object |
DecisionTree object |
... |
This parameter is included for compatibility reasons. |
method in class DecisionTreeClassifier used to build a Decision Tree
## S4 method for signature 'DecisionTreeClassifier' fit_decision_tree( object, X, y, min_samples_split = 20, min_samples_leaf = ceiling(min_samples_split/3), w = 0.5 )
## S4 method for signature 'DecisionTreeClassifier' fit_decision_tree( object, X, y, min_samples_split = 20, min_samples_leaf = ceiling(min_samples_split/3), w = 0.5 )
object |
A DecisionTreeClassifier object |
X |
A object that can be coerced as data.frame. Training instances |
y |
A vector with the labels of the training instances. In this vector
the unlabeled instances are specified with the value |
min_samples_split |
the minimum number of observations to do split |
min_samples_leaf |
the minimum number of any terminal leaf node |
w |
weight parameter ranging from 0 to 1 |
method in classRandomForestSemisupervised used to build a Decision Tree
## S4 method for signature 'RandomForestSemisupervised' fit_random_forest( object, X, y, mtry = 2, trees = 500, min_n = 2, w = 0.5, replace = TRUE, tree_max_depth = Inf, sampsize = if (replace) nrow(X) else ceiling(0.632 * nrow(X)), min_samples_leaf = if (!is.null(y) && !is.factor(y)) 5 else 1, allowParallel = TRUE )
## S4 method for signature 'RandomForestSemisupervised' fit_random_forest( object, X, y, mtry = 2, trees = 500, min_n = 2, w = 0.5, replace = TRUE, tree_max_depth = Inf, sampsize = if (replace) nrow(X) else ceiling(0.632 * nrow(X)), min_samples_leaf = if (!is.null(y) && !is.factor(y)) 5 else 1, allowParallel = TRUE )
object |
A RandomForestSemisupervised object |
X |
A object that can be coerced as data.frame. Training instances |
y |
A vector with the labels of the training instances. In this vector
the unlabeled instances are specified with the value |
mtry |
number of features in each decision tree |
trees |
number of trees. Default is 5 |
min_n |
number of minimum samples in each tree |
w |
weight parameter ranging from 0 to 1 |
replace |
replacing type in sampling |
tree_max_depth |
maximum tree depth. Default is Inf |
sampsize |
Size of sample. Default if (replace) nrow(x) else ceiling(.632*nrow(x)) |
min_samples_leaf |
the minimum number of any terminal leaf node |
allowParallel |
Execute Random Forest in parallel if doParallel is loaded. Default is TRUE |
list of decision trees
fit_x_u
fit_x_u(object, ...)
fit_x_u(object, ...)
object |
object |
... |
other parameters to be passed |
Funtion to fit with x and y and x_U. Function calcule y with NA values and append in y param
## S3 method for class 'model_sslr' fit_x_u(object, x = NULL, y = NULL, x_U = NULL, ...)
## S3 method for class 'model_sslr' fit_x_u(object, x = NULL, y = NULL, x_U = NULL, ...)
object |
is the model |
x |
is a data frame or matrix with train dataset without objective feature. X only have labeled data |
y |
is objective feature with labeled values |
x_U |
train unlabeled data without objective feature |
... |
This parameter is included for compatibility reasons. |
Funtion to fit with x and y
## S3 method for class 'model_sslr' fit_xy(object, x = NULL, y = NULL, ...)
## S3 method for class 'model_sslr' fit_xy(object, x = NULL, y = NULL, ...)
object |
is the model |
x |
is a data frame or matrix with train dataset without objective feature. X have labeled and unlabeled data |
y |
is objective feature with labeled values and NA values in unlabeled data |
... |
unused in this case |
Funtion to fit through the formula
## S3 method for class 'model_sslr' fit(object, formula = NULL, data = NULL, ...)
## S3 method for class 'model_sslr' fit(object, formula = NULL, data = NULL, ...)
object |
is the model |
formula |
is the formula |
data |
is the total data train |
... |
unused in this case |
Centers clustering
get_centers(object, ...)
get_centers(object, ...)
object |
object |
... |
other parameters to be passed |
Get labels of clusters raw returns factor or numeric values
## S3 method for class 'model_sslr_fitted' get_centers(object, ...)
## S3 method for class 'model_sslr_fitted' get_centers(object, ...)
object |
model_sslr_fitted model built |
... |
other parameters to be passed |
Get value most frequented in vector Used in predictions. It calls a predict with type = "prob" in Decision Tree
get_class_max_prob(trees, input)
get_class_max_prob(trees, input)
trees |
trees list |
input |
is input to be predicted |
Get mean probability over all trees as prob vector. It calls a predict with type = "prob" in Decision Tree
get_class_mean_prob(trees, input)
get_class_mean_prob(trees, input)
trees |
trees list |
input |
is input to be predicted |
FUNCTION TO GET FUNCTION METHOD SPECIFIC
get_function(met)
get_function(met)
met |
character |
method_train (function)
FUNCTION TO GET FUNCTION METHOD GENERIC
get_function_generic(met)
get_function_generic(met)
met |
character |
method_train (function)
Function to get group from gini index. Used in categorical variable From: https://freakonometrics.hypotheses.org/20736
get_levels_categoric(column, Y)
get_levels_categoric(column, Y)
column |
is the column |
Y |
values |
Get value most frequented in vector Used in predictions
get_most_frequented(elements)
get_most_frequented(elements)
elements |
vector with values |
Get value most frequented in vector Used in predictions. It calls a predict with type = "numeric" in Decision Tree
get_value_mean(trees, input)
get_value_mean(trees, input)
trees |
trees list |
input |
is input to be predicted |
FUNCTION TO GET REAL X AND Y WITH FORMULA AND DATA
get_x_y(form, data)
get_x_y(form, data)
form |
formula |
data |
data values, matrix, dataframe.. |
x (matrix,dataframe...) and y(factor)
function used to calculate the gini coefficient or variance according to the type of the column. This function is called for the creation of the decision tree
gini_or_variance(X)
gini_or_variance(X)
X |
column to calculate variance or gini |
Function to compute Gini index From: https://freakonometrics.hypotheses.org/20736
gini_prob(y, classe)
gini_prob(y, classe)
y |
values |
classe |
classes |
model from RSSL package Implements the approach proposed in Zhu et al. (2003) to label propagation over an affinity graph. Note, as in the original paper, we consider the transductive scenario, so the implementation does not generalize to out of sample predictions. The approach minimizes the squared difference in labels assigned to different objects, where the contribution of each difference to the loss is weighted by the affinity between the objects. The default in this implementation is to use a knn adjacency matrix based on euclidean distance to determine this weight. Setting adjacency="heat" will use an RBF kernel over euclidean distances between objects to determine the weights.
GRFClassifierSSLR( adjacency = "nn", adjacency_distance = "euclidean", adjacency_k = 6, adjacency_sigma = 0.1, class_mass_normalization = TRUE, scale = FALSE, x_center = FALSE )
GRFClassifierSSLR( adjacency = "nn", adjacency_distance = "euclidean", adjacency_k = 6, adjacency_sigma = 0.1, class_mass_normalization = TRUE, scale = FALSE, x_center = FALSE )
adjacency |
character; "nn" for nearest neighbour graph or "heat" for radial basis adjacency matrix |
adjacency_distance |
character; distance metric for nearest neighbour adjacency matrix |
adjacency_k |
integer; number of neighbours for the nearest neighbour adjacency matrix |
adjacency_sigma |
double; width of the rbf adjacency matrix |
class_mass_normalization |
logical; Should the Class Mass Normalization heuristic be applied? (default: TRUE) |
scale |
logical; Should the features be normalized? (default: FALSE) |
x_center |
logical; Should the features be centered? |
Zhu, X., Ghahramani, Z. & Lafferty, J., 2003 Semi-supervised learning using gaussian fields and harmonic functions. In Proceedings of the 20th International Conference on Machine Learning. pp. 912-919.
library(tidyverse) library(caret) library(SSLR) library(tidymodels) data(wine) cls <- which(colnames(wine) == "Wine") #% LABELED labeled.index <- createDataPartition(wine$Wine, p = .2, list = FALSE) wine[-labeled.index,cls] <- NA m <- GRFClassifierSSLR() %>% fit(Wine ~ ., data = wine) #Accesing model from RSSL model <- m$model #Predictions of unlabeled preds_unlabeled <- m %>% predictions() print(preds_unlabeled) preds_unlabeled <- m %>% predictions(type = "raw") print(preds_unlabeled) #Total y_total <- wine[,cls] y_total[-labeled.index] <- preds_unlabeled
library(tidyverse) library(caret) library(SSLR) library(tidymodels) data(wine) cls <- which(colnames(wine) == "Wine") #% LABELED labeled.index <- createDataPartition(wine$Wine, p = .2, list = FALSE) wine[-labeled.index,cls] <- NA m <- GRFClassifierSSLR() %>% fit(Wine ~ ., data = wine) #Accesing model from RSSL model <- m$model #Predictions of unlabeled preds_unlabeled <- m %>% predictions() print(preds_unlabeled) preds_unlabeled <- m %>% predictions(type = "raw") print(preds_unlabeled) #Total y_total <- wine[,cls] y_total[-labeled.index] <- preds_unlabeled
An S4 method to grow tree.
grow_tree(object, ...)
grow_tree(object, ...)
object |
DecisionTree object |
... |
This parameter is included for compatibility reasons. |
Function to grow tree in Decision Tree
## S4 method for signature 'DecisionTreeClassifier' grow_tree(object, X, y, parms, depth = 0)
## S4 method for signature 'DecisionTreeClassifier' grow_tree(object, X, y, parms, depth = 0)
object |
DecisionTree instance |
X |
data values |
y |
classes |
parms |
parameters for grow tree |
depth |
depth in tree |
create model knn
knn_regression(k, x, y, p)
knn_regression(k, x, y, p)
k |
parameter in KNN model |
x |
data |
y |
vector labeled data |
p |
distance order |
model from RSSL package Manifold regularization applied to the support vector machine as proposed in Belkin et al. (2006). As an adjacency matrix, we use the k nearest neighbour graph based on a chosen distance (default: euclidean).
LaplacianSVMSSLR( lambda = 1, gamma = 1, scale = TRUE, kernel = kernlab::vanilladot(), adjacency_distance = "euclidean", adjacency_k = 6, normalized_laplacian = FALSE, eps = 1e-09 )
LaplacianSVMSSLR( lambda = 1, gamma = 1, scale = TRUE, kernel = kernlab::vanilladot(), adjacency_distance = "euclidean", adjacency_k = 6, normalized_laplacian = FALSE, eps = 1e-09 )
lambda |
numeric; L2 regularization parameter |
gamma |
numeric; Weight of the unlabeled data |
scale |
logical; Should the features be normalized? (default: FALSE) |
kernel |
kernlab::kernel to use |
adjacency_distance |
character; distance metric used to construct adjacency graph from the dist function. Default: "euclidean" |
adjacency_k |
integer; Number of of neighbours used to construct adjacency graph. |
normalized_laplacian |
logical; If TRUE use the normalized Laplacian, otherwise, the Laplacian is used |
eps |
numeric; Small value to ensure positive definiteness of the matrix in the QP formulation |
Belkin, M., Niyogi, P. & Sindhwani, V., 2006. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. Journal of Machine Learning Research, 7, pp.2399-2434.
library(tidyverse) library(caret) library(tidymodels) library(SSLR) data(breast) set.seed(1) train.index <- createDataPartition(breast$Class, p = .7, list = FALSE) train <- breast[ train.index,] test <- breast[-train.index,] cls <- which(colnames(breast) == "Class") #% LABELED labeled.index <- createDataPartition(breast$Class, p = .2, list = FALSE) train[-labeled.index,cls] <- NA library(kernlab) m <- LaplacianSVMSSLR(kernel=kernlab::vanilladot()) %>% fit(Class ~ ., data = train) #Accesing model from RSSL model <- m$model #Accuracy predict(m,test) %>% bind_cols(test) %>% metrics(truth = "Class", estimate = .pred_class)
library(tidyverse) library(caret) library(tidymodels) library(SSLR) data(breast) set.seed(1) train.index <- createDataPartition(breast$Class, p = .7, list = FALSE) train <- breast[ train.index,] test <- breast[-train.index,] cls <- which(colnames(breast) == "Class") #% LABELED labeled.index <- createDataPartition(breast$Class, p = .2, list = FALSE) train[-labeled.index,cls] <- NA library(kernlab) m <- LaplacianSVMSSLR(kernel=kernlab::vanilladot()) %>% fit(Class ~ ., data = train) #Accesing model from RSSL model <- m$model #Accuracy predict(m,test) %>% bind_cols(test) %>% metrics(truth = "Class", estimate = .pred_class)
Model from conclust
This function takes an unlabeled dataset and two lists of must-link and cannot-link constraints
as input and produce a clustering as output.
lcvqeSSLR(n_clusters = NULL, mustLink = NULL, cantLink = NULL, max_iter = 2)
lcvqeSSLR(n_clusters = NULL, mustLink = NULL, cantLink = NULL, max_iter = 2)
n_clusters |
A number of clusters to be considered. Default is NULL (num classes) |
mustLink |
A list of must-link constraints. NULL Default, constrints same label |
cantLink |
A list of cannot-link constraints. NULL Default, constrints with different label |
max_iter |
maximum iterations in KMeans. Default is 2 |
This models only returns labels, not centers
Dan Pelleg, Dorit Baras
K-means with large and noisy constraint sets
2007
library(tidyverse) library(caret) library(SSLR) library(tidymodels) data <- iris set.seed(1) #% LABELED cls <- which(colnames(iris) == "Species") labeled.index <- createDataPartition(data$Species, p = .2, list = FALSE) data[-labeled.index,cls] <- NA m <- lcvqeSSLR(max_iter = 1) %>% fit(Species ~ ., data) #Get labels (assing clusters), type = "raw" return factor labels <- m %>% cluster_labels() print(labels)
library(tidyverse) library(caret) library(SSLR) library(tidymodels) data <- iris set.seed(1) #% LABELED cls <- which(colnames(iris) == "Species") labeled.index <- createDataPartition(data$Species, p = .2, list = FALSE) data[-labeled.index,cls] <- NA m <- lcvqeSSLR(max_iter = 1) %>% fit(Species ~ ., data) #Get labels (assing clusters), type = "raw" return factor labels <- m %>% cluster_labels() print(labels)
model from RSSL package
Implementation of the Linear Support Vector Classifier. Can be solved in the Dual formulation, which is equivalent to SVM
or the Primal formulation.
LinearTSVMSSLR( C = 1, Cstar = 0.1, s = 0, x_center = FALSE, scale = FALSE, eps = 1e-06, verbose = FALSE, init = NULL )
LinearTSVMSSLR( C = 1, Cstar = 0.1, s = 0, x_center = FALSE, scale = FALSE, eps = 1e-06, verbose = FALSE, init = NULL )
C |
Cost variable |
Cstar |
numeric; Cost parameter of the unlabeled objects |
s |
numeric; parameter controlling the loss function of the unlabeled objects |
x_center |
logical; Should the features be centered? |
scale |
Whether a z-transform should be applied (default: TRUE) |
eps |
Small value to ensure positive definiteness of the matrix in QP formulation |
verbose |
logical; Controls the verbosity of the output |
init |
numeric; Initial classifier parameters to start the convex concave procedure |
library(tidyverse) library(caret) library(tidymodels) library(SSLR) data(breast) set.seed(1) train.index <- createDataPartition(breast$Class, p = .7, list = FALSE) train <- breast[ train.index,] test <- breast[-train.index,] cls <- which(colnames(breast) == "Class") #% LABELED labeled.index <- createDataPartition(breast$Class, p = .2, list = FALSE) train[-labeled.index,cls] <- NA m <- LinearTSVMSSLR() %>% fit(Class ~ ., data = train) #Accesing model from RSSL model <- m$model
library(tidyverse) library(caret) library(tidymodels) library(SSLR) data(breast) set.seed(1) train.index <- createDataPartition(breast$Class, p = .7, list = FALSE) train <- breast[ train.index,] test <- breast[-train.index,] cls <- which(colnames(breast) == "Class") #% LABELED labeled.index <- createDataPartition(breast$Class, p = .2, list = FALSE) train[-labeled.index,cls] <- NA m <- LinearTSVMSSLR() %>% fit(Class ~ ., data = train) #Accesing model from RSSL model <- m$model
model from RSSL package
Update the means based on the moment constraints as defined in Loog (2010).
The means estimated using the labeled data are updated by making sure their
weighted mean corresponds to the overall mean on all (labeled and unlabeled) data.
Optionally, the estimated variance of the classes can be re-estimated after this
update is applied by setting update_sigma to TRUE
. To get the true nearest mean
classifier, rather than estimate the class priors, set them to equal priors using, for
instance prior=matrix(0.5,2)
.
MCNearestMeanClassifierSSLR( update_sigma = FALSE, prior = NULL, x_center = FALSE, scale = FALSE )
MCNearestMeanClassifierSSLR( update_sigma = FALSE, prior = NULL, x_center = FALSE, scale = FALSE )
update_sigma |
logical; Whether the estimate of the variance should be updated after the means have been updated using the unlabeled data |
prior |
matrix; Class priors for the classes |
x_center |
logical; Should the features be centered? |
scale |
logical; Should the features be normalized? (default: FALSE) |
Loog, M., 2010. Constrained Parameter Estimation for Semi-Supervised Learning: The Case of the Nearest Mean Classifier. In Proceedings of the 2010 European Conference on Machine learning and Knowledge Discovery in Databases. pp. 291-304.
library(tidyverse) library(caret) library(tidymodels) library(SSLR) data(breast) set.seed(1) train.index <- createDataPartition(breast$Class, p = .7, list = FALSE) train <- breast[ train.index,] test <- breast[-train.index,] cls <- which(colnames(breast) == "Class") #% LABELED labeled.index <- createDataPartition(breast$Class, p = .2, list = FALSE) train[-labeled.index,cls] <- NA m <- MCNearestMeanClassifierSSLR() %>% fit(Class ~ ., data = train) #Accesing model from RSSL model <- m$model #Accuracy predict(m,test) %>% bind_cols(test) %>% metrics(truth = "Class", estimate = .pred_class)
library(tidyverse) library(caret) library(tidymodels) library(SSLR) data(breast) set.seed(1) train.index <- createDataPartition(breast$Class, p = .7, list = FALSE) train <- breast[ train.index,] test <- breast[-train.index,] cls <- which(colnames(breast) == "Class") #% LABELED labeled.index <- createDataPartition(breast$Class, p = .2, list = FALSE) train[-labeled.index,cls] <- NA m <- MCNearestMeanClassifierSSLR() %>% fit(Class ~ ., data = train) #Accesing model from RSSL model <- m$model #Accuracy predict(m,test) %>% bind_cols(test) %>% metrics(truth = "Class", estimate = .pred_class)
Model from conclust
This function takes an unlabeled dataset and two lists of must-link and cannot-link constraints
as input and produce a clustering as output.
mpckmSSLR(n_clusters = NULL, mustLink = NULL, cantLink = NULL, max_iter = 10)
mpckmSSLR(n_clusters = NULL, mustLink = NULL, cantLink = NULL, max_iter = 10)
n_clusters |
A number of clusters to be considered. Default is NULL (num classes) |
mustLink |
A list of must-link constraints. NULL Default, constrints same label |
cantLink |
A list of cannot-link constraints. NULL Default, constrints with different label |
max_iter |
maximum iterations in KMeans. Default is 10 |
This models only returns labels, not centers
Bilenko, Basu, Mooney
Integrating Constraints and Metric Learning in Semi-Supervised Clustering
2004
library(tidyverse) library(caret) library(SSLR) library(tidymodels) data <- iris set.seed(1) #% LABELED cls <- which(colnames(iris) == "Species") labeled.index <- createDataPartition(data$Species, p = .2, list = FALSE) data[-labeled.index,cls] <- NA m <- mpckmSSLR() %>% fit(Species ~ ., data) #Get labels (assing clusters), type = "raw" return factor labels <- m %>% cluster_labels() print(labels)
library(tidyverse) library(caret) library(SSLR) library(tidymodels) data <- iris set.seed(1) #% LABELED cls <- which(colnames(iris) == "Species") labeled.index <- createDataPartition(data$Species, p = .2, list = FALSE) data[-labeled.index,cls] <- NA m <- mpckmSSLR() %>% fit(Species ~ ., data) #Get labels (assing clusters), type = "raw" return factor labels <- m %>% cluster_labels() print(labels)
Function to create DecisionTree
newDecisionTree(max_depth)
newDecisionTree(max_depth)
max_depth |
max depth in tree |
Class Node for Decision Tree Slots: gini, num_samples, num_samples_per_class, predicted_class_value, feature_index threshold, left, right, probabilities
An S4 class to represent a class with more types values: null, numeric or character
Build a model using the given data to be able to predict the label or the probabilities of other instances, according to 1-NN algorithm.
oneNN(x = NULL, y)
oneNN(x = NULL, y)
x |
This argument is not used, the reason why he gets is to fulfill an agreement |
y |
a vector with the labels of training instances |
A model with the data needed to use 1-NN
An S4 method to predict inputs.
predict_inputs(object, ...)
predict_inputs(object, ...)
object |
DecisionTree object |
... |
This parameter is included for compatibility reasons. |
Function to predict one input in Decision Tree
## S4 method for signature 'DecisionTreeClassifier' predict_inputs(object, inputs, type = "class")
## S4 method for signature 'DecisionTreeClassifier' predict_inputs(object, inputs, type = "class")
object |
DecisionTree object |
inputs |
inputs to be predicted |
type |
type prediction, class or prob |
Function to predict inputs in Decision Tree
## S4 method for signature 'DecisionTreeClassifier' predict(object, inputs, type = "class")
## S4 method for signature 'DecisionTreeClassifier' predict(object, inputs, type = "class")
object |
The Decision Tree object |
inputs |
data to be predicted |
type |
Is param to define the type of predict. It can be "class", to get class labels Or "prob" to get probabilites for class in each input. Default is "class" |
Function to predict inputs in Decision Tree
## S4 method for signature 'RandomForestSemisupervised' predict( object, inputs, type = "class", confident = "max_prob", allowParallel = TRUE )
## S4 method for signature 'RandomForestSemisupervised' predict( object, inputs, type = "class", confident = "max_prob", allowParallel = TRUE )
object |
The Decision Tree object |
inputs |
data to be predicted |
type |
class raw |
confident |
Is param to define the type of predict. It can be "max_prob", to get class with sum of probability is the maximum Or "vote" to get the most frequented class in all trees. Default is "max_prob" |
allowParallel |
Execute Random Forest in parallel if doParallel is loaded. |
Predicts the label of instances according to the coBC
model.
## S3 method for class 'coBC' predict(object, x, ...)
## S3 method for class 'coBC' predict(object, x, ...)
object |
coBC model built with the |
x |
An object that can be coerced to a matrix.
Depending on how the model was built, |
... |
This parameter is included for compatibility reasons. |
For additional help see coBC
examples.
Vector with the labels assigned.
Predicts the label of instances according to the COREG
model.
## S3 method for class 'COREG' predict(object, x, type = "numeric", ...)
## S3 method for class 'COREG' predict(object, x, type = "numeric", ...)
object |
Self-training model built with the |
x |
A object that is data |
type |
of predict in principal model (numeric) |
... |
This parameter is included for compatibility reasons. |
For additional help see COREG
examples.
Vector with the labels assigned (numeric).
Predicts the label of instances according to the democratic
model.
## S3 method for class 'democratic' predict(object, x, ...)
## S3 method for class 'democratic' predict(object, x, ...)
object |
Democratic model built with the |
x |
A object that can be coerced as matrix.
Depending on how was the model built, |
... |
This parameter is included for compatibility reasons. |
For additional help see democratic
examples.
Vector with the labels assigned.
Predict EMLeastSquaresClassifierSSLR
## S3 method for class 'EMLeastSquaresClassifierSSLR' predict(object, x, ...)
## S3 method for class 'EMLeastSquaresClassifierSSLR' predict(object, x, ...)
object |
is the object |
x |
is the dataset |
... |
This parameter is included for compatibility reasons. |
Predict EMNearestMeanClassifierSSLR
## S3 method for class 'EMNearestMeanClassifierSSLR' predict(object, x, ...)
## S3 method for class 'EMNearestMeanClassifierSSLR' predict(object, x, ...)
object |
is the object |
x |
is the dataset |
... |
This parameter is included for compatibility reasons. |
Predict EntropyRegularizedLogisticRegressionSSLR
## S3 method for class 'EntropyRegularizedLogisticRegressionSSLR' predict(object, x, ...)
## S3 method for class 'EntropyRegularizedLogisticRegressionSSLR' predict(object, x, ...)
object |
is the object |
x |
is the dataset |
... |
This parameter is included for compatibility reasons. |
Predict LaplacianSVMSSLR
## S3 method for class 'LaplacianSVMSSLR' predict(object, x, ...)
## S3 method for class 'LaplacianSVMSSLR' predict(object, x, ...)
object |
is the object |
x |
is the dataset |
... |
This parameter is included for compatibility reasons. |
Predict LinearTSVMSSLR
## S3 method for class 'LinearTSVMSSLR' predict(object, x, ...)
## S3 method for class 'LinearTSVMSSLR' predict(object, x, ...)
object |
is the object |
x |
is the dataset |
... |
This parameter is included for compatibility reasons. |
Predict MCNearestMeanClassifierSSLR
## S3 method for class 'MCNearestMeanClassifierSSLR' predict(object, x, ...)
## S3 method for class 'MCNearestMeanClassifierSSLR' predict(object, x, ...)
object |
is the object |
x |
is the dataset |
... |
This parameter is included for compatibility reasons. |
Predicts from model. There are different types: class, prob, raw class returns tibble with one column prob returns tibble with probabilities class columns raw returns factor or numeric values
## S3 method for class 'model_sslr_fitted' predict(object, x, type = NULL, ...)
## S3 method for class 'model_sslr_fitted' predict(object, x, type = NULL, ...)
object |
model_sslr_fitted model built. |
x |
A object that can be coerced as matrix.
Depending on how was the model built, |
type |
of predict in principal model: class, raw, prob, vote, max_prob, numeric |
... |
This parameter is included for compatibility reasons. |
tibble or vector.
This function predicts the class label of instances or its probability of pertaining to each class based on the distance matrix.
## S3 method for class 'OneNN' predict(object, dists, type = "prob", ...)
## S3 method for class 'OneNN' predict(object, dists, type = "prob", ...)
object |
A model of class OneNN built with |
dists |
A matrix of distances between the instances to classify (by rows) and the instances used to train the model (by column) |
type |
A string that can take two values: |
... |
Currently not used. |
If type
is equal to "class"
a vector of length equal to the rows number
of matrix dists
, containing the predicted labels. If type
is equal
to "prob"
it returns a matrix which has nrow(dists)
rows and a column for every
class, where each cell represents the probability that the instance belongs to the class,
according to 1NN.
Predicts the label of instances according to the RandomForestSemisupervised_fitted model.
## S3 method for class 'RandomForestSemisupervised_fitted' predict(object, x, type = "class", confident = "max_prob", ...)
## S3 method for class 'RandomForestSemisupervised_fitted' predict(object, x, type = "class", confident = "max_prob", ...)
object |
RandomForestSemisupervised_fitted. |
x |
A object that can be coerced as matrix.
Depending on how was the model built, |
type |
of predict in principal model |
confident |
Is param to define the type of predict. It can be "max_prob", to get class with sum of probability is the maximum Or "vote" to get the most frequented class in all trees. Default is "max_prob" |
... |
This parameter is included for compatibility reasons. |
Vector with the labels assigned.
Predicts the label of instances according to the selfTraining
model.
## S3 method for class 'selfTraining' predict(object, x, type = "class", ...)
## S3 method for class 'selfTraining' predict(object, x, type = "class", ...)
object |
Self-training model built with the |
x |
A object that can be coerced as matrix.
Depending on how was the model built, |
type |
of predict in principal model |
... |
This parameter is included for compatibility reasons. |
For additional help see selfTraining
examples.
Vector with the labels assigned.
Predicts the label of instances according to the setred
model.
## S3 method for class 'setred' predict(object, x, col_name = ".pred_class", ...)
## S3 method for class 'setred' predict(object, x, col_name = ".pred_class", ...)
object |
SETRED model built with the |
x |
A object that can be coerced as matrix.
Depending on how was the model built, |
col_name |
is the colname from returned tibble in class type. The same from parsnip and tidymodels Default is .pred_clas |
... |
This parameter is included for compatibility reasons. |
For additional help see setred
examples.
Vector with the labels assigned.
Predicts the label of instances according to the snnrce
model.
## S3 method for class 'snnrce' predict(object, x, ...)
## S3 method for class 'snnrce' predict(object, x, ...)
object |
SNNRCE model built with the |
x |
A object that can be coerced as matrix.
Depending on how was the model built, |
... |
This parameter is included for compatibility reasons. |
For additional help see snnrce
examples.
Vector with the labels assigned.
Predicts the label of instances according to the snnrceG
model.
## S3 method for class 'snnrceG' predict(object, D, ...)
## S3 method for class 'snnrceG' predict(object, D, ...)
object |
model instance |
D |
distance matrix |
... |
This parameter is included for compatibility reasons. |
Predicts the label of instances SSLRDecisionTree_fitted model.
## S3 method for class 'SSLRDecisionTree_fitted' predict(object, x, type = "class", ...)
## S3 method for class 'SSLRDecisionTree_fitted' predict(object, x, type = "class", ...)
object |
model SSLRDecisionTree_fitted. |
x |
A object that can be coerced as matrix.
Depending on how was the model built, |
type |
of predict in principal model |
... |
This parameter is included for compatibility reasons. |
Vector with the labels assigned.
Predicts the label of instances according to the triTraining
model.
## S3 method for class 'triTraining' predict(object, x, ...)
## S3 method for class 'triTraining' predict(object, x, ...)
object |
Tri-training model built with the |
x |
A object that can be coerced as matrix.
Depending on how was the model built, |
... |
This parameter is included for compatibility reasons. |
For additional help see triTraining
examples.
Vector with the labels assigned.
Predict TSVMSSLR
## S3 method for class 'TSVMSSLR' predict(object, x, ...)
## S3 method for class 'TSVMSSLR' predict(object, x, ...)
object |
is the object |
x |
is the dataset |
... |
This parameter is included for compatibility reasons. |
Predict USMLeastSquaresClassifierSSLR
## S3 method for class 'USMLeastSquaresClassifierSSLR' predict(object, x, ...)
## S3 method for class 'USMLeastSquaresClassifierSSLR' predict(object, x, ...)
object |
is the object |
x |
is the dataset |
... |
This parameter is included for compatibility reasons. |
Predict WellSVMSSLR
## S3 method for class 'WellSVMSSLR' predict(object, x, ...)
## S3 method for class 'WellSVMSSLR' predict(object, x, ...)
object |
is the object |
x |
is the dataset |
... |
This parameter is included for compatibility reasons. |
Predictions
predictions(object, ...)
predictions(object, ...)
object |
object |
... |
other parameters to be passed |
Predictions
## S3 method for class 'GRFClassifierSSLR' predictions(object, ...)
## S3 method for class 'GRFClassifierSSLR' predictions(object, ...)
object |
object |
... |
other parameters to be passed |
Predictions of unlabeled data (transductive) raw returns factor or numeric values
## S3 method for class 'model_sslr_fitted' predictions(object, type = "class", ...)
## S3 method for class 'model_sslr_fitted' predictions(object, type = "class", ...)
object |
model_sslr_fitted model built |
type |
of predict in principal model: class, raw |
... |
other parameters to be passed |
Print model SSLR
## S3 method for class 'model_sslr' print(object)
## S3 method for class 'model_sslr' print(object)
object |
model_sslr object to print |
The difference with traditional Kmeans is that in this method implemented, at initialization, there are as many clusters as the number of classes that exist of the labelled data, the average of the labelled data of a given class
seeded_kmeans(max_iter = 10, method = "euclidean")
seeded_kmeans(max_iter = 10, method = "euclidean")
max_iter |
maximum iterations in KMeans. Default is 10 |
method |
distance method in KMeans: "euclidean", "maximum", "manhattan", "canberra", "binary" or "minkowski" |
Sugato Basu, Arindam Banerjee, Raymond Mooney
Semi-supervised clustering by seeding
July 2002
In Proceedings of 19th International Conference on Machine Learning
library(tidyverse) library(caret) library(SSLR) library(tidymodels) data <- iris set.seed(1) #% LABELED cls <- which(colnames(iris) == "Species") labeled.index <- createDataPartition(data$Species, p = .2, list = FALSE) data[-labeled.index,cls] <- NA m <- seeded_kmeans() %>% fit(Species ~ ., data) #Get labels (assing clusters), type = "raw" return factor labels <- m %>% cluster_labels() print(labels) #Get centers centers <- m %>% get_centers() print(centers)
library(tidyverse) library(caret) library(SSLR) library(tidymodels) data <- iris set.seed(1) #% LABELED cls <- which(colnames(iris) == "Species") labeled.index <- createDataPartition(data$Species, p = .2, list = FALSE) data[-labeled.index,cls] <- NA m <- seeded_kmeans() %>% fit(Species ~ ., data) #Get labels (assing clusters), type = "raw" return factor labels <- m %>% cluster_labels() print(labels) #Get centers centers <- m %>% get_centers() print(centers)
Self-training is a simple and effective semi-supervised learning classification method. The self-training classifier is initially trained with a reduced set of labeled examples. Then it is iteratively retrained with its own most confident predictions over the unlabeled examples. Self-training follows a wrapper methodology using a base supervised classifier to establish the possible class of unlabeled instances.
selfTraining(learner, max.iter = 50, perc.full = 0.7, thr.conf = 0.5)
selfTraining(learner, max.iter = 50, perc.full = 0.7, thr.conf = 0.5)
learner |
model from parsnip package for training a supervised base classifier using a set of instances. This model need to have probability predictions (or optionally a distance matrix) and it's corresponding classes. |
max.iter |
maximum number of iterations to execute the self-labeling process. Default is 50. |
perc.full |
A number between 0 and 1. If the percentage of new labeled examples reaches this value the self-training process is stopped. Default is 0.7. |
thr.conf |
A number between 0 and 1 that indicates the confidence threshold.
At each iteration, only the newly labelled examples with a confidence greater than
this value ( |
For predicting the most accurate instances per iteration, selfTraining
uses the predictions obtained with the learner specified. To train a model
using the learner
function, it is required a set of instances
(or a precomputed matrix between the instances if x.inst
parameter is FALSE
)
in conjunction with the corresponding classes.
Additionals parameters are provided to the learner
function via the
learner.pars
argument. The model obtained is a supervised classifier
ready to predict new instances through the pred
function.
Using a similar idea, the additional parameters to the pred
function
are provided using the pred.pars
argument. The pred
function returns
the probabilities per class for each new instance. The value of the
thr.conf
argument controls the confidence of instances selected
to enlarge the labeled set for the next iteration.
The stopping criterion is defined through the fulfillment of one of the following
criteria: the algorithm reaches the number of iterations defined in the max.iter
parameter or the portion of the unlabeled set, defined in the perc.full
parameter,
is moved to the labeled set. In some cases, the process stops and no instances
are added to the original labeled set. In this case, the user must assign a more
flexible value to the thr.conf
parameter.
(When model fit) A list object of class "selfTraining" containing:
The final base classifier trained using the enlarged labeled set.
The indexes of the training instances used to
train the model
. These indexes include the initial labeled instances
and the newly labeled instances.
Those indexes are relative to x
argument.
The levels of y
factor.
The function provided in the pred
argument.
The list provided in the pred.pars
argument.
David Yarowsky.
Unsupervised word sense disambiguation rivaling supervised methods.
In Proceedings of the 33rd annual meeting on Association for Computational Linguistics,
pages 189-196. Association for Computational Linguistics, 1995.
library(tidyverse) library(tidymodels) library(caret) library(SSLR) data(wine) set.seed(1) train.index <- createDataPartition(wine$Wine, p = .7, list = FALSE) train <- wine[ train.index,] test <- wine[-train.index,] cls <- which(colnames(wine) == "Wine") #% LABELED labeled.index <- createDataPartition(train$Wine, p = .2, list = FALSE) train[-labeled.index,cls] <- NA #We need a model with probability predictions from parsnip #https://tidymodels.github.io/parsnip/articles/articles/Models.html #It should be with mode = classification #For example, with Random Forest rf <- rand_forest(trees = 100, mode = "classification") %>% set_engine("randomForest") m <- selfTraining(learner = rf, perc.full = 0.7, thr.conf = 0.5, max.iter = 10) %>% fit(Wine ~ ., data = train) #Accuracy predict(m,test) %>% bind_cols(test) %>% metrics(truth = "Wine", estimate = .pred_class)
library(tidyverse) library(tidymodels) library(caret) library(SSLR) data(wine) set.seed(1) train.index <- createDataPartition(wine$Wine, p = .7, list = FALSE) train <- wine[ train.index,] test <- wine[-train.index,] cls <- which(colnames(wine) == "Wine") #% LABELED labeled.index <- createDataPartition(train$Wine, p = .2, list = FALSE) train[-labeled.index,cls] <- NA #We need a model with probability predictions from parsnip #https://tidymodels.github.io/parsnip/articles/articles/Models.html #It should be with mode = classification #For example, with Random Forest rf <- rand_forest(trees = 100, mode = "classification") %>% set_engine("randomForest") m <- selfTraining(learner = rf, perc.full = 0.7, thr.conf = 0.5, max.iter = 10) %>% fit(Wine ~ ., data = train) #Accuracy predict(m,test) %>% bind_cols(test) %>% metrics(truth = "Wine", estimate = .pred_class)
Self-training is a simple and effective semi-supervised learning classification method. The self-training classifier is initially trained with a reduced set of labeled examples. Then it is iteratively retrained with its own most confident predictions over the unlabeled examples. Self-training follows a wrapper methodology using one base supervised classifier to establish the possible class of unlabeled instances.
selfTrainingG( y, gen.learner, gen.pred, max.iter = 50, perc.full = 0.7, thr.conf = 0.5 )
selfTrainingG( y, gen.learner, gen.pred, max.iter = 50, perc.full = 0.7, thr.conf = 0.5 )
y |
A vector with the labels of training instances. In this vector the
unlabeled instances are specified with the value |
gen.learner |
A function for training a supervised base classifier. This function needs two parameters, indexes and cls, where indexes indicates the instances to use and cls specifies the classes of those instances. |
gen.pred |
A function for predicting the probabilities per classes.
This function must be two parameters, model and indexes, where the model
is a classifier trained with |
max.iter |
Maximum number of iterations to execute the self-labeling process. Default is 50. |
perc.full |
A number between 0 and 1. If the percentage of new labeled examples reaches this value the self-training process is stopped. Default is 0.7. |
thr.conf |
A number between 0 and 1 that indicates the confidence theshold.
At each iteration, only the newly labelled examples with a confidence greater than
this value ( |
SelfTrainingG can be helpful in those cases where the method selected as
base classifier needs learner
and pred
functions with other
specifications. For more information about the general self-training method,
please see the selfTraining
function. Essentially, the selfTraining
function is a wrapper of the selfTrainingG
function.
A list object of class "selfTrainingG" containing:
The final base classifier trained using the enlarged labeled set.
The indexes of the training instances used to
train the model
. These indexes include the initial labeled instances
and the newly labeled instances.
Those indexes are relative to the y
argument.
library(SSLR) ## Load Wine data set data(wine) cls <- which(colnames(wine) == "Wine") x <- wine[, - cls] # instances without classes y <- wine[, cls] # the classes x <- scale(x) set.seed(20) # Use 50% of instances for training tra.idx <- sample(x = length(y), size = ceiling(length(y) * 0.5)) xtrain <- x[tra.idx,] ytrain <- y[tra.idx] # Use 70% of train instances as unlabeled set tra.na.idx <- sample(x = length(tra.idx), size = ceiling(length(tra.idx) * 0.7)) ytrain[tra.na.idx] <- NA # Use the other 50% of instances for inductive testing tst.idx <- setdiff(1:length(y), tra.idx) xitest <- x[tst.idx,] # testing instances yitest <- y[tst.idx] # classes of instances in xitest # Use the unlabeled examples for transductive testing xttest <- x[tra.idx[tra.na.idx],] # transductive testing instances yttest <- y[tra.idx[tra.na.idx]] # classes of instances in xttest library(caret) #PREPARE DATA data <- cbind(xtrain, Class = ytrain) dtrain <- as.matrix(proxy::dist(x = xtrain, method = "euclidean", by_rows = TRUE)) ditest <- as.matrix(proxy::dist(x = xitest, y = xtrain, method = "euclidean", by_rows = TRUE)) ddata <- cbind(dtrain, Class = ytrain) ddata <- as.data.frame(ddata) ktrain <- as.matrix(exp(-0.048 * dtrain ^ 2)) kdata <- cbind(ktrain, Class = ytrain) kdata <- as.data.frame(kdata) ktrain <- as.matrix(exp(-0.048 * dtrain ^ 2)) kitest <- as.matrix(exp(-0.048 * ditest ^ 2)) ## Example: Training from a set of instances with 1-NN (knn3) as base classifier. gen.learner <- function(indexes, cls) caret::knn3(x = xtrain[indexes,], y = cls, k = 1) gen.pred <- function(model, indexes) predict(model, xtrain[indexes,]) trControl_selfTrainingG1 <- list(gen.learner = gen.learner, gen.pred = gen.pred) md1 <- train_generic(ytrain, method = "selfTrainingG", trControl = trControl_selfTrainingG1) p1 <- predict(md1$model, xitest, type = "class") table(p1, yitest) confusionMatrix(p1, yitest)$overall[1] ## Example: Training from a distance matrix with 1-NN (oneNN) as base classifier. dtrain <- as.matrix(proxy::dist(x = xtrain, method = "euclidean", by_rows = TRUE)) gen.learner <- function(indexes, cls) { m <- SSLR::oneNN(y = cls) attr(m, "tra.idxs") <- indexes m } gen.pred <- function(model, indexes) { tra.idxs <- attr(model, "tra.idxs") d <- dtrain[indexes, tra.idxs] prob <- predict(model, d, distance.weighting = "none") prob } trControl_selfTrainingG2 <- list(gen.learner = gen.learner, gen.pred = gen.pred) md2 <- train_generic(ytrain, method = "selfTrainingG", trControl = trControl_selfTrainingG2) ditest <- proxy::dist(x = xitest, y = xtrain[md2$instances.index,], method = "euclidean", by_rows = TRUE) p2 <- predict(md2$model, ditest, type = "class") table(p2, yitest) confusionMatrix(p2, yitest)$overall[1]
library(SSLR) ## Load Wine data set data(wine) cls <- which(colnames(wine) == "Wine") x <- wine[, - cls] # instances without classes y <- wine[, cls] # the classes x <- scale(x) set.seed(20) # Use 50% of instances for training tra.idx <- sample(x = length(y), size = ceiling(length(y) * 0.5)) xtrain <- x[tra.idx,] ytrain <- y[tra.idx] # Use 70% of train instances as unlabeled set tra.na.idx <- sample(x = length(tra.idx), size = ceiling(length(tra.idx) * 0.7)) ytrain[tra.na.idx] <- NA # Use the other 50% of instances for inductive testing tst.idx <- setdiff(1:length(y), tra.idx) xitest <- x[tst.idx,] # testing instances yitest <- y[tst.idx] # classes of instances in xitest # Use the unlabeled examples for transductive testing xttest <- x[tra.idx[tra.na.idx],] # transductive testing instances yttest <- y[tra.idx[tra.na.idx]] # classes of instances in xttest library(caret) #PREPARE DATA data <- cbind(xtrain, Class = ytrain) dtrain <- as.matrix(proxy::dist(x = xtrain, method = "euclidean", by_rows = TRUE)) ditest <- as.matrix(proxy::dist(x = xitest, y = xtrain, method = "euclidean", by_rows = TRUE)) ddata <- cbind(dtrain, Class = ytrain) ddata <- as.data.frame(ddata) ktrain <- as.matrix(exp(-0.048 * dtrain ^ 2)) kdata <- cbind(ktrain, Class = ytrain) kdata <- as.data.frame(kdata) ktrain <- as.matrix(exp(-0.048 * dtrain ^ 2)) kitest <- as.matrix(exp(-0.048 * ditest ^ 2)) ## Example: Training from a set of instances with 1-NN (knn3) as base classifier. gen.learner <- function(indexes, cls) caret::knn3(x = xtrain[indexes,], y = cls, k = 1) gen.pred <- function(model, indexes) predict(model, xtrain[indexes,]) trControl_selfTrainingG1 <- list(gen.learner = gen.learner, gen.pred = gen.pred) md1 <- train_generic(ytrain, method = "selfTrainingG", trControl = trControl_selfTrainingG1) p1 <- predict(md1$model, xitest, type = "class") table(p1, yitest) confusionMatrix(p1, yitest)$overall[1] ## Example: Training from a distance matrix with 1-NN (oneNN) as base classifier. dtrain <- as.matrix(proxy::dist(x = xtrain, method = "euclidean", by_rows = TRUE)) gen.learner <- function(indexes, cls) { m <- SSLR::oneNN(y = cls) attr(m, "tra.idxs") <- indexes m } gen.pred <- function(model, indexes) { tra.idxs <- attr(model, "tra.idxs") d <- dtrain[indexes, tra.idxs] prob <- predict(model, d, distance.weighting = "none") prob } trControl_selfTrainingG2 <- list(gen.learner = gen.learner, gen.pred = gen.pred) md2 <- train_generic(ytrain, method = "selfTrainingG", trControl = trControl_selfTrainingG2) ditest <- proxy::dist(x = xitest, y = xtrain[md2$instances.index,], method = "euclidean", by_rows = TRUE) p2 <- predict(md2$model, ditest, type = "class") table(p2, yitest) confusionMatrix(p2, yitest)$overall[1]
SETRED (SElf-TRaining with EDiting) is a variant of the self-training
classification method (as implemented in the function selfTraining
) with a different addition mechanism.
The SETRED classifier is initially trained with a
reduced set of labeled examples. Then, it is iteratively retrained with its own most
confident predictions over the unlabeled examples. SETRED uses an amending scheme
to avoid the introduction of noisy examples into the enlarged labeled set. For each
iteration, the mislabeled examples are identified using the local information provided
by the neighborhood graph.
setred( dist = "Euclidean", learner, theta = 0.1, max.iter = 50, perc.full = 0.7, D = NULL )
setred( dist = "Euclidean", learner, theta = 0.1, max.iter = 50, perc.full = 0.7, D = NULL )
dist |
A distance function or the name of a distance available
in the |
learner |
model from parsnip package for training a supervised base classifier using a set of instances. This model need to have probability predictions (or optionally a distance matrix) and it's corresponding classes. |
theta |
Rejection threshold to test the critical region. Default is 0.1. |
max.iter |
maximum number of iterations to execute the self-labeling process. Default is 50. |
perc.full |
A number between 0 and 1. If the percentage of new labeled examples reaches this value the self-training process is stopped. Default is 0.7. |
D |
A distance matrix between all the training instances. This matrix is used to construct the neighborhood graph. Default is NULL, this means the method create a matrix with dist param |
SETRED initiates the self-labeling process by training a model from the original
labeled set. In each iteration, the learner
function detects unlabeled
examples for which it makes the most confident prediction and labels those examples
according to the pred
function. The identification of mislabeled examples is
performed using a neighborhood graph created from the distance matrix.
Most examples possess the same label in a neighborhood. So if an example locates
in a neighborhood with too many neighbors from different classes, this example should
be considered problematic. The value of the theta
argument controls the confidence
of the candidates selected to enlarge the labeled set. The lower this value is, the more
restrictive is the selection of the examples that are considered good.
For more information about the self-labeled process and the rest of the parameters, please
see selfTraining
.
(When model fit) A list object of class "setred" containing:
The final base classifier trained using the enlarged labeled set.
The indexes of the training instances used to
train the model
. These indexes include the initial labeled instances
and the newly labeled instances.
Those indexes are relative to x
argument.
The levels of y
factor.
The function provided in the pred
argument.
The list provided in the pred.pars
argument.
Ming Li and ZhiHua Zhou.
Setred: Self-training with editing.
In Advances in Knowledge Discovery and Data Mining, volume 3518 of Lecture Notes in
Computer Science, pages 611-621. Springer Berlin Heidelberg, 2005.
ISBN 978-3-540-26076-9. doi: 10.1007/11430919 71.
library(tidyverse) library(tidymodels) library(caret) library(SSLR) data(wine) set.seed(1) train.index <- createDataPartition(wine$Wine, p = .7, list = FALSE) train <- wine[ train.index,] test <- wine[-train.index,] cls <- which(colnames(wine) == "Wine") #% LABELED labeled.index <- createDataPartition(wine$Wine, p = .2, list = FALSE) train[-labeled.index,cls] <- NA #We need a model with probability predictions from parsnip #https://tidymodels.github.io/parsnip/articles/articles/Models.html #It should be with mode = classification #For example, with Random Forest rf <- rand_forest(trees = 100, mode = "classification") %>% set_engine("randomForest") m <- setred(learner = rf, theta = 0.1, max.iter = 2, perc.full = 0.7) %>% fit(Wine ~ ., data = train) #Accuracy predict(m,test) %>% bind_cols(test) %>% metrics(truth = "Wine", estimate = .pred_class) #Another example, with dist matrix distance <- as.matrix(proxy::dist(train[,-cls], method ="Euclidean", by_rows = TRUE, diag = TRUE, upper = TRUE)) m <- setred(learner = rf, theta = 0.1, max.iter = 2, perc.full = 0.7, D = distance) %>% fit(Wine ~ ., data = train) #Accuracy predict(m,test) %>% bind_cols(test) %>% metrics(truth = "Wine", estimate = .pred_class)
library(tidyverse) library(tidymodels) library(caret) library(SSLR) data(wine) set.seed(1) train.index <- createDataPartition(wine$Wine, p = .7, list = FALSE) train <- wine[ train.index,] test <- wine[-train.index,] cls <- which(colnames(wine) == "Wine") #% LABELED labeled.index <- createDataPartition(wine$Wine, p = .2, list = FALSE) train[-labeled.index,cls] <- NA #We need a model with probability predictions from parsnip #https://tidymodels.github.io/parsnip/articles/articles/Models.html #It should be with mode = classification #For example, with Random Forest rf <- rand_forest(trees = 100, mode = "classification") %>% set_engine("randomForest") m <- setred(learner = rf, theta = 0.1, max.iter = 2, perc.full = 0.7) %>% fit(Wine ~ ., data = train) #Accuracy predict(m,test) %>% bind_cols(test) %>% metrics(truth = "Wine", estimate = .pred_class) #Another example, with dist matrix distance <- as.matrix(proxy::dist(train[,-cls], method ="Euclidean", by_rows = TRUE, diag = TRUE, upper = TRUE)) m <- setred(learner = rf, theta = 0.1, max.iter = 2, perc.full = 0.7, D = distance) %>% fit(Wine ~ ., data = train) #Accuracy predict(m,test) %>% bind_cols(test) %>% metrics(truth = "Wine", estimate = .pred_class)
SETRED is a variant of the self-training classification method
(selfTraining
) with a different addition mechanism.
The SETRED classifier is initially trained with a
reduced set of labeled examples. Then it is iteratively retrained with its own most
confident predictions over the unlabeled examples. SETRED uses an amending scheme
to avoid the introduction of noisy examples into the enlarged labeled set. For each
iteration, the mislabeled examples are identified using the local information provided
by the neighborhood graph.
setredG( y, D, gen.learner, gen.pred, theta = 0.1, max.iter = 50, perc.full = 0.7 )
setredG( y, D, gen.learner, gen.pred, theta = 0.1, max.iter = 50, perc.full = 0.7 )
y |
A vector with the labels of training instances. In this vector the
unlabeled instances are specified with the value |
D |
A distance matrix between all the training instances. This matrix is used to construct the neighborhood graph. |
gen.learner |
A function for training a supervised base classifier. This function needs two parameters, indexes and cls, where indexes indicates the instances to use and cls specifies the classes of those instances. |
gen.pred |
A function for predicting the probabilities per classes.
This function must be two parameters, model and indexes, where the model
is a classifier trained with |
theta |
Rejection threshold to test the critical region. Default is 0.1. |
max.iter |
Maximum number of iterations to execute the self-labeling process. Default is 50. |
perc.full |
A number between 0 and 1. If the percentage of new labeled examples reaches this value the self-training process is stopped. Default is 0.7. |
SetredG can be helpful in those cases where the method selected as
base classifier needs a learner
and pred
functions with other
specifications. For more information about the general setred method,
please see setred
function. Essentially, setred
function is a wrapper of setredG
function.
A list object of class "setredG" containing:
The final base classifier trained using the enlarged labeled set.
The indexes of the training instances used to
train the model
. These indexes include the initial labeled instances
and the newly labeled instances.
Those indexes are relative to the y
argument.
library(SSLR) library(caret) ## Load Wine data set data(wine) cls <- which(colnames(wine) == "Wine") x <- wine[, - cls] # instances without classes y <- wine[, cls] # the classes x <- scale(x) # scale the attributes ## Prepare data set.seed(20) # Use 50% of instances for training tra.idx <- sample(x = length(y), size = ceiling(length(y) * 0.5)) xtrain <- x[tra.idx,] # training instances ytrain <- y[tra.idx] # classes of training instances # Use 70% of train instances as unlabeled set tra.na.idx <- sample(x = length(tra.idx), size = ceiling(length(tra.idx) * 0.7)) ytrain[tra.na.idx] <- NA # remove class information of unlabeled instances # Use the other 50% of instances for inductive testing tst.idx <- setdiff(1:length(y), tra.idx) xitest <- x[tst.idx,] # testing instances yitest <- y[tst.idx] # classes of testing instances # Compute distances between training instances D <- as.matrix(proxy::dist(x = xtrain, method = "euclidean", by_rows = TRUE)) ## Example: Training from a set of instances with 1-NN (knn3) as base classifier. # Compute distances between training instances D <- as.matrix(proxy::dist(x = xtrain, method = "euclidean", by_rows = TRUE)) ## Example: Training from a set of instances with 1-NN (knn3) as base classifier. gen.learner <- function(indexes, cls) caret::knn3(x = xtrain[indexes,], y = cls, k = 1) gen.pred <- function(model, indexes) predict(model, xtrain[indexes,]) trControl_SETRED1 <- list(D = D, gen.learner = gen.learner, gen.pred = gen.pred) md1 <- train_generic(ytrain, method = "setredG", trControl = trControl_SETRED1) 'md1 <- setredG(y = ytrain, D, gen.learner, gen.pred)' cls1 <- predict(md1$model, xitest, type = "class") table(cls1, yitest) confusionMatrix(cls1, yitest)$overall[1] ## Example: Training from a distance matrix with 1-NN (oneNN) as base classifier gen.learner <- function(indexes, cls) { m <- SSLR::oneNN(y = cls) attr(m, "tra.idxs") <- indexes m } gen.pred <- function(model, indexes) { tra.idxs <- attr(model, "tra.idxs") d <- D[indexes, tra.idxs] prob <- predict(model, d, distance.weighting = "none") prob } trControl_SETRED2 <- list(D = D, gen.learner = gen.learner, gen.pred = gen.pred) md2 <- train_generic(ytrain, method = "setredG", trControl = trControl_SETRED2) ditest <- proxy::dist(x = xitest, y = xtrain[md2$instances.index,], method = "euclidean", by_rows = TRUE) cls2 <- predict(md2$model, ditest, type = "class") table(cls2, yitest) confusionMatrix(cls2, yitest)$overall[1]
library(SSLR) library(caret) ## Load Wine data set data(wine) cls <- which(colnames(wine) == "Wine") x <- wine[, - cls] # instances without classes y <- wine[, cls] # the classes x <- scale(x) # scale the attributes ## Prepare data set.seed(20) # Use 50% of instances for training tra.idx <- sample(x = length(y), size = ceiling(length(y) * 0.5)) xtrain <- x[tra.idx,] # training instances ytrain <- y[tra.idx] # classes of training instances # Use 70% of train instances as unlabeled set tra.na.idx <- sample(x = length(tra.idx), size = ceiling(length(tra.idx) * 0.7)) ytrain[tra.na.idx] <- NA # remove class information of unlabeled instances # Use the other 50% of instances for inductive testing tst.idx <- setdiff(1:length(y), tra.idx) xitest <- x[tst.idx,] # testing instances yitest <- y[tst.idx] # classes of testing instances # Compute distances between training instances D <- as.matrix(proxy::dist(x = xtrain, method = "euclidean", by_rows = TRUE)) ## Example: Training from a set of instances with 1-NN (knn3) as base classifier. # Compute distances between training instances D <- as.matrix(proxy::dist(x = xtrain, method = "euclidean", by_rows = TRUE)) ## Example: Training from a set of instances with 1-NN (knn3) as base classifier. gen.learner <- function(indexes, cls) caret::knn3(x = xtrain[indexes,], y = cls, k = 1) gen.pred <- function(model, indexes) predict(model, xtrain[indexes,]) trControl_SETRED1 <- list(D = D, gen.learner = gen.learner, gen.pred = gen.pred) md1 <- train_generic(ytrain, method = "setredG", trControl = trControl_SETRED1) 'md1 <- setredG(y = ytrain, D, gen.learner, gen.pred)' cls1 <- predict(md1$model, xitest, type = "class") table(cls1, yitest) confusionMatrix(cls1, yitest)$overall[1] ## Example: Training from a distance matrix with 1-NN (oneNN) as base classifier gen.learner <- function(indexes, cls) { m <- SSLR::oneNN(y = cls) attr(m, "tra.idxs") <- indexes m } gen.pred <- function(model, indexes) { tra.idxs <- attr(model, "tra.idxs") d <- D[indexes, tra.idxs] prob <- predict(model, d, distance.weighting = "none") prob } trControl_SETRED2 <- list(D = D, gen.learner = gen.learner, gen.pred = gen.pred) md2 <- train_generic(ytrain, method = "setredG", trControl = trControl_SETRED2) ditest <- proxy::dist(x = xitest, y = xtrain[md2$instances.index,], method = "euclidean", by_rows = TRUE) cls2 <- predict(md2$model, ditest, type = "class") table(cls2, yitest) confusionMatrix(cls2, yitest)$overall[1]
SNNRCE (Self-training Nearest Neighbor Rule using Cut Edges) is a variant
of the self-training classification method (selfTraining
) with a different
addition mechanism and a fixed learning scheme (1-NN). SNNRCE uses an amending scheme
to avoid the introduction of noisy examples into the enlarged labeled set.
The mislabeled examples are identified using the local information provided
by the neighborhood graph. A statistical test using cut edge weight is used to modify
the labels of the missclassified examples.
snnrce(x.inst = TRUE, dist = "Euclidean", alpha = 0.1)
snnrce(x.inst = TRUE, dist = "Euclidean", alpha = 0.1)
x.inst |
A boolean value that indicates if |
dist |
A distance function available in the |
alpha |
Rejection threshold to test the critical region. Default is 0.1. |
SNNRCE initiates the self-labeling process by training a 1-NN from the original
labeled set. This method attempts to reduce the noise in examples by labeling those instances
with no cut edges in the initial stages of self-labeling learning.
These highly confident examples are added into the training set.
The remaining examples follow the standard self-training process until a minimum number
of examples will be labeled for each class. A statistical test using cut edge weight is used
to modify the labels of the missclassified examples The value of the alpha
argument
defines the critical region where the candidates examples are tested. The higher this value
is, the more relaxed it is the selection of the examples that are considered mislabeled.
(When model fit) A list object of class "snnrce" containing:
The final base classifier trained using the enlarged labeled set.
The indexes of the training instances used to
train the model
. These indexes include the initial labeled instances
and the newly labeled instances.
Those indexes are relative to x
argument.
The levels of y
factor.
The value provided in the x.inst
argument.
The value provided in the dist
argument when x.inst is TRUE
.
A matrix with the subset of training instances referenced by the indexes
instances.index
when x.inst is TRUE
.
Yu Wang, Xiaoyan Xu, Haifeng Zhao, and Zhongsheng Hua.
Semisupervised learning based on nearest neighbor rule and cut edges.
Knowledge-Based Systems, 23(6):547-554, 2010. ISSN 0950-7051. doi: http://dx.doi.org/10.1016/j.knosys.2010.03.012.
library(tidyverse) library(tidymodels) library(caret) library(SSLR) data(wine) set.seed(1) train.index <- createDataPartition(wine$Wine, p = .7, list = FALSE) train <- wine[ train.index,] test <- wine[-train.index,] cls <- which(colnames(wine) == "Wine") #% LABELED labeled.index <- createDataPartition(wine$Wine, p = .2, list = FALSE) train[-labeled.index,cls] <- NA m <- snnrce(x.inst = TRUE, dist = "Euclidean", alpha = 0.1) %>% fit(Wine ~ ., data = train) predict(m,test) %>% bind_cols(test) %>% metrics(truth = "Wine", estimate = .pred_class)
library(tidyverse) library(tidymodels) library(caret) library(SSLR) data(wine) set.seed(1) train.index <- createDataPartition(wine$Wine, p = .7, list = FALSE) train <- wine[ train.index,] test <- wine[-train.index,] cls <- which(colnames(wine) == "Wine") #% LABELED labeled.index <- createDataPartition(wine$Wine, p = .2, list = FALSE) train[-labeled.index,cls] <- NA m <- snnrce(x.inst = TRUE, dist = "Euclidean", alpha = 0.1) %>% fit(Wine ~ ., data = train) predict(m,test) %>% bind_cols(test) %>% metrics(truth = "Wine", estimate = .pred_class)
Decision Tree is a simple and effective semi-supervised learning method. Based on the article "Semi-supervised classification trees". It also offers many parameters to modify the behavior of this method. It is the same as the traditional Decision Tree algorithm, but the difference is how the gini coefficient is calculated (classification). In regression we use SSE metric (different from the original investigation) It can be used in classification or regression. If Y is numeric is for regression, classification in another case
SSLRDecisionTree( max_depth = 30, w = 0.5, min_samples_split = 20, min_samples_leaf = ceiling(min_samples_split/3) )
SSLRDecisionTree( max_depth = 30, w = 0.5, min_samples_split = 20, min_samples_leaf = ceiling(min_samples_split/3) )
max_depth |
A number from 1 to Inf. Is the maximum number of depth in Decision Tree Default is 30 |
w |
weight parameter ranging from 0 to 1. Default is 0.5 |
min_samples_split |
the minimum number of observations to do split. Default is 20 |
min_samples_leaf |
the minimum number of any terminal leaf node. Default is ceiling(min_samples_split/3) |
In this model we can make predictions with prob type
Jurica Levati, Michelangelo Ceci, Dragi Kocev, Saso Dzeroski.
Semi-supervised classification trees.
Published online: 25 March 2017
© Springer Science Business Media New York 2017
library(tidyverse) library(caret) library(SSLR) library(tidymodels) data(wine) set.seed(1) train.index <- createDataPartition(wine$Wine, p = .7, list = FALSE) train <- wine[ train.index,] test <- wine[-train.index,] cls <- which(colnames(wine) == "Wine") #% LABELED labeled.index <- createDataPartition(wine$Wine, p = .2, list = FALSE) train[-labeled.index,cls] <- NA m <- SSLRDecisionTree(min_samples_split = round(length(labeled.index) * 0.25), w = 0.3, ) %>% fit(Wine ~ ., data = train) #Accuracy predict(m,test) %>% bind_cols(test) %>% metrics(truth = "Wine", estimate = .pred_class) #For probabilities predict(m,test, type = "prob")
library(tidyverse) library(caret) library(SSLR) library(tidymodels) data(wine) set.seed(1) train.index <- createDataPartition(wine$Wine, p = .7, list = FALSE) train <- wine[ train.index,] test <- wine[-train.index,] cls <- which(colnames(wine) == "Wine") #% LABELED labeled.index <- createDataPartition(wine$Wine, p = .2, list = FALSE) train[-labeled.index,cls] <- NA m <- SSLRDecisionTree(min_samples_split = round(length(labeled.index) * 0.25), w = 0.3, ) %>% fit(Wine ~ ., data = train) #Accuracy predict(m,test) %>% bind_cols(test) %>% metrics(truth = "Wine", estimate = .pred_class) #For probabilities predict(m,test, type = "prob")
Random Forest is a simple and effective semi-supervised learning method. It is the same as the traditional Random Forest algorithm, but the difference is that it use Semi supervised Decision Trees It can be used in classification or regression. If Y is numeric is for regression, classification in another case
SSLRRandomForest( mtry = NULL, trees = 500, min_n = NULL, w = 0.5, replace = TRUE, tree_max_depth = Inf, sampsize = NULL, min_samples_leaf = NULL, allowParallel = TRUE )
SSLRRandomForest( mtry = NULL, trees = 500, min_n = NULL, w = 0.5, replace = TRUE, tree_max_depth = Inf, sampsize = NULL, min_samples_leaf = NULL, allowParallel = TRUE )
mtry |
number of features in each decision tree. Default is null. This means that mtry = log(n_features) + 1 |
trees |
number of trees. Default is 500 |
min_n |
number of minimum samples in each tree Default is null. This means that uses all training data |
w |
weight parameter ranging from 0 to 1. Default is 0.5 |
replace |
replacing type in sampling. Default is true |
tree_max_depth |
maximum tree depth. Default is Inf |
sampsize |
Size of sample. Default if (replace) nrow(x) else ceiling(.632*nrow(x)) |
min_samples_leaf |
the minimum number of any terminal leaf node. Default is 1 |
allowParallel |
Execute Random Forest in parallel if doParallel is loaded. Default is TRUE |
We can use paralleling processing with doParallel package and allowParallel = TRUE.
Jurica Levati, Michelangelo Ceci, Dragi Kocev, Saso Dzeroski.
Semi-supervised classification trees.
Published online: 25 March 2017
© Springer Science Business Media New York 2017
library(tidyverse) library(caret) library(SSLR) library(tidymodels) data(wine) set.seed(1) train.index <- createDataPartition(wine$Wine, p = .7, list = FALSE) train <- wine[ train.index,] test <- wine[-train.index,] cls <- which(colnames(wine) == "Wine") #% LABELED labeled.index <- createDataPartition(train$Wine, p = .2, list = FALSE) train[-labeled.index,cls] <- NA m <- SSLRRandomForest(trees = 5, w = 0.3) %>% fit(Wine ~ ., data = train) #Accuracy predict(m,test) %>% bind_cols(test) %>% metrics(truth = "Wine", estimate = .pred_class) #For probabilities predict(m,test, type = "prob")
library(tidyverse) library(caret) library(SSLR) library(tidymodels) data(wine) set.seed(1) train.index <- createDataPartition(wine$Wine, p = .7, list = FALSE) train <- wine[ train.index,] test <- wine[-train.index,] cls <- which(colnames(wine) == "Wine") #% LABELED labeled.index <- createDataPartition(train$Wine, p = .2, list = FALSE) train[-labeled.index,cls] <- NA m <- SSLRRandomForest(trees = 5, w = 0.3) %>% fit(Wine ~ ., data = train) #Accuracy predict(m,test) %>% bind_cols(test) %>% metrics(truth = "Wine", estimate = .pred_class) #For probabilities predict(m,test, type = "prob")
FUNCTION TO TRAIN GENERIC MODEL
train_generic(y, ...)
train_generic(y, ...)
y |
(optional) factor (classes) |
... |
list parms trControl (method...) |
model trained
Tri-training is a semi-supervised learning algorithm with a co-training style. This algorithm trains three classifiers with the same learning scheme from a reduced set of labeled examples. For each iteration, an unlabeled example is labeled for a classifier if the other two classifiers agree on the labeling proposed.
triTraining(learner)
triTraining(learner)
learner |
model from parsnip package for training a supervised base classifier using a set of instances. This model need to have probability predictions (or optionally a distance matrix) and it's corresponding classes. |
Tri-training initiates the self-labeling process by training three models from the
original labeled set, using the learner
function specified.
In each iteration, the algorithm detects unlabeled examples on which two classifiers
agree with the classification and includes these instances in the enlarged set of the
third classifier under certain conditions. The generation of the final hypothesis is
produced via the majority voting. The iteration process ends when no changes occur in
any model during a complete iteration.
A list object of class "triTraining" containing:
The final three base classifiers trained using the enlarged labeled set.
List of three vectors of indexes related to the training instances
used per each classifier. These indexes are relative to the y
argument.
The indexes of all training instances used to
train the three models. These indexes include the initial labeled instances
and the newly labeled instances. These indexes are relative to the y
argument.
List of three vectors with the same information in model.index
but the indexes are relative to instances.index
vector.
The levels of y
factor.
The function provided in the pred
argument.
The list provided in the pred.pars
argument.
The value provided in the x.inst
argument.
ZhiHua Zhou and Ming Li.
Tri-training: exploiting unlabeled data using three classifiers.
IEEE Transactions on Knowledge and Data Engineering, 17(11):1529-1541, Nov 2005. ISSN 1041-4347. doi: 10.1109/TKDE.2005. 186.
library(tidyverse) library(tidymodels) library(caret) library(SSLR) data(wine) set.seed(1) train.index <- createDataPartition(wine$Wine, p = .7, list = FALSE) train <- wine[ train.index,] test <- wine[-train.index,] cls <- which(colnames(wine) == "Wine") #% LABELED labeled.index <- createDataPartition(wine$Wine, p = .2, list = FALSE) train[-labeled.index,cls] <- NA #We need a model with probability predictions from parsnip #https://tidymodels.github.io/parsnip/articles/articles/Models.html #It should be with mode = classification #For example, with Random Forest rf <- rand_forest(trees = 100, mode = "classification") %>% set_engine("randomForest") m <- triTraining(learner = rf) %>% fit(Wine ~ ., data = train) #Accuracy predict(m,test) %>% bind_cols(test) %>% metrics(truth = "Wine", estimate = .pred_class)
library(tidyverse) library(tidymodels) library(caret) library(SSLR) data(wine) set.seed(1) train.index <- createDataPartition(wine$Wine, p = .7, list = FALSE) train <- wine[ train.index,] test <- wine[-train.index,] cls <- which(colnames(wine) == "Wine") #% LABELED labeled.index <- createDataPartition(wine$Wine, p = .2, list = FALSE) train[-labeled.index,cls] <- NA #We need a model with probability predictions from parsnip #https://tidymodels.github.io/parsnip/articles/articles/Models.html #It should be with mode = classification #For example, with Random Forest rf <- rand_forest(trees = 100, mode = "classification") %>% set_engine("randomForest") m <- triTraining(learner = rf) %>% fit(Wine ~ ., data = train) #Accuracy predict(m,test) %>% bind_cols(test) %>% metrics(truth = "Wine", estimate = .pred_class)
This function combines the predictions obtained by the set of classifiers.
triTrainingCombine(pred)
triTrainingCombine(pred)
pred |
A list with the predictions of each classifiers |
A vector of classes
Tri-training is a semi-supervised learning algorithm with a co-training style. This algorithm trains three classifiers with the same learning scheme from a reduced set of labeled examples. For each iteration, an unlabeled example is labeled for a classifier if the other two classifiers agree on the labeling proposed.
triTrainingG(y, gen.learner, gen.pred)
triTrainingG(y, gen.learner, gen.pred)
y |
A vector with the labels of training instances. In this vector the
unlabeled instances are specified with the value |
gen.learner |
A function for training three supervised base classifiers. This function needs two parameters, indexes and cls, where indexes indicates the instances to use and cls specifies the classes of those instances. |
gen.pred |
A function for predicting the probabilities per classes.
This function must be two parameters, model and indexes, where the model
is a classifier trained with |
TriTrainingG can be helpful in those cases where the method selected as
base classifier needs a learner
and pred
functions with other
specifications. For more information about the general triTraining method,
please see the triTraining
function. Essentially, the triTraining
function is a wrapper of the triTrainingG
function.
A list object of class "triTrainingG" containing:
The final three base classifiers trained using the enlarged labeled set.
List of three vectors of indexes related to the training instances
used per each classifier. These indexes are relative to the y
argument.
The indexes of all training instances used to
train the three models. These indexes include the initial labeled instances
and the newly labeled instances. These indexes are relative to the y
argument.
List of three vectors with the same information in model.index
but the indexes are relative to instances.index
vector.
library(SSLR) library(caret) ## Load Wine data set data(wine) cls <- which(colnames(wine) == "Wine") x <- wine[, - cls] # instances without classes y <- wine[, cls] # the classes x <- scale(x) # scale the attributes ## Prepare data set.seed(20) # Use 50% of instances for training tra.idx <- sample(x = length(y), size = ceiling(length(y) * 0.5)) xtrain <- x[tra.idx,] # training instances ytrain <- y[tra.idx] # classes of training instances # Use 70% of train instances as unlabeled set tra.na.idx <- sample(x = length(tra.idx), size = ceiling(length(tra.idx) * 0.7)) ytrain[tra.na.idx] <- NA # remove class information of unlabeled instances # Use the other 50% of instances for inductive testing tst.idx <- setdiff(1:length(y), tra.idx) xitest <- x[tst.idx,] # testing instances yitest <- y[tst.idx] # classes of testing instances ## Example: Training from a set of instances with 1-NN (knn3) as base classifier. gen.learner <- function(indexes, cls) caret::knn3(x = xtrain[indexes,], y = cls, k = 1) gen.pred <- function(model, indexes) predict(model, xtrain[indexes,]) # Train set.seed(1) trControl_triTraining1 <- list(gen.learner = gen.learner, gen.pred = gen.pred) md1 <- train_generic(ytrain, method = "triTrainingG", trControl = trControl_triTraining1) # Predict testing instances using the three classifiers pred <- lapply( X = md1$model, FUN = function(m) predict(m, xitest, type = "class") ) # Combine the predictions cls1 <- triTrainingCombine(pred) table(cls1, yitest) confusionMatrix(cls1, yitest)$overall[1] ## Example: Training from a distance matrix with 1-NN (oneNN) as base classifier. dtrain <- as.matrix(proxy::dist(x = xtrain, method = "euclidean", by_rows = TRUE)) gen.learner <- function(indexes, cls) { m <- SSLR::oneNN(y = cls) attr(m, "tra.idxs") <- indexes m } gen.pred <- function(model, indexes) { tra.idxs <- attr(model, "tra.idxs") d <- dtrain[indexes, tra.idxs] prob <- predict(model, d, distance.weighting = "none") prob } # Train set.seed(1) trControl_triTraining2 <- list(gen.learner = gen.learner, gen.pred = gen.pred) md2 <- train_generic(ytrain, method = "triTrainingG", trControl = trControl_triTraining2) # Predict ditest <- proxy::dist(x = xitest, y = xtrain[md2$instances.index,], method = "euclidean", by_rows = TRUE) # Predict testing instances using the three classifiers pred <- mapply( FUN = function(m, indexes) { D <- ditest[, indexes] predict(m, D, type = "class") }, m = md2$model, indexes = md2$model.index.map, SIMPLIFY = FALSE ) # Combine the predictions cls2 <- triTrainingCombine(pred) table(cls2, yitest) confusionMatrix(cls2, yitest)$overall[1]
library(SSLR) library(caret) ## Load Wine data set data(wine) cls <- which(colnames(wine) == "Wine") x <- wine[, - cls] # instances without classes y <- wine[, cls] # the classes x <- scale(x) # scale the attributes ## Prepare data set.seed(20) # Use 50% of instances for training tra.idx <- sample(x = length(y), size = ceiling(length(y) * 0.5)) xtrain <- x[tra.idx,] # training instances ytrain <- y[tra.idx] # classes of training instances # Use 70% of train instances as unlabeled set tra.na.idx <- sample(x = length(tra.idx), size = ceiling(length(tra.idx) * 0.7)) ytrain[tra.na.idx] <- NA # remove class information of unlabeled instances # Use the other 50% of instances for inductive testing tst.idx <- setdiff(1:length(y), tra.idx) xitest <- x[tst.idx,] # testing instances yitest <- y[tst.idx] # classes of testing instances ## Example: Training from a set of instances with 1-NN (knn3) as base classifier. gen.learner <- function(indexes, cls) caret::knn3(x = xtrain[indexes,], y = cls, k = 1) gen.pred <- function(model, indexes) predict(model, xtrain[indexes,]) # Train set.seed(1) trControl_triTraining1 <- list(gen.learner = gen.learner, gen.pred = gen.pred) md1 <- train_generic(ytrain, method = "triTrainingG", trControl = trControl_triTraining1) # Predict testing instances using the three classifiers pred <- lapply( X = md1$model, FUN = function(m) predict(m, xitest, type = "class") ) # Combine the predictions cls1 <- triTrainingCombine(pred) table(cls1, yitest) confusionMatrix(cls1, yitest)$overall[1] ## Example: Training from a distance matrix with 1-NN (oneNN) as base classifier. dtrain <- as.matrix(proxy::dist(x = xtrain, method = "euclidean", by_rows = TRUE)) gen.learner <- function(indexes, cls) { m <- SSLR::oneNN(y = cls) attr(m, "tra.idxs") <- indexes m } gen.pred <- function(model, indexes) { tra.idxs <- attr(model, "tra.idxs") d <- dtrain[indexes, tra.idxs] prob <- predict(model, d, distance.weighting = "none") prob } # Train set.seed(1) trControl_triTraining2 <- list(gen.learner = gen.learner, gen.pred = gen.pred) md2 <- train_generic(ytrain, method = "triTrainingG", trControl = trControl_triTraining2) # Predict ditest <- proxy::dist(x = xitest, y = xtrain[md2$instances.index,], method = "euclidean", by_rows = TRUE) # Predict testing instances using the three classifiers pred <- mapply( FUN = function(m, indexes) { D <- ditest[, indexes] predict(m, D, type = "class") }, m = md2$model, indexes = md2$model.index.map, SIMPLIFY = FALSE ) # Combine the predictions cls2 <- triTrainingCombine(pred) table(cls2, yitest) confusionMatrix(cls2, yitest)$overall[1]
model from RSSL package Transductive SVM using the CCCP algorithm as proposed by Collobert et al. (2006) implemented in R using the quadprog package. The implementation does not handle large datasets very well, but can be useful for smaller datasets and visualization purposes. C is the cost associated with labeled objects, while Cstar is the cost for the unlabeled objects. s control the loss function used for the unlabeled objects: it controls the size of the plateau for the symmetric ramp loss function. The balancing constraint makes sure the label assignments of the unlabeled objects are similar to the prior on the classes that was observed on the labeled data.
TSVMSSLR( C = 1, Cstar = 0.1, kernel = kernlab::vanilladot(), balancing_constraint = TRUE, s = 0, x_center = TRUE, scale = FALSE, eps = 1e-09, max_iter = 20, verbose = FALSE )
TSVMSSLR( C = 1, Cstar = 0.1, kernel = kernlab::vanilladot(), balancing_constraint = TRUE, s = 0, x_center = TRUE, scale = FALSE, eps = 1e-09, max_iter = 20, verbose = FALSE )
C |
numeric; Cost parameter of the SVM |
Cstar |
numeric; Cost parameter of the unlabeled objects |
kernel |
kernlab::kernel to use |
balancing_constraint |
logical; Whether a balancing constraint should be enfored that causes the fraction of objects assigned to each label in the unlabeled data to be similar to the label fraction in the labeled data. |
s |
numeric; parameter controlling the loss function of the unlabeled objects (generally values between -1 and 0) |
x_center |
logical; Should the features be centered? |
scale |
If TRUE, apply a z-transform to all observations in X and X_u before running the regression |
eps |
numeric; Stopping criterion for the maximinimization |
max_iter |
integer; Maximum number of iterations |
verbose |
logical; print debugging messages, only works for vanilladot() kernel (default: FALSE) |
Collobert, R. et al., 2006. Large scale transductive SVMs. Journal of Machine Learning Research, 7, pp.1687-1712.
library(tidyverse) library(caret) library(tidymodels) library(SSLR) data(breast) set.seed(1) train.index <- createDataPartition(breast$Class, p = .7, list = FALSE) train <- breast[ train.index,] test <- breast[-train.index,] cls <- which(colnames(breast) == "Class") #% LABELED labeled.index <- createDataPartition(breast$Class, p = .2, list = FALSE) train[-labeled.index,cls] <- NA library(kernlab) m <- TSVMSSLR(kernel = kernlab::vanilladot()) %>% fit(Class ~ ., data = train) #Accesing model from RSSL model <- m$model
library(tidyverse) library(caret) library(tidymodels) library(SSLR) data(breast) set.seed(1) train.index <- createDataPartition(breast$Class, p = .7, list = FALSE) train <- breast[ train.index,] test <- breast[-train.index,] cls <- which(colnames(breast) == "Class") #% LABELED labeled.index <- createDataPartition(breast$Class, p = .2, list = FALSE) train[-labeled.index,cls] <- NA library(kernlab) m <- TSVMSSLR(kernel = kernlab::vanilladot()) %>% fit(Class ~ ., data = train) #Accesing model from RSSL model <- m$model
model from RSSL package
This methods uses the closed form solution of the supervised least squares problem,
except that the second moment matrix (X'X) is exchanged with a second moment matrix that
is estimated based on all data. See for instance Shaffer1991, where in this
implementation we use all data to estimate E(X'X), instead of just the labeled data.
This method seems to work best when the data is first centered x_center=TRUE
and the outputs are scaled using y_scale=TRUE
.
USMLeastSquaresClassifierSSLR( lambda = 0, intercept = TRUE, x_center = FALSE, scale = FALSE, y_scale = FALSE, ..., use_Xu_for_scaling = TRUE )
USMLeastSquaresClassifierSSLR( lambda = 0, intercept = TRUE, x_center = FALSE, scale = FALSE, y_scale = FALSE, ..., use_Xu_for_scaling = TRUE )
lambda |
numeric; L2 regularization parameter |
intercept |
logical; Whether an intercept should be included |
x_center |
logical; Should the features be centered? |
scale |
logical; Should the features be normalized? (default: FALSE) |
y_scale |
logical; whether the target vector should be centered |
... |
Not used |
use_Xu_for_scaling |
logical; whether the unlabeled objects should be used to determine the mean and scaling for the normalization |
Shaffer, J.P., 1991. The Gauss-Markov Theorem and Random Regressors. The American Statistician, 45(4), pp.269-273.
library(tidyverse) library(tidymodels) library(caret) library(SSLR) data(breast) set.seed(1) train.index <- createDataPartition(breast$Class, p = .7, list = FALSE) train <- breast[ train.index,] test <- breast[-train.index,] cls <- which(colnames(breast) == "Class") #% LABELED labeled.index <- createDataPartition(breast$Class, p = .2, list = FALSE) train[-labeled.index,cls] <- NA m <- USMLeastSquaresClassifierSSLR() %>% fit(Class ~ ., data = train) #Accesing model from RSSL model <- m$model #Accuracy predict(m,test) %>% bind_cols(test) %>% metrics(truth = "Class", estimate = .pred_class)
library(tidyverse) library(tidymodels) library(caret) library(SSLR) data(breast) set.seed(1) train.index <- createDataPartition(breast$Class, p = .7, list = FALSE) train <- breast[ train.index,] test <- breast[-train.index,] cls <- which(colnames(breast) == "Class") #% LABELED labeled.index <- createDataPartition(breast$Class, p = .2, list = FALSE) train[-labeled.index,cls] <- NA m <- USMLeastSquaresClassifierSSLR() %>% fit(Class ~ ., data = train) #Accesing model from RSSL model <- m$model #Accuracy predict(m,test) %>% bind_cols(test) %>% metrics(truth = "Class", estimate = .pred_class)
model from RSSL package WellSVM is a minimax relaxation of the mixed integer programming problem of finding the optimal labels for the unlabeled data in the SVM objective function. This implementation is a translation of the Matlab implementation of Li (2013) into R.
WellSVMSSLR( C1 = 1, C2 = 0.1, gamma = 1, x_center = TRUE, scale = FALSE, use_Xu_for_scaling = FALSE, max_iter = 20 )
WellSVMSSLR( C1 = 1, C2 = 0.1, gamma = 1, x_center = TRUE, scale = FALSE, use_Xu_for_scaling = FALSE, max_iter = 20 )
C1 |
double; A regularization parameter for labeled data, default 1; |
C2 |
double; A regularization parameter for unlabeled data, default 0.1; |
gamma |
double; Gaussian kernel parameter, i.e., k(x,y) = exp(-gamma^2||x-y||^2/avg) where avg is the average distance among instances; when gamma = 0, linear kernel is used. default gamma = 1; |
x_center |
logical; Should the features be centered? |
scale |
logical; Should the features be normalized? (default: FALSE) |
use_Xu_for_scaling |
logical; whether the unlabeled objects should be used to determine the mean and scaling for the normalization |
max_iter |
integer; Maximum number of iterations |
Y.-F. Li, I. W. Tsang, J. T. Kwok, and Z.-H. Zhou. Scalable and Convex Weakly Labeled SVMs. Journal of Machine Learning Research, 2013.
R.-E. Fan, P.-H. Chen, and C.-J. Lin. Working set selection using second order information for training SVM. Journal of Machine Learning Research 6, 1889-1918, 2005.
library(tidyverse) library(tidymodels) library(caret) library(SSLR) data(breast) set.seed(1) train.index <- createDataPartition(breast$Class, p = .7, list = FALSE) train <- breast[ train.index,] test <- breast[-train.index,] cls <- which(colnames(breast) == "Class") #% LABELED labeled.index <- createDataPartition(breast$Class, p = .2, list = FALSE) train[-labeled.index,cls] <- NA m <- WellSVMSSLR() %>% fit(Class ~ ., data = train) #Accesing model from RSSL model <- m$model #Accuracy predict(m,test) %>% bind_cols(test) %>% metrics(truth = "Class", estimate = .pred_class)
library(tidyverse) library(tidymodels) library(caret) library(SSLR) data(breast) set.seed(1) train.index <- createDataPartition(breast$Class, p = .7, list = FALSE) train <- breast[ train.index,] test <- breast[-train.index,] cls <- which(colnames(breast) == "Class") #% LABELED labeled.index <- createDataPartition(breast$Class, p = .2, list = FALSE) train[-labeled.index,cls] <- NA m <- WellSVMSSLR() %>% fit(Class ~ ., data = train) #Accesing model from RSSL model <- m$model #Accuracy predict(m,test) %>% bind_cols(test) %>% metrics(truth = "Class", estimate = .pred_class)
This dataset is the result of a chemical analysis of wine grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines.
data(wine)
data(wine)
A data frame with 178 rows and 14 variables including the class.
The dataset is taken from the UCI data repository, to which it was donated by Riccardo Leardi, University of Genova. The attributes are as follows:
Alcohol
Malic acid
Ash
Alcalinity of ash
Magnesium
Total phenols
Flavanoids
Nonflavanoid phenols
Proanthocyanins
Color intensity
Hue
OD280/OD315 of diluted wines
Proline
Wine (class)
https://archive.ics.uci.edu/ml/datasets/Wine