After showing how to use nn2poly
in a regression setting
both in vignette("nn2poly-01-introduction")
and
vignette("nn2poly-02-supported-DL-frameworks")
, we will see
here a multiclass classification example using the iris dataset to
showcase how nn2poly obtains a polynomial for each class (or output
neuron).
In order to train our model we will use in this case the
keras
/tensorflow
framework, and we will
benefit from the provided support for it in nn2poly
as we
saw in vignette("nn2poly-02-supported-DL-frameworks")
.
First we load the iris
dataset and scale the data to the
[−1, 1] interval:
# Load the data
data(iris)
# Change response to numeric. In this case, Species was already numeric,
# but this step is needed if it is a factor variable.
iris$Species <- as.numeric(iris$Species)
# Define dimension p (number of predictor variables)
p <- dim(iris)[2] - 1
# Define objective classes
n_class <- max(iris[,(p+1)])
# Move objective classes from (1:3) to (0:2), needed for tensorflow
iris[,(p+1)] <- iris[,(p+1)] - 1
# Scale the data in the [-1,1] interval and separate train and test
# Only the predictor variables are scaled, not the response as those will be
# the different classes.
iris_x <- iris[,-(p+1)]
maxs <- apply(iris_x, 2, max)
mins <- apply(iris_x, 2, min)
data_x_scaled <- as.data.frame(scale(iris_x, center = mins + (maxs - mins) / 2, scale = (maxs - mins) / 2))
data <- cbind(data_x_scaled, iris[,(p+1)])
# Divide in train (0.75) and test (0.25)
index <- sample(1:nrow(data), round(0.75 * nrow(data)))
train <- data[index, ]
test <- data[-index, ]
train_x <- as.matrix(train[,-(p+1)])
train_y <- as.matrix(train[,(p+1)])
test_x <- as.matrix(test[,-(p+1)])
test_y <- as.matrix(test[,(p+1)])
In this section we will create and train (imposing weight
constraints) our original neural network that will be later explained by
nn2poly()
.
First, we build the model.
keras_model <- function() {
tensorflow::set_random_seed(42)
nn <- keras::keras_model_sequential()
nn <- keras::layer_dense(nn, units = 100, activation = "tanh", input_shape = p)
nn <- keras::layer_dense(nn, units = 100, activation = "tanh")
nn <- keras::layer_dense(nn, units = n_class, activation = "linear")
nn
}
nn <- keras_model()
# Impose weight constraints provided by nn2poly package
nn <- add_constraints(nn, constraint_type = "l1_norm")
nn
#> Model: "sequential"
#> ________________________________________________________________________________
#> Layer (type) Output Shape Param #
#> ================================================================================
#> dense (Dense) (None, 100) 500
#> dense_1 (Dense) (None, 100) 10100
#> dense_2 (Dense) (None, 3) 303
#> ================================================================================
#> Total params: 10903 (42.59 KB)
#> Trainable params: 10903 (42.59 KB)
#> Non-trainable params: 0 (0.00 Byte)
#> ________________________________________________________________________________
Note: In this case the NN has a linear output with the same number of neurons as the number of classes to predict (3 species,
n_class
). Then, the linear output will be transformed into a probability to find the most probable class but this step is done after training. Therefore, nn2poly will be used to obtain a polynomial that approximates this nn with linear outputs and then its results will also be transformed in probabilities to predict the highest probability class.
Then we compile and train the model, using a categorical crossentropy loss and accuracy as the chosen metric.
In this case, to asses the NNs accuracy we have to transform their
output into a probability, which we will do by using
layer_activation_softmax()
and then choosing the class with
highest probability with k_argmax
.
# Define the model probability model using our previously trained nn
probability_model <- keras_model_sequential() %>%
nn() %>%
layer_activation_softmax() %>%
layer_lambda(k_argmax)
# Obtain the predicted classes with the NN to compare them
prediction_NN_class <- predict(probability_model, test_x)
#> 2/2 - 0s - 70ms/epoch - 35ms/step
Note: As said before,
nn2poly()
will approximate the linear output of the neural network, so we will need to perform these transformation also in the polynomial predictions to obtain a classification response.
We can also compute the linear output of the model (before converting it to a probability) and store it to compare with the polynomial output later.
# Also, the linear output can be predicted before the probability model
prediction_NN <- predict(nn, test_x)
#> 2/2 - 0s - 40ms/epoch - 20ms/step
Finally We can use here a confusion matrix to visualize the results, where we can see that that the NN correctly predicts almost all of the classes in the test data:
# Create a confusion matrix
cm <- caret::confusionMatrix(as.factor(prediction_NN_class), as.factor(test_y))
cm
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1 2
#> 0 11 0 0
#> 1 0 11 1
#> 2 0 1 14
#>
#> Overall Statistics
#>
#> Accuracy : 0.9474
#> 95% CI : (0.8225, 0.9936)
#> No Information Rate : 0.3947
#> P-Value [Acc > NIR] : 7.82e-13
#>
#> Kappa : 0.9203
#>
#> Mcnemar's Test P-Value : NA
#>
#> Statistics by Class:
#>
#> Class: 0 Class: 1 Class: 2
#> Sensitivity 1.0000 0.9167 0.9333
#> Specificity 1.0000 0.9615 0.9565
#> Pos Pred Value 1.0000 0.9167 0.9333
#> Neg Pred Value 1.0000 0.9615 0.9565
#> Prevalence 0.2895 0.3158 0.3947
#> Detection Rate 0.2895 0.2895 0.3684
#> Detection Prevalence 0.2895 0.3158 0.3947
#> Balanced Accuracy 1.0000 0.9391 0.9449
nn2poly
After the NN has been trained, we can directly call
nn2poly()
on the model, where we only need to choose the
value max_order
which will limit the order of the computed
coefficients in the final polynomial. In this case we will set it as
3
.
Note that in this case, as we have 3 output neurons, there will be 3 output polynomials. The polynomials will be stored in the same way as in the regression case, in a list with labels and values, but in this case the values will be a matrix with 3 rows, where each row will be the polynomial obtained for each output neuron.
# Polynomial for nn
final_poly <- nn2poly(object = nn,
max_order = 3)
final_poly
#> $labels
#> $labels[[1]]
#> [1] 0
#>
#> $labels[[2]]
#> [1] 1
#>
#> $labels[[3]]
#> [1] 2
#>
#> $labels[[4]]
#> [1] 3
#>
#> $labels[[5]]
#> [1] 4
#>
#> $labels[[6]]
#> [1] 1 1
#>
#> $labels[[7]]
#> [1] 1 2
#>
#> $labels[[8]]
#> [1] 1 3
#>
#> $labels[[9]]
#> [1] 1 4
#>
#> $labels[[10]]
#> [1] 2 2
#>
#> $labels[[11]]
#> [1] 2 3
#>
#> $labels[[12]]
#> [1] 2 4
#>
#> $labels[[13]]
#> [1] 3 3
#>
#> $labels[[14]]
#> [1] 3 4
#>
#> $labels[[15]]
#> [1] 4 4
#>
#> $labels[[16]]
#> [1] 1 1 1
#>
#> $labels[[17]]
#> [1] 1 1 2
#>
#> $labels[[18]]
#> [1] 1 1 3
#>
#> $labels[[19]]
#> [1] 1 1 4
#>
#> $labels[[20]]
#> [1] 1 2 2
#>
#> $labels[[21]]
#> [1] 1 2 3
#>
#> $labels[[22]]
#> [1] 1 2 4
#>
#> $labels[[23]]
#> [1] 1 3 3
#>
#> $labels[[24]]
#> [1] 1 3 4
#>
#> $labels[[25]]
#> [1] 1 4 4
#>
#> $labels[[26]]
#> [1] 2 2 2
#>
#> $labels[[27]]
#> [1] 2 2 3
#>
#> $labels[[28]]
#> [1] 2 2 4
#>
#> $labels[[29]]
#> [1] 2 3 3
#>
#> $labels[[30]]
#> [1] 2 3 4
#>
#> $labels[[31]]
#> [1] 2 4 4
#>
#> $labels[[32]]
#> [1] 3 3 3
#>
#> $labels[[33]]
#> [1] 3 3 4
#>
#> $labels[[34]]
#> [1] 3 4 4
#>
#> $labels[[35]]
#> [1] 4 4 4
#>
#>
#> $values
#> [,1] [,2] [,3]
#> [1,] 0.04067109 2.838697004 -2.366766912
#> [2,] -1.01692952 0.922284435 0.306762048
#> [3,] 3.82618130 -1.532152815 -2.683000514
#> [4,] -4.86343269 -0.367029700 5.201883882
#> [5,] -6.05908248 -2.026009803 7.701675045
#> [6,] 0.02216647 -0.068650500 0.031318449
#> [7,] -0.09821698 0.115922745 0.007638440
#> [8,] 0.08146315 -0.072443601 -0.025525980
#> [9,] 0.04337399 -0.019371395 -0.029198248
#> [10,] 0.08763212 -0.122314949 0.007196786
#> [11,] -0.13090934 0.160297202 0.008898167
#> [12,] -0.03982065 0.118363240 -0.047742059
#> [13,] -0.03990296 -0.186114002 0.182448046
#> [14,] -0.22518863 -0.383706458 0.517483976
#> [15,] -0.21539519 -0.340739130 0.476053215
#> [16,] 0.01640856 -0.015861053 -0.004179781
#> [17,] -0.08649215 0.054035395 0.045407899
#> [18,] 0.08657797 -0.020808458 -0.071458194
#> [19,] 0.10626455 0.016833308 -0.120917295
#> [20,] -0.09226002 0.058721711 0.047211597
#> [21,] -0.12782353 0.099224893 0.053681832
#> [22,] -0.14300636 0.061378209 0.099556359
#> [23,] 0.08681554 -0.019495659 -0.072698532
#> [24,] 0.07906020 -0.071096510 -0.027609755
#> [25,] 0.10751970 0.019949727 -0.124596164
#> [26,] -0.12832404 0.057077780 0.086244551
#> [27,] 0.31410761 -0.112571487 -0.233324113
#> [28,] 0.39822386 -0.072171099 -0.351003640
#> [29,] 0.32366110 -0.115551628 -0.240170974
#> [30,] -0.64179985 0.150429244 0.539640723
#> [31,] 0.40593582 -0.069258417 -0.360602426
#> [32,] 0.17561230 0.009076263 -0.185280705
#> [33,] 0.51767960 0.109300583 -0.609898697
#> [34,] 0.51398607 0.101899652 -0.600694175
#> [35,] 0.34526975 0.146409617 -0.461815984
#>
#> attr(,"class")
#> [1] "nn2poly"
As said before, the obtained polynomial represents the neural network
before including the softmax function and computing the class assigned
to each observation. Then, we need to define again a keras sequential
model that includes the class computation from the polynomial output.
This polynomial output is obtained with predict()
used on
the nn2poly()
output, in this case in matrix form, as the 3
polynomials are evaluated at the same time:
# Obtain the predicted values for the test data with our Polynomial Regression
prediction_poly_matrix <- predict(object = final_poly, newdata = test_x)
# Define probability model with keras for the polynomial outputs
probability_poly <- keras_model_sequential() %>%
layer_activation_softmax() %>%
layer_lambda(k_argmax)
# Class prediction with the polynomial outputs
prediction_poly_class <- predict(probability_poly,
prediction_poly_matrix)
#> 2/2 - 0s - 121ms/epoch - 60ms/step
With the polynomial predictions, there are two options. We can
represent in a diagonal line the linear outputs obtained directly from
the polynomial and NN predictions, or compare the assigned classes after
employing the probability models. Please note here that we compare the
predictions (linear and classes) of the polynomials with the NN
predictions and not the original data, as nn2poly
’s goal is
to faithfully represent the NN behavior independently of how well the NN
predicts.
First, let’s observe the confusion matrix for the assigned classes:
# Confussion matrix between NN class prediction and polynomial class prediction
cm_poly <- caret::confusionMatrix(as.factor(prediction_NN_class), as.factor(prediction_poly_class))
cm_poly
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1 2
#> 0 11 0 0
#> 1 0 12 0
#> 2 0 0 15
#>
#> Overall Statistics
#>
#> Accuracy : 1
#> 95% CI : (0.9075, 1)
#> No Information Rate : 0.3947
#> P-Value [Acc > NIR] : 4.568e-16
#>
#> Kappa : 1
#>
#> Mcnemar's Test P-Value : NA
#>
#> Statistics by Class:
#>
#> Class: 0 Class: 1 Class: 2
#> Sensitivity 1.0000 1.0000 1.0000
#> Specificity 1.0000 1.0000 1.0000
#> Pos Pred Value 1.0000 1.0000 1.0000
#> Neg Pred Value 1.0000 1.0000 1.0000
#> Prevalence 0.2895 0.3158 0.3947
#> Detection Rate 0.2895 0.3158 0.3947
#> Detection Prevalence 0.2895 0.3158 0.3947
#> Balanced Accuracy 1.0000 1.0000 1.0000
The polynomials obtain the same results as the original NN.
Then, we can extract a diagonal plot for each of the polynomials obtained, in total 3 diagonal plots.
for (i in 1:3){
print(
nn2poly:::plot_diagonal(x_axis = prediction_NN[,i],
y_axis = prediction_poly_matrix[,i],
xlab = "NN prediction",
ylab = "Polynomial prediction")
)
}
We can observe how all the polynomials obtain quite similar predictions to their equivalent NN predictions.
We can also plot the n most important coefficients in absolute value to compare which variables or interactions are more relevant in each polynomial. In this case, we will have 3 plots for each NN again, one per polynomial at each output neuron. In this case, the obtained coefficients will represent the most important variables when assigning the probability to be in each class.
Note: The coefficients values are not in the same scale as the original polynomial due to the fact that we have scaled all the data before training, even the response variable Y. Furthermore, as data has been scaled to the [−1, 1] interval, interactions of order 2 or higher would usually need a higher absolute value than the lower order coefficients to be more relevant
Finally, the problem with Taylor expansion can be checked with the following plots, where each layer is represented with their activation function, its Taylor expansion, the error and also the density of the activation potentials that the activation functions receives at that layer.
It can be clearly seen with the activation potentials density, in green, that it is kept close to zero, thus having an accurate Taylor expansion around zero at each hidden layer.
nn2poly:::plot_taylor_and_activation_potentials(object = nn,
data = train,
max_order = 3,
constraints = TRUE)
#> [[1]]
#>
#> [[2]]
#>
#> [[3]]