---
title: Introduction to Keras for Researchers
date-created: 2020/04/01
last-modified: 2020/10/02
description: Everything you need to know to use Keras & TensorFlow for deep learning
  research.
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Introduction to Keras for Researchers}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

## Setup


```r
library(keras3)
library(tensorflow)
```

## Introduction

Are you a machine learning researcher? Do you publish at NeurIPS and push the
state-of-the-art in CV and NLP? This guide will serve as your first introduction to core
Keras & TensorFlow API concepts.

In this guide, you will learn about:

- Tensors, variables, and gradients in TensorFlow
- Creating layers by subclassing the [`Layer`] class
- Writing low-level training loops
- Tracking losses created by layers via the `add_loss()` method
- Tracking metrics in a low-level training loop
- Speeding up execution with a compiled [`tensorflow::tf_function()`]
- Executing layers in training or inference mode
- The Keras Functional API

You will also see the Keras API in action in two end-to-end research examples:
a Variational Autoencoder, and a Hypernetwork.

## Tensors

TensorFlow is an infrastructure layer for differentiable programming.
At its heart, it's a framework for manipulating N-dimensional arrays (tensors),
much like NumPy.

However, there are three key differences between NumPy and TensorFlow:

- TensorFlow can leverage hardware accelerators such as GPUs and TPUs.
- TensorFlow can automatically compute the gradient of arbitrary differentiable tensor expressions.
- TensorFlow computation can be distributed to large numbers of devices on a single machine, and large number of
machines (potentially with multiple devices each).

Let's take a look at the object that is at the core of TensorFlow: the Tensor.

Here's a constant tensor:


```r
x <- tf$constant(rbind(c(5, 2), c(1, 3)))
print(x)
```

```
## tf.Tensor(
## [[5. 2.]
##  [1. 3.]], shape=(2, 2), dtype=float64)
```

You can get its value as a R array by calling `as.array()`:


```r
as.array(x)
```

```
##      [,1] [,2]
## [1,]    5    2
## [2,]    1    3
```

It features the attributes `dtype` and `shape`:


```r
x$dtype
```

```
## tf.float64
```

```r
x$shape
```

```
## TensorShape([2, 2])
```

A common way to create constant tensors is via `tf$ones` and `tf$zeros`:


```r
tf$ones(shape = shape(2, 1))
```

```
## tf.Tensor(
## [[1.]
##  [1.]], shape=(2, 1), dtype=float32)
```

```r
tf$zeros(shape = shape(2, 1))
```

```
## tf.Tensor(
## [[0.]
##  [0.]], shape=(2, 1), dtype=float32)
```

You can also create random constant tensors:


```r
x <- random_normal(shape = c(2, 2), mean = 0.0, stddev = 1.0)
x <- random_uniform(shape = c(2, 2), minval = 0, maxval = 10)
```

## Variables

Variables are special tensors used to store mutable state (such as the weights of a neural network).
You create a `Variable` using some initial value:


```r
initial_value <- random_normal(shape=c(2, 2))
a <- tf$Variable(initial_value)
print(a)
```

```
## <tf.Variable 'Variable:0' shape=(2, 2) dtype=float32, numpy=
## array([[ 0.9057419 ,  0.7916686 ],
##        [ 0.28754202, -0.5408822 ]], dtype=float32)>
```

You update the value of a `Variable` by using the methods `$assign(value)`, `$assign_add(increment)`, or `$assign_sub(decrement)`:


```r
new_value <- random_normal(shape=c(2, 2))
a$assign(new_value)
```

```
## <tf.Variable 'UnreadVariable' shape=(2, 2) dtype=float32, numpy=
## array([[-0.3405368 , -2.1463926 ],
##        [ 1.2602988 ,  0.12241419]], dtype=float32)>
```

```r
added_value <- random_normal(shape=c(2, 2))
a$assign_add(added_value)
```

```
## <tf.Variable 'UnreadVariable' shape=(2, 2) dtype=float32, numpy=
## array([[ 0.04820395, -2.6854615 ],
##        [ 0.23246336,  1.4535258 ]], dtype=float32)>
```

## Doing math in TensorFlow

If you've used NumPy, doing math in TensorFlow will look very familiar.
The main difference is that your TensorFlow code can run on GPU and TPU.


```r
a <- random_normal(shape=c(2, 2))
b <- random_normal(shape=c(2, 2))

c <- a + b
d <- tf$square(c)
e <- tf$exp(d)
```

## Gradients

Here's another big difference with R: you can automatically retrieve the gradient of any differentiable expression.

Just open a `GradientTape`, start "watching" a tensor via `tape$watch()`,
and compose a differentiable expression using this tensor as input:


```r
a <- random_normal(shape=c(2, 2))
b <- random_normal(shape=c(2, 2))

with(tf$GradientTape() %as% tape, {
  tape$watch(a)  # Start recording the history of operations applied to `a`
  c <- tf$sqrt(tf$square(a) + tf$square(b))  # Do some math using `a`
  # What's the gradient of `c` with respect to `a`?
  dc_da <- tape$gradient(c, a)
  print(dc_da)
})
```

```
## tf.Tensor(
## [[ 0.9969011  -0.7707146 ]
##  [ 0.23378514  0.96255165]], shape=(2, 2), dtype=float32)
```

By default, variables are watched automatically, so you don't need to manually `watch` them:


```r
a <- tf$Variable(a)

with(tf$GradientTape() %as% tape, {
  c <- tf$sqrt(tf$square(a) + tf$square(b))
  dc_da <- tape$gradient(c, a)
  print(dc_da)
})
```

```
## tf.Tensor(
## [[ 0.9969011  -0.7707146 ]
##  [ 0.23378514  0.96255165]], shape=(2, 2), dtype=float32)
```

Note that you can compute higher-order derivatives by nesting tapes:


```r
with(tf$GradientTape() %as% outer_tape, {
  with(tf$GradientTape() %as% tape, {
    c <- tf$sqrt(tf$square(a) + tf$square(b))
    dc_da <- tape$gradient(c, a)
  })
  d2c_da2 <- outer_tape$gradient(dc_da, a)
  print(d2c_da2)
})
```

```
## tf.Tensor(
## [[3.3447742e-03 7.1282005e-01]
##  [5.7464113e+00 5.5013180e-02]], shape=(2, 2), dtype=float32)
```

## Keras layers

While TensorFlow is an **infrastructure layer for differentiable programming**,
dealing with tensors, variables, and gradients,
Keras is a **user interface for deep learning**, dealing with
layers, models, optimizers, loss functions, metrics, and more.

Keras serves as the high-level API for TensorFlow:
Keras is what makes TensorFlow simple and productive.

The `Layer` class is the fundamental abstraction in Keras.
A `Layer` encapsulates a state (weights) and some computation
(defined in the call method).

A simple layer looks like this.
The `self$add_weight()` method gives you a shortcut for creating weights:


```r
Linear <- new_layer_class(
  "Linear",
  initialize = function(units = 32, input_dim = 32) {
    super$initialize()
    self$w <- self$add_weight(
      shape = shape(input_dim, units),
      initializer = "random_normal",
      trainable = TRUE
    )
    self$b <- self$add_weight(
      shape = shape(units),
      initializer = "zeros",
      trainable = TRUE
    )
  },
  call = function(inputs) {
    tf$matmul(inputs, self$w) + self$b
  }
)
```

You would use a `Layer` instance much like a R function:


```r
# Instantiate our layer.
linear_layer <- Linear(units=4, input_dim=2)

# The layer can be treated as a function.
# Here we call it on some data.
y <- linear_layer(tf$ones(shape(2, 2)))
```

The weight variables (created in `initialize`) are automatically
tracked under the `weights` property:


```r
linear_layer$weights
```

```
## [[1]]
## <KerasVariable shape=(2, 4), dtype=float32, path=linear/variable>
##
## [[2]]
## <KerasVariable shape=(4), dtype=float32, path=linear/variable_1>
```

You have many built-in layers available, from `Dense` to `Conv2D` to `LSTM` to
fancier ones like `Conv3DTranspose` or `ConvLSTM2D`. Be smart about reusing
built-in functionality.

## Layer weight creation in `build(input_shape)`

It's often a good idea to defer weight creation to the `build()` method, so
that you don't need to specify the input dim/shape at layer construction time:


```r
Linear <- new_layer_class(
  "Linear",
  initialize = function(units = 32) {
    super$initialize()
    self$units <- units
  },
  build = function(input_shape) {
    self$w <- self$add_weight(
      shape = shape(input_shape[-1], self$units),
      initializer = "random_normal",
      trainable = TRUE
    )
    self$b <- self$add_weight(
      shape = shape(self$units),
      initializer = "zeros",
      trainable = TRUE
    )
  },
  call = function(inputs) {
    tf$matmul(inputs, self$w) + self$b
  }
)

# Instantiate our layer.
linear_layer <- Linear(units = 4)

# This will also call `build(input_shape)` and create the weights.
y <- linear_layer(tf$ones(shape(2, 2)))
```

## Layer gradients

You can automatically retrieve the gradients of the weights of a layer by
calling it inside a `GradientTape`. Using these gradients, you can update the
weights of the layer, either manually, or using an optimizer object. Of course,
you can modify the gradients before using them, if you need to.


```r
# Prepare a dataset.
c(c(x_train, y_train), .) %<-% dataset_mnist()

x_train <- array_reshape(x_train, c(60000, 784)) / 255

dataset <- tfdatasets::tensor_slices_dataset(list(x_train, y_train)) %>%
  tfdatasets::dataset_shuffle(buffer_size=1024) %>%
  tfdatasets::dataset_batch(64)

# Instantiate our linear layer (defined above) with 10 units.
linear_layer <- Linear(units = 10)

# Instantiate a logistic loss function that expects integer targets.
loss_fn <- loss_sparse_categorical_crossentropy(from_logits=TRUE)

# Instantiate an optimizer.
optimizer <- optimizer_sgd(learning_rate=1e-3)

# Iterate over the batches of the dataset.
coro::loop(for(data in dataset) {
  # Open a GradientTape.
  with(tf$GradientTape() %as% tape, {
    # Forward pass.
    logits <- linear_layer(data[[1]])

    # Loss value for this batch.
    loss_value <- loss_fn(data[[2]], logits)
  })

  # Get gradients of the loss wrt the weights.
  gradients <- tape$gradient(loss_value, linear_layer$trainable_weights)

  # Update the weights of our linear layer.
  optimizer$apply_gradients(zip_lists(gradients, linear_layer$trainable_weights))
})
loss_value
```

```
## tf.Tensor(1.2819729, shape=(), dtype=float32)
```

## Trainable and non-trainable weights

Weights created by layers can be either trainable or non-trainable. They're
exposed in `trainable_weights` and `non_trainable_weights` respectively.
Here's a layer with a non-trainable weight:


```r
ComputeSum <- new_layer_class(
  "ComputeSum",
  initialize = function(input_dim) {
    super$initialize()
    # Create a non-trainable weight.
    self$total <- self$add_weight(
      initializer = "zeros",
      shape = shape(input_dim),
      trainable = FALSE
    )
  },
  call = function(inputs) {
    self$total$assign_add(tf$reduce_sum(inputs, axis=0L))
    self$total
  }
)

my_sum <- ComputeSum(input_dim = 2)
x <- tf$ones(shape(2, 2))

as.array(my_sum(x))
```

```
## [1] 2 2
```

```r
as.array(my_sum(x))
```

```
## [1] 4 4
```

```r
my_sum$trainable_weights
```

```
## list()
```

## Layers that own layers

Layers can be recursively nested to create bigger computation blocks.
Each layer will track the weights of its sublayers
(both trainable and non-trainable).


```r
# Let's reuse the Linear class
# with a `build` method that we defined above.

MLP <- new_layer_class(
  "MLP",
  initialize = function() {
    super$initialize()
    self$linear_1 <- Linear(units = 32)
    self$linear_2 <- Linear(units = 32)
    self$linear_3 <- Linear(units = 10)
  },
  call = function(inputs) {
    x <- self$linear_1(inputs)
    x <- tf$nn$relu(x)
    x <- self$linear_2(x)
    x <- tf$nn$relu(x)
    return(self$linear_3(x))
  }
)

mlp <- MLP()

# The first call to the `mlp` object will create the weights.
y <- mlp(tf$ones(shape=shape(3, 64)))

# Weights are recursively tracked.
length(mlp$weights)
```

```
## [1] 6
```

Note that our manually-created MLP above is equivalent to the following
built-in option:


```r
mlp <- keras_model_sequential() %>%
  layer_dense(units = 32, activation = "relu") %>%
  layer_dense(units = 32, activation = "relu") %>%
  layer_dense(units = 10)
```

## Tracking losses created by layers

Layers can create losses during the forward pass via the `add_loss()` method.
This is especially useful for regularization losses.
The losses created by sublayers are recursively tracked by the parent layers.

Here's a layer that creates an activity regularization loss:


```r
# A layer that creates an activity sparsity regularization loss
ActivityRegularization <- new_layer_class(
  "ActivityRegularization",
  initialize = function(rate=1e-2) {
    super$initialize()
    self$rate <- rate
  },
  call = function(inputs) {
    self$add_loss(self$rate * tf$reduce_sum(tf$abs(inputs)))
    inputs
  }
)
```

Any model incorporating this layer will track this regularization loss:


```r
# Let's use the loss layer in a MLP block.
SparseMLP <- new_layer_class(
  "SparseMLP",
  initialize = function() {
    super$initialize()
    self$linear_1 <- Linear(units = 32)
    self$reg <- ActivityRegularization(rate = 1e-2)
    self$linear_3 <- Linear(units = 10)
  },
  call = function(inputs) {
    x <- self$linear_1(inputs)
    x <- tf$nn$relu(x)
    x <- self$reg(x)
    return(self$linear_3(x))
  }
)

mlp <- SparseMLP()
y <- mlp(tf$ones(shape(10, 10)))

mlp$losses  # List containing one float32 scalar
```

```
## [[1]]
## tf.Tensor(0.18065463, shape=(), dtype=float32)
```

These losses are cleared by the top-level layer at the start of each forward
pass -- they don't accumulate. `layer.losses` always contains only the losses
created during the last forward pass. You would typically use these losses by
summing them before computing your gradients when writing a training loop.


```r
# Losses correspond to the *last* forward pass.
mlp <- SparseMLP()
mlp(tf$ones(shape(10, 10)))
```

```
## tf.Tensor(
## [[ 0.0388482  -0.03920118  0.01624808 -0.01361975 -0.01354899  0.07107338
##   -0.01077365  0.05688906 -0.02838149 -0.04084621]
##  [ 0.0388482  -0.03920118  0.01624808 -0.01361975 -0.01354899  0.07107338
##   -0.01077365  0.05688906 -0.02838149 -0.04084621]
##  [ 0.0388482  -0.03920118  0.01624808 -0.01361975 -0.01354899  0.07107338
##   -0.01077365  0.05688906 -0.02838149 -0.04084621]
##  [ 0.0388482  -0.03920118  0.01624808 -0.01361975 -0.01354899  0.07107338
##   -0.01077365  0.05688906 -0.02838149 -0.04084621]
##  [ 0.0388482  -0.03920118  0.01624808 -0.01361975 -0.01354899  0.07107338
##   -0.01077365  0.05688906 -0.02838149 -0.04084621]
##  [ 0.0388482  -0.03920118  0.01624808 -0.01361975 -0.01354899  0.07107338
##   -0.01077365  0.05688906 -0.02838149 -0.04084621]
##  [ 0.0388482  -0.03920118  0.01624808 -0.01361975 -0.01354899  0.07107338
##   -0.01077365  0.05688906 -0.02838149 -0.04084621]
##  [ 0.0388482  -0.03920118  0.01624808 -0.01361975 -0.01354899  0.07107338
##   -0.01077365  0.05688906 -0.02838149 -0.04084621]
##  [ 0.0388482  -0.03920118  0.01624808 -0.01361975 -0.01354899  0.07107338
##   -0.01077365  0.05688906 -0.02838149 -0.04084621]
##  [ 0.0388482  -0.03920118  0.01624808 -0.01361975 -0.01354899  0.07107338
##   -0.01077365  0.05688906 -0.02838149 -0.04084621]], shape=(10, 10), dtype=float32)
```

```r
length(mlp$losses)
```

```
## [1] 1
```

```r
mlp(tf$ones(shape(10, 10)))
```

```
## tf.Tensor(
## [[ 0.0388482  -0.03920118  0.01624808 -0.01361975 -0.01354899  0.07107338
##   -0.01077365  0.05688906 -0.02838149 -0.04084621]
##  [ 0.0388482  -0.03920118  0.01624808 -0.01361975 -0.01354899  0.07107338
##   -0.01077365  0.05688906 -0.02838149 -0.04084621]
##  [ 0.0388482  -0.03920118  0.01624808 -0.01361975 -0.01354899  0.07107338
##   -0.01077365  0.05688906 -0.02838149 -0.04084621]
##  [ 0.0388482  -0.03920118  0.01624808 -0.01361975 -0.01354899  0.07107338
##   -0.01077365  0.05688906 -0.02838149 -0.04084621]
##  [ 0.0388482  -0.03920118  0.01624808 -0.01361975 -0.01354899  0.07107338
##   -0.01077365  0.05688906 -0.02838149 -0.04084621]
##  [ 0.0388482  -0.03920118  0.01624808 -0.01361975 -0.01354899  0.07107338
##   -0.01077365  0.05688906 -0.02838149 -0.04084621]
##  [ 0.0388482  -0.03920118  0.01624808 -0.01361975 -0.01354899  0.07107338
##   -0.01077365  0.05688906 -0.02838149 -0.04084621]
##  [ 0.0388482  -0.03920118  0.01624808 -0.01361975 -0.01354899  0.07107338
##   -0.01077365  0.05688906 -0.02838149 -0.04084621]
##  [ 0.0388482  -0.03920118  0.01624808 -0.01361975 -0.01354899  0.07107338
##   -0.01077365  0.05688906 -0.02838149 -0.04084621]
##  [ 0.0388482  -0.03920118  0.01624808 -0.01361975 -0.01354899  0.07107338
##   -0.01077365  0.05688906 -0.02838149 -0.04084621]], shape=(10, 10), dtype=float32)
```

```r
length(mlp$losses)  # No accumulation.
```

```
## [1] 1
```

```r
# Let's demonstrate how to use these losses in a training loop.

# Prepare a dataset.
c(c(x_train, y_train), .) %<-% dataset_mnist()
x_train <- array_reshape(x_train, c(60000, 784)) / 255

dataset <- tfdatasets::tensor_slices_dataset(list(x_train, y_train)) %>%
  tfdatasets::dataset_shuffle(buffer_size=1024) %>%
  tfdatasets::dataset_batch(64)

# A new MLP.
mlp <- SparseMLP()

# Loss and optimizer.
loss_fn <- loss_sparse_categorical_crossentropy(from_logits=TRUE)
optimizer <- optimizer_sgd(learning_rate=1e-3)

coro::loop(for(data in dataset) {
  x <- data[[1]]
  y <- data[[2]]
  with(tf$GradientTape() %as% tape, {
    # Forward pass.
    logits <- mlp(x)

    # External loss value for this batch.
    loss <- loss_fn(y, logits)

    # Add the losses created during the forward pass.
    loss <- loss + Reduce(`+`, mlp$losses)

    # Get gradients of the loss wrt the weights.
    gradients <- tape$gradient(loss, mlp$trainable_weights)

    # Update the weights of our linear layer.
    optimizer$apply_gradients(zip_lists(gradients, mlp$trainable_weights))
  })
})
```

## Keeping track of training metrics

Keras offers a broad range of built-in metrics, like `metric_auc`
or `metric_precision_at_recall`. It's also easy to create your
own metrics in a few lines of code.

To use a metric in a custom training loop, you would:

- Instantiate the metric object, e.g. `metric = metric_auc()`
- Call its `metric$udpate_state(targets, predictions)` method for each batch of data
- Query its result via `metric$result()`
- Reset the metric's state at the end of an epoch or at the start of an evaluation via
`metric$reset_state()`

Here's a simple example:


```r
# Instantiate a metric object
accuracy <- metric_sparse_categorical_accuracy()

# Prepare our layer, loss, and optimizer.
model <- keras_model_sequential() %>%
  layer_dense(units = 32, activation = "relu") %>%
  layer_dense(units = 32, activation = "relu") %>%
  layer_dense(units = 10)
loss_fn <- loss_sparse_categorical_crossentropy(from_logits = TRUE)
optimizer <- optimizer_adam(learning_rate=1e-3)

for (epoch in seq_len(2)) {
  coro::loop(for (data in dataset) {
    x <- data[[1]]
    y <- data[[2]]
    with(tf$GradientTape() %as% tape, {
      # Forward pass.
      logits <- model(x)

      # External loss value for this batch.
      loss_value <- loss_fn(y, logits)
    })

    # Update the state of the `accuracy` metric.
    accuracy$update_state(y, logits)

    # Update the weights of the model to minimize the loss value.
    gradients <- tape$gradient(loss_value, model$trainable_weights)
    optimizer$apply_gradients(zip_lists(gradients, model$trainable_weights))

  })
  cat("Epoch:", epoch, "Accuracy:", as.numeric(accuracy$result()), "\n")
  accuracy$reset_state()
}
```

```
## Epoch: 1 Accuracy: 0.8757833
## Epoch: 2 Accuracy: 0.93915
```

You can also define your own metrics by subclassing `keras.metrics.Metric`.
You need to override the three functions called above:

- Override `update_state()` to update the statistic values.
- Override `result()` to return the metric value.
- Override `reset_state()` to reset the metric to its initial state.

Here is an example where we implement the F1-score metric
(with support for sample weighting).


```r
F1Score <- new_metric_class(
  "F1Score",
  initialize = function(self, name="f1_score", dtype="float32", threshold=0.5, ...) {
    super$initialize(name=name, dtype=dtype, ...)
    self$threshold <- threshold
    self$true_positives <- self$add_weight(
      name="tp", dtype=dtype, initializer="zeros"
    )
    self$false_positives <- self$add_weight(
      name="fp", dtype=dtype, initializer="zeros"
    )
    self$false_negatives <- self$add_weight(
      name="fn", dtype=dtype, initializer="zeros"
    )
  },
  update_state = function(y_true, y_pred, sample_weight=NULL) {
    y_pred <- tf$math$greater_equal(y_pred, self$threshold)
    y_true <- tf$cast(y_true, tf$bool)
    y_pred <- tf$cast(y_pred, tf$bool)

    true_positives <- tf$cast(y_true & y_pred, self$dtype)
    false_positives <- tf$cast((!y_true) & y_pred, self$dtype)
    false_negatives <- tf$cast(y_true & (!y_pred), self$dtype)

    if (!is.null(sample_weight)) {
      sample_weight <- tf$cast(sample_weight, self$dtype)
      true_positives <- true_positives * sample_weight
      false_positives <- false_positives * sample_weight
      false_negatives <- false_negatives * sample_weight
    }

    self$true_positives$assign_add(tf$reduce_sum(true_positives))
    self$false_positives$assign_add(tf$reduce_sum(false_positives))
    self$false_negatives$assign_add(tf$reduce_sum(false_negatives))
  },

  result = function() {
    precision <- self$true_positives / (self$true_positives + self$false_positives)
    recall <- self$true_positives / (self$true_positives + self$false_negatives)
    f1_score <- 2 * precision * recall / (precision + recall)
    f1_score
  },

  reset_state = function() {
    self$true_positives$assign(0)
    self$false_positives$assign(0)
    self$false_negatives$assign(0)
  }
)
```

Let's test-drive it:


```r
m <- F1Score()
m$update_state(c(0, 1, 0, 0), c(0.3, 0.5, 0.8, 0.9))
cat("Intermediate result:", as.numeric(m$result()), "\n")
```

```
## Intermediate result: 0.5
```

```r
m$update_state(c(1, 1, 1, 1), c(0.1, 0.7, 0.6, 0.0))
cat("Final result:", as.numeric(m$result()), "\n")
```

```
## Final result: 0.6
```

## Compiled functions

Running eagerly is great for debugging, but you will get better performance by
compiling your computation into static graphs. Static graphs are a researcher's
best friends. You can compile any function by wrapping it in a `tf.function`
decorator.


```r
# Prepare our layer, loss, and optimizer.
model <- keras_model_sequential() %>%
  layer_dense(units = 32, activation = "relu") %>%
  layer_dense(units = 32, activation = "relu") %>%
  layer_dense(units = 10)
loss_fn <- loss_sparse_categorical_crossentropy(from_logits = TRUE)
optimizer <- optimizer_adam(learning_rate=1e-3)

# Create a training step function.
train_on_batch <- tf_function(function(x, y) {
  with(tf$GradientTape() %as% tape, {
    # Forward pass.
    logits <- model(x)
    # External loss value for this batch.
    loss_value <- loss_fn(y, logits)
  })
  # Update the weights of the model to minimize the loss value.
  gradients <- tape$gradient(loss_value, model$trainable_weights)
  optimizer$apply_gradients(zip_lists(gradients, model$trainable_weights))
  loss_value
})


# Prepare a dataset.
c(c(x_train, y_train), .) %<-% dataset_mnist()
x_train <- array_reshape(x_train, c(60000, 784)) / 255

dataset <- tfdatasets::tensor_slices_dataset(list(x_train, y_train)) %>%
  tfdatasets::dataset_shuffle(buffer_size=1024) %>%
  tfdatasets::dataset_batch(64)

i <- 0
coro::loop(for (data in dataset) {
  i <- i + 1
  x <- data[[1]]
  y <- data[[2]]
  loss <- train_on_batch(x, y)
  if (i %% 100 == 0)
    cat("Loss:", as.numeric(loss), "\n")
})
```

```
## Loss: 0.551749
## Loss: 0.2131135
## Loss: 0.2765952
## Loss: 0.1296219
## Loss: 0.2657076
## Loss: 0.2683381
## Loss: 0.1570166
## Loss: 0.3139241
## Loss: 0.08981849
```

## Training mode & inference mode

Some layers, in particular the `BatchNormalization` layer and the `Dropout`
layer, have different behaviors during training and inference. For such layers,
it is standard practice to expose a `training` (boolean) argument in the `call`
method.

By exposing this argument in `call`, you enable the built-in training and
evaluation loops (e.g. fit) to correctly use the layer in training and
inference modes.


```r
Dropout <- new_layer_class(
  "Dropout",
  initialize = function(rate) {
    super$initialize()
    self$rate <- rate
  },
  call = function(inputs, training = NULL) {
    if (!is.null(training) && training) {
      return(tf$nn$dropout(inputs, rate = self$rate))
    }
    inputs
  }
)

MLPWithDropout <- new_layer_class(
  "MLPWithDropout",
  initialize = function() {
    super$initialize()
    self$linear_1 <- Linear(units = 32)
    self$dropout <- Dropout(rate = 0.5)
    self$linear_3 <- Linear(units = 10)
  },
  call = function(inputs, training = NULL) {
    x <- self$linear_1(inputs)
    x <- tf$nn$relu(x)
    x <- self$dropout(x, training = training)
    self$linear_3(x)
  }
)

mlp <- MLPWithDropout()
y_train <- mlp(tf$ones(shape(2, 2)), training=TRUE)
y_test <- mlp(tf$ones(shape(2, 2)), training=FALSE)
```

## The Functional API for model-building

To build deep learning models, you don't have to use object-oriented programming all the
time. All layers we've seen so far can also be composed functionally, like this (we call
it the "Functional API"):


```r
# We use an `Input` object to describe the shape and dtype of the inputs.
# This is the deep learning equivalent of *declaring a type*.
# The shape argument is per-sample; it does not include the batch size.
# The functional API focused on defining per-sample transformations.
# The model we create will automatically batch the per-sample transformations,
# so that it can be called on batches of data.
inputs <- layer_input(shape = 16, dtype = "float32")

# We call layers on these "type" objects
# and they return updated types (new shapes/dtypes).
outputs <- inputs %>%
  Linear(units = 32) %>% # We are reusing the Linear layer we defined earlier.
  Dropout(rate = 0.5) %>% # We are reusing the Dropout layer we defined earlier.
  Linear(units = 10)

# A functional `Model` can be defined by specifying inputs and outputs.
# A model is itself a layer like any other.
model <- keras_model(inputs, outputs)

# A functional model already has weights, before being called on any data.
# That's because we defined its input shape in advance (in `Input`).
length(model$weights)
```

```
## [1] 4
```

```r
# Let's call our model on some data, for fun.
y <- model(tf$ones(shape(2, 16)))
y$shape
```

```
## TensorShape([2, 10])
```

```r
# You can pass a `training` argument in `__call__`
# (it will get passed down to the Dropout layer).
y <- model(tf$ones(shape(2, 16)), training=TRUE)
```

The Functional API tends to be more concise than subclassing, and provides a few other
advantages (generally the same advantages that functional, typed languages provide over
untyped OO development). However, it can only be used to define DAGs of layers --
recursive networks should be defined as Layer subclasses instead.

Learn more about the Functional API [here](functional_api.html).

In your research workflows, you may often find yourself mix-and-matching OO models and
Functional models.

Note that the `Model` class also features built-in training & evaluation loops:
`fit()`, `predict()` and `evaluate()` (configured via the `compile()` method).
These built-in functions give you access to the
following built-in training infrastructure features:

* [Callbacks](https://keras3.posit.co/reference/index.html#callbacks). You can leverage built-in
callbacks for early-stopping, model checkpointing,
and monitoring training with TensorBoard. You can also
[implement custom callbacks](writing_your_own_callbacks.html) if needed.
* [Distributed training](distributed_training_with_tensorflow.html). You
can easily scale up your training to multiple GPUs, TPU, or even multiple machines
with the `tf.distribute` API -- with no changes to your code.
* [Step fusing](https://keras3.posit.co/reference/compile.keras.src.models.model.Model.html).
With the `steps_per_execution` argument in `Model.compile()`, you can process
multiple batches in a single `tf.function` call, which greatly improves
device utilization on TPUs.

We won't go into the details, but we provide a simple code example
below. It leverages the built-in training infrastructure to implement the MNIST
example above.


```r
inputs <- layer_input(shape = 784, dtype="float32")
outputs <- inputs %>%
  layer_dense(units = 32, activation = "relu") %>%
  layer_dense(units = 32, activation = "relu") %>%
  layer_dense(units = 10)
model <- keras_model(inputs, outputs)

# Specify the loss, optimizer, and metrics with `compile()`.
model %>% compile(
    loss = loss_sparse_categorical_crossentropy(from_logits=TRUE),
    optimizer=optimizer_adam(learning_rate=1e-3),
    metrics=list(metric_sparse_categorical_accuracy()),
)

# Train the model with the dataset for 2 epochs.
model %>% fit(dataset, epochs=2)
```

```
## Epoch 1/2
## 938/938 - 4s - 4ms/step - loss: 0.3958 - sparse_categorical_accuracy: 0.8866
## Epoch 2/2
## 938/938 - 1s - 960us/step - loss: 0.1888 - sparse_categorical_accuracy: 0.9443
```

```r
predictions <- model %>% predict(dataset)
```

```
## 938/938 - 1s - 1ms/step
```

```r
model %>% evaluate(dataset)
```

```
## 938/938 - 1s - 1ms/step - loss: 0.1763 - sparse_categorical_accuracy: 0.9454
```

```
## $loss
## [1] 0.1763445
##
## $sparse_categorical_accuracy
## [1] 0.9454167
```

You can always subclass the `Model` class (it works exactly like subclassing
`Layer`) if you want to leverage built-in training loops for your OO models.
Just override the `Model$train_step()` to
customize what happens in `fit()` while retaining support
for the built-in infrastructure features outlined above -- callbacks,
zero-code distribution support, and step fusing support.
You may also override `test_step()` to customize what happens in `evaluate()`,
and override `predict_step()` to customize what happens in `predict()`. For more
information, please refer to
[this guide](custom_train_step_in_tensorflow.html).


```r
CustomModel <- new_model_class(
  "CustomModel",
  initialize = function(...) {
    super$initialize(...)
    self$loss_tracker <- metric_mean(name="loss")
    self$accuracy <- metric_sparse_categorical_accuracy()
    self$loss_fn <- loss_sparse_categorical_crossentropy(from_logits=TRUE)
    self$optimizer <- optimizer_adam(learning_rate=1e-3)
  },
  train_step = function(data) {
    c(x, y = NULL, sample_weight = NULL) %<-% data
    with(tf$GradientTape() %as% tape, {
      y_pred <- self(x, training=TRUE)
      loss <- self$loss_fn(y = y, y_pred = y_pred, sample_weight=sample_weight)
    })
    gradients <- tape$gradient(loss, self$trainable_variables)
    self$optimizer$apply_gradients(
      zip_lists(gradients, self$trainable_variables)
    )

    # Update metrics (includes the metric that tracks the loss)
    self$loss_tracker$update_state(loss)
    self$accuracy$update_state(y, y_pred, sample_weight=sample_weight)
    # Return a list mapping metric names to current value
    list(
      loss = self$loss_tracker$result(),
      accuracy = self$accuracy$result()
    )
  },
  metrics = mark_active(function() {
    list(self$loss_tracker, self$accuracy)
  })
)

inputs <- layer_input(shape = 784, dtype="float32")
outputs <- inputs %>%
  layer_dense(units = 32, activation = "relu") %>%
  layer_dense(units = 32, activation = "relu") %>%
  layer_dense(units = 10)
model <- CustomModel(inputs, outputs)
model %>% compile()
model %>% fit(dataset, epochs=2)
```

```
## Epoch 1/2
## 938/938 - 2s - 2ms/step - loss: 0.3869 - sparse_categorical_accuracy: 0.8924
## Epoch 2/2
## 938/938 - 1s - 1ms/step - loss: 0.2163 - sparse_categorical_accuracy: 0.9370
```

## End-to-end experiment example 1: variational autoencoders.

Here are some of the things you've learned so far:

- A `Layer` encapsulates a state (created in `__init__` or `build`) and some computation
(defined in `call`).
- Layers can be recursively nested to create new, bigger computation blocks.
- You can easily write highly hackable training loops by opening a
`GradientTape`, calling your model inside the tape's scope, then retrieving
gradients and applying them via an optimizer.
- You can speed up your training loops using the `@tf.function` decorator.
- Layers can create and track losses (typically regularization losses) via
`self.add_loss()`.

Let's put all of these things together into an end-to-end example: we're going to
implement a Variational AutoEncoder (VAE). We'll train it on MNIST digits.

Our VAE will be a subclass of `Layer`, built as a nested composition of layers that
subclass `Layer`. It will feature a regularization loss (KL divergence).

Below is our model definition.

First, we have an `Encoder` class, which uses a `Sampling` layer to map a MNIST digit to
a latent-space triplet `(z_mean, z_log_var, z)`.


```r
Sampling <- new_layer_class(
  "Sampling",
  call = function(inputs) {
    c(z_mean, z_log_var) %<-% inputs
    batch <- op_shape(z_mean)[[1]]
    dim <- op_shape(z_mean)[[2]]
    epsilon <- random_normal(shape = c(batch, dim))
    z_mean + op_exp(0.5 * z_log_var) * epsilon
  }
)

Encoder <- new_layer_class(
  "Encoder",
  initialize = function(latent_dim = 32, intermediate_dim = 64, ...) {
    super$initialize(...)
    self$dense_proj <- layer_dense(units = intermediate_dim, activation = "relu")
    self$dense_mean <- layer_dense(units = latent_dim)
    self$dense_log_var <- layer_dense(units = latent_dim)
    self$sampling <- Sampling()
  },
  call = function(inputs) {
    x <- self$dense_proj(inputs)
    z_mean <- self$dense_mean(x)
    z_log_var <- self$dense_log_var(x)
    z <- self$sampling(list(z_mean, z_log_var))
    list(z_mean, z_log_var, z)
  }
)
```

Next, we have a `Decoder` class, which maps the probabilistic latent space coordinates
back to a MNIST digit.


```r
Decoder <- new_layer_class(
  "Decoder",
  initialize = function(original_dim, intermediate_dim = 64, ...) {
    super$initialize(...)
    self$dense_proj <- layer_dense(units = intermediate_dim, activation = "relu")
    self$dense_output <- layer_dense(units = original_dim, activation = "sigmoid")
  },
  call = function(inputs) {
    x <- self$dense_proj(inputs)
    self$dense_output(x)
  }
)
```

Finally, our `VariationalAutoEncoder` composes together an encoder and a decoder, and
creates a KL divergence regularization loss via `add_loss()`.


```r
VariationalAutoEncoder <- new_model_class(
  "VariationalAutoEncoder",
  initialize = function(original_dim,
        intermediate_dim=64,
        latent_dim=32,
        name="autoencoder", ...) {
    super$initialize(name = name, ...)
    self$original_dim <- original_dim
    self$encoder <- Encoder(
      latent_dim = latent_dim,
      intermediate_dim = intermediate_dim
    )
    self$decoder <- Decoder(
      original_dim = original_dim,
      intermediate_dim = intermediate_dim
    )
  },
  call = function(inputs) {
    c(z_mean, z_log_var, z) %<-% self$encoder(inputs)
    reconstructed <- self$decoder(z)
    # Add KL divergence regularization loss.
    kl_loss <- -0.5 * op_mean(
      z_log_var - op_square(z_mean) - op_exp(z_log_var) + 1
    )
    self$add_loss(kl_loss)
    reconstructed
  }
)
```

Now, let's write a training loop. Our training step is decorated with a `@tf.function` to
compile into a super fast graph function.


```r
# Our model.
vae <- VariationalAutoEncoder(
  original_dim = 784,
  intermediate_dim = 64,
  latent_dim = 32
)

# Loss and optimizer.
loss_fn <- loss_mean_squared_error()
optimizer = optimizer_adam(learning_rate=1e-3)

# Prepare a dataset.
c(c(x_train, .), .) %<-% dataset_mnist()
x_train <- array_reshape(x_train, c(60000, 784)) / 255

dataset <- tfdatasets::tensor_slices_dataset(x_train) %>%
  tfdatasets::dataset_shuffle(buffer_size=1024) %>%
  tfdatasets::dataset_batch(32)


training_step <- tf_function(function(x) {
  with(tf$GradientTape() %as% tape, {
    reconstructed <- vae(x)  # Compute input reconstruction.
    # Compute loss.
    loss <- loss_fn(x, reconstructed)
    loss <- loss + op_sum(vae$losses)  # Add KLD term.
  })
  # Update the weights of the VAE.
  grads <- tape$gradient(loss, vae$trainable_weights)
  optimizer$apply_gradients(zip_lists(grads, vae$trainable_weights))
  loss
})

losses <- c()  # Keep track of the losses over time.
coro::loop(for(data in dataset) {
  loss <- training_step(data)

  # Logging.
  losses[length(losses) + 1] <- as.numeric(loss)
  if (length(losses) %% 100 == 0) {
    cat("Step:", length(losses), "Loss:", mean(losses), "\n")
  }
  # Stop after 1000 steps.
  # Training the model to convergence is left
  # as an exercise to the reader.
  if (length(losses) >= 1000) {
    break
  }
})
```

```
## Step: 100 Loss: 0.1270978
## Step: 200 Loss: 0.1003238
## Step: 300 Loss: 0.09001128
## Step: 400 Loss: 0.08493649
## Step: 500 Loss: 0.08171404
## Step: 600 Loss: 0.07926706
## Step: 700 Loss: 0.07790599
## Step: 800 Loss: 0.07670419
## Step: 900 Loss: 0.07570736
## Step: 1000 Loss: 0.07476593
```

As you can see, building and training this type of model in Keras
is quick and painless.

## End-to-end experiment example 2: hypernetworks.

Let's take a look at another kind of research experiment: hypernetworks.

The idea is to use a small deep neural network (the hypernetwork) to generate
the weights for a larger network (the main network).

Let's implement a really trivial hypernetwork: we'll use a small 2-layer network  to
generate the weights of a larger 3-layer network.


```r
input_dim <- 784
classes <- 10

# This is the main network we'll actually use to predict labels.
inputs <- layer_input(shape = input_dim)
dense1 <- layer_dense(units = 64, activation = "relu")
dense1$built <- TRUE

dense2 <- layer_dense(units = classes)
dense2$built <- TRUE

outputs <- inputs %>% dense1() %>% dense2()
main_network <- keras_model(inputs, outputs)

# This is the number of weight coefficients to generate. Each layer in the
# main network requires output_dim * input_dim + output_dim coefficients.
num_weights_to_generate <- (classes * 64 + classes) + (64 * input_dim + 64)

# This is the hypernetwork that generates the weights of the `main_network` above.
hypernetwork <- keras_model_sequential() %>%
  layer_dense(units=16, activation="relu") %>%
  layer_dense(units=num_weights_to_generate, activation="sigmoid")
```

This is our training loop. For each batch of data:

- We use `hypernetwork` to generate an array of weight coefficients, `weights_pred`
- We reshape these coefficients into kernel & bias tensors for the `main_network`
- We run the forward pass of the `main_network` to compute the actual MNIST predictions
- We run backprop through the weights of the `hypernetwork` to minimize the
final classification loss


```r
# Loss and optimizer.
loss_fn <- loss_sparse_categorical_crossentropy(from_logits = TRUE)
optimizer <- optimizer_adam(learning_rate=1e-4)

# Prepare a dataset.
c(c(x_train, y_train), .) %<-% dataset_mnist()
x_train <- array_reshape(x_train, c(60000, 784)) / 255

dataset <- tfdatasets::tensor_slices_dataset(list(x_train, y_train)) %>%
  tfdatasets::dataset_shuffle(buffer_size=1024) %>%
  # We'll use a batch size of 1 for this experiment.
  tfdatasets::dataset_batch(1)

train_step <- function(x, y) {
  with(tf$GradientTape() %as% tape, {
    weights_pred <- hypernetwork(x)

    # Reshape them to the expected shapes for w and b for the outer model.
    # Layer 1 kernel.
    start_index <- 1
    w1_shape <- c(input_dim, 64)
    w1_coeffs <- weights_pred[, start_index:(start_index + prod(w1_shape) - 1)]
    w1 <- tf$reshape(w1_coeffs, as.integer(w1_shape))
    start_index <- start_index + prod(w1_shape)

    # Layer 1 bias.
    b1_shape <- c(64)
    b1_coeffs <- weights_pred[, start_index:(start_index + prod(b1_shape) - 1)]
    b1 <- tf$reshape(b1_coeffs, as.integer(b1_shape))
    start_index <- start_index + prod(b1_shape)

    # Layer 2 kernel.
    w2_shape <- c(64, classes)
    w2_coeffs <- weights_pred[, start_index:(start_index + prod(w2_shape) - 1)]
    w2 <- tf$reshape(w2_coeffs, as.integer(w2_shape))
    start_index <- start_index + prod(w2_shape)

    # Layer 2 bias.
    b2_shape <- c(classes)
    b2_coeffs <- weights_pred[, start_index:(start_index + prod(b2_shape) - 1)]
    b2 <- tf$reshape(b2_coeffs, as.integer(b2_shape))
    start_index <- start_index + prod(b2_shape)

    # Set the weight predictions as the weight variables on the outer model.
    dense1$kernel <- w1
    dense1$bias <- b1
    dense2$kernel <- w2
    dense2$bias <- b2

    # Inference on the outer model.
    preds <- main_network(x)
    loss <- loss_fn(y, preds)
  })

  grads <- tape$gradient(loss, hypernetwork$trainable_weights)
  optimizer$apply_gradients(zip_lists(grads, hypernetwork$trainable_weights))
  loss
}

losses <- c()  # Keep track of the losses over time.
coro::loop(for (data in dataset) {
  x <- data[[1]]
  y <- data[[2]]
  loss <- train_step(x, y)

  # Logging.
  losses[length(losses) + 1] <- as.numeric(loss)
  if (length(losses) %% 100 == 0) {
    cat("Step:", length(losses), "Loss:", mean(losses), "\n")
  }
  # Stop after 1000 steps.
  # Training the model to convergence is left
  # as an exercise to the reader.
  if (length(losses) >= 1000) {
    break
  }
})
```

```
## Step: 100 Loss: 2.536778
## Step: 200 Loss: 2.236472
## Step: 300 Loss: 2.119417
## Step: 400 Loss: 2.040341
## Step: 500 Loss: 1.949125
## Step: 600 Loss: 1.859384
## Step: 700 Loss: 1.845726
## Step: 800 Loss: 1.820594
## Step: 900 Loss: 1.771334
## Step: 1000 Loss: 1.730648
```

Implementing arbitrary research ideas with Keras is straightforward and highly
productive. Imagine trying out 25 ideas per day (20 minutes per experiment on average)!

Keras has been designed to go from idea to results as fast as possible, because we
believe this is
the key to doing great research.

We hope you enjoyed this quick introduction. Let us know what you build with Keras!