---
title: "Matrix Factorization with Side Info"
output:
    rmarkdown::html_vignette:
        toc: true
author: "David Cortes"
vignette: >
    %\VignetteIndexEntry{Matrix Factorization with Side Info}
    %\VignetteEngine{knitr::rmarkdown}
    %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>"
)
options(rmarkdown.html_vignette.check_title = FALSE)
```
```{r, include = FALSE}
### Don't overload CRAN servers
### https://stackoverflow.com/questions/28961431/computationally-heavy-r-vignettes
is_check <- ("CheckExEnv" %in% search()) || any(c("_R_CHECK_TIMINGS_",
             "_R_CHECK_LICENSE_") %in% names(Sys.getenv()))
```

This vignette illustrates the usage of the
[cmfrec](https://cran.r-project.org/package=cmfrec) library for building recommender
systems based on collaborative filtering models for explicit-feedback data, with or
without side information about the users and items. Note that the library offers also content-based models and implicit-feedback models, but they are not showcased in this
vignette.

This example will use the [MovieLens100k](https://grouplens.org/datasets/movielens/100k/)
data, as bundled in the [recommenderlab](https://cran.r-project.org/package=recommenderlab)
package, which contains around ~ 100k movie ratings from 943 users about 1664 movies, in a
scale from 1 to 5.

In addition to the ratings, it also contains side information about the movies (genre,
year of release) and about the users (age, occupation), which will be used here to
construct a better recommendation model.

**For a more comprehensive introduction see also the `cmfrec`
[Python Notebook](https://nbviewer.jupyter.org/github/david-cortes/cmfrec/blob/master/example/cmfrec_movielens_sideinfo.ipynb),
which uses the more richer MovieLens1M instead (not provided by R packages)**.

## Matrix Factorization

One of the most popular techniques for building recommender systems is to frame the problem
as matrix completion, in which a large sparse matrix is built containing the ratings that
users give to products (in this case, movies), with rows representing users, columns
representing items, and entries corresponding to the ratings that they've given (e.g.
"5 stars"). Most of these entries will be missing, as each users is likely to consume only
a handful of the available products (thus, the matrix is sparse), and the goal is to
construct a model which would be able to predict the value of the known interactions
(i.e. predict which rating would each user give to each movie), which is compared against
the observed values. The items to recommend to each user are then the ones with highest
predicted values among those which the user has not yet consumed.

Typically, the problem is approached by trying to approximate the interactions matrix as the
product of two lower-dimension matrices (a.k.a. latent factor matrices), which when
multiplied by each other would produce something that resembles the original matrix, having
the nice property that it will produce predictions for all user-item combinations - i.e.

$$
\mathbf{X} \approx \mathbf{A} \mathbf{B}^T
$$
Where:

* $\mathbf{X}$ is the interactions matrix (users are rows, items are columns).
* $\mathbf{A}$ and $\mathbf{B}$ are the matrices estimated by the model (a.k.a.
latent factors), which have a low number of columns, typically 30-100.

For a better and more stable model, the $\mathbf{X}$ matrix is typically centered by
substracting its mean, a bias/intercept is added for each user and item, and a
regularization penalty is applied to the model matrices and biases (typically on the
L2 norm) - i.e.:

$$
\mathbf{X} \approx \mathbf{A} \mathbf{B}^T + \mu + \mathbf{b}_A + \mathbf{b}_B
$$
Where:

* $\mu$ is the global mean used to center $\mathbf{X}$.
* $\mathbf{b}_A$ are user-specific biases (row vector).
* $\mathbf{b}_B$ are item-specific biases (column vector).

The matrices are typically fitted by initializing them to random numbers and then
iteratively updating them in a way that decreases the reconstruction error
with respect to the observed entries in $\mathbf{X}$, using either gradient-based
procedures (e.g. stochastic gradient descent) or the ALS (alternating least-squares)
method, which optimizes one matrix at a time while leaving the other fixed, performing
a few sweeps until convergence.

This library (`cmfrec`) will by default use the ALS method with L2 regularization, and
will use user/item biases which are model parameters (updated at each iteration) rather
than being pre-estimated.

## Loading the data

The MovieLens100k data is taken from the `recommenderlab` package. As the data is sparse,
it is represented as sparse matrices from the
[Matrix](https://cran.r-project.org/package=Matrix) package. The data comes in CSC format,
whereas `cmfrec` requires COO/triplets format - the conversion is handled by the
[MatrixExtra](https://cran.r-project.org/package=MatrixExtra) package for convenience,
which also provides extra slicing functionality that will be used later.

```{r, message=FALSE}
library(cmfrec)
library(Matrix)
library(MatrixExtra)
library(recommenderlab)

data("MovieLense")
X <- as.coo.matrix(MovieLense@data)
str(X)
```

##### Creating a train-test split

In order to evaluate models, 25% of the data will be set as a test set, while the model
will be built with the remainder 75%. The split done here is random, but usually
time-based splits tend to reflect more realistic scenarios for recommendation.

Typically, **these splits are done in such a way that the test set contains only users
and items which are in the train set**, but such a rule is not necessary and perhaps
not even desirable for `cmfrec`, since it can accomodate global/user/item biases and
thus it can make predictions based on them alone.

```{r}
subsample_coo_matrix <- function(X, indices) {
    X@i <- X@i[indices]
    X@j <- X@j[indices]
    X@x <- X@x[indices]
    return(X)
}

n_ratings <- length(X@x)
set.seed(123)
ix_train <- sample(n_ratings, floor(0.75 * n_ratings), replace=FALSE)
X_train <- subsample_coo_matrix(X, ix_train)
X_test <- subsample_coo_matrix(X, -ix_train)
```

## Classical model

Now fitting the classical matrix factorization model, with global mean centering,
user/item biases, L2 regularization which scales with the number of ratings for each
user/item, and no side information. This is the model explained in the earlier section:
$$
\mathbf{X} \approx \mathbf{A} \mathbf{B}^T + \mu + \mathbf{b}_A + \mathbf{b}_B
$$

```{r, eval=FALSE}
model.classic <- CMF(X_train, k=25, lambda=0.1, scale_lam=TRUE, verbose=FALSE)
```
```{r, echo=FALSE}
### Don't overload CRAN servers
if (!is_check) {
    model.classic <- CMF(X_train, k=25, lambda=0.1, scale_lam=TRUE, verbose=FALSE)
} else {
    model.classic <- CMF(X_train, k=5, lambda=0.1, scale_lam=TRUE, verbose=FALSE,
                         niter=2, nthreads=1)
}
```

#### How good is it?

The most typical way of evaluating the quality of these models is by evaluating the error
that they have at predicting known entries, which here will be evaluated against the test
data that was set apart earlier. The evaluation here will be done in terms of mean squared
error (RMSE).

**Note that, while widely used in the early literature for recommender systems, RMSE
might not provide a good overview of the ranking of items (which is what matters
for recommendations), and it's recommended to also evaluate other metrics such as
`NDCG@K`, `P@K`, correlations, etc.**

```{r}
print_rmse <- function(X_test, X_hat, model_name) {
  rmse <- sqrt(mean( (X_test@x - X_hat@x)^2 ))
  cat(sprintf("RMSE for %s is: %.4f\n", model_name, rmse))
}

pred_classic <- predict(model.classic, X_test)
print_rmse(X_test, pred_classic, "classic model")
```

i.e. it means that the ratings are off by about one star. This is better
than a non-personalized model that would always predict the same rating for each user,
which can also be simulated through `cmfrec`:
```{r}
model.baseline <- MostPopular(X_train, lambda=10, scale_lam=FALSE)
pred_baseline <- predict(model.baseline, X_test)
print_rmse(X_test, pred_baseline, "non-personalized model")
```
(_Note: it's not recommended to use scaled/dynamic regularization in a most-popular
model, as it will tend to recommend items with only one user giving the maximum rating._)

#### Improving the classical model

By default, ALS-based models are broken down to small problems involving linear systems,
which are in turned solved through the
[Conjugate Gradient](http://rs1.sze.hu/~gtakacs/download/recsys_2011_draft.pdf) method,
but `cmfrec` can also use a Cholesky solver for them, which is slower but tends to
result in better-quality solutions for explicit-feedback.

As well, the default number of iterations is 10, but can be increased for better models
at the expense of longer fitting times.

But more importantly, `cmfrec` offers the option of adding "implicit-features" or
co-factoring, which will additionally factorize binarized versions of $\mathbf{X}$
(telling whether each entry is missing or not), sharing the same latent components
with the factorization of $\mathbf{X}$ - that is:
$$
\mathbf{X} \approx \mathbf{A} \mathbf{B}^T + \mu + \mathbf{b}_A + \mathbf{b}_B
$$
$$
\mathbf{I}_x \approx \mathbf{A} \mathbf{B}^T_i
\:\:\:\:
\mathbf{I}^T_x \approx \mathbf{B} \mathbf{A}^T_i
$$
Where:

* $\mathbf{I}_x$ is a binary matrix indicating whether each entry of $\mathbf{X}$ is
observed or missing.
* $\mathbf{A}_i$ and $\mathbf{B}_i$ are model matrices which are not directly used
for $\mathbf{X}$, and not used in the prediction formula, but are still estimated in
this new multi-objective optimization objective.

```{r, eval=FALSE}
model.improved <- CMF(X_train, k=25, lambda=0.1, scale_lam=TRUE,
                      add_implicit_features=TRUE, w_main=0.75, w_implicit=0.25,
                      use_cg=FALSE, niter=30, verbose=FALSE)
pred_improved <- predict(model.improved, X_test)
print_rmse(X_test, pred_improved, "improved classic model")
```
```{r, echo=FALSE}
### Don't overload CRAN servers
if (!is_check) {
    model.improved <- CMF(X_train, k=25, lambda=0.1, scale_lam=TRUE,
                          add_implicit_features=TRUE, w_main=0.75, w_implicit=0.25,
                          use_cg=FALSE, niter=30, verbose=FALSE)
} else {
   model.improved <- CMF(X_train, k=5, lambda=0.1, scale_lam=TRUE,
                         add_implicit_features=TRUE, w_main=0.75, w_implicit=0.25,
                         use_cg=FALSE, verbose=FALSE,
                         niter=2, nthreads=1)
}
pred_improved <- predict(model.improved, X_test)
print_rmse(X_test, pred_improved, "improved classic model")
```

## Adding side information

Collective matrix factorization extends the classical model by incorporating side
information about users/items into the formula, which is done by also factorizing
the side information matrices, sharing the same latent components that are used for
factorizing the $\mathbf{X}$ matrix:
$$
\mathbf{X} \approx \mathbf{A} \mathbf{B}^T + \mu + \mathbf{b}_A + \mathbf{b}_B
$$
$$
\mathbf{U} \approx \mathbf{A} \mathbf{C}^T + \mu_U
$$
$$
\mathbf{I} \approx \mathbf{B} \mathbf{D}^T + \mu_I
$$
$$
\mathbf{I}_x \approx \mathbf{A} \mathbf{B}^T_i
\:\:\:\:
\mathbf{I}^T_x \approx \mathbf{B} \mathbf{A}^T_i
$$
Where:

* $\mathbf{U}$ is a matrix representing side information about users, with each user
being a row, and columns corresponding to their attributes.
* $\mathbf{I}$ is similarly a matrix representing side information about items.
* $\mathbf{C}$ and $\mathbf{D}$ are new latent factor matrices used for factorizing
the side information matrices, but are not used directly for $\mathbf{X}$.
* $\mu_U$ and $\mu_I$ are column means for the attributes, which are used in order
to center them.

Informally, the latent factors now need to explain both the interactions data as well as
the side information, thereby making them generalize better to unseen data. This library
in addition allows controlling aspects such as the weight that each factorization has
in the optimization objective, different regularization for each matrix, having factors
that are not shared, among others.

** *

Fetching the side information from `recommenderlab`:
```{r}
U <- MovieLenseUser
U$id      <- NULL
U$zipcode <- NULL
U$age2    <- U$age^2
### Note that `cmfrec` does not standardize features beyond mean centering
U$age     <- (U$age - mean(U$age)) / sd(U$age)
U$age2    <- (U$age2 - mean(U$age2)) / sd(U$age2)
U <- model.matrix(~.-1, data=U)

I <- MovieLenseMeta
I$title <- NULL
I$url   <- NULL
I$year  <- ifelse(is.na(I$year), median(I$year, na.rm=TRUE), I$year)
I$year2 <- I$year^2
I$year  <- (I$year - mean(I$year)) / sd(I$year)
I$year2 <- (I$year2 - mean(I$year2)) / sd(I$year2)
I <- as.coo.matrix(I)

cat(dim(U), "\n")
cat(dim(I), "\n")
```

Now fitting the model:
```{r, eval=FALSE}
model.w.sideinfo <- CMF(X_train, U=U, I=I, NA_as_zero_item=TRUE,
                        k=25, lambda=0.1, scale_lam=TRUE,
                        niter=30, use_cg=FALSE, include_all_X=FALSE,
                        w_main=0.75, w_user=0.5, w_item=0.5, w_implicit=0.5,
                        center_U=FALSE, center_I=FALSE,
                        verbose=FALSE)
pred_side_info <- predict(model.w.sideinfo, X_test)
print_rmse(X_test, pred_side_info, "model with side info")
```
```{r, echo=FALSE}
### Don't overload CRAN servers
if (!is_check) {
    model.w.sideinfo <- CMF(X_train, U=U, I=I, NA_as_zero_item=TRUE,
                            k=25, lambda=0.1, scale_lam=TRUE,
                            niter=30, use_cg=FALSE, include_all_X=FALSE,
                            w_main=0.75, w_user=0.5, w_item=0.5, w_implicit=0.5,
                            center_U=FALSE, center_I=FALSE,
                            verbose=FALSE)
} else {
    model.w.sideinfo <- CMF(X_train, U=U, I=I, NA_as_zero_item=TRUE,
                            k=5, lambda=0.1, scale_lam=TRUE, scale_lam_sideinfo=TRUE,
                            use_cg=FALSE, include_all_X=FALSE,
                            w_main=0.75, w_user=0.5, w_item=0.5, w_implicit=0.5,
                            center_U=FALSE, center_I=FALSE,
                            verbose=FALSE, niter=2, nthreads=1)
}
pred_side_info <- predict(model.w.sideinfo, X_test)
print_rmse(X_test, pred_side_info, "model with side info")
```

** *
Summary:
```{r}
library(kableExtra)

calc_rmse <- function(X_test, X_hat) {
    return(sqrt(mean( (X_test@x - X_hat@x)^2 )))
}
results <- data.frame(
    NonPersonalized = calc_rmse(X_test, pred_baseline),
    ClassicalModel = calc_rmse(X_test, pred_classic),
    ClassicPlusImplicit = calc_rmse(X_test, pred_improved),
    CollectiveModel = calc_rmse(X_test, pred_side_info)
)
results <- as.data.frame(t(results))
names(results) <- "RMSE"
results %>%
    kable() %>%
    kable_styling()
```

Important to keep in mind:

* These RMSEs have high standard errors due to the small amount of data used here.
* The model hyperparameters are not particularly tuned, and a proper tuning
should use a validation split too.
* The test split is using users and items which might not have been in the training set.
* While it looks like the difference from adding side information is very small, it
also comes with the side effect of being able to recommend items based on attributes.
* RMSE as a metric can hide overfitting in models that tend to recommend items with
too few ratings/interactions - these models in particular will tend to recommend many
movies with only a handful ratings, which is typically undesirable. A model with higher
regularization that shows higher test RMSE might in practice produce better quality
recommendations (see the introductory
[Python notebook](https://nbviewer.jupyter.org/github/david-cortes/cmfrec/blob/master/example/cmfrec_movielens_sideinfo.ipynb)
for more examples on this topic).
* The models evaluated so far have all used dynamic/scaled regularization (as proposed
in [Large-scale Parallel Collaborative Filtering for the Netflix Prize](https://link.springer.com/chapter/10.1007/978-3-540-68880-8_32)),
save for the baseline most-popular model - this means that the regularization for
each user and item is scaled by the number of present entries for it. This setting tends
to produce dubious recommendations in small datasets like the MovieLens100k, even if it
makes it look like it improves RMSE.

## Generating Top-N recommendations

The goal behind building a collaborative filtering model is typically to be able to make
top-N recommended lists for users or to obtain latent factors for an unseen user given
its current data. `cmfrec` has many prediction functions for these purposes depending
on what specifically one wants to do, supporting both warm-start and cold-start
recommendations.

```{r, eval=FALSE}
### Re-fitting the earlier model to all the data,
### this time *without* scaled regularization
model.classic <- CMF(X, k=20, lambda=10, scale_lam=FALSE, verbose=FALSE)
model.w.sideinfo <- CMF(X, U=U, I=I, k=20, lambda=10, scale_lam=FALSE,
                        w_main=0.75, w_user=0.125, w_item=0.125,
                        verbose=FALSE)
```
```{r, echo=FALSE}
### Don't overload CRAN servers
if (!is_check) {
    model.classic <- CMF(X, k=20, lambda=10, scale_lam=FALSE, verbose=FALSE)
    model.w.sideinfo <- CMF(X, U=U, I=I, k=20, lambda=10, scale_lam=FALSE,
                            w_main=0.75, w_user=0.125, w_item=0.125,
                            verbose=FALSE)
} else {
    model.classic <- CMF(X, k=5, lambda=10, scale_lam=FALSE, verbose=FALSE,
                         niter=2, nthreads=1)
    model.w.sideinfo <- CMF(X, U=U, I=I, k=5, lambda=10, scale_lam=FALSE,
                            w_main=0.75, w_user=0.125, w_item=0.125,
                            verbose=FALSE, niter=2, nthreads=1)
}
```


### Recommendations for existing users

When fitting a model, all the necessary fitted matrices are saved inside the object
itself, which allows making predictions for existing users based just on the ID.
The specific items consumed by each user are however not saved, so in order to avoid
recommending already-seen items, these have to be explicitly passed for exclusion.

```{r}
user_to_recommend <- 10
### Note: slicing of 'X' is provided by 'MatrixExtra',
### returning a 'sparseVector' object as required by cmfrec
topN(model.classic, user=user_to_recommend, n=10,
     exclude=X[user_to_recommend, , drop=TRUE])
```

```{r}
### A handy function for visualizing recommendations
movie_names <- colnames(X)
n_ratings <- colSums(as.csc.matrix(X, binary=TRUE))
avg_ratings <- colSums(as.csc.matrix(X)) / n_ratings
print_recommended <- function(rec, txt) {
    cat(txt, ":\n",
        paste(paste(1:length(rec), ". ", sep=""),
              movie_names[rec],
              " - Avg rating:", round(avg_ratings[rec], 2),
              ", #ratings: ", n_ratings[rec],
              collapse="\n", sep=""),
        "\n", sep="")
}
recommended <- topN(model.w.sideinfo, user=user_to_recommend, n=5,
                    exclude=X[user_to_recommend, , drop=TRUE])
print_recommended(recommended, "Recommended for user_id=10")
```
### Recommendations for new users

The fitted model, as it is, can only provide recommendations for the specific users
and items to which it was fit. Typically, one wants to produce recommendations for
new users as they go, or update the recommended lists for existing users once they
consume more items. `cmfrec` allows obtaining latent factors and top-N recommended
lists for new users without having to refit the whole model.

This is how it would be if user 10 were to come as a new visitor:
```{r}
recommended_new <- topN_new(model.w.sideinfo, n=5,
                            exclude=X[user_to_recommend, , drop=TRUE],
                            X=X[user_to_recommend, , drop=TRUE],
                            U=U[user_to_recommend, , drop=TRUE])
print_recommended(recommended_new, "Recommended for user_id=10 as new user")
```
It is not mandatory to provide all the side information, as the ratings alone can
also be used to generate a recommendation, even if the model was fit with side information
(this would not be the case if passing `NA_as_zero_user=TRUE`):
```{r}
recommended_new <- topN_new(model.w.sideinfo, n=5,
                            exclude=X[user_to_recommend, , drop=TRUE],
                            X=X[user_to_recommend, , drop=TRUE])
print_recommended(recommended_new, "Recommended for user_id=10 as new user (NO sideinfo)")
```
(_In this case, the top-5 recommendations did not change, as the side information has
little effect in this particular model, but that might not always be the case - that is,
the top-N recommended items for a different user might be different if side information
is absent._)


### Cold-start recommendations

Conversely, it is also possible to make a recommendation based on the side information
without having any rated movies/items. The quality of these recommendations is however
highly dependant on the influence that the attributes have in the model, and in this
case, the user attributes have very little associated information and thus little leverage.

Nevertheless, they might still provide an improvement over a completely non-personalized
recommendation (see
[Cold-start recommendations in Collective Matrix Factorization](https://arxiv.org/pdf/1809.00366.pdf)):
```{r}
recommended_cold <- topN_new(model.w.sideinfo, n=5,
                             exclude=X[user_to_recommend, , drop=TRUE],
                             U=U[user_to_recommend, , drop=TRUE])
print_recommended(recommended_cold, "Recommended for user_id=10 as new user (NO ratings)")
```