--- title: "Introduction to the RcppML package" author: "Zach DeBruine" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Introduction to the RcppML package} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` The 'RcppML' package provides high-performance machine learning algorithms using Rcpp with a focus on matrix factorization. ## Installation Install the latest development version of RcppML from github: ```{R, eval = FALSE} library(devtools) install_github("zdebruine/RcppML") ``` ```{R} library(RcppML) library(Matrix) ``` ## Non-Negative Least Squares RcppML contains extremely fast NNLS solvers. Use the `nnls` function to solve systems of equations subject to non-negativity constraints. The `RcppML::solve` function solves the equation \eqn{ax = b} for \eqn{x} where \eqn{a} is symmetric positive definite matrix of dimensions \eqn{m x m} and \eqn{b} is a vector of length \eqn{m} or a matrix of dimensions \eqn{m x n}. ```{R} # construct a system of equations X <- matrix(rnorm(2000),100,20) btrue <- runif(20) y <- X %*% btrue + rnorm(100) a <- crossprod(X) b <- crossprod(X, y) # solve the system of equations x <- RcppML::nnls(a, b) # use only coordinate descent x <- RcppML::nnls(a, b, fast_nnls = FALSE, cd_maxit = 1000, cd_tol = 1e-8) ``` `RcppML::solve` implements a new and fastest-in-class algorithm for non-negative least squares: 1. *initialization* is done by solving for the unconstrained least squares solution. 2. *forward active set tuning* (FAST) provides a near-exact solution (often exact for well-conditioned systems) by setting all negative values in the unconstrained solution to zero, re-solving the system for only positive values, and repeating the process until the solution for values not constrained to zero is strictly positive. Set `cd_maxit = 0` to use only the FAST solver. 3. *Coordinate descent* refines the FAST solution and finds the best solution discoverable by gradient descent. The coordinate descent solution is only used if it gives a better error than the FAST solution. Generally, coordinate descent re-introduces variables constrained to zero by FAST back into the feasible set, but does not dramatically change the solution. ## Projecting Linear Models Project dense linear factor models onto real-valued sparse matrices (or any matrix coercible to `Matrix::dgCMatrix`) using `RcppML::project`. `RcppML::project` solves the equation \eqn{A = WH} for \eqn{H}. ```{R} # simulate a sparse matrix A <- rsparsematrix(1000, 100, 0.1) # simulate a linear factor model w <- matrix(runif(1000 * 10), 1000, 10) # project the model h <- RcppML::project(A, w) ``` ## Non-negative Matrix Factorization `RcppML::nmf` finds a non-negative matrix factorization by alternating least squares (alternating projections of linear models \eqn{w} and \eqn{h}). There are several ways in which the NMF algorithm differs from other currently available methods: * Diagonalized scaling of factors to sum to 1, permitting convex L1 regularization along the entire solution path * Fast stopping criteria, based on correlation between models across consecutive iterations * Extremely fast algorithms using the Eigen C++ library, optimized for matrices that are >90% sparse * Support for NMF or unconstrained matrix factorization * Parallelized using OpenMP multithreading The following example runs rank-10 NMF on a random 1000 x 1000 matrix that is 90% sparse: ```{R} A <- rsparsematrix(100, 100, 0.1) model <- RcppML::nmf(A, 10, verbose = F) w <- model$w d <- model$d h <- model$h model_tolerance <- tail(model$tol, 1) ``` Tolerance is simply a measure of the average correlation between \eqn{w_{i-1} and \eqn{w_i} and \eqn{h_{i-1}} and \eqn{h_i} for a given iteration \eqn{i}. For symmetric factorizations (when \code{symmetric = TRUE}), tolerance becomes a measure of the correlation between \eqn{w_i} and \eqn{h_i}, and diagonalization is automatically performed to enforce symmetry: ```{R} A_sym <- as(crossprod(A), "dgCMatrix") model <- RcppML::nmf(A_sym, 10, verbose = F) ``` Mean squared error of a factorization can be calculated for a given model using the `RcppML::mse` function: ```{R} RcppML::mse(A_sym, model$w, model$d, model$h) ```