---
title: "milr: Multiple-Instance Logistic Regression with Lasso Penalty"
author:
  - name: Ping-Yang Chen
    affiliation: Department of Statistics, National Cheng Kung University
    address:
    - 1 University Road,
    - Tainan 70101, Taiwan.
    email:  pychen.ping@gmail.com
  - name: Ching-Chuan Chen
    affiliation: Department of Statistics, National Cheng Kung University
    address:
    - 1 University Road,
    - Tainan 70101, Taiwan.
    email:  zw12356@gmail.com
  - name: Chun-Hao Yang
    affiliation: Department of Statistics, University of Florida
    address:
    - University of Florida
    - Gainesville, FL 32611
    email:  chunhaoyang@ufl.edu
  - name: Sheng-Mao Chang
    affiliation: Department of Statistics, National Cheng Kung University
    address:
    - 1 University Road,
    - Tainan 70101, Taiwan.
    email:  smchang@mail.ncku.edu.tw
  - name: Kuo-Jung Lee
    affiliation: Department of Statistics, National Cheng Kung University
    address:
    - 1 University Road,
    - Tainan 70101, Taiwan.
    email:  kuojunglee@mail.ncku.edu.tw
abstract: >
  The purpose of the proposed package **milr** is to analyze multiple-instance data. Ordinary multiple-instance data consists of many independent bags, and each bag is composed of several instances. The statuses of bags and instances are binary. Moreover, the statuses of instances are not observed, whereas the statuses of bags are observed. The functions in this package are applicable for analyzing multiple-instance data, simulating data via logistic regression, and selecting important covariates in the regression model.  To this end, maximum likelihood estimation with an expectation-maximization algorithm is implemented for model estimation, and a lasso penalty added to the likelihood function is applied for variable selection. Additionally, an `milr` object is applicable to generic functions `fitted`, `predict` and `summary`. Simulated data and a real example are given to demonstrate the features of this package.
date: "`r Sys.Date()`"
bibliography: milr.bib
output: 
  rmarkdown::html_vignette:
    css: vignette.css
    number_sections: yes
    toc: yes
vignette: >
  %\VignetteIndexEntry{milr\: Multiple-Instance Logistic Regression with Lasso Penalty}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(cache = 2, cache.lazy = FALSE, tidy = FALSE, warning = FALSE)
```

# Introduction

Multiple-instance learning (MIL) is used to model the class labels which
are associated with bags of observations instead of the individual
observations. This technique has been widely used in solving many
different real-world problems. In the early stage of the MIL
application, @dietterich1997solving studied the drug-activity
prediction problem. A molecule is classified as a good drug if it is able
to bind strongly to a binding site on the target molecule. The problem
is: one molecule can adopt multiple shapes called the conformations
and only one or a few conformations can bind the target molecule well.
They described a molecule by a bag of its many possible conformations
whose binding strength remains unknown. An important application of
MIL is the image and text categorization, such as in
@maron1998multiple, @andrews2002support, @zhang2007local, @zhou2009multi, 
@li2011text, @Kotzias2015,
to name a few. An image (bag) possessing at least one particular pattern
(instance) is categorized into one class; otherwise, it is categorized
into another class. For example, @maron1998multiple treated the
natural scene images as bags, and, each bag is categorized as the scene
of waterfall if at least one of its subimages is the waterfall.
Whereas, @zhou2009multi studied the categorization of collections
(bags) of posts (instances) from different newsgroups corpus. A
collection is a positive bag if it contains 3\% posts from a target
corpus category and the remaining 97\% posts, as well as all posts in
the negative bags, belong to the other corpus categories. MIL is also
used in medical researches. The UCSB breast cancer study
(@Kandemir2014image) is such a case. Patients (bags) were diagnosed
as having or not having cancer by doctors; however, the computer,
initially, had no knowledge of which patterns (instances) were
associated with the disease. Furthermore, in manufacturing processes
(@milr_paper), a product (bag) is defective as long as one or more
of its components (instances) are defective. In practice, at the initial
stage, we only know that a product is defective, and we have no idea
which component is responsible for the defect.

Several approaches have been offered to analyze datasets with multiple
instances, e.g.,
@maron1998learning, @ray2005supervised, @xu2004logistic, @zhang2001dd.
From our point of view, the statuses of these components are missing
variables, and thus, the Expectation-Maximization (EM) algorithm
(@dempster1977maximum) can play a role in multiple-instance
learning. By now the toolboxes or libraries available for implementing
MIL methods are developed by other computer softwares. For example,
@MILL2008 and @MIL2016 are implemented in MATLAB software,
but neither of them carries the methods based on logistic regression model.
@settles.nips08 provided the Java codes including the method
introduced in @ray2005supervised. Thus, for R users, we are first
to develop a MIL-related package based on logistic regression
modelling which is called multiple-instance logistic regression (MILR). 
In this package, we first apply the logistic regression
defined in @ray2005supervised and @xu2004logistic, and
then, we use the EM algorithm to obtain maximum likelihood estimates of
the regression coefficients. In addition, the popular lasso penalty
(@tibshirani1996regression) is applied to the likelihood function
so that parameter estimation and variable selection can be performed
simultaneously. This feature is especially desirable when the number of
covariates is relatively large.

To fix ideas, we firstly define the notations and introduce the
construction of the likelihood function. Suppose that the dataset
consists of \(n\) bags and that there are \(m_i\) instances in the
\(i\)th bag for \(i=1,\dots, n\). Let \(Z_i\) denote the status of the
\(i\)th bag, and let \(Y_{ij}\) be the status of the \(j\)th instance in
the \(i\)th bag along with \(x_{ij} \in \Re^p\) as the corresponding
covariates. We assume that the \(Y_{ij}\) follow independent Bernoulli
distributions with defect rates of \(p_{ij}\), where
\(p_{ij}=g\left(\beta_0+x_{ij}^T\beta\right)\) and
\(g(x) = 1/\left(1+e^{-x}\right)\). We also assume that the \(Z_i\)
follow independent Bernoulli distributions with defect rates of
\(\pi_i\). Therefore, the bag-level likelihood function is

\begin{equation}\label{eq:L}
L\left(\beta_0,\beta\right)=\prod_{i=1}^n\pi_i^{z_i}\left(1-\pi_i\right)^{1-z_i}.
\end{equation}

\noindent To associate the bag-level defect rate \(\pi_i\) with the
instance-level defect rates \(p_{ij}\), several methods have been
proposed. The bag-level status is defined as
\(Z_i=I\left(\sum_{j=1}^{m_i}Y_{ij}>0\right)\). If the independence
assumption among the \(Y_{ij}\) holds, the bag-level defect rate is
\(\pi_i=1-\prod_{j=1}^{m_i}(1-p_{ij})\). On the other hand, if the
independence assumption might not be held, @xu2004logistic and
@ray2005supervised proposed the softmax function to associate
\(\pi_i\) to \(p_{ij}\), as follows:

\begin{equation}\label{eq:softmax}
s_i\left(\alpha\right)=\sum_{j=1}^{m_i}p_{ij}\exp{\left\{\alpha p_{ij}\right\}} \Big/ \sum_{j=1}^{m_i}\exp{\left\{\alpha p_{ij}\right\}},
\end{equation}

\noindent where \(\alpha\) is a pre-specified nonnegative value.
@xu2004logistic used \(\alpha=0\), therein modeling \(\pi_i\) by
taking the average of \(p_{ij}\), \(j=1,\ldots,m_i\), whereas
@ray2005supervised suggested \(\alpha=3\). We observe that the
likelihood (\ref{eq:L}) applying neither the \(\pi_i\) function nor the
\(s_i(\alpha)\) function results in effective estimators.

Below, we begin by establishing the E-steps and M-steps required for the
EM algorithm and then attach the lasso penalty for the estimation and
feature selection. Several computation strategies applied are the same
as those addressed in @friedman2010regularization. Finally, we
demonstrate the functions provided in the **milr** package via
simulations and on a real dataset.

# The multiple-instance logistic regression 

## EM algorithm 

If the instance-level statuses, \(y_{ij}\), are observable, the complete
data likelihood is
\[\prod_{i=1}^n\prod_{j=1}^{m_i}p_{ij}^{y_{ij}}q_{ij}^{1-y_{ij}}~,\] where
\(q_{ij}=1-p_{ij}\). An ordinary approach, such as the Newton method,
can be used to solve this maximal likelihood estimate (MLE). However,
considering multiple-instance data, we can only observe the statuses of
the bags, \(Z_i=I\left(\sum_{j=1}^{m_j}Y_{ij}>0\right)\), and not the
statuses of the instances \(Y_{ij}\). As a result, we apply the EM
algorithm to obtain the MLEs of the parameters by treating the
instance-level labels as the missing data.

In the E-step, two conditional distributions of the missing data given
the bag-level statuses \(Z_i\) are
\[Pr\left(Y_{i1}=0,\ldots,Y_{im_i}=0\mid Z_i=0\right)=1\] and \[
  Pr\left(Y_{ij}=y_{ij}, \quad j=1,\dots, m_i \mid Z_i=1\right) =
    \frac{
      \prod_{j=1}^{m_i}p_{ij}^{y_{ij}}q_{ij}^{1-y_{ij}}\times
        I\left(\sum_{j=1}^{m_i}y_{ij}>0\right)
      }{1-\prod_{l=1}^{m_i}q_{il}}.
\] Thus, the conditional expectations are

\begin{equation*}
E\left(Y_{ij}\mid Z_i=0\right)=0
\quad \mbox{ and } \quad
E\left(Y_{ij}\mid Z_i=1\right)=\frac{p_{ij}}{1-\prod_{l=1}^{m_i}q_{il}}\equiv\gamma_{ij}.
\end{equation*}

\noindent The \(Q\) function at step \(t\) is
\(Q\left(\beta_0,\beta\mid\beta_0^t,\beta^t\right) = \sum_{i=1}^nQ_i\left(\beta_0,\beta\mid\beta_0^t,\beta^t\right)\),
where \(Q_i\) is the conditional expectation of the complete
log-likelihood for the \(i\)th bag given \(Z_i\), which is defined as

\begin{align*}
Q_i\left(\beta_0,\beta\mid\beta_0^t,\beta^t\right)
 & = E\left(\sum_{j=1}^{m_i}y_{ij}\log{\left(p_{ij}\right)}+\left(1-y_{ij}\right)\log{\left(q_{ij}\right)} ~\Bigg|~ Z_i=z_i,\beta_0^t,\beta^t\right) \\
 & = \sum_{j=1}^{m_i}z_i\gamma_{ij}^t\left(\beta_0+x_{ij}^T\beta\right)-\log{\left(1+e^{\beta_0+x_{ij}^T\beta}\right)}.
\end{align*}

\noindent Note that all the \(p_{ij}\), \(q_{ij}\), and \(\gamma_{ij}\)
are functions of \(\beta_0\) and \(\beta\), and thus, we define these
functions by substituting \(\beta_0\) and \(\beta\) by their current
estimates \(\beta_0^t\) and \(\beta^t\) to obtain \(p_{ij}^t\),
\(q_{ij}^t\), and \(\gamma_{ij}^t\), respectively.

In the M-step, we maximize this \(Q\) function with respect to
\(\left(\beta_0, \beta\right)\). Since the maximization of the nonlinear
\(Q\) function is computationally expensive, following
@friedman2010regularization, the quadratic approximation to \(Q\)
is applied. Taking the second-order Taylor expansion about \(\beta_0^t\)
and \(\beta^t\), we have
\(Q\left(\beta_0,\beta\mid\beta_0^t,\beta^t\right) =Q_Q\left(\beta_0,\beta\mid \beta_0^t,\beta^t\right) + C + R_2\left(\beta_0,\beta\mid\beta_0^t,\beta^t\right)\),
where \(C\) is a constant in terms of \(\beta_0\) and \(\beta\),
\(R_2\left(\beta_0,\beta\mid\beta_0^t,\beta^t\right)\) is the remainder
term of the expansion and \[
  Q_Q\left(\beta_0,\beta\mid \beta_0^t,\beta^t\right) =
  -\frac{1}{2}\sum_{i=1}^n\sum_{j=1}^{m_i}w_{ij}^t\left[u_{ij}^t-\beta_0-x_{ij}^T\beta\right]^2,
\] where
\(u_{ij}^t=\beta_0+x_{ij}^T\beta^t+\left(z_i\gamma^t_{ij}-p_{ij}^t\right)\Big/\left(p_{ij}^tq_{ij}^t\right)\)
and \(w_{ij}^t=p_{ij}^tq_{ij}^t\). In the **milr** package, instead
of maximizing \(Q\left(\beta_0,\beta\mid\beta_0^t,\beta^t\right)\), we
maximize its quadratic approximation,
\(Q_Q\left(\beta_0,\beta\mid\beta_0^t,\beta^t\right)\). Since the
objective function is quadratic, the roots of
\(\partial Q_Q / \partial \beta_0\) and
\(\partial Q_Q / \partial \beta\) have closed-form representations.

## Variable selection with lasso penalty

We adopt the lasso method (@tibshirani1996regression) to identify
active features in this MILR framework. The key is to add the \(L_1\)
penalty into the objective function in the M-step so that the EM
algorithm is capable of performing estimation and variable selection
simultaneously. To this end, we rewrite the objective function as

\begin{equation}\label{eq:lasso}
\underset{\beta_0,\beta}{\min}\left\{-Q_Q\left(\beta_0,\beta\mid \beta_0^t,\beta^t\right)+\lambda\sum_{k=1}^p\left|\beta_k\right|\right\}.
\end{equation}

Note that the intercept term \(\beta_0\) is always kept in the model;
thus, we do not place a penalty on \(\beta_0\). In addition, \(\lambda\)
is the tuning parameter, and we will introduce how to determine this
parameter later. We applied the shooting algorithm
(@fu1998penalized, milr_paper) to update
\(\left(\beta^t_0,\beta^t\right)\).


# Implementation 

The **milr** package contains a data generator, `DGP`, which
is used to generate the multiple-instance data for the simulation
studies, and two estimation approaches, `milr` and `softmax`,
which are the main tools for modeling the multiple-instance data. In
this section, we introduce the usage and default setups of these

## Data generator

The function `DGP` is the generator for the multiple-instance-type
data under the MILR framework.

\noindent To use the `DGP` function, the user needs to specify an
integer `n` as the number of bags, a vector `m` of length
\(n\) as the number of instances in each bag, and a vector `beta`
of length \(p\), with the desired number of covariates, and the
regression coefficients, \(\beta\), as in `DGP(n, m, beta)`. 
Note that one can set `m` as an
integer for generating the data with an equal instance size `m` for
each bag. Thus, the total number of observations is
\(N=\sum_{i=1}^n m_i\). The `DGP` simulates the labels of bags
through the following steps:

\begin{enumerate}
\def\labelenumi{\arabic{enumi}.}
\tightlist
\item
  Generate \(p\) mutually independent covariates of length \(N\) from
  the standard normal distribution as an \(N\times p\) matrix, \(X\).
\item
  Generate the binary response, \(Y_{ij}\), for the \(j\)th instance of
  the \(i\)th bag from the Bernoulli distribution with
  \[
    p_{ij}=1\big/\left(1+\exp{\left\{-x_{ij}^T\beta\right\}}\right)
  \] where \(x_{ij}\) is the \(p\)-component vector in the row of \(X\) 
  representing the \(j\)th instance of the \(i\)th bag.
\item
  Calculate the observed response for the \(i\)th bag by
  \(Z_i=I\left(\sum_{j=1}^{m_i}Y_{ij}>0\right)\).
\item
  Return the indices of the bags, the covariate matrix \(X\) and the
  bag-level statuses \(Z\).
\end{enumerate}

## The milr and softmax apporaches

In the **milr** package, we provide two approaches to model the
multiple-instance data: the proposed `milr` (@milr_paper) and
the `softmax` approach (@xu2004logistic). To implement these
two approaches, we assume that the number of observations and
covariates are \(N\) and \(p\), respectively. The input data for both
`milr` and `softmax` are separated into three parts: the
bag-level statuses, `y`, as a vector of length \(N\); the
\(N\times p\) design matrix, `x`; and `bag`, the vector of
indices of length \(N\), representing the indices of the bag to which
each instance belongs.

```{r, eval=FALSE}
milr(y, x, bag, lambda, numLambda, lambdaCriterion, nfold, maxit)
softmax(y, x, bag, alpha, ...)
```

For the `milr` function, specifying `lambda` in different ways
controls whether and how the lasso penalty participates in parameter 
estimation. The default value of `lambda` is $0$. With this value, 
the ordinary MLE is applied, i.e., no penalty term is considered. 
This is the suggested choice when the number of covariates $p$ is small. 
When $p$ is large or when variable selection is desired, users can 
specify a \(\lambda\) vector  of length \(\kappa\); otherwise, by
letting `lambda = -1`, the program automatically provides a \(\lambda\) 
vector of length \(\kappa=\)`numLambda` as the tuning set. 
Following @friedman2010regularization,  the theoretical maximal value of \(\lambda\) in (\ref{eq:lasso}) is

\begin{equation*}\label{eq:lammax}
\lambda_{max}=\left[\prod_{i=1}^n\left(m_i-1\right)\right]^{\frac{1}{2}}\left[\prod_{i=1}^nm_i^{1-2z_i}\right]^{\frac{1}{2}}.
\end{equation*}

\noindent
The automatically specified sequence of \(\lambda\) values
ranges from \(\lambda_{min}=\lambda_{max}/1000\) to \(\lambda_{max}\)
in ascending order.

The default setting for choosing the optimal \(\lambda\) among these
\(\lambda\) values is the Bayesian information criterion (BIC),
\(-2\log{(likelihood)} + p^*\times\log{(n)}\), where \(p^*\) is the
number of nonzero regression coefficients. Alternatively, the user can
use the options `lambdaCriterion = "deviance"` and `nfold = K`
with an integer `K` to obtain the best \(\lambda\) that
minimizes the predictive deviance through 'bag-wise' K-fold cross
validation. The last option, `maxit`, indicates the maximal number
of iterations of the EM algorithm; its default value is 500.

For the `softmax` function, the option `alpha` is a
nonnegative real number for the \(\alpha\) value in (\ref{eq:softmax}).
The maximum likelihood estimators of the regression coefficients are
obtained by the generic function `optim`. Note that no variable
selection approach is implemented for this method.

Two generic accessory functions, `coef` and `fitted`, can be
used to extract the regression coefficients and the fitted bag-level
labels returned by `milr` and `softmax`. We also provide the
significance test based on Wald's test for the `milr` estimations
without the lasso penalty through the `summary` function. In
addition, to predict the bag-level statuses for the new data set, the
`predict` function can be used by assigning three items:
`object` is the fitted model obtained by `milr` or
`softmax`, `newdata` is the covariate matrix, and
`bag\_newdata` is the bag indices of the new dataset. Finally, the
MIL model can be used to predict the bag-level labels and the
instances-level labels. The option `type` in `fitted` and
`predicted` functions controls the type of output labels. The
default option is `type = "bag"` which results the bag-level
prediction. Otherwise, by setting `type = "instance"`, the
instances-level labels will be presented.

```{r, eval=FALSE}
fitted(object, type)
predict(object, newdata, bag_newdata, type)
```

# Examples

We illustrate the usage of the **milr** package via simulated and
real examples.

## Estimation and variable selection

We demonstrate how to apply the `milr` function for model
estimation and variable selection. We simulate data with \(n=30\) bags,
each containing \(m=3\) instances and regression coefficients
\(\beta = (-2, -1, 1, 2, 0.5, 0, 0, 0, 0, 0)\). Specifically, the first
four covariates are important.

```{r DGP1}
library(milr)
library(pipeR)
set.seed(99)
# set the size of dataset
numOfBag <- 30
numOfInstsInBag <- 3
# set true coefficients: beta_0, beta_1, beta_2, beta_3
trueCoefs <- c(-2, -1, 2, 0.5)
trainData <- DGP(numOfBag, numOfInstsInBag, trueCoefs)
colnames(trainData$X) <- paste0("X", 1:ncol(trainData$X))
(instanceResponse <- as.numeric(with(trainData, tapply(Z, ID, any))))
```

Since the number of covariates is small, we then use the `milr`
function to estimate the model parameters with `lambda = 0`. One
can apply `summary` to produce results including estimates of the
regression coefficients and their corresponding standard error, testing
statistics and the P-values under Wald's test. The regression
coefficients are returned by the function `coef`.

```{r EST2}
# fit milr model
milrFit_EST <- milr(trainData$Z, trainData$X, trainData$ID, lambda = 1e-7)
# call the Wald test result
summary(milrFit_EST)
# call the regression coefficients
coef(milrFit_EST)
```

The generic function `table` builds a contingency table of the
counts for comparing the true bag-level statuses and the fitted
bag-level statuses (obtained by the option `type = "bag"`) and the
`predict` function is used to predict the labels of each bag with
corresponding covariate \(X\). On the other hand, The fitted and
predicted instance-level statuses can also be found by setting
`type = "instance"` in the `fitted` and `predict` functions.


```{r EST}
fitted(milrFit_EST, type = "bag") 
# fitted(milrFit_EST, type = "instance") # instance-level fitted labels
table(DATA = instanceResponse, FITTED = fitted(milrFit_EST, type = "bag")) 
# predict for testing data
testData <- DGP(numOfBag, numOfInstsInBag, trueCoefs)
colnames(testData$X) <- paste0("X", 1:ncol(testData$X))
(instanceResponseTest <- as.numeric(with(trainData, tapply(Z, ID, any))))
pred_EST <- with(testData, predict(milrFit_EST, X, ID, type = "bag"))
# predict(milrFit_EST, testData$X, testData$ID, 
#         type = "instance") # instance-level prediction
table(DATA = instanceResponseTest, PRED = pred_EST) 
```

Next, the $n < p$ cases are considered. We generate a data set with
\(n=30\) bags, each with 3 instances and \(p=45\) covariates.  Among
these covariates, only the first five of them, \(X_1,\ldots,X_5\), are
active and their nonzero coefficients are the same as the previous
example. First, we manually specify 20 \(\lambda\) values manually
from 0.01 to 20 The `milr` function chooses the best tuning
parameter which results in the smallest
BIC.  For this dataset, the chosen model is a constant model.

```{r VS, message=FALSE}
set.seed(99)
# Set the new coefficienct vector (large p)
trueCoefs_Lp <- c(-2, -2, -1, 1, 2, 0.5, rep(0, 45))
# Generate the new training data with large p
trainData_Lp <- DGP(numOfBag, numOfInstsInBag, trueCoefs_Lp)
colnames(trainData_Lp$X) <- paste0("X", 1:ncol(trainData_Lp$X))
# variable selection by user-defined tuning set
lambdaSet <- exp(seq(log(0.01), log(20), length = 20))
milrFit_VS <- with(trainData_Lp, milr(Z, X, ID, lambda = lambdaSet))
# grep the active factors and their corresponding coefficients
coef(milrFit_VS) %>>% `[`(abs(.) > 0)
```

Second, we try the auto-tuning feature implemented in `milr` by
assigning `lambda = -1`. The total number of tuning \(\lambda\)
values is indicated by setting `nlambda`. The following example
shows the result of the best model chosen among 5 \(\lambda\) values.
The slice `$lambda` shows the auto-tuned \(\lambda\)
candidates and the slice `$BIC` returns the
corresponding value of BIC for every candidate \(\lambda\) value. Again,
the chosen model is a constant model.

```{r AUTOVS,message=FALSE}
# variable selection using auto-tuning
milrFit_auto_VS <- milr(trainData_Lp$Z, trainData_Lp$X, trainData_Lp$ID,
                        lambda = -1, numLambda = 5) 
# the auto-selected lambda values
milrFit_auto_VS$lambda 
# the values of BIC under each lambda value
milrFit_auto_VS$BIC
# grep the active factors and their corresponding coefficients
coef(milrFit_auto_VS) %>>% `[`(abs(.) > 0)
```

Instead of using BIC, a better way to choose the proper \(\lambda\) is
using the cross validation by setting
`lambdaCriterion = "deviance"`. The following example shows the
best model chosen by minimizing the predictive deviance via 'bag-wise'
3-fold cross validation. The results of the predictive deviance for
every candidate \(\lambda\) can be found in the slice
`$cv`. Twenty-nine covariates were identified including
the first four true active covariates, \(X_1,\ldots,X_4\).

```{r CV,message=FALSE}
# variable selection using auto-tuning with cross validation
milrFit_auto_CV <- milr(trainData_Lp$Z, trainData_Lp$X, trainData_Lp$ID,
                        lambda = -1, numLambda = 5, 
                        lambdaCriterion = "deviance", nfold = 3) 
# the values of predictive deviance under each lambda value
milrFit_auto_CV$cv 
# grep the active factors and their corresponding coefficients
coef(milrFit_auto_CV) %>>% `[`(abs(.) > 0)
```

According to another simulation study which is not shown in this paper,
in contrast to cross-validation, BIC does not perform well for variable
selection in terms of multiple-instance logistic regressions. However,
it can be an alternative when performing cross-validation is too time
consuming.

## Real case study

Hereafter, we denote the proposed method with the lasso penalty by
MILR-LASSO for brevity. In the following, we demonstrate the usage of
MILR-LASSO and the `softmax` approach on a real dataset, called
MUSK1. The MUSK1 data set consists of 92 molecules (bags) of which 47
are classified as having a musky smell and 45 are classified to be
non-musks. The molecules are musky if at least one of their conformers
(instances) were responsible for the musky smell. However, knowledge
about which conformers are responsible for the musky smell is unknown.
There are 166 features that describe the shape, or conformation, of the
molecules. The goal is to predict whether a new molecules is musk or
non-musk. This dataset is one of the popular benchmark datasets in the
field of multiple-instance learning research and one can download the
dataset from the following weblink.

```{r DLMUSK1}
dataName <- "MIL-Data-2002-Musk-Corel-Trec9.tgz"
dataUrl <- "http://www.cs.columbia.edu/~andrews/mil/data/"
```

Here are the codes that use the `untar` function to decompress the downloaded
\emph{.tgz} file and extract the `MUSK1` dataset. Then, with the following
data preprocessing, we reassemble the `MUSK1` dataset in a
`"data.frame"` format. The first 2 columns of the `MUSK1` dataset are the
bag indices and the bag-level labels of each observation. Starting with
the third column, there are \(p=166\) covariates involved in the `MUSK1`
dataset.

```{r READMUSK1}
filePath <- file.path(getwd(), dataName)
# Download MIL data sets from the url (not run)
# if (!file.exists(filePath))
#  download.file(paste0(dataUrl, dataName), filePath)
# Extract MUSK1 data file (not run)
# if (!dir.exists("MilData"))
#   untar(filePath, files = "musk1norm.svm")
# Read and Preprocess MUSK1
library(data.table)
MUSK1 <- fread("musk1norm.svm", header = FALSE) %>>%
  `[`(j = lapply(.SD, function(x) gsub("\\d+:(.*)", "\\1", x))) %>>%
  `[`(j = c("bag", "label") := tstrsplit(V1, ":")) %>>%
  `[`(j = V1 := NULL) %>>% `[`(j = lapply(.SD, as.numeric)) %>>%
  `[`(j = `:=`(bag = bag + 1, label = (label + 1)/2)) %>>%
  setnames(paste0("V", 2:(ncol(.)-1)), paste0("V", 1:(ncol(.)-2))) %>>%
  `[`(j = paste0("V", 1:(ncol(.)-2)) := lapply(.SD, scale), 
       .SDcols = paste0("V", 1:(ncol(.)-2)))
X <- paste0("V", 1:(ncol(MUSK1) - 2), collapse = "+") %>>% 
  (paste("~", .)) %>>% as.formula %>>% model.matrix(MUSK1) %>>% `[`( , -1L)
Y <- as.numeric(with(MUSK1, tapply(label, bag, function(x) sum(x) > 0)))
```

To fit an MIL model without variable selection, the **milr** package
provides two functions. The first is the `milr` function with
`lambda = 0`. The second approach is the `softmax` function
with a specific value of `alpha`. Here, we apply the approaches that
have been introduced in @xu2004logistic and
@ray2005supervised, called the \(s(0)\) (`alpha=0`) and
\(s(3)\) (`alpha=3`) methods, respectively. The optimization method
in `softmax` is chosen as the default settings of the generic
function `optim`, that is, the \emph{Nelder-Mead} method.

```{r MIFIT,message=FALSE,results="hide"}
# softmax with alpha = 0
softmaxFit_0 <- softmax(MUSK1$label, X, MUSK1$bag, alpha = 0, 
                        control = list(maxit = 5000))
# softmax with alpha = 3
softmaxFit_3 <- softmax(MUSK1$label, X, MUSK1$bag, alpha = 3, 
                        control = list(maxit = 5000))
# use a very small lambda so that milr do the estimation 
# without evaluating the Hessian matrix
milrFit <- milr(MUSK1$label, X, MUSK1$bag, lambda = 1e-7, maxit = 5000) 
```

For variable selection, we apply the MILR-LASSO approach. First, the
tuning parameter set is chosen automatically by setting
\(\lambda = -1\), and the best \(\lambda\) value is obtained by
minimizing the predictive deviance with 3-fold cross validation among
`nlambda = 20` candidates. 

```{r MILRVS, cache=TRUE,cache.lazy=FALSE,message=FALSE,warning=FALSE,tidy=FALSE}
# MILR-LASSO
milrSV <- milr(MUSK1$label, X, MUSK1$bag, lambda = -1, numLambda = 20, 
               nfold = 3, lambdaCriterion = "deviance", maxit = 5000)
# show the detected active covariates
sv_ind <- names(which(coef(milrSV)[-1L] != 0)) %>>% 
  (~ print(.)) %>>% match(colnames(X))
# use a very small lambda so that milr do the estimation 
# without evaluating the Hessian matrix
milrREFit <- milr(MUSK1$label, X[ , sv_ind], MUSK1$bag, 
                  lambda = 1e-7, maxit = 5000)
# Confusion matrix of the fitted model
table(DATA = Y, FIT_MILR = fitted(milrREFit, type = "bag"))
```

We use 3-fold cross validation and
compare the prediction accuracy among four MIL models which are
\(s(0)\), \(s(3)\), the MILR model with all covariates, and, the MILR
model fitted by the selected covariates via MILR-LASSO. Then, we show their 
prediction accuracy by the confusion matrices.

```{r MUSK1PRED2,message=FALSE}
set.seed(99)
predY <- matrix(0, length(Y), 4L) %>>%
  `colnames<-`(c("s0","s3","milr","milr_sv"))
folds <- 3
foldBag <- rep(1:folds, floor(length(Y) / folds) + 1, 
               length = length(Y)) %>>% sample(length(.))
foldIns <- rep(foldBag, table(MUSK1$bag))
for (i in 1:folds) {
  # prepare training and testing sets
  ind <- which(foldIns == i)
  # train models
  fit_s0 <- softmax(MUSK1[-ind, ]$label, X[-ind, ], MUSK1[-ind, ]$bag,
                    alpha = 0, control = list(maxit = 5000))
  fit_s3 <- softmax(MUSK1[-ind, ]$label, X[-ind, ], MUSK1[-ind, ]$bag,
                    alpha = 3, control = list(maxit = 5000))
  # milr, use a very small lambda so that milr do the estimation
  #       without evaluating the Hessian matrix
  fit_milr <- milr(MUSK1[-ind, ]$label, X[-ind, ], MUSK1[-ind, ]$bag,
                   lambda = 1e-7, maxit = 5000)
  fit_milr_sv <- milr(MUSK1[-ind, ]$label, X[-ind, sv_ind], MUSK1[-ind, ]$bag,
                      lambda = 1e-7, maxit = 5000)
  # store the predicted labels
  ind2 <- which(foldBag == i)
  # predict function returns bag response in default
  predY[ind2, 1L] <- predict(fit_s0, X[ind, ], MUSK1[ind, ]$bag)
  predY[ind2, 2L] <- predict(fit_s3, X[ind, ], MUSK1[ind, ]$bag)
  predY[ind2, 3L] <- predict(fit_milr, X[ind, ], MUSK1[ind, ]$bag)
  predY[ind2, 4L] <- predict(fit_milr_sv, X[ind, sv_ind], MUSK1[ind, ]$bag)
}

table(DATA = Y, PRED_s0 = predY[ , 1L])
table(DATA = Y, PRED_s3 = predY[ , 2L])
table(DATA = Y, PRED_MILR = predY[ , 3L])
table(DATA = Y, PRED_MILR_SV = predY[ , 4L])
```

# Summary

This vignette introduces the usage of the R package **milr** for
analyzing multiple-instance data under the framework of logistic
regression. In particular, the package contains two approaches:
summarizing the mean responses within each bag using the softmax
function (@xu2004logistic, @ray2005supervised) and treating the
instance-level statuses as hidden information as well as applying the EM
algorithm for estimation (@milr_paper). In addition, to estimate
the MILR model, a lasso-type variable selection technique is
incorporated into the latter approach. The limitations of the developed
approaches are as follows. First, we ignore the potential dependency
among instance statuses within one bag. Random effects can be
incorporated into the proposed logistic regression to represent the
dependency. Second, according to our preliminary simulation study, not
shown in this paper, the maximum likelihood estimator might be biased
when the number of instances in a bag is large, say, \(m_i=100\) or
more. Bias reduction methods, such as @firth1993bias and
@quenouille1956notes, can be applied to alleviate this bias.
These attempts are deferred to our future work.

# Reference