--- title: "A Tutorial for `dsdp`" thanks: | This research was supported in part with Grant-in-Aid for Scientific Research(B) JP18H03206, JP21H03398 and Grant-in-Aid for Exploratory Research JP20K21792 from the Japan Society for the Promotion of Sciences. author: - Satoshi Kakihara^[National Graduate Institute for Policy Studies, skakihara@gmail.com] - Takashi Tsuchiya^[National Graduate Institute for Policy Studies, tsuchiya@grips.ac.jp] date: November 11th, 2022 output: pdf_document: toc: false number_sections: true # citation_package: biblatex keep_tex: true html_document: number_sections: true linkcolor: red citecolor: green urlcolor: blue fontsize: 11pt # bibliography: "`r system.file("dsdp.bib", package="dsdp")`" # bibliography: ["dsdp.bib"] # biblio-style: alphabetic # link-citations: yes keep_tex: yes keywoards: - dsdp abstract: | This vignette is a tutorial for a R package `dsdp`, a probability density estimation package using a maximum likelihood method. A model of interest in this package is a family of exponential distributions as base functions, with polynomial correction terms. To find an optimal model, we adopt a grid search for parameters of base functions and degrees of polynomials, together with semidefinite programming for coefficients of polynomials, and then model selection is done by Akaike Information Criterion. We first give a quick overview of the package, and then move on to a tutorial. vignette: > %\VignetteIndexEntry{Tutorial} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- # Overview The main task of the package `dsdp` is to estimate probability density functions from a data set using a maximum likelihood method. The models of density functions in use are familiar Gaussian or exponential distributions with polynomial correction terms. We call Gaussian distribution with a polynomial **Gaussian-based model** and an exponential distribution with a polynomial **Exponential-based model**, respectively. `dsdp` seeks parameters of Gaussian or exponential distributions together with degrees of polynomials using a grid search, and coefficients of polynomials using a variant of semidefinite programming(SDP) problems. Detailed discussions of SDP problem formulations and this type of SDP problems are found in other vignettes. The outline of estimation procedure is as follows. 1. Create Gaussian-based or Exponential-based model from a data set. 2. Explore a data set by checking the statistics and the histogram 3. Provide a set of parameters of Gaussian or exponential distributions and degrees of polynomials. 4. Estimate the coefficients of the polynomials for a set of parameters and then check the results by comparing Akaike Information Criterion(AIC) and plotting density functions. 5. Refine the parameters and repeat 3-4 until a sufficient estimate is obtained. We will see each process step by step in the next section. Before we move on, please install and import the package if you haven't yet. Installation is done by ```{r eval=FALSE} ## Install from CRAN install.packages("dsdp") ``` Importing the package is done by ```{r eval=TRUE, results="hide", message=FALSE} ## Import dsdp library(dsdp) ``` This package requires `ggplot2` for displaying histograms and density functions. `ggplot2` is a part of [`tidyverse`](https://www.tidyverse.org), a de facto standard for data wrangling in R. Our `plot` method returns `ggplot2` objects, so if you plan to add the title or change the labels in the graphs, it is better to import `ggplot2` too. ```{r eval=TRUE} ## Import ggplot2 if necessary. library(ggplot2) ``` # A Tutorial In this section, we will see estimation procedures in Gaussian-based model and Exponential-based model in details. Essentially, they are same in computations, yet there are subtle differences in practice. ## Gaussian-based Model The density function of Gaussian-based model is $$ p(x; \boldsymbol{\alpha}) \cdot N(x; \mu, \sigma^2), $$ where $p(x; \boldsymbol{\alpha})$ is a polynomial with a coefficient vector $\boldsymbol{\alpha}$, and $N(x; \mu, \sigma^2)$ is Gaussian distribution with mean $\mu$ and variance $\sigma^2$: $$ N(x;\mu, \sigma^2) := \frac{1}{\sigma \sqrt{2\pi}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right). $$ The aim of estimation is to find a good set of parameters: $\boldsymbol{\alpha}$, $\mu$, and $\sigma$. To this end, we first provide a coarse set of parameters of base functions, namely, $\mu$ and $\sigma$, along with degrees of polynomials, and then compute the coefficients of polynomials $\boldsymbol{\alpha}$, to get a rough idea of the model. And subsequently, refine the set of parameters and repeat above process until sufficient estimate is obtained. ### A Creation of a model We first create Gaussian-based model from a data set. The name of R's S3 class for Gaussian-based model is `gaussmodel`, and we refer to the method of `gaussmodel` as *method.gaussmodel*, for example, `summmary.gaussmodel`, `plot.gaussmodel`, `estimate.gaussmodel`, `func.gaussmodel`. There are two scenarios for model creations. One is to create a model from only a data set, and the other is to create a model from a data set and its corresponding frequency data. Let's see model creations in examples. In the first case, we use a data set `mix2gauss$n200`, which contains 200 realizations of bimodal mixed Gaussian distributions, to create R's S3 class `gaussmodel` object `gm1`. ```{r eval=TRUE} ## Create gaussmodel object from a data set mix2gauss$n200 gm1 <- gaussmodel(data=mix2gauss$n200) ``` The object `gm1`, an instance of a S3 class `gaussmodel`, contains the data and parameters to be estimated. Similarly, in the second case, we use `mix2gaussHist$n200p` for data points and `mix2gaussHist$n200f` for their corresponding frequencies, to create `gaussmodel` object `gm2`. ```{r eval=FALSE} ## Create gaussmodel object from a data set mix2gaussHist$n200p and ## its frequencies mix2gaussHist$n200f gm2 <- gaussmodel(mix2gaussHist$n200p, mix2gaussHist$n200f) ``` ### Exploring a data set A summary of `gm1` is displayed: ```{r eval=TRUE} ## Display the summary of a data set summary(gm1) ``` As a name suggests, `summary.gaussmodel` shows the basic statistics of a data set. It prints out the quantiles of standardized data as well as original data. Here standardization means rescaling of data so as to have a mean 0 and a standard deviation 1. The histogram of the data is displayed: ```{r, eval=TRUE, out.height="80%", out.width="70%"} ## Draw a histogram of the data set plot(gm1) ``` `plot.gaussmodel` can plot scaled data as well as original data by setting `scaling=TRUE`. ### Providing the set of parameters Before estimation, we need to provide a set of parameters, means, standard deviations, and degrees of polynomials, to compute the coefficients of polynomials. ```{r, eval=TRUE} ## A vector of degrees of polynomials deglist <- c(2, 4, 6) ## A vector of means in Gaussian distributions mulist <- c(-0.5, 0, 0.5) ## A vector of standard deviations in Gaussian distributions sdlist <- c(0.75, 1.0, 1.25) ``` A vector `deglist` indicates degrees of polynomials, in this case `2, 4, 6`. In Gaussian-based model, a positive even integer up to around 20 is okay. Note that large degrees can cause numerical difficulty. `mulist` is a vector of means of Gaussian distribution, and `sdlist` is a vector of standard deviations of Gaussian distribution, so the element of `sdlist` should be positive. Note that we set these data for estimation of scaled data, as we will mention later. ### Estimation Providing these parameter sets, we are now ready to estimate the model. ```{r eval=TRUE, results="hide", message=FALSE} ## Do estimation ## Output messages are suppressed for brevity gm1 <- estimate(gm1, deglist=deglist, mulist=mulist, sdlist=sdlist, scaling=TRUE) ``` The computation of the coefficients of the polynomials is done for all of the combinations of the parameter sets `deglist`, `mulist`, and `sdlist`, 9 cases in this example. By setting `scaling=TRUE`, estimation is done for scaled data, not for original data, as mentioned before. The result is sorted according to [Akaike information criterion(AIC)](https://en.wikipedia.org/wiki/Akaike_information_criterion). AIC is widely used criterion for model selection so as to avoid overfitting by penalizing the number of free parameters. Let's see the result of estimation. ```{r eval=TRUE} ## Show the summary of results up to 5 summary(gm1, nmax=5, estonly=TRUE) ``` (`nmax=5` limits top 5 estimates, and `estonly=TRUE` suppresses the basic statistics.) The columns of `deg`, `mu1` and `sig1` indicate the degree of polynomials, mean, and standard deviation, respectively. And the columns of `mu` and `sig` indicate the scaled mean and standard deviation, respectively. The column of `aic` indicates AIC and that of `accuracy` does the accuracy of the underlying SDP solver, whose value around $1.0^{-7}$ is sufficient for estimation under IEEE 754 double precision. If sufficient accuracy is not achieved because of numerical difficulty, set `recompute=TRUE` and `stepsize=c(0.4, 0.2)`, for example, and try recomputation. ```{r eval=FALSE, results="hide", message=FALSE} ## This is demonstration for recomputation ## Not Executed gm1 <- estimate(gm1, deglist=deglist, mulist=mulist, sdlist=sdlist, scaling=TRUE, recompute=TRUE, stepsize=c(0.4, 0.2)) ``` The flag `stepsize` indicates the vector of step sizes of the underlying SDP solver. The smaller the values are, the better the chances of successful estimation, but the slower the computation is. The default value of `stepsize` is `c(0.5, 0.3)`, which is enough in many cases. We will not discuss the implementation details here, but if the user sets `stepsize` by oneself, the user should set the values smaller than default values and the length of two is enough. The numbers 1,2,3,... in the leftmost column indicate the indices of estimates ordered by AIC. First row is the best estimate, second row is the second best, so on, and these numbers can be used in `plot.gaussmodel` or `func.gaussmodel` to designate the estimates to be plotted or evaluated, respectively. To see the graphs of estimated densities along with the histogram, simply type: ```{r, eval=TRUE, out.height="80%", out.width="70%"} plot(gm1) ``` By default, `plot.gaussmodel` plots up to best 4 estimated density functions. In the legend, the color of graphs are displayed in increasing order of AIC. For example, in this graph, the estimation with the degree 6, mean 0.96151, standard deviation 0.76680 is the best one. By setting `scaling=TRUE`, it can plot a scaled data, like scaled graphs. ```{r, eval=TRUE, out.height="80%", out.width="70%"} plot(gm1, scaling=TRUE) ``` Similarly, in the legend, the color of graphs are displayed in increasing order of AIC. For example, in this graph, the estimation with the degree 6, mean 0.5, standard deviation 0.75 is the best one. ### Refine estimation We continue to estimate further by adding the degree 8 and refining `mulist=seq(0, 0.5, by=0.1)` and `sdlist=seq(0.6, 0.9, by=0.1)`. Here `seq` command generates the vector starting from 0, incrementing by 0.1, and ending with 0.5, in case of `seq(0, 0.5, by=0.1)`. ```{r eval=TRUE, results="hide", message=FALSE} ## Do estimation ## Output messages are suppressed for brevity gm1 <- estimate(gm1, c(4, 6, 8), seq(0, 0.5, by=0.1), seq(0.5, 1, by=0.1), scaling=TRUE) ``` Note that parameters already estimated are skipped in `estimate.gaussmodel` and we omit argument names. The summary of estimation is displayed: ```{r eval=TRUE} ## Show the summary of results up to 5 summary(gm1, nmax=5, estonly=TRUE) ``` The graphs are displayed: ```{r, eval=TRUE, out.height="80%", out.width="70%"} plot(gm1) ``` We continue to do estimation by refining parameters and checking estimates and graphs. ```{r eval=TRUE, results="hide", message=FALSE} ## Do estimation ## Output messages are suppressed for brevity gm1 <- estimate(gm1, c(4, 6, 8), seq(0, 0.2, by=0.05), seq(0.6, 0.8, by=0.05), scaling=TRUE) ``` ```{r eval=TRUE} ## Show the summary of results up to 5 summary(gm1, nmax=5, estonly=TRUE) ``` ```{r, eval=TRUE, out.height="80%", out.width="70%"} plot(gm1) ``` According to above results, we confine the degrees of polynomials only to 4, and set `mulist=seq(0, 0.2, by=0.025)` and `sdlist=seq(0.7, 0.8, by=0.01)`. ```{r eval=TRUE, results="hide", message=FALSE} ## Do estimation ## Output messages are suppressed for brevity gm1 <- estimate(gm1, c(4, 6, 8), seq(0, 0.2, by=0.025), seq(0.7, 0.8, by=0.01), scaling=TRUE) ``` ```{r eval=TRUE} ## Show the summary of results up to 5 summary(gm1, nmax=5, estonly=TRUE) ``` ```{r, eval=TRUE, out.height="80%", out.width="70%"} plot(gm1) ``` We stop doing estimation here. ### Notes Using `plot.gaussmodel`, we can plot cumulative distributions by setting `cum=TRUE`: ```{r, eval=TRUE, out.height="80%", out.width="70%"} plot(gm1, cum=TRUE) ``` For more details, see `?plot.gaussmodel`. `func.gaussmodel` computes the values of density and cumulative distribution of desired estimate. For example, ```{r, eval=TRUE, out.height="80%", out.width="70%"} x <- seq(-4, 4, by=0.1) ## Compute the density of 1st estimate y_pdf <- func(gm1, x, n=1) ## Compute the cumulative distribution of 1st estimate y_cdf <- func(gm1, x, cdf=TRUE, n=1) ``` Of course, we can compute the desired estimate by designating `n=k` for kth estimate shown in `summary.gaussmodel`. ## Exponential-based Model The density function of Exponential-based model is $$ p(x; \boldsymbol{\alpha}) \cdot \mathrm{Exp}(x; \lambda), $$ where $p(x; \boldsymbol{\alpha})$ is a polynomial with a coefficient vector $\boldsymbol{\alpha}$, and $\mathrm{Exp}(x; \lambda)$ is an exponential distribution with rate parameter $\lambda$: $$ \mathrm{Exp}(x;\lambda) := \lambda e^{-\lambda x}, \quad x \in S = [0, \infty). $$ The aim of estimation is to find a good set of parameters: $\boldsymbol{\alpha}$, $\lambda$. To this end, we first provide a coarse set of parameters of base functions, namely, $\lambda$, and a degree of polynomials, and then compute the coefficients of polynomials $\boldsymbol{\alpha}$, to get a rough idea of the model. ### A Creation of a model The creation of Exponential-based model from a data set is same as that of Gaussian-based model. We will show the two scenarios, one is to create a model from only a data set, and the other is to create a model from a data set and its corresponding frequency data, in sequel. In the first case, we use a data set `mixexpgamma$n200`, which contains 200 realizations of mixture of an exponential distribution and a gamma distribution, to create R's S3 class `expmodel` object `em1`. ```{r eval=TRUE} em1 <- expmodel(mixexpgamma$n200) ``` The object `em1` of a S3 class `expmodel` contains the data and parameters to be estimated. Similarly, in the second case, we use `mixExpGammaHist$n800p` for data points and `mixExpGammaHist$n800f` for their corresponding frequencies, to create `expmodel` object `em2`. ```{r eval=FALSE} em2 <- expmodel(mixExpGammaHist$n800p, mixExpGammaHist$n800f) ``` ### Exploring of a data set A summary of `em1` is displayed: ```{r eval=TRUE} ## Display the summary of a data set summary(em1) ``` As a name suggests, `summary.expmodel` shows the basic statistics of a data set. It prints out the quantiles of scaled data as well as original data. Here the scaling is to divide the data by the mean of the data. The histogram of the data is displayed: ```{r, eval=TRUE, out.height="80%", out.width="70%"} ## Draw a histogram of the data set plot(em1) ``` `plot.expmodel` can plot scaled data as well as original data by setting `scaling=TRUE`. ### Providing the set of parameters Before estimation, we need to provide a set of rate parameters, and degrees of polynomials, to compute the coefficients of polynomials. ```{r, eval=TRUE} ## A vector of degrees of polynomials deglist <- c(2, 3, 4) ## A vector of rate parameters of exponential distributions lmdlist <- c(0.5, 1, 2, 4) ``` `deglist` is a vector of degrees of polynomials, in this case `2, 3, 4`. In Exponential-based model, a positive integer up to around 20 is okay. Note that large degrees can cause numerical difficulty. `lmdlist` is a vector of means of exponential distributions, so the element of `lmdlist` should be positive. Also note that the rate parameters to be passed to an `estimate.expmodel` method are applied to internally scaled data, not original data. ### Estimation Providing these parameter sets, we are now ready to estimate the model. ```{r eval=TRUE, results="hide", message=FALSE} ## Do estimation ## Output messages are suppressed for brevity em1 <- estimate(em1, deglist=deglist, lmdlist=lmdlist) ``` The computation of the coefficients of the polynomials is done for all of the combinations of the parameter sets `deglist`, `lmdlist`, 12 cases in this example. The result is sorted according to Akaike information criterion(AIC) Let's see the result of estimation. ```{r eval=TRUE} ## Show the summary of results up to 5 summary(em1, nmax=5, estonly=TRUE) ``` (`nmax=5` limits top 5 estimates, and `estonly=TRUE` suppresses the basic statistics.) Next see the histogram. ```{r, eval=TRUE, out.height="80%", out.width="70%"} plot(em1) ``` ### Refine estimation We continue to estimate further by adding parameters. As we see in `estimate.gaussmodel`, paramters already estimated are skipped. ```{r eval=TRUE, results="hide", message=FALSE} ## Do estimation ## Output messages are suppressed for brevity em1 <- estimate(em1, c(3, 4, 5, 6), c(1, 2, 4, 8)) ``` The summary of the estimation is: ```{r eval=TRUE} ## Show the summary of results up to 5 summary(em1, nmax=5, estonly=TRUE) ``` The graphs are: ```{r, eval=TRUE, out.height="80%", out.width="70%"} plot(em1) ``` We further confine parameters as follows: ```{r eval=TRUE, results="hide", message=FALSE} ## Do estimation ## Output messages are suppressed for brevity em1 <- estimate(em1, c(5, 6), seq(3, 4, by=0.25)) ``` And see the summary of estimation: ```{r eval=TRUE} ## Show the summary of results up to 5 summary(em1, nmax=5, estonly=TRUE) ``` The graphs are as follows: ```{r, eval=TRUE, out.height="80%", out.width="70%"} plot(em1) ``` Theoretically, we can refine estimation as you like, but we stop estimation here. ### Notes Using `plot.expmodel`, we can plot cumulative distributions by setting `cum=TRUE`: ```{r, eval=TRUE, out.height="80%", out.width="70%"} plot(em1, cum=TRUE) ``` For more details, see `?plot.expmodel`. `func.expmodel` computes the values of density and cumulative distribution of desired estimate. For example, ```{r, eval=TRUE, out.height="80%", out.width="70%"} x <- seq(0, 14, by=0.1) ## Compute the density of 1st estimate y_pdf <- func(em1, x, n=1) ## Compute the cumulative distribution of 1st estimate y_cdf <- func(em1, x, cdf=TRUE, n=1) ``` Of course, we can compute the desired estimate by designating `n=k` for kth estimate shown in `summary.expmodel`.