--- title: "An infinite-horizon HMDP" author: "Lars Relund " date: "`r Sys.Date()`" bibliography: litt.bib output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{infinite-hmdp} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} editor_options: chunk_output_type: console --- ```{r setup, include=FALSE} library(knitr) options(rmarkdown.html_vignette.check_title = FALSE, # formatR.arrow = TRUE, # scipen=999, # digits=5, width=90) #thm <- knit_theme$get("edit-kwrite") # whitengrey, bright, print, edit-flashdevelop, edit-kwrite #knit_theme$set(thm) knit_hooks$set( par = function(before, options, envir) { if (before && options$fig.show != 'none') par(mar = c(0, 0, 0, 0), # bottom, left, top, and right oma = c(0, 0, 0, 0))} ) knitr::opts_chunk$set( # collapse = TRUE, comment = "#>", fig.align = 'center', fig.width = 9, fig.height = 5, fig.show = 'hold', out.extra = 'style="max-width:100%;"', # tidy = TRUE, # prompt=T, # comment=NA, cache = F # background = "red" ) library(magrittr) library(dplyr) ``` The `MDP2` package in R is a package for solving Markov decision processes (MDPs) with discrete time-steps, states and actions. Both traditional MDPs [@Puterman94], semi-Markov decision processes (semi-MDPs) [@Tijms03] and hierarchical-MDPs (HMDPs) [@Kristensen00] can be solved under a finite and infinite time-horizon. The package implement well-known algorithms such as policy iteration and value iteration under different criteria e.g. average reward per time unit and expected total discounted reward. The model is stored using an underlying data structure based on the *state-expanded directed hypergraph* of the MDP (@Relund06) implemented in `C++` for fast running times. Building and solving an MDP is done in two steps. First, the MDP is built and saved in a set of binary files. Next, you load the MDP into memory from the binary files and apply various algorithms to the model. For building the MDP models see `vignette("building")`. In this vignette we focus on the second step, i.e. finding the optimal policy. Here we consider an infinite-horizon HMDP. ```{r} library(MDP2) ``` ## An infinite-horizon HMDP A hierarchical MDP is an MDP with parameters defined in a special way, but nevertheless in accordance with all usual rules and conditions relating to such processes (@Kristensen00). The basic idea of the hierarchical structure is that stages of the process can be expanded to a so-called *child process*, which again may expand stages further to new child processes leading to multiple levels. To illustrate consider the HMDP shown in the figure below. The process has three levels. At `Level 2` we have a set of finite-horizon semi-MDPs (one for each oval box) which all can be represented using a state-expanded hypergraph (hyperarcs not shown, only hyperarcs connecting processes are shown). A semi-MDP at `Level 2` is uniquely defined by a given state $s$ and action $a$ of its *parent process* at `Level 1` (illustrated by the arcs with head and tail node at `Level 1` and `Level 2`, respectively). Moreover, when a child process at `Level 2` terminates a transition from a state $s\in \mathcal{S}_{N}$ of the child process to a state at the next stage of the parent process occur (illustrated by the (hyper)arcs having head and tail at `Level 2` and `Level 1`, respectively). ```{r, echo=FALSE, fig.cap="The state-expanded hypergraph of the first stage of a hierarchical MDP. Level 0 indicate the founder level, and the nodes indicates states at the different levels. A child process (oval box) is represented using its state-expanded hypergraph (hyperarcs not shown) and is uniquely defined by a given state and action of its parent process."} knitr::include_graphics("vignette_files/hmdp_index.png") ``` Since a child process is always defined by a stage, state and action of the parent process we have that for instance a state at Level 1 can be identified using an index vector $\nu=(n_{0},s_{0},a_{0},n_{1},s_{1})$ where $s_1$ is the state id at the given stage $n_1$ in the process defined by the action $a_0$ in state $s_0$ at stage $n_0$. Note all values are ids starting from zero, e.g. if $s_1=0$ it is the first state at the corresponding stage and if $a_0=2$ it is the third action at the corresponding state. In general a state $s$ and action $a$ at level $l$ can be uniquely identified using $$ \begin{aligned} \nu_{s}&=(n_{0},s_{0},a_{0},n_{1},s_{1},\ldots,n_{l},s_{l}) \\ \nu_{a}&=(n_{0},s_{0},a_{0},n_{1},s_{1},\ldots,n_{l},s_{l},a_{l}). \end{aligned} $$ The index vectors for state $v_0$, $v_1$ and $v_2$ are illustrated in the figure. As under a semi-MDP another way to identify a state in the state-expanded hypergraph is using an unique id. ## Example Let us try to solve a small problem from livestock farming, namely the cow replacement problem where we want to represent the age of the cow, i.e. the lactation number of the cow. During a lactation a cow may have a high, average or low yield. We assume that a cow is always replaced after 4 lactations. In addition to lactation and milk yield we also want to take the genetic merit into account which is either bad, average or good. When a cow is replaced we assume that the probability of a bad, average or good heifer is equal. We formulate the problem as a HMDP with 2 levels. At level 0 the states are the genetic merit and the length of a stage is a life of a cow. At level 1 a stage describe a lactation and states describe the yield. Decisions at level 1 are `keep` or `replace`. Note the MDP runs over an infinite time-horizon at the founder level where each state (genetic merit) define a semi-MDP at level 1 with 4 lactations. Let us try to load the model and get some info: ```{r, par=TRUE} prefix <- paste0(system.file("models", package = "MDP2"), "/cow_") mdp <- loadMDP(prefix) mdp ``` The state-expanded hypergraph representing the HMDP with infinite time-horizon can be plotted using ```{r plotHMDP, message=FALSE, par=TRUE} hgf <- getHypergraph(mdp) ## Rename labels dat <- hgf$nodes %>% dplyr::mutate(label = dplyr::case_when( label == "Low yield" ~ "L", label == "Avg yield" ~ "A", label == "High yield" ~ "H", label == "Dummy" ~ "D", label == "Bad genetic level" ~ "Bad", label == "Avg genetic level" ~ "Avg", label == "Good genetic level" ~ "Good", TRUE ~ "Error" )) ## Set grid id dat$gId[1:3]<-85:87 dat$gId[43:45]<-1:3 getGId<-function(process,stage,state) { if (process==0) start=18 if (process==1) start=22 if (process==2) start=26 return(start + 14 * stage + state) } idx<-43 for (process in 0:2) for (stage in 0:4) for (state in 0:2) { if (stage==0 & state>0) break idx<-idx-1 #cat(idx,process,stage,state,getGId(process,stage,state),"\n") dat$gId[idx]<-getGId(process,stage,state) } hgf$nodes <- dat ## Rename labels dat <- hgf$hyperarcs %>% dplyr::mutate(label = dplyr::case_when( label == "Replace" ~ "R", label == "Keep" ~ "K", label == "Dummy" ~ "D", TRUE ~ "Error" ), col = dplyr::case_when( label == "R" ~ "deepskyblue3", label == "K" ~ "darkorange1", label == "D" ~ "black", TRUE ~ "Error" ), lwd = 0.5, label = "" ) hgf$hyperarcs <- dat ## Make the plot plotHypergraph(hgf, gridDim = c(14, 7), cex = 0.8, radx = 0.02, rady = 0.03) ``` Note action `keep` is drawn with orange color and action `replace` with blue color. We find the optimal policy under the expected discounted reward criterion using policy iteration with an interest rate of 10%: ```{r Optimize (cow)} wLbl<-"Net reward" # the weight we want to optimize (net reward) durLbl<-"Duration" # the duration/time label runPolicyIteDiscount(mdp, wLbl, durLbl, rate = 0.1) ``` The optimal policy is: ```{r plotPolicy, results='hide', message=FALSE, par=TRUE} hgf$hyperarcs <- right_join(hgf$hyperarcs, getPolicy(mdp), by = c("sId", "aIdx")) plotHypergraph(hgf, gridDim = c(14, 7), cex = 0.8, radx = 0.02, rady = 0.03) ``` ```{r eval=FALSE, include=FALSE} # getPolicy(mdp) # rpo<-calcRPO(mdp, wLbl, iA=rep(0,42), criterion="discount", dur=durLbl, rate=rate, rateBase=rateBase) # policy<-merge(policy,rpo) # policy ``` We may also find the policy which maximize the average reward per lactation: ```{r avePerLac, tidy.opts=list(comment=FALSE)} wLbl<-"Net reward" # the weight we want to optimize (net reward) durLbl<-"Duration" # the duration/time label runPolicyIteAve(mdp, wLbl, durLbl) getPolicy(mdp) ``` Since other weights are defined for each action we can calculate the average reward per litre milk under the optimal policy: ```{r, echo=TRUE} runCalcWeights(mdp, w=wLbl, criterion="average", dur = "Yield") ``` or the average yield per lactation: ```{r Reward/piglet (sow rep), echo=TRUE} runCalcWeights(mdp, w="Yield", criterion="average", dur = durLbl) ``` ```{r Delete bin, include=FALSE} do.call(file.remove,list(list.files(pattern = ".bin"))) ``` ## References