| Title: | Solution Methods for Multi-Objective Markov Decision Processes |
|---|---|
| Description: | Compendium of the most representative algorithms in print---vector-valued dynamic programming, linear programming, policy iteration, the weighting factor approach---for solving multi-objective Markov decision processes, with or without reward discount, over a finite or infinite horizon. Mifrani, A. (2024) <doi:10.1007/s10479-024-06439-x>; Mifrani, A. & Noll, D. <doi:10.48550/arXiv.2502.13697>; Wakuta, K. (1995) <doi:10.1016/0304-4149(94)00064-Z>. |
| Authors: | Anas Mifrani [aut, cre, cph] (ORCID: <https://orcid.org/0009-0005-1373-9028>) |
| Maintainer: | Anas Mifrani <[email protected]> |
| License: | GPL-3 |
| Version: | 1.0.0 |
| Built: | 2026-06-04 07:27:06 UTC |
| Source: | https://github.com/cran/multiobjectiveMDP |
are_valid_finite_horizon_rewards() returns TRUE if rewards constitutes a
valid reward structure for a finite-horizon problem. In the opposite case, it returns FALSE together with an explanatory message.
are_valid_finite_horizon_rewards(rewards)are_valid_finite_horizon_rewards(rewards)
rewards |
A numeric list. |
This function juxtaposes rewards against the template described in the documentation of evaluate_finite_horizon_MMDP_markov_policy() for finite-horizon reward lists.
A boolean: TRUE if rewards defines a valid finite-horizon reward structure, FALSE otherwise.
evaluate_finite_horizon_MMDP_markov_policy() for the template of a finite-horizon reward list.
set.seed(1234) MMDP <- generate_rand_MMDP(2, list(c(1, 2), c(1, 2, 3)), horizon = 5, no_objectives = 2) rewards <- MMDP$R are_valid_finite_horizon_rewards(rewards)set.seed(1234) MMDP <- generate_rand_MMDP(2, list(c(1, 2), c(1, 2, 3)), horizon = 5, no_objectives = 2) rewards <- MMDP$R are_valid_finite_horizon_rewards(rewards)
are_valid_finite_horizon_transition_probabilities() returns TRUE if transition_probabilities constitutes a
valid transition probability structure for a finite-horizon problem. In the opposite case, it returns FALSE in addition to an explanation.
are_valid_finite_horizon_transition_probabilities(transition_probabilities)are_valid_finite_horizon_transition_probabilities(transition_probabilities)
transition_probabilities |
A numeric list. |
This function juxtaposes transition_probabilities against the template described in the documentation of evaluate_finite_horizon_MMDP_markov_policy() for finite-horizon transition probability lists.
A boolean: TRUE if transition_probabilities defines a valid finite-horizon transition probability structure, FALSE otherwise.
evaluate_finite_horizon_MMDP_markov_policy()
set.seed(1234) MMDP <- generate_rand_MMDP(2, list(c(1, 2), c(1, 2, 3)), horizon = 5, no_objectives = 2) transition_probabilities <- MMDP$P are_valid_finite_horizon_transition_probabilities(transition_probabilities)set.seed(1234) MMDP <- generate_rand_MMDP(2, list(c(1, 2), c(1, 2, 3)), horizon = 5, no_objectives = 2) transition_probabilities <- MMDP$P are_valid_finite_horizon_transition_probabilities(transition_probabilities)
are_valid_infinite_horizon_rewards() returns TRUE if stationary_rewards constitutes a
valid reward structure for a stationary infinite-horizon problem. Otherwise, it returns FALSE in addition to an explanation.
are_valid_infinite_horizon_rewards(stationary_rewards)are_valid_infinite_horizon_rewards(stationary_rewards)
stationary_rewards |
A numeric list. |
This function juxtaposes stationary_rewards against the template described in the documentation of evaluate_discounted_MMDP_pure_policy() for infinite-horizon reward lists.
A boolean: TRUE if stationary_rewards defines a valid infinite-horizon reward structure, FALSE otherwise.
evaluate_discounted_MMDP_pure_policy()
set.seed(1234) stationary_MMDP <- generate_rand_MMDP(2, list(c(1, 2), c(1, 2, 3)), horizon = Inf, no_objectives = 2) stationary_rewards <- stationary_MMDP$R are_valid_infinite_horizon_rewards(stationary_rewards)set.seed(1234) stationary_MMDP <- generate_rand_MMDP(2, list(c(1, 2), c(1, 2, 3)), horizon = Inf, no_objectives = 2) stationary_rewards <- stationary_MMDP$R are_valid_infinite_horizon_rewards(stationary_rewards)
are_valid_infinite_horizon_transition_probabilities() returns TRUE if stationary_rewards constitutes a
valid transition probability structure for a stationary infinite-horizon problem. In the opposite case, it returns FALSE in addition to an explanation.
are_valid_infinite_horizon_transition_probabilities( stationary_transition_probabilities )are_valid_infinite_horizon_transition_probabilities( stationary_transition_probabilities )
stationary_transition_probabilities |
A numeric list. |
This function juxtaposes stationary_transition_probabilities against the template described in the documentation of evaluate_discounted_MMDP_pure_policy() for infinite-horizon transition probability lists.
A boolean: TRUE if stationary_transition_probabilities defines a valid infinite-horizon transition probability structure, FALSE otherwise.
evaluate_discounted_MMDP_pure_policy()
set.seed(1234) stationary_MMDP <- generate_rand_MMDP(2, list(c(1, 2), c(1, 2, 3)), horizon = Inf, no_objectives = 2) stationary_transition_probabilities <- stationary_MMDP$P are_valid_infinite_horizon_transition_probabilities(stationary_transition_probabilities)set.seed(1234) stationary_MMDP <- generate_rand_MMDP(2, list(c(1, 2), c(1, 2, 3)), horizon = Inf, no_objectives = 2) stationary_transition_probabilities <- stationary_MMDP$P are_valid_infinite_horizon_transition_probabilities(stationary_transition_probabilities)
compromise_solution() determines which of a finite set value_set of
objective vectors is geometrically the closest to the ideal point that
maximizes all components simultaneously.
compromise_solution(value_set, p = 2)compromise_solution(value_set, p = 2)
value_set |
A numeric matrix whose columns contain the objective vectors. |
p |
Either (1) an integer greater than or equal to 1 or (2) |
A numeric vector giving the column of value_set that minimizes the
distance to the ideal point.
Yu, P. L. (2013). Multiple-criteria decision making: concepts, techniques, and extensions (Vol. 30). Springer Science & Business Media.
# Construct a set of three points: (3, 1), (2, 2) and (8, -1/2). The ideal point here is (8, 2). value_set <- matrix(c(3, 1, 2, 2, 8, -1/2), nrow = 2) # Which of the three points is closest to the ideal under, say, the Euclidean norm? compromise_solution(value_set) # What of the sup norm? compromise_solution(value_set, Inf)# Construct a set of three points: (3, 1), (2, 2) and (8, -1/2). The ideal point here is (8, 2). value_set <- matrix(c(3, 1, 2, 2, 8, -1/2), nrow = 2) # Which of the three points is closest to the ideal under, say, the Euclidean norm? compromise_solution(value_set) # What of the sup norm? compromise_solution(value_set, Inf)
discounted_bellman_operator() is an adjunct to
solve_discounted_MMDP_policy_iteration() and
solve_discounted_MDP_policy_iteration() that operates a standard linear
transformation on a value function value_function for a stationary
infinite-horizon multi-objective Markov decision process.
discounted_bellman_operator( stationary_transition_probabilities, stationary_rewards, stationary_policy, value_function, rho )discounted_bellman_operator( stationary_transition_probabilities, stationary_rewards, stationary_policy, value_function, rho )
stationary_transition_probabilities |
A numeric list defining the
(stationary) transition probabilities of the infinite-horizon model under
study. See the documentation of |
stationary_rewards |
A numeric list defining the (stationary) reward
structure. See the documentation of |
stationary_policy |
An integer vector, of length equal to the number of states, representing a stationary policy. Each element in the vector gives the index of the action prescribed for the corresponding state. |
value_function |
A numeric matrix, with one column per state,
representing the value function to be transformed. The |
rho |
A number greater than or equal to 0 but strictly less than 1 to use as a reward discount factor. |
This function uses value_function and stationary_policy to compute a new value function.
Let denote
the reward vector for executing policy stationary_policy in state —
which is to say stationary_rewards[[1]][[s]][[stationary_policy[s]]]—and
let denote the matrix formed by these column vectors. Similarly,
let be the square matrix whose -th entry is given by
the probability of transition from state to state under
stationary_policy, i.e.,
stationary_probabilities[[1]][[s]][[stationary_policy[s]]][[j]]. Then, if the letter
denotes value_function, this function returns the quantity
where denotes the transpose of
. The resulting matrix bears the same dimensions as value_function.
In the special case where value_function stores the value function of stationary_policy, the transformation leaves value_function unchanged as per standard discounted MDP theory.
A numeric matrix having the dimensions of value_function,
representing the new value function.
Puterman, M. L. (2014). Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons.
solve_discounted_MDP_policy_iteration() for an implementation of the standard policy iteration algorithm and solve_discounted_MMDP_policy_iteration() for its multi-objective extension.
To ensure that the transition probability and reward lists passed to the function are acceptable, use are_valid_infinite_horizon_transition_probabilities() and are_valid_infinite_horizon_rewards().
# Generate a discounted infinite-horizon bi-objective MDP with two states # and action sets that provide two actions for state 1 and three for state 2 set.seed(1234) no_states <- 2 action_sets <- list(c(1, 2), c(1, 2, 3)) no_objectives <- 2 stationary_MMDP <- generate_rand_MMDP(no_states, action_sets, horizon = Inf, no_objectives) stationary_transition_probabilities <- stationary_MMDP$P stationary_rewards <- stationary_MMDP$R # Consider the stationary policy that prescribes action 1 for state 1 and action 3 for state 2 stationary_policy <- c(1, 3) # Evaluate this policy assuming a reward discount factor of .7, display the value function rho <- .7 value_function <- evaluate_discounted_MMDP_pure_policy(stationary_transition_probabilities, stationary_rewards, stationary_policy, rho) value_function # What happens when we subject a policy's value function # to the transformation defined by the policy itself? transformed_value_function <- discounted_bellman_operator(stationary_transition_probabilities, stationary_rewards, stationary_policy, value_function, rho) transformed_value_function# Generate a discounted infinite-horizon bi-objective MDP with two states # and action sets that provide two actions for state 1 and three for state 2 set.seed(1234) no_states <- 2 action_sets <- list(c(1, 2), c(1, 2, 3)) no_objectives <- 2 stationary_MMDP <- generate_rand_MMDP(no_states, action_sets, horizon = Inf, no_objectives) stationary_transition_probabilities <- stationary_MMDP$P stationary_rewards <- stationary_MMDP$R # Consider the stationary policy that prescribes action 1 for state 1 and action 3 for state 2 stationary_policy <- c(1, 3) # Evaluate this policy assuming a reward discount factor of .7, display the value function rho <- .7 value_function <- evaluate_discounted_MMDP_pure_policy(stationary_transition_probabilities, stationary_rewards, stationary_policy, rho) value_function # What happens when we subject a policy's value function # to the transformation defined by the policy itself? transformed_value_function <- discounted_bellman_operator(stationary_transition_probabilities, stationary_rewards, stationary_policy, value_function, rho) transformed_value_function
efficient_subset_sort_prune() enumerates the efficient column vectors of a
matrix according to the componentwise order that prefers larger components to smaller ones.
efficient_subset_sort_prune(X)efficient_subset_sort_prune(X)
X |
A numeric matrix. |
A list comprised of:
A numeric matrix formed by the efficient columns of X, and
the indices of those columns in the original matrix.
If X's columns are incomparable, the function will return X.
# The points in the following set are (1, 2), (0, 0) and (3, -6) X <- matrix(c(1, 2, 0, 0, 3, -6), nrow = 2, ncol = 3) # (1, 2) and (3, -6) are efficient whereas (0, 0) is dominated by (1, 2) efficient_subset_sort_prune(X)# The points in the following set are (1, 2), (0, 0) and (3, -6) X <- matrix(c(1, 2, 0, 0, 3, -6), nrow = 2, ncol = 3) # (1, 2) and (3, -6) are efficient whereas (0, 0) is dominated by (1, 2) efficient_subset_sort_prune(X)
evaluate_discounted_MMDP_pure_policy() calculates the expected discounted
reward of a stationary deterministic policy in an infinite-horizon
multi-objective Markov decision process.
evaluate_discounted_MMDP_pure_policy( stationary_transition_probabilities, stationary_rewards, stationary_policy, rho )evaluate_discounted_MMDP_pure_policy( stationary_transition_probabilities, stationary_rewards, stationary_policy, rho )
stationary_transition_probabilities |
A numeric list describing the
(stationary) transition probabilities of the infinite-horizon model in
which
|
stationary_rewards |
A numeric list describing the model's (stationary) reward structure. The list contains a single sublist with one component per state that breaks down as follows:
|
stationary_policy |
An integer vector, of length equal to the number of
states as determined by |
rho |
A number greater than or equal to 0 but strictly less than 1 to use as a reward discount factor. |
Let denote the number of states. Let be a (Markov) stationary deterministic policy and let denote the expected discounted reward vector it achieves over an infinite horizon assuming the initial state is . For each component of this value function, Puterman (2014) shows that
where is viewed as an -dimensional vector having entries ; is the transition probability matrix induced by ; is the -dimensional vector given by the -th reward of executing policy in each state ; and is the identity matrix.
A numeric matrix, one row per objective and one column per state, in
which the -th column contains the value vector achieved from initial
state .
set.seed(1234) stationary_MMDP <- generate_rand_MMDP(no_states = 2, action_sets = list(c(1, 2), c(1, 2, 3)), horizon = Inf, no_objectives = 2) stationary_transition_probabilities <- stationary_MMDP$P stationary_rewards <- stationary_MMDP$R rho <- .7 # What's the value of the stationary policy that dictates action 2 in state 1 # and action 3 in state 2? stationary_policy <- c(2, 3) evaluate_discounted_MMDP_pure_policy(stationary_transition_probabilities, stationary_rewards, stationary_policy, rho)set.seed(1234) stationary_MMDP <- generate_rand_MMDP(no_states = 2, action_sets = list(c(1, 2), c(1, 2, 3)), horizon = Inf, no_objectives = 2) stationary_transition_probabilities <- stationary_MMDP$P stationary_rewards <- stationary_MMDP$R rho <- .7 # What's the value of the stationary policy that dictates action 2 in state 1 # and action 3 in state 2? stationary_policy <- c(2, 3) evaluate_discounted_MMDP_pure_policy(stationary_transition_probabilities, stationary_rewards, stationary_policy, rho)
evaluate_finite_horizon_MMDP_markov_policy() recursively calculates the
value function of a Markov deterministic policy for a finite-horizon
multi-objective Markov decision process.
evaluate_finite_horizon_MMDP_markov_policy( transition_probabilities, rewards, policy, rho = 1 )evaluate_finite_horizon_MMDP_markov_policy( transition_probabilities, rewards, policy, rho = 1 )
transition_probabilities |
A numeric list containing the transition
probabilities of the finite-horizon model in which
|
rewards |
A numeric list that defines the rewards of the finite-horizon
model in which
|
policy |
An integer matrix in which the |
rho |
A number between 0 and 1 representing the reward discount factor.
The default value |
Let denote the Markov deterministic policy represented by policy.
For each state , define
to be the expected reward that accrues from using between
a decision epoch and terminal epoch (horizon), with
denoting the action prescribed at time by
for (random) state , and denoting the
attendant reward vector. We assume no action is taken at termination, and so
the reward will merely be a function of the terminal state
.
This function finds , 's value function, through the
recursion
where the index runs over the states, denotes the probability of transition from to under action , and
where we set for each state .
A numeric matrix whose s-th column gives the value achieved by
policy from initial state s.
Puterman, M. L. (2014). Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons.
are_valid_finite_horizon_transition_probabilities(),
are_valid_finite_horizon_rewards()
set.seed(1234) no_states <- 2 action_sets <- list(c(1, 2), c(1, 2, 3)) horizon <- 5 no_objectives <- 2 MMDP <- generate_rand_MMDP(no_states, action_sets, horizon, no_objectives) transition_probabilities <- MMDP$P rewards <- MMDP$R rho <- .97 # Consider this policy: for state 1, always take action 1; # for state 2, take action 1 at odd decision epochs and action 2 at even decision epochs. policy <- matrix(nrow = no_states, ncol = (horizon - 1), data = c(1, 1, 1, 1, 1, 3, 1, 3), byrow = TRUE) evaluate_finite_horizon_MMDP_markov_policy(transition_probabilities, rewards, policy, rho)set.seed(1234) no_states <- 2 action_sets <- list(c(1, 2), c(1, 2, 3)) horizon <- 5 no_objectives <- 2 MMDP <- generate_rand_MMDP(no_states, action_sets, horizon, no_objectives) transition_probabilities <- MMDP$P rewards <- MMDP$R rho <- .97 # Consider this policy: for state 1, always take action 1; # for state 2, take action 1 at odd decision epochs and action 2 at even decision epochs. policy <- matrix(nrow = no_states, ncol = (horizon - 1), data = c(1, 1, 1, 1, 1, 3, 1, 3), byrow = TRUE) evaluate_finite_horizon_MMDP_markov_policy(transition_probabilities, rewards, policy, rho)
generate_rand_MMDP() returns a randomly generated finite- or
infinite-horizon multi-objective Markov decision process having the requested
states and action sets.
generate_rand_MMDP(no_states, action_sets, horizon = Inf, no_objectives)generate_rand_MMDP(no_states, action_sets, horizon = Inf, no_objectives)
no_states |
An integer greater than or equal to 2. This is the number of states in the model to be generated. |
action_sets |
A list of |
horizon |
|
no_objectives |
An integer greater than or equal to 1. This is the number of reward functions (objectives) in the model to be generated. |
A list summarizing the components of the multi-objective Markov decision process that was generated in the body of the function. These components are:
The state set S, an integer vector, as determined by no_states,
The action sets action_sets, an integer list,
The decision-making horizon horizon,
A randomly generated transition probability list P of length one if the horizon is infinite and of length horizon - 1 otherwise. There are three levels in P. In the finite-horizon case, a typical third-level entry, P[[t]][[s]][[a]], is a numeric vector that gives the probability distribution of the states at epoch t+1 if the state at epoch t was s and the action selected was a. For an infinite-horizon model, P[[1]][[s]][[a]] is to be interpreted as the stationary (i.e., time-homogeneous) probability distribution of future states if the current state is s and the action taken was a.
A randomly generated reward list R of length one if the horizon is infinite and of length horizon otherwise. There are likewise three levels in R. In the finite-horizon case, a typical third-level entry, R[[t]][[s]][[a]], is a numeric no_objectives-dimensional vector that gives the reward for selecting action a in state s at epoch t. A special case is the entry R[[horizon]][[s]], which is a no_objectives-dimensional vector that gives the reward for the terminating in state s. In the infinite-horizon case, R[[1]][[s]][[a]] is the stationary reward for selecting action a in state s. There are no terminal rewards in infinite-horizon problems.
# Generate a two-state finite-horizon bi-objective Markov decision process no_states <- 2 # Two actions are available in state 1 whereas three are available in state 2 action_sets <- list(c(1, 2), c(1, 2, 3)) # Let there be four decision epochs and two objectives horizon <- 5 no_objectives <- 2 MMDP <- generate_rand_MMDP(no_states, action_sets, horizon, no_objectives) # Inspect the randomly generated transition probabilities and rewards MMDP$P MMDP$R# Generate a two-state finite-horizon bi-objective Markov decision process no_states <- 2 # Two actions are available in state 1 whereas three are available in state 2 action_sets <- list(c(1, 2), c(1, 2, 3)) # Let there be four decision epochs and two objectives horizon <- 5 no_objectives <- 2 MMDP <- generate_rand_MMDP(no_states, action_sets, horizon, no_objectives) # Inspect the randomly generated transition probabilities and rewards MMDP$P MMDP$R
is_valid_finite_horizon_policy() returns TRUE if policy represents a
valid policy for finite-horizon problems with reward list rewards. In the opposite case, it returns FALSE together with an explanation.
is_valid_finite_horizon_policy(policy, rewards)is_valid_finite_horizon_policy(policy, rewards)
policy |
An integer matrix. |
rewards |
A numeric list providing a valid finite-horizon reward
structure. See the documentation of |
This function juxtaposes policy against the template documented under evaluate_finite_horizon_MMDP_markov_policy() and in the vignette for describing finite-horizon policies. In particular, it checks whether:
the number of rows equals the number of states stipulated by rewards;
the number of columns equals the length of the decision-making horizon, minus the terminal epoch, stipulated by rewards;
the entries conform to the action sets stipulated by rewards.
A boolean: TRUE if policy defines a valid Markov deterministic policy with respect to
problems in which the rewards are given by rewards, FALSE otherwise.
are_valid_finite_horizon_rewards()
set.seed(1234) no_states <- 2 action_sets <- list(c(1, 2), c(1, 2)) horizon <- 5 no_objectives <- 2 MMDP <- generate_rand_MMDP(no_states, action_sets, horizon, no_objectives) rewards <- MMDP$R # An example of a valid policy for a two-state, two-action bi-objective MDP # with four decision epochs policy <- matrix(nrow = no_states, ncol = (horizon - 1), data = c(2, 1, 1, 1, 1, 2, 1, 2), byrow = TRUE) is_valid_finite_horizon_policy(policy, rewards) # The following policy is invalid because it prescribes no actions for epoch four policy <- matrix(2, ncol = 3, data = c(2, 1, 1, 1, 2, 1), byrow = TRUE) is_valid_finite_horizon_policy(policy, rewards)set.seed(1234) no_states <- 2 action_sets <- list(c(1, 2), c(1, 2)) horizon <- 5 no_objectives <- 2 MMDP <- generate_rand_MMDP(no_states, action_sets, horizon, no_objectives) rewards <- MMDP$R # An example of a valid policy for a two-state, two-action bi-objective MDP # with four decision epochs policy <- matrix(nrow = no_states, ncol = (horizon - 1), data = c(2, 1, 1, 1, 1, 2, 1, 2), byrow = TRUE) is_valid_finite_horizon_policy(policy, rewards) # The following policy is invalid because it prescribes no actions for epoch four policy <- matrix(2, ncol = 3, data = c(2, 1, 1, 1, 2, 1), byrow = TRUE) is_valid_finite_horizon_policy(policy, rewards)
is_valid_infinite_horizon_policy() returns TRUE if stationary_policy represents a
valid stationary deterministic policy for infinite-horizon problems with reward list stationary_rewards. In the opposite case, it returns FALSE together with an explanation.
is_valid_infinite_horizon_policy(stationary_policy, stationary_rewards)is_valid_infinite_horizon_policy(stationary_policy, stationary_rewards)
stationary_policy |
An integer vector. |
stationary_rewards |
A numeric list providing a valid infinite-horizon reward
structure. See the documentation of |
This function juxtaposes stationary_policy against the template documented under evaluate_discounted_MMDP_pure_policy() for describing stationary policies. In particular, it checks whether:
the number of entries equals the number of states implied by stationary_rewards;
the entries themselves conform to the action sets implied by stationary_rewards.
A boolean: TRUE if stationary_policy defines a valid stationary policy with respect to
problems where the rewards are given by rewards, FALSE otherwise.
are_valid_infinite_horizon_rewards()
no_states <- 2 action_sets <- list(c(1, 2), c(1, 2)) no_objectives <- 2 stationary_MMDP <- generate_rand_MMDP(no_states, action_sets, horizon = Inf, no_objectives) stationary_rewards <- stationary_MMDP$R # An example of a valid stationary policy: send state 1 to action 1, state 2 to action 1 stationary_policy <- c(1, 1) is_valid_infinite_horizon_policy(stationary_policy, stationary_rewards) # An example of an invalid stationary policy: send state 1 to action 1, # but send state 2 to an action that falls outside its action set stationary_policy <- c(1, 3) is_valid_infinite_horizon_policy(stationary_policy, stationary_rewards)no_states <- 2 action_sets <- list(c(1, 2), c(1, 2)) no_objectives <- 2 stationary_MMDP <- generate_rand_MMDP(no_states, action_sets, horizon = Inf, no_objectives) stationary_rewards <- stationary_MMDP$R # An example of a valid stationary policy: send state 1 to action 1, state 2 to action 1 stationary_policy <- c(1, 1) is_valid_infinite_horizon_policy(stationary_policy, stationary_rewards) # An example of an invalid stationary policy: send state 1 to action 1, # but send state 2 to an action that falls outside its action set stationary_policy <- c(1, 3) is_valid_infinite_horizon_policy(stationary_policy, stationary_rewards)
solve_discounted_MDP_policy_iteration() is an adjunct to
solve_discounted_MMDP_weighting_factor() that implements the policy
iteration method for a discounted infinite-horizon MDP. It returns an optimal
stationary policy together with its value function.
solve_discounted_MDP_policy_iteration( stationary_transition_probabilities, stationary_scalar_rewards, rho )solve_discounted_MDP_policy_iteration( stationary_transition_probabilities, stationary_scalar_rewards, rho )
stationary_transition_probabilities |
A numeric list describing the
(stationary) transition probabilities of the infinite-horizon model under
study. See the documentation of |
stationary_scalar_rewards |
A numeric list describing the model's
rewards. See the documentation of |
rho |
A number greater than or equal to 0 but strictly less than 1 to use as a reward discount factor. |
Policy iteration starts from an arbitrary stationary policy then proceeds to improve on it until the policy iterates generated over the course of the algorithm stabilize. The algorithm is as follows, where the phrase "decision rule" refers to a mapping from states to actions:
Step 1: Set , and select an arbitrary decision rule .
Step 2: (Policy evaluation) Calculate the value function of the stationary policy that always uses .
Step 3: (Policy improvement) For each state , choose from the actions that maximize the quantity
where is the reward for taking action in state and is the probability of transition from to under .
Step 4: If , stop: the policy that repeats is optimal. Otherwise increment by 1 and return to step 2.
When the state and action sets are finite, this procedure terminates in an
optimal policy after a finite number of iterations. To implement step 2, the
function calls evaluate_discounted_MMDP_pure_policy(). To evaluate the
quantities in step 3 prior to maximization, the function calls
discounted_bellman_operator().
A list with one integer vector and one numeric vector:
optimal_policy, of length equal to the number of states as determined by stationary_transition_probabilities and stationary_scalar_rewards, defines an optimal stationary policy such that -th entry provides the action prescribed for state .
optimal_value_function, of the same length as optimal_policy, gives the expected discounted reward achieved by optimal_policy from each initial state.
Puterman, M. L. (2014). Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons.
To ensure that the transition probability and reward lists passed are acceptable, use are_valid_infinite_horizon_transition_probabilities() and are_valid_infinite_horizon_rewards().
set.seed(1234) stationary_MDP <- generate_rand_MMDP(2, list(c(1, 2), c(1, 2, 3)), horizon = Inf, no_objectives = 1) stationary_transition_probabilities <- stationary_MDP$P stationary_scalar_rewards <- stationary_MDP$R rho <- .7 solve_discounted_MDP_policy_iteration(stationary_transition_probabilities, stationary_scalar_rewards, rho)set.seed(1234) stationary_MDP <- generate_rand_MMDP(2, list(c(1, 2), c(1, 2, 3)), horizon = Inf, no_objectives = 1) stationary_transition_probabilities <- stationary_MDP$P stationary_scalar_rewards <- stationary_MDP$R rho <- .7 solve_discounted_MDP_policy_iteration(stationary_transition_probabilities, stationary_scalar_rewards, rho)
solve_discounted_MMDP_linear_programming() implements a linear programming
approach to the solution of a discounted infinite-horizon multi-objective
Markov decision process. The function "maximizes" the expected discounted total reward under a fixed initial_state_distribution and returns all efficient stationary
policies together with their value vectors.
solve_discounted_MMDP_linear_programming( stationary_transition_probabilities, stationary_rewards, rho, initial_state_distribution, max_iter = 100 )solve_discounted_MMDP_linear_programming( stationary_transition_probabilities, stationary_rewards, rho, initial_state_distribution, max_iter = 100 )
stationary_transition_probabilities |
A numeric list defining the
(stationary) transition probabilities of the infinite-horizon model under
study. See the documentation of |
stationary_rewards |
A numeric list defining the (stationary) reward
structure. See the documentation of |
rho |
A number greater than or equal to 0 but strictly less than 1 to use as a reward discount factor. |
initial_state_distribution |
A numeric vector, whose length equals the number of states and whose entries sum to 1.0, specifying the initial probability of each state. The linear programming approach requires that all probabilities be positive. |
max_iter |
An integer greater than or equal to 1 giving the maximum
number of iterations in the optimization phase. The default value is |
This function extends to the multi-objective case the linear programming
approach described in Puterman (1994). The basis for this approach is that
the expected discounted reward of a policy given an initial-state
distribution (initial_state_distribution) is linear in the state-action frequencies
induced by that policy. A state-action frequency is the discounted total joint
probability that a particular state will be visited and a particular action will be taken at least once in the process's
lifetime. The collection of these frequencies forms a convex polyhedral set that is in a
one-to-one correspondence with Markov randomized policies, and whose extreme points uniquely determine the model's stationary
deterministic policies. Using Mifrani and Noll's (2025) adaptation of a
multi-objective simplex-type algorithm by Ecker and Kouada (1978), this
function enumerates those extreme points that are efficient solutions to the
multi-objective linear program then calculates the corresponding (efficient) stationary policies.
A list with one integer matrix and one numeric matrix:
efficient_pure_policies tabulates the efficient stationary policies found through linear programming. Each row represents a policy and thus possesses as many entries as there are states.
efficient_value_vectors records in its columns the value vectors generated by the policies in efficient_pure_policies. The length of each column is the number of objectives as determined by stationary_rewards.
Puterman, M. L. (2014). Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons.
Mifrani, A., & Noll, D. (2025). Linear programming for finite-horizon vector-valued Markov decision processes. arXiv preprint arXiv:2502.13697.
solve_MOLP() for an implementation of Mifrani and Noll's version
of the Ecker-Kouada multi-objective linear programming method.
To ensure that the transition probability and reward lists passed are acceptable, use are_valid_infinite_horizon_transition_probabilities() and are_valid_infinite_horizon_rewards().
set.seed(1234) no_states <- 2 action_sets <- list(c(1, 2), c(1, 2, 3)) no_objectives <- 2 stationary_MMDP <- generate_rand_MMDP(no_states, action_sets, horizon = Inf, no_objectives) stationary_transition_probabilities <- stationary_MMDP$P stationary_rewards <- stationary_MMDP$R rho <- .7 # Use multi-objective linear programming to solve the model # under a uniform initial-state distribution initial_state_distribution <- c(.5, .5) solve_discounted_MMDP_linear_programming(stationary_transition_probabilities, stationary_rewards, rho, initial_state_distribution)set.seed(1234) no_states <- 2 action_sets <- list(c(1, 2), c(1, 2, 3)) no_objectives <- 2 stationary_MMDP <- generate_rand_MMDP(no_states, action_sets, horizon = Inf, no_objectives) stationary_transition_probabilities <- stationary_MMDP$P stationary_rewards <- stationary_MMDP$R rho <- .7 # Use multi-objective linear programming to solve the model # under a uniform initial-state distribution initial_state_distribution <- c(.5, .5) solve_discounted_MMDP_linear_programming(stationary_transition_probabilities, stationary_rewards, rho, initial_state_distribution)
solve_discounted_MMDP_policy_iteration() implements a vector-valued policy
iteration procedure to solve a discounted infinite-horizon multi-objective
Markov decision process.
solve_discounted_MMDP_policy_iteration( stationary_transition_probabilities, stationary_rewards, rho )solve_discounted_MMDP_policy_iteration( stationary_transition_probabilities, stationary_rewards, rho )
stationary_transition_probabilities |
A numeric list describing the
(stationary) transition probabilities of the infinite-horizon model under
study. See the documentation of |
stationary_rewards |
A numeric list describing the (stationary) reward
structure. See the documentation of |
rho |
A number greater than or equal to 0 but strictly less than 1 to use as a reward discount factor. |
This function implements Kazuyoshi Wakuta's multi-objective extension of the standard policy iteration method for infinite-horizon MDPs. Wakuta's algorithm produces stationary deterministic (also known as pure) policies which achieve a Pareto efficient expected discounted total reward vector from any initial state.
A list with two components:
An integer matrix with one column per state in which the rows define the efficient stationary policies found by policy iteration. The s-th entry in each row gives the action prescribed by the row's policy for state s.
A list containing one numeric matrix per efficient policy that records the policy' value function. The column indices of each matrix represent the initial states and the columns themselves give the value vectors achieved from those states.
Wakuta, K. (1995). Vector-valued Markov decision processes and the systems of linear inequalities. Stochastic processes and their applications, 56(1), 159-169.
solve_discounted_MDP_policy_iteration() for an implementation of the standard policy iteration method in a discounted single-objective MDP.
To ensure that the transition probability and reward lists provided are acceptable, use are_valid_infinite_horizon_transition_probabilities() and are_valid_infinite_horizon_rewards().
set.seed(1234) no_states <- 2 action_sets <- list(c(1, 2), c(1, 2, 3)) no_objectives <- 2 stationary_MMDP <- generate_rand_MMDP(no_states, action_sets, horizon = Inf, no_objectives) stationary_transition_probabilities <- stationary_MMDP$P stationary_rewards <- stationary_MMDP$R rho <- .7 # Let's locate the efficient stationary policies for this infinite-horizon model solution <- solve_discounted_MMDP_policy_iteration(stationary_transition_probabilities, stationary_rewards, rho) solution$policies # Inspect their value functions solution$value_functionsset.seed(1234) no_states <- 2 action_sets <- list(c(1, 2), c(1, 2, 3)) no_objectives <- 2 stationary_MMDP <- generate_rand_MMDP(no_states, action_sets, horizon = Inf, no_objectives) stationary_transition_probabilities <- stationary_MMDP$P stationary_rewards <- stationary_MMDP$R rho <- .7 # Let's locate the efficient stationary policies for this infinite-horizon model solution <- solve_discounted_MMDP_policy_iteration(stationary_transition_probabilities, stationary_rewards, rho) solution$policies # Inspect their value functions solution$value_functions
solve_discounted_MMDP_weighting_factor() generates efficient stationary
policies for a discounted infinite-horizon multi-objective Markov decision
process using the weighting factor approach.
solve_discounted_MMDP_weighting_factor( stationary_transition_probabilities, stationary_rewards, rho, no_iterations = 100 )solve_discounted_MMDP_weighting_factor( stationary_transition_probabilities, stationary_rewards, rho, no_iterations = 100 )
stationary_transition_probabilities |
A numeric list describing the
(stationary) transition probabilities of the infinite-horizon model under
study. See the documentation of |
stationary_rewards |
A numeric list describing the (stationary) reward
structure. See the documentation of |
rho |
A number greater than or equal to 0 but strictly less than 1 to use as a reward discount factor. |
no_iterations |
The number of weighting factors, each factor being a
numeric vector with one entry per objective, to be used. The default value is |
In a discounted infinite-horizon MMDP with finite state and action spaces,
there exist stationary policies which are efficient. Furthermore, each such
policy maximizes some positive linear combination of the objectives (Wakuta,
1982). Conversely, if we consider the problem of maximizing a positive linear
combination (any one would do) of the objectives, we get a
single-objective MDP for which any (uniformly) optimal policy is also (uniformly) efficient in the
original multi-objective model. This function follows this approach, sampling
no_iterations positive weight vectors at random
then solving the resulting MDPs with the standard policy iteration method.
A list with one integer matrix and one numeric sublist:
efficient_policies, one row per efficient policy detected and one column per state, tabulates the policies found. The -th entry specifies the action prescribed by policy for state .
efficient_value_functions, comprised of one numeric matrix per policy in efficient_policies, records each policy's value function. The -th column of a value function matrix contains the expected discounted reward attained from initial state .
Wakuta, K. (1995). Vector-valued Markov decision processes and the systems of linear inequalities. Stochastic processes and their applications, 56(1), 159-169.
solve_discounted_MDP_policy_iteration(),
evaluate_discounted_MMDP_pure_policy().
To ensure that the transition probability and reward lists passed are acceptable, use are_valid_infinite_horizon_transition_probabilities() and are_valid_infinite_horizon_rewards().
set.seed(1234) no_states <- 2 action_sets <- list(c(1, 2), c(1, 2, 3)) no_objectives <- 2 stationary_MMDP <- generate_rand_MMDP(no_states, action_sets, horizon = Inf, no_objectives) stationary_transition_probabilities <- stationary_MMDP$P stationary_rewards <- stationary_MMDP$R rho <- .7 solve_discounted_MMDP_weighting_factor(stationary_transition_probabilities, stationary_rewards, rho)set.seed(1234) no_states <- 2 action_sets <- list(c(1, 2), c(1, 2, 3)) no_objectives <- 2 stationary_MMDP <- generate_rand_MMDP(no_states, action_sets, horizon = Inf, no_objectives) stationary_transition_probabilities <- stationary_MMDP$P stationary_rewards <- stationary_MMDP$R rho <- .7 solve_discounted_MMDP_weighting_factor(stationary_transition_probabilities, stationary_rewards, rho)
solve_finite_horizon_MDP_backward_induction() implements the standard
backward induction algorithm in order to determine the optimal value function and an optimal policy
for a single-objective finite-horizon Markov decision
process.
solve_finite_horizon_MDP_backward_induction( transition_probabilities, scalar_rewards, rho = 1 )solve_finite_horizon_MDP_backward_induction( transition_probabilities, scalar_rewards, rho = 1 )
transition_probabilities |
A numeric list defining the transition
probabilities of the model to be solved. See the documentation of
|
scalar_rewards |
A numeric list defining the rewards. See the
documentation of |
rho |
A number between (and including) 0 and 1 to use as the reward
discount factor. The default value is |
A list of one numeric matrix and one integer matrix:
optimal_values has one row per state and one column per epoch. The (s, t)-th entry records the largest expected total reward that can be achieved between time t and process termination if the initial state is s.
optimal_policy represents an optimal policy in the manner of evaluate_finite_horizon_MMDP_markov_policy().
In keeping with the notation of
evaluate_finite_horizon_MMDP_markov_policy(), let
denote the maximum expected (possibly discounted) reward accrued from time
onward if the state at that time is . It is shown in
Puterman (2014) and elsewhere that the optimal value function
can be determined from the inductive relationship
where the index runs over the action set of state , and where
. In a finite-horizon MDP with finite state and
action sets, there always exists an optimal policy which is Markov
deterministic. In fact, if for each decision epoch and each state
we record those actions that cause the above maximum to be
attained, then we have constructed all optimal Markov deterministic
policies. The present function calls MDP_update() to carry out the
maximization for each epoch and constructs an optimal policy in the iterative
manner described.
Inasmuch as this function is intended for
single-objective problems, a reward scalar_rewards[[t]][[s]][[a]] (when
t is a decision epoch) or scalar_rewards[[horizon]][[s]] (horizon
being the length of scalar_rewards) should be a number.
Standard Markov decision theory assures us
that backward induction in finite-horizon MDPs having finite states and
actions constructs policies that are optimal at every stage of decision
making. In particular, optimal_policy achieves the largest expected
reward not only between time 1 and termination, but between any time t
and termination.
To ensure that the transition probability and reward lists provided are acceptable, use are_valid_finite_horizon_transition_probabilities() and are_valid_finite_horizon_rewards().
set.seed(1234) MDP <- generate_rand_MMDP(2, list(c(1, 2), c(1, 2, 3)), 5, no_objectives = 1) transition_probabilities <- MDP$P scalar_rewards <- MDP$R rho <- .97 solve_finite_horizon_MDP_backward_induction(transition_probabilities, scalar_rewards, rho)set.seed(1234) MDP <- generate_rand_MMDP(2, list(c(1, 2), c(1, 2, 3)), 5, no_objectives = 1) transition_probabilities <- MDP$P scalar_rewards <- MDP$R rho <- .97 solve_finite_horizon_MDP_backward_induction(transition_probabilities, scalar_rewards, rho)
solve_finite_horizon_MMDP_backward_induction() finds the Pareto efficient
values of a finite-horizon multi-objective Markov decision process by solving
a vector-valued extension of the dynamic programming equations.
solve_finite_horizon_MMDP_backward_induction( transition_probabilities, rewards, rho = 1, method = "heuristic" )solve_finite_horizon_MMDP_backward_induction( transition_probabilities, rewards, rho = 1, method = "heuristic" )
transition_probabilities |
A numeric list specifying the finite-horizon
transition probabilities of the model under study. See the documentation of
|
rewards |
A numeric list defining the model's reward structure. See the
documentation of |
rho |
A number between (and including) 0 and 1 to use as the reward
discount factor. The default value is |
method |
A string indicating the method by which the efficiency
equations are solved: |
This function solves a multi-objective counterpart to the optimality
equations involved in solve_finite_horizon_MDP_backward_induction(). In
order that we define the new equations and the manner in which they characterize the
efficient values of a finite-horizon MMDP, it is necessary to expand the policy space to
allow for a special kind of policy which is not Markov. If denotes a sequence of states observed prior to and
including time , then we shall be interested in policies that map each
such sequence to some action permissible in . Notice that this policy space subsumes Markov deterministic policies, which depend on state
histories only through the current state. For any policy , let denote the expected total reward vector it achieves from time onward if the state at that time is . The set of all such expected reward vectors is denoted by . Then, if represents the efficient subset of , Mifrani (2024) shows that
can be determined from the recursion
where denotes the efficient subset of some vector set , and denotes a sum set.
This function finds for each initial state
by carrying out the above recursion backward in time, starting from the boundary condition that the efficient value sets at termination, , reduce to the terminal rewards .
To compute the efficient subsets involved at each epoch, the function calls either exact_MMDP_update() or heuristic_MMDP_update(), depending on whether the method sought is enumerative or heuristic. The former relies on efficient_subset_sort_prune(), whereas the latter
uses nsga2R::nsga2R(), an implementation of the popular Non-dominated
Sorting Genetic Algorithm II (more commonly known as NSGA-II).
A list with three components:
efficient_values, a numeric list the length of rewards[[t]] / transition_probabilities[[t]] (i.e., the cardinality of the state set) in which each component is a numeric matrix representing the efficient value set
relative to some starting state (see the Details section). The column vectors of each matrix give the efficient values that can be achieved from the corresponding state, or an approximation of these values in case the solution procedure is heuristic. The length of these column vectors is the number of objectives.
compromise_solution, a numeric list the length of efficient_values that stores, for each initial state, the "compromise" value obtained by application of compromise_solution() to the columns of the matrix in efficient_values corresponding to that state. See the documentation of compromise_solution() for more information.
time, the CPU time in seconds.
Mifrani, A. (2025). A counterexample and a corrective to the vector extension of the Bellman equations of a Markov decision process. Annals of Operations Research, 345(1), 351-369.
To ensure that the transition probability and reward lists provided are acceptable, use are_valid_finite_horizon_transition_probabilities() and are_valid_finite_horizon_rewards().
set.seed(1234) MMDP <- generate_rand_MMDP(no_states = 2, action_sets = list(c(1, 2), c(1, 2, 3)), horizon = 5, no_objectives = 2) transition_probabilities <- MMDP$P rewards <- MMDP$R solve_finite_horizon_MMDP_backward_induction(transition_probabilities, rewards)set.seed(1234) MMDP <- generate_rand_MMDP(no_states = 2, action_sets = list(c(1, 2), c(1, 2, 3)), horizon = 5, no_objectives = 2) transition_probabilities <- MMDP$P rewards <- MMDP$R solve_finite_horizon_MMDP_backward_induction(transition_probabilities, rewards)
solve_finite_horizon_MMDP_linear_programming() implements a linear programming
approach to the solution of a finite-horizon multi-objective
Markov decision process. The function "maximizes" the expected total reward vector under a fixed distribution of initial states initial_state_distribution. It returns all efficient Markov deterministic
policies along with the values they generate.
solve_finite_horizon_MMDP_linear_programming( transition_probabilities, rewards, initial_state_distribution, max_iter = 100 )solve_finite_horizon_MMDP_linear_programming( transition_probabilities, rewards, initial_state_distribution, max_iter = 100 )
transition_probabilities |
A numeric list specifying the
transition probabilities of the finite-horizon model under
study. See the documentation of |
rewards |
A numeric list specifying the reward
structure. See the documentation of |
initial_state_distribution |
A numeric vector, whose length equals the number of states and whose entries sum to 1.0, specifying the initial probability of each state. The linear programming approach requires that all probabilities be positive. |
max_iter |
An integer greater than or equal to 1 giving the maximum
number of iterations in the optimization phase. The default value is |
This function implements the linear programming technique proposed by Mifrani and Noll (2025). The technique rests on the observation that
the finite-horizon expected total reward of a policy under an initial-state
distribution (initial_state_distribution) is linear in the state-action frequencies
induced by that policy. A state-action frequency is the joint
probability that the process will enter state at time and select action under policy . The collection of these frequencies forms a convex polyhedral set that is in a
one-to-one correspondence with Markov randomized policies, and whose extreme points uniquely determine the model's Markov
deterministic policies. Using Mifrani and Noll's adaptation of a
multi-objective simplex-type algorithm by Ecker and Kouada (1978), this
function enumerates those extreme points that are efficient solutions to the
multi-objective linear program then calculates the corresponding (efficient) policies.
A list of one integer sublist and one numeric matrix:
efficient_policies contains integer matrices representing the efficient policies detected by linear programming.
efficient_value_vectors records in its columns the value vectors generated under initial_state_distribution by the policies in efficient_policies. The length of each column is the number of objectives as determined by rewards.
Mifrani, A., & Noll, D. (2025). Linear programming for finite-horizon vector-valued Markov decision processes. arXiv preprint arXiv:2502.13697.
solve_MOLP() for an implementation of Mifrani and Noll's version
of the Ecker-Kouada multi-objective linear programming technique.
To ensure that the transition probability and reward lists passed are acceptable, use are_valid_finite_horizon_transition_probabilities() and are_valid_finite_horizon_rewards().
set.seed(1234) no_states <- 2 action_sets <- list(c(1, 2), c(1, 2, 3)) no_objectives <- 2 horizon <- 5 MMDP <- generate_rand_MMDP(no_states, action_sets, horizon, no_objectives) transition_probabilities <- MMDP$P rewards <- MMDP$R # Use multi-objective linear programming to solve the model # under a uniform initial-state distribution initial_state_distribution <- c(.5, .5) solve_finite_horizon_MMDP_linear_programming(transition_probabilities, rewards, initial_state_distribution)set.seed(1234) no_states <- 2 action_sets <- list(c(1, 2), c(1, 2, 3)) no_objectives <- 2 horizon <- 5 MMDP <- generate_rand_MMDP(no_states, action_sets, horizon, no_objectives) transition_probabilities <- MMDP$P rewards <- MMDP$R # Use multi-objective linear programming to solve the model # under a uniform initial-state distribution initial_state_distribution <- c(.5, .5) solve_finite_horizon_MMDP_linear_programming(transition_probabilities, rewards, initial_state_distribution)
solve_finite_horizon_MMDP_weighting_factor() finds Pareto efficient Markov deterministic policies for a finite-horizon multi-objective Markov decision process, with or without discount, using the weighting factor approach.
solve_finite_horizon_MMDP_weighting_factor( transition_probabilities, rewards, rho = 1, no_iterations = 100 )solve_finite_horizon_MMDP_weighting_factor( transition_probabilities, rewards, rho = 1, no_iterations = 100 )
transition_probabilities |
A numeric list defining the transition
probabilities of the finite-horizon model under study. See the
documentation of |
rewards |
A numeric list defining the model's reward structure. See the
documentation of |
rho |
A number between 0 and 1 representing the reward discount factor.
The default value |
no_iterations |
The number of weighting factors, each factor being a
numeric vector with one positive entry per objective, to be used. One single-objective MDP will be solved per factor. The default value is |
The premise of the weighting factor approach for finite-horizon problems is identical to the infinite-horizon case as documented under solve_discounted_MMDP_weighting_factor(), except that the efficient policies that are generated need not be stationary.
A list with two sublists:
efficient_policies is a list where each efficient policy detected by the weighting factor approach occupies an integer matrix. For each of these matrices, the -th entry specifies the action prescribed for state at time .
efficient_value_functions, comprised of one numeric matrix per policy in efficient_policies, records each policy's value function. The -th column of a value function matrix contains the expected discounted reward attained from initial state .
solve_discounted_MMDP_weighting_factor() for an analogous implementation in the infinite-horizon case.
To ensure that the transition probability and reward lists provided are acceptable, use are_valid_finite_horizon_transition_probabilities() and are_valid_finite_horizon_rewards().
set.seed(1234) no_states <- 2 action_sets <- list(c(1, 2), c(1, 2, 3)) no_objectives <- 2 horizon <- 5 MMDP <- generate_rand_MMDP(no_states, action_sets, horizon, no_objectives) transition_probabilities <- MMDP$P rewards <- MMDP$R rho <- .97 solve_finite_horizon_MMDP_weighting_factor(transition_probabilities, rewards, rho)set.seed(1234) no_states <- 2 action_sets <- list(c(1, 2), c(1, 2, 3)) no_objectives <- 2 horizon <- 5 MMDP <- generate_rand_MMDP(no_states, action_sets, horizon, no_objectives) transition_probabilities <- MMDP$P rewards <- MMDP$R rho <- .97 solve_finite_horizon_MMDP_weighting_factor(transition_probabilities, rewards, rho)
solve_MOLP() is an adjunct to solve_discounted_MMDP_linear_programming()
that enumerates the efficient basic feasible solutions of a multi-objective
linear programming problem subject to equality and nonnegativity constraints.
solve_MOLP(C, A, b, max_iter)solve_MOLP(C, A, b, max_iter)
C |
A numeric matrix whose |
A |
A numeric matrix, one row per constraint and one column per variable, giving the constraint coefficients. |
b |
A numeric vector giving the right-hand side coefficients. |
max_iter |
An integer greater than or equal to 1 that places an upper limit on the number of iterations performed by the linear programming procedure. |
This function implements a particular version of Ecker and Kouada's (1978) procedure for enumerating the efficient basic feasible solutions of a multi-objective linear (maximization) program where candidate points are tested for efficiency using a proposition from Evans and Steuer (1973). The procedure requires an initial efficient extreme point, which can be found by maximizing a positive linear combination of the objectives.
A list with one integer matrix and one numeric matrix:
efficient_bfs, whose rows contain the efficient basic feasible solutions—which are vectors of length equal to the number of variables—found;
and values, whose columns contain the objective value vectors generated by the solutions in efficient_bfs.
Ecker, J. G., & Kouada, I. A. (1978). Finding all efficient extreme points for multiple objective linear programs. Mathematical programming, 14(1), 249-261.
Evans, J. P., & Steuer, R. E. (1973). A revised simplex method for linear multiple objective programs. Mathematical programming, 5(1), 54-72.
Mifrani, A., & Noll, D. (2025). Linear programming for finite-horizon vector-valued Markov decision processes. arXiv preprint arXiv:2502.13697.
solve_discounted_MMDP_linear_programming()
# Consider the multi-objective linear programming problem # Maximize 4*x_1 - x_2 + 3*x_3 # -2*x_1 + 5*x_2 - 0.5*x_3 # subject to 2*x_1 + 3*x_2 - x_3 = 12 # x_1 - x_2 + 3*x_3 = 3 # x_1, x_2, x_3 nonnegative # Criteria matrix C <- matrix(c(4, -1, 3, -2, 5, -0.5), nrow = 2, ncol = 3, byrow = TRUE) # Constraint matrix (excluding nonnegativity) A <- matrix(c(2, 3, -1, 1, -1, 3), nrow = 2, ncol = 3, byrow = TRUE) # Right-hand side vector b <- c(12, 3) # Solve the multi-objective linear programming problem, # display the efficient basic feasible solutions result <- solve_MOLP(C, A, b, max_iter = 50) result$efficient_bfs result$values# Consider the multi-objective linear programming problem # Maximize 4*x_1 - x_2 + 3*x_3 # -2*x_1 + 5*x_2 - 0.5*x_3 # subject to 2*x_1 + 3*x_2 - x_3 = 12 # x_1 - x_2 + 3*x_3 = 3 # x_1, x_2, x_3 nonnegative # Criteria matrix C <- matrix(c(4, -1, 3, -2, 5, -0.5), nrow = 2, ncol = 3, byrow = TRUE) # Constraint matrix (excluding nonnegativity) A <- matrix(c(2, 3, -1, 1, -1, 3), nrow = 2, ncol = 3, byrow = TRUE) # Right-hand side vector b <- c(12, 3) # Solve the multi-objective linear programming problem, # display the efficient basic feasible solutions result <- solve_MOLP(C, A, b, max_iter = 50) result$efficient_bfs result$values
sum_set() is an adjunct to exact_MMDP_update() and heuristic_MMDP_update()
that forms the sum set of two or more sets viewed as matrices whose columns
represent the sets' elements.
sum_set(X)sum_set(X)
X |
A list of numeric matrices representing Euclidean sets. Each column in a matrix represents an element of the set of which the matrix is a representation. The matrices must have an identical number of rows, which is to say that the underlying sets must have the same dimension. |
A numeric matrix representing the sum set of the sets in X.
set_1 <- matrix(c(1, 2, 0, 0, 3, -6), nrow = 2, ncol = 3) set_2 <- matrix(c(5, -2, 7, 0), nrow = 2, ncol = 2) X <- list(set_1, set_2) sum_set(X) # Returns the sum set of set_1 and set_2set_1 <- matrix(c(1, 2, 0, 0, 3, -6), nrow = 2, ncol = 3) set_2 <- matrix(c(5, -2, 7, 0), nrow = 2, ncol = 2) X <- list(set_1, set_2) sum_set(X) # Returns the sum set of set_1 and set_2
value_function_domination_sets() is an adjunct to
solve_discounted_MMDP_policy_iteration() that compares a vector-valued
function V with a set of such functions to determine which of them
dominate V, are dominated by V or are equal to V.
value_function_domination_sets(V, value_functions)value_function_domination_sets(V, value_functions)
V |
A numeric matrix in which the |
value_functions |
A list of numeric matrices, having the same size as |
Dominance here is taken with respect to the product order induced by the
"larger-is-better" componentwise order over column vectors. That is, if U denotes some matrix in value_functions, then U dominates V if U[, s] is componentwise greater than or equal to V[, s] for all columns s and there exists a column j such that U[, j] dominates V[, j] componentwise.
A list comprising, in this order,
the matrices in value_functions that dominate V,
those that are dominated by V,
and those that are identical to V.
V <- matrix(c(5, 5, 3, 1), nrow = 2) U1 <- matrix(c(4, 4, 2, 0), nrow = 2) # Dominated by V U2 <- matrix(c(6, 5, 4, 2), nrow = 2) # Dominates V U3 <- V # Identical to V value_functions <- list(U1, U2, U3) value_function_domination_sets(V, value_functions)V <- matrix(c(5, 5, 3, 1), nrow = 2) U1 <- matrix(c(4, 4, 2, 0), nrow = 2) # Dominated by V U2 <- matrix(c(6, 5, 4, 2), nrow = 2) # Dominates V U3 <- V # Identical to V value_functions <- list(U1, U2, U3) value_function_domination_sets(V, value_functions)