Farahmand, Amir-massoud
Efficient and Accurate Optimal Transport with Mirror Descent and Conjugate Gradients
Kemertas, Mete, Jepson, Allan D., Farahmand, Amir-massoud
We design a novel algorithm for optimal transport by drawing from the entropic optimal transport, mirror descent and conjugate gradients literatures. Our scalable and GPU parallelizable algorithm is able to compute the Wasserstein distance with extreme precision, reaching relative error rates of $10^{-8}$ without numerical stability issues. Empirically, the algorithm converges to high precision solutions more quickly in terms of wall-clock time than a variety of algorithms including log-domain stabilized Sinkhorn's Algorithm. We provide careful ablations with respect to algorithm and problem parameters, and present benchmarking over upsampled MNIST images, comparing to various recent algorithms over high-dimensional problems. The results suggest that our algorithm can be a useful addition to the practitioner's optimal transport toolkit.
$\lambda$-AC: Learning latent decision-aware models for reinforcement learning in continuous state-spaces
Voelcker, Claas A, Ahmadian, Arash, Abachi, Romina, Gilitschenski, Igor, Farahmand, Amir-massoud
The idea of decision-aware model learning, that models should be accurate where it matters for decision-making, has gained prominence in model-based reinforcement learning. While promising theoretical results have been established, the empirical performance of algorithms leveraging a decision-aware loss has been lacking, especially in continuous control problems. In this paper, we present a study on the necessary components for decision-aware reinforcement learning models and we showcase design choices that enable well-performing algorithms. To this end, we provide a theoretical and empirical investigation into prominent algorithmic ideas in the field. We highlight that empirical design decisions established in the MuZero line of works are vital to achieving good performance for related algorithms, and we showcase differences in behavior between different instantiations of value-aware algorithms in stochastic environments. Using these insights, we propose the Latent Model-Based Decision-Aware Actor-Critic framework ($\lambda$-AC) for decision-aware model-based reinforcement learning in continuous state-spaces and highlight important design choices in different environments.
Value Gradient weighted Model-Based Reinforcement Learning
Voelcker, Claas, Liao, Victor, Garg, Animesh, Farahmand, Amir-massoud
Model-based reinforcement learning (MBRL) is a sample efficient technique to obtain control policies, yet unavoidable modeling errors often lead performance deterioration. The model in MBRL is often solely fitted to reconstruct dynamics, state observations in particular, while the impact of model error on the policy is not captured by the training objective. This leads to a mismatch between the intended goal of MBRL, enabling good policy and value learning, and the target of the loss function employed in practice, future state prediction. Naive intuition would suggest that value-aware model learning would fix this problem and, indeed, several solutions to this objective mismatch problem have been proposed based on theoretical analysis. However, they tend to be inferior in practice to commonly used maximum likelihood (MLE) based approaches. In this paper we propose the Value-gradient weighted Model Learning (VaGraM), a novel method for value-aware model learning which improves the performance of MBRL in challenging settings, such as small model capacity and the presence of distracting state dimensions. We analyze both MLE and value-aware approaches and demonstrate how they fail to account for exploration and the behavior of function approximation when learning value-aware models and highlight the additional goals that must be met to stabilize optimization in the deep learning setting. We verify our analysis by showing that our loss function is able to achieve high returns on the Mujoco benchmark suite while being more robust than maximum likelihood based approaches.
Operator Splitting Value Iteration
Rakhsha, Amin, Wang, Andrew, Ghavamzadeh, Mohammad, Farahmand, Amir-massoud
We introduce new planning and reinforcement learning algorithms for discounted MDPs that utilize an approximate model of the environment to accelerate the convergence of the value function. Inspired by the splitting approach in numerical linear algebra, we introduce Operator Splitting Value Iteration (OS-VI) for both Policy Evaluation and Control problems. OS-VI achieves a much faster convergence rate when the model is accurate enough. We also introduce a sample-based version of the algorithm called OS-Dyna.
The act of remembering: a study in partially observable reinforcement learning
Icarte, Rodrigo Toro, Valenzano, Richard, Klassen, Toryn Q., Christoffersen, Phillip, Farahmand, Amir-massoud, McIlraith, Sheila A.
Reinforcement Learning (RL) agents typically learn memoryless policies---policies that only consider the last observation when selecting actions. Learning memoryless policies is efficient and optimal in fully observable environments. However, some form of memory is necessary when RL agents are faced with partial observability. In this paper, we study a lightweight approach to tackle partial observability in RL. We provide the agent with an external memory and additional actions to control what, if anything, is written to the memory. At every step, the current memory state is part of the agent's observation, and the agent selects a tuple of actions: one action that modifies the environment and another that modifies the memory. When the external memory is sufficiently expressive, optimal memoryless policies yield globally optimal solutions. Unfortunately, previous attempts to use external memory in the form of binary memory have produced poor results in practice. Here, we investigate alternative forms of memory in support of learning effective memoryless policies. Our novel forms of memory outperform binary and LSTM-based memory in well-established partially observable domains.
Beyond Prioritized Replay: Sampling States in Model-Based RL via Simulated Priorities
Mei, Jincheng, Pan, Yangchen, White, Martha, Farahmand, Amir-massoud, Yao, Hengshuai
Model-based reinforcement learning (MBRL) can significantly improve sample efficiency, particularly when carefully choosing the states from which to sample hypothetical transitions. Such prioritization has been empirically shown to be useful for both experience replay (ER) and Dyna-style planning. However, there is as yet little theoretical understanding in RL about such prioritization strategies, and why they help. In this work, we revisit prioritized ER and, in an ideal setting, show an equivalence to minimizing cubic loss, providing theoretical insight into why it improves upon uniform sampling. This ideal setting, however, cannot be realized in practice, due to insufficient coverage of the sample space and outdated priorities of training samples. This motivates our model-based approach, which does not suffer from these limitations. Our key idea is to actively search for high priority states using gradient ascent. Under certain conditions, we prove that the distribution of hypothetical experiences generated from these states provides a diverse set of states, sampled proportionally to approximately true priorities. Our experiments on both benchmark and application-oriented domain show that our approach achieves superior performance over both the model-free prioritized ER method and several closely related model-based baselines.
Iterative Value-Aware Model Learning
Farahmand, Amir-massoud
This paper introduces a model-based reinforcement learning (MBRL) framework that incorporates the underlying decision problem in learning the transition model of the environment. This is in contrast with conventional approaches to MBRL that learn the model of the environment, for example by finding the maximum likelihood estimate, without taking into account the decision problem. Value-Aware Model Learning (VAML) framework argues that this might not be a good idea, especially if the true model of the environment does not belong to the model class from which we are estimating the model. The original VAML framework, however, may result in an optimization problem that is difficult to solve. This paper introduces a new MBRL class of algorithms, called Iterative VAML, that benefits from the structure of how the planning is performed (i.e., through approximate value iteration) to devise a simpler optimization problem.
Hill Climbing on Value Estimates for Search-control in Dyna
Pan, Yangchen, Yao, Hengshuai, Farahmand, Amir-massoud, White, Martha
Dyna is an architecture for model-based reinforcement learning (RL), where simulated experience from a model is used to update policies or value functions. A key component of Dyna is search-control, the mechanism to generate the state and action from which the agent queries the model, which remains largely unexplored. In this work, we propose to generate such states by using the trajectory obtained from Hill Climbing (HC) the current estimate of the value function. This has the effect of propagating value from high-value regions and of preemptively updating value estimates of the regions that the agent is likely to visit next. We derive a noisy stochastic projected gradient ascent algorithm for hill climbing, and highlight a connection to Langevin dynamics. We provide an empirical demonstration on four classical domains that our algorithm, HC-Dyna, can obtain significant sample efficiency improvements. We study the properties of different sampling distributions for search-control, and find that there appears to be a benefit specifically from using the samples generated by climbing on current value estimates from low-value to high-value region.
Improving Skin Condition Classification with a Visual Symptom Checker trained using Reinforcement Learning
Akrout, Mohamed, Farahmand, Amir-massoud, Jarmain, Tory, Abid, Latif
We present a visual symptom checker that combines a pre-trained Convolutional Neural Network (CNN) with a Reinforcement Learning (RL) agent as a Question Answering (QA) model. This method enables us to not only increase the classification confidence and accuracy of the visual symptom checker, but also decreases the average number of relevant questions asked to narrow down the differential diagnosis. By combining the CNN output in the form of classification probabilities as a part of the state structure of the simulated patient's environment, a DQN-based RL agent learns to ask the best symptom that maximizes its expected return over symptoms. We demonstrate that our RL approach increases the accuracy more than 20% as compared to the CNN alone, and up to 10% as compared to the decision tree model. We finally show that the RL approach not only outperforms the performance of the decision tree approach but also narrows down the diagonosis faster in terms of the average number of asked questions.
Iterative Value-Aware Model Learning
Farahmand, Amir-massoud
This paper introduces a model-based reinforcement learning (MBRL) framework that incorporates the underlying decision problem in learning the transition model of the environment. This is in contrast with conventional approaches to MBRL that learn the model of the environment, for example by finding the maximum likelihood estimate, without taking into account the decision problem. Value-Aware Model Learning (VAML) framework argues that this might not be a good idea, especially if the true model of the environment does not belong to the model class from which we are estimating the model. The original VAML framework, however, may result in an optimization problem that is difficult to solve. This paper introduces a new MBRL class of algorithms, called Iterative VAML, that benefits from the structure of how the planning is performed (i.e., through approximate value iteration) to devise a simpler optimization problem. The paper theoretically analyzes Iterative VAML and provides finite sample error upper bound guarantee for it.