In 9 hours, Google's AlphaZero went from only knowing the rules of chess to beating the best models in the world. Chess has been studied by humans for over 1000 years, yet a reinforcement learning model was able to further our knowledge of the game in a negligible amount of time, using no prior knowledge aside from the game rules. No other machine learning field allows for such progress in this problem. Today, similar models by Google are being used in a wide variety of fields like predicting and detecting early signs of life-changing illnesses, improving text-to-speech systems, and more. Machine learning can be divided into 3 main paradigms.

Kusari, Arpan, How, Jonathan P.

A common approach for defining a reward function for Multi-objective Reinforcement Learning (MORL) problems is the weighted sum of the multiple objectives. The weights are then treated as design parameters dependent on the expertise (and preference) of the person performing the learning, with the typical result that a new solution is required for any change in these settings. This paper investigates the relationship between the reward function and the optimal value function for MORL; specifically addressing the question of how to approximate the optimal value function well beyond the set of weights for which the optimization problem was actually solved, thereby avoiding the need to recompute for any particular choice. We prove that the value function transforms smoothly given a transformation of weights of the reward function (and thus a smooth interpolation in the policy space). A Gaussian process is used to obtain a smooth interpolation over the reward function weights of the optimal value function for three well-known examples: GridWorld, Objectworld and Pendulum. The results show that the interpolation can provide very robust values for sample states and action space in discrete and continuous domain problems. Significant advantages arise from utilizing this interpolation technique in the domain of autonomous vehicles: easy, instant adaptation of user preferences while driving and true randomization of obstacle vehicle behavior preferences during training.

Note: All of the code is in the form of snippets and will not work when executed alone. The full code can be found on my Github repo. Reinforcement learning is the training of an agent to make decisions in an environment. An agent is deployed in an environment. At any given frame, the agent must use data from the environment to act.

Zhai, Yuexiang, Baek, Christina, Zhou, Zhengyuan, Jiao, Jiantao, Ma, Yi

Many hierarchical reinforcement learning (RL) applications have empirically verified that incorporating prior knowledge in reward design improves convergence speed and practical performance. We attempt to quantify the computational benefits of hierarchical RL from a planning perspective under assumptions about the intermediate state and intermediate rewards frequently (but often implicitly) adopted in practice. Our approach reveals a trade-off between computational complexity and the pursuit of the shortest path in hierarchical planning: using intermediate rewards significantly reduces the computational complexity in finding a successful policy but does not guarantee to find the shortest path, whereas using sparse terminal rewards finds the shortest path at a significantly higher computational cost. We also corroborate our theoretical results with extensive experiments on the MiniGrid environments using Q-learning and other popular deep RL algorithms.

Seijen, Harm Van, Fatemi, Mehdi, Romoff, Joshua, Laroche, Romain, Barnes, Tavian, Tsang, Jeffrey

One of the main challenges in reinforcement learning (RL) is generalisation. In typical deep RL methods this is achieved by approximating the optimal value function with a low-dimensional representation using a deep network. While this approach works well in many domains, in domains where the optimal value function cannot easily be reduced to a low-dimensional representation, learning can be very slow and unstable. This paper contributes towards tackling such challenging domains, by proposing a new method, called Hybrid Reward Architecture (HRA). HRA takes as input a decomposed reward function and learns a separate value function for each component reward function.