This paper describes the application of reinforcement learning (RL) to multi-product inventory management in supply chains. The problem description and solution are both adapted from a real-world business solution. The novelty of this problem with respect to supply chain literature is (i) we consider concurrent inventory management of a large number (50 to 1000) of products with shared capacity, (ii) we consider a multi-node supply chain consisting of a warehouse which supplies three stores, (iii) the warehouse, stores, and transportation from warehouse to stores have finite capacities, (iv) warehouse and store replenishment happen at different time scales and with realistic time lags, and (v) demand for products at the stores is stochastic. We describe a novel formulation in a multi-agent (hierarchical) reinforcement learning framework that can be used for parallelised decision-making, and use the advantage actor critic (A2C) algorithm with quantised action spaces to solve the problem. Experiments show that the proposed approach is able to handle a multi-objective reward comprised of maximising product sales and minimising wastage of perishable products.
Reinforcement learning (RL) has been widely applied to game-playing and surpassed the best human-level performance in many domains, yet there are few use-cases in industrial or commercial settings. We introduce OR-Gym, an open-source library for developing reinforcement learning algorithms to address operations research problems. In this paper, we apply reinforcement learning to the knapsack, multi-dimensional bin packing, multi-echelon supply chain, and multi-period asset allocation model problems, as well as benchmark the RL solutions against MILP and heuristic models. These problems are used in logistics, finance, engineering, and are common in many business operation settings. We develop environments based on prototypical models in the literature and implement various optimization and heuristic models in order to benchmark the RL results. By re-framing a series of classic optimization problems as RL tasks, we seek to provide a new tool for the operations research community, while also opening those in the RL community to many of the problems and challenges in the OR field.
Supply chain and price management were among the first areas of enterprise operations that adopted data science and combinatorial optimization methods and have a long history of using these techniques with great success. Although a wide range of traditional optimization methods are available for inventory and price management applications, deep reinforcement learning has the potential to substantially improve the optimization capabilities for these and other types of enterprise operations due to impressive recent advances in the development of generic self-learning algorithms for optimal control. In this article, we explore how deep reinforcement learning methods can be applied in several basic supply chain and price management scenarios. The traditional price optimization process in retail or manufacturing environments is typically framed as a what-if analysis of different pricing scenarios using some sort of demand model. In many cases, the development of a demand model is challenging because it has to properly capture a wide range of factors and variables that influence demand, including regular prices, discounts, marketing activities, seasonality, competitor prices, cross-product cannibalization, and halo effects. Once the demand model is developed, however, the optimization process for pricing decisions is relatively straightforward, and standard techniques such as linear or integer programming typically suffice. For instance, consider an apparel retailer that purchases a seasonal product at the beginning of the season and has to sell it out by the end of the period. Assuming that a retailer chooses pricing levels from a discrete set (e.g., \$59.90, \$69.90, etc.) and can make price changes frequently (e.g., weekly), we can pose the following optimization problem: The first constraint ensures that each time interval has only one price, and the second constraint ensures that all demands sum up to the available stock level.
Continuous control is a widely applicable area of reinforcement learning. The main players of this area are actor-critic methods that utilize policy gradients of neural approximators as a common practice. The focus of our study is to show the characteristics of the actor loss function which is the essential part of the optimization. We exploit low dimensional visualizations of the loss function and provide comparisons for loss landscapes of various algorithms. Furthermore, we apply our approach to multi-store dynamic inventory control, a notoriously difficult problem in supply chain operations, and explore the shape of the loss function associated with the optimal policy. We modelled and solved the problem using reinforcement learning while having a loss landscape in favor of optimality.
We present a scheme for sequential decision making with a risk-sensitive objective and constraints in a dynamic environment. A neural network is trained as an approximator of the mapping from parameter space to space of risk and policy with risk-sensitive constraints. For a given risk-sensitive problem, in which the objective and constraints are, or can be estimated by, functions of the mean and variance of return, we generate a synthetic dataset as training data. Parameters defining a targeted process might be dynamic, i.e., they might vary over time, so we sample them within specified intervals to deal with these dynamics. We show that: i). Most risk measures can be estimated using return variance; ii). By virtue of the state-augmentation transformation, practical problems modeled by Markov decision processes with stochastic rewards can be solved in a risk-sensitive scenario; and iii). The proposed scheme is validated by a numerical experiment.