Meisheri, Hardik
Multi-Agent Learning of Efficient Fulfilment and Routing Strategies in E-Commerce
Shelke, Omkar, Pathakota, Pranavi, Chauhan, Anandsingh, Khadilkar, Harshad, Meisheri, Hardik, Ravindran, Balaraman
This paper presents an integrated algorithmic framework for minimising product delivery costs in e-commerce (known as the cost-to-serve or C2S). One of the major challenges in e-commerce is the large volume of spatio-temporally diverse orders from multiple customers, each of which has to be fulfilled from one of several warehouses using a fleet of vehicles. This results in two levels of decision-making: (i) selection of a fulfillment node for each order (including the option of deferral to a future time), and then (ii) routing of vehicles (each of which can carry multiple orders originating from the same warehouse). We propose an approach that combines graph neural networks and reinforcement learning to train the node selection and vehicle routing agents. We include real-world constraints such as warehouse inventory capacity, vehicle characteristics such as travel times, service times, carrying capacity, and customer constraints including time windows for delivery. The complexity of this problem arises from the fact that outcomes (rewards) are driven both by the fulfillment node mapping as well as the routing algorithms, and are spatio-temporally distributed. Our experiments show that this algorithmic pipeline outperforms pure heuristic policies.
DCT: Dual Channel Training of Action Embeddings for Reinforcement Learning with Large Discrete Action Spaces
Pathakota, Pranavi, Meisheri, Hardik, Khadilkar, Harshad
The ability to learn robust policies while generalizing over large discrete action spaces is an open challenge for intelligent systems, especially in noisy environments that face the curse of dimensionality. In this paper, we present a novel framework to efficiently learn action embeddings that simultaneously allow us to reconstruct the original action as well as to predict the expected future state. We describe an encoder-decoder architecture for action embeddings with a dual channel loss that balances between action reconstruction and state prediction accuracy. We use the trained decoder in conjunction with a standard reinforcement learning algorithm that produces actions in the embedding space. Our architecture is able to outperform two competitive baselines in two diverse environments: a 2D maze environment with more than 4000 discrete noisy actions, and a product recommendation task that uses real-world e-commerce transaction data. Empirical results show that the model results in cleaner action embeddings, and the improved representations help learn better policies with earlier convergence.
Follow your Nose: Using General Value Functions for Directed Exploration in Reinforcement Learning
Kalwar, Durgesh, Shelke, Omkar, Nath, Somjit, Meisheri, Hardik, Khadilkar, Harshad
Improving sample efficiency is a key challenge in reinforcement learning, especially in environments with large state spaces and sparse rewards. In literature, this is resolved either through the use of auxiliary tasks (subgoals) or through clever exploration strategies. Exploration methods have been used to sample better trajectories in large environments while auxiliary tasks have been incorporated where the reward is sparse. However, few studies have attempted to tackle both large scale and reward sparsity at the same time. This paper explores the idea of combining exploration with auxiliary task learning using General Value Functions (GVFs) and a directed exploration strategy. We present a way to learn value functions which can be used to sample actions and provide directed exploration. Experiments on navigation tasks with varying grid sizes demonstrate the performance advantages over several competitive baselines.
Using Contrastive Samples for Identifying and Leveraging Possible Causal Relationships in Reinforcement Learning
Khadilkar, Harshad, Meisheri, Hardik
A significant challenge in reinforcement learning is quantifying the complex relationship between actions and long-term rewards. The effects may manifest themselves over a long sequence of state-action pairs, making them hard to pinpoint. In this paper, we propose a method to link transitions with significant deviations in state with unusually large variations in subsequent rewards. Such transitions are marked as possible causal effects, and the corresponding state-action pairs are added to a separate replay buffer. In addition, we include \textit{contrastive} samples corresponding to transitions from a similar state but with differing actions. Including this Contrastive Experience Replay (CER) during training is shown to outperform standard value-based methods on 2D navigation tasks. We believe that CER can be useful for a broad class of learning tasks, including for any off-policy reinforcement learning algorithm.
Learning to Minimize Cost-to-Serve for Multi-Node Multi-Product Order Fulfilment in Electronic Commerce
Pathakota, Pranavi, Zaid, Kunwar, Dhara, Anulekha, Meisheri, Hardik, Souza, Shaun D, Shah, Dheeraj, Khadilkar, Harshad
We describe a novel decision-making problem developed in response to the demands of retail electronic commerce (e-commerce). While working with logistics and retail industry business collaborators, we found that the cost of delivery of products from the most opportune node in the supply chain (a quantity called the cost-to-serve or CTS) is a key challenge. The large scale, high stochasticity, and large geographical spread of e-commerce supply chains make this setting ideal for a carefully designed data-driven decision-making algorithm. In this preliminary work, we focus on the specific subproblem of delivering multiple products in arbitrary quantities from any warehouse to multiple customers in each time period. We compare the relative performance and computational efficiency of several baselines, including heuristics and mixed-integer linear programming. We show that a reinforcement learning based algorithm is competitive with these policies, with the potential of efficient scale-up in the real world.
School of hard knocks: Curriculum analysis for Pommerman with a fixed computational budget
Shelke, Omkar, Meisheri, Hardik, Khadilkar, Harshad
Pommerman is a hybrid cooperative/adversarial multi-agent environment, with challenging characteristics in terms of partial observability, limited or no communication, sparse and delayed rewards, and restrictive computational time limits. This makes it a challenging environment for reinforcement learning (RL) approaches. In this paper, we focus on developing a curriculum for learning a robust and promising policy in a constrained computational budget of 100,000 games, starting from a fixed base policy (which is itself trained to imitate a noisy expert policy). All RL algorithms starting from the base policy use vanilla proximal-policy optimization (PPO) with the same reward function, and the only difference between their training is the mix and sequence of opponent policies. One expects that beginning training with simpler opponents and then gradually increasing the opponent difficulty will facilitate faster learning, leading to more robust policies compared against a baseline where all available opponent policies are introduced from the start. We test this hypothesis and show that within constrained computational budgets, it is in fact better to "learn in the school of hard knocks", i.e., against all available opponent policies nearly from the start. We also include ablation studies where we study the effect of modifying the base environment properties of ammo and bomb blast strength on the agent performance.
Reinforcement Learning for Multi-Product Multi-Node Inventory Management in Supply Chains
Sultana, Nazneen N, Meisheri, Hardik, Baniwal, Vinita, Nath, Somjit, Ravindran, Balaraman, Khadilkar, Harshad
This paper describes the application of reinforcement learning (RL) to multi-product inventory management in supply chains. The problem description and solution are both adapted from a real-world business solution. The novelty of this problem with respect to supply chain literature is (i) we consider concurrent inventory management of a large number (50 to 1000) of products with shared capacity, (ii) we consider a multi-node supply chain consisting of a warehouse which supplies three stores, (iii) the warehouse, stores, and transportation from warehouse to stores have finite capacities, (iv) warehouse and store replenishment happen at different time scales and with realistic time lags, and (v) demand for products at the stores is stochastic. We describe a novel formulation in a multi-agent (hierarchical) reinforcement learning framework that can be used for parallelised decision-making, and use the advantage actor critic (A2C) algorithm with quantised action spaces to solve the problem. Experiments show that the proposed approach is able to handle a multi-objective reward comprised of maximising product sales and minimising wastage of perishable products.
Reinforcement Learning for Multi-Objective Optimization of Online Decisions in High-Dimensional Systems
Meisheri, Hardik, Baniwal, Vinita, Sultana, Nazneen N, Ravindran, Balaraman, Khadilkar, Harshad
This paper describes a purely data-driven solution to a class of sequential decision-making problems with a large number of concurrent online decisions, with applications to computing systems and operations research. We assume that while the micro-level behaviour of the system can be broadly captured by analytical expressions or simulation, the macro-level or emergent behaviour is complicated by non-linearity, constraints, and stochasticity. If we represent the set of concurrent decisions to be computed as a vector, each element of the vector is assumed to be a continuous variable, and the number of such elements is arbitrarily large and variable from one problem instance to another. We first formulate the decision-making problem as a canonical reinforcement learning (RL) problem, which can be solved using purely data-driven techniques. We modify a standard approach known as advantage actor critic (A2C) to ensure its suitability to the problem at hand, and compare its performance to that of baseline approaches on the specific instance of a multi-product inventory management task. The key modifications include a parallelised formulation of the decision-making task, and a training procedure that explicitly recognises the quantitative relationship between different decisions. We also present experimental results probing the learned policies, and their robustness to variations in the data.