AITopics

2507.2204

Country: North America > United States (0.92)

Genre: Research Report (1.00)

Industry: Leisure & Entertainment (0.45)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Jiang, Hansheng, Sun, Chunlin, Shen, Zuo-Jun Max, Jiang, Shunan

Learning While Repositioning in On-Demand Vehicle Sharing Networks

arXiv.org Machine LearningJan-31-2025

We consider a network inventory problem motivated by one-way, on-demand vehicle sharing services. Due to uncertainties in both demand and returns, as well as a fixed number of rental units across an $n$-location network, the service provider must periodically reposition vehicles to match supply with demand spatially while minimizing costs. The optimal repositioning policy under a general $n$-location network is intractable without knowing the optimal value function. We introduce the best base-stock repositioning policy as a generalization of the classical inventory control policy to $n$ dimensions, and establish its asymptotic optimality in two distinct limiting regimes under general network structures. We present reformulations to efficiently compute this best base-stock policy in an offline setting with pre-collected data. In the online setting, we show that a natural Lipschitz-bandit approach achieves a regret guarantee of $\widetilde{O}(T^{\frac{n}{n+1}})$, which suffers from the exponential dependence on $n$. We illustrate the challenges of learning with censored data in networked systems through a regret lower bound analysis and by demonstrating the suboptimality of alternative algorithmic approaches. Motivated by these challenges, we propose an Online Gradient Repositioning algorithm that relies solely on censored demand. Under a mild cost-structure assumption, we prove that it attains an optimal regret of $O(n^{2.5} \sqrt{T})$, which matches the regret lower bound in $T$ and achieves only polynomial dependence on $n$. The key algorithmic innovation involves proposing surrogate costs to disentangle intertemporal dependencies and leveraging dual solutions to find the gradient of policy change. Numerical experiments demonstrate the effectiveness of our proposed methods.

artificial intelligence, data mining, machine learning, (17 more...)

2501.19208

Country:

Europe (0.67)
North America > United States > California (0.45)

Genre: Research Report (1.00)

Industry:

Law > Civil Rights & Constitutional Law (0.55)
Transportation > Infrastructure & Services (0.45)
Energy > Oil & Gas > Upstream (0.45)

Technology:

Information Technology > Communications > Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
(2 more...)

Temizöz, Tarkan, Imdahl, Christina, Dijkman, Remco, Lamghari-Idrissi, Douniel, van Jaarsveld, Willem

Zero-shot Generalization in Inventory Management: Train, then Estimate and Decide

arXiv.org Artificial IntelligenceNov-1-2024

Deploying deep reinforcement learning (DRL) in real-world inventory management presents challenges, including dynamic environments and uncertain problem parameters, e.g. demand and lead time distributions. These challenges highlight a research gap, suggesting a need for a unifying framework to model and solve sequential decision-making under parameter uncertainty. We address this by exploring an underexplored area of DRL for inventory management: training generally capable agents (GCAs) under zero-shot generalization (ZSG). Here, GCAs are advanced DRL policies designed to handle a broad range of sampled problem instances with diverse inventory challenges. ZSG refers to the ability to successfully apply learned policies to unseen instances with unknown parameters without retraining. We propose a unifying Super-Markov Decision Process formulation and the Train, then Estimate and Decide (TED) framework to train and deploy a GCA tailored to inventory management applications. The TED framework consists of three phases: training a GCA on varied problem instances, continuously estimating problem parameters during deployment, and making decisions based on these estimates. Applied to periodic review inventory problems with lost sales, cyclic demand patterns, and stochastic lead times, our trained agent, the Generally Capable Lost Sales Network (GC-LSN) consistently outperforms well-known traditional policies when problem parameters are known. Moreover, under conditions where demand and/or lead time distributions are initially unknown and must be estimated, we benchmark against online learning methods that provide worst-case performance guarantees. Our GC-LSN policy, paired with the Kaplan-Meier estimator, is demonstrated to complement these methods by providing superior empirical performance.

machine learning, parameterization, reinforcement learning, (19 more...)

2411.00515

Country:

Europe > Netherlands > North Brabant > Eindhoven (0.04)
North America > United States > Arizona > Maricopa County > Chandler (0.04)
Europe > Switzerland (0.04)

Genre:

Overview (0.92)
Research Report > New Finding (0.45)
Research Report > Experimental Study (0.45)

Industry: Education (0.48)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Wan, Jia, Sinclair, Sean R., Shah, Devavrat, Wainwright, Martin J.

Exploiting Exogenous Structure for Sample-Efficient Reinforcement Learning

arXiv.org Machine LearningOct-14-2024

We study a class of structured Markov Decision Processes (MDPs) known as Exo-MDPs. They are characterized by a partition of the state space into two components: the exogenous states evolve stochastically in a manner not affected by the agent's actions, whereas the endogenous states can be affected by actions, and evolve according to deterministic dynamics involving both the endogenous and exogenous states. Exo-MDPs provide a natural model for various applications, including inventory control, portfolio management, power systems, and ride-sharing, among others. While seemingly restrictive on the surface, our first result establishes that any discrete MDP can be represented as an Exo-MDP. The underlying argument reveals how transition and reward dynamics can be written as linear functions of the exogenous state distribution, showing how Exo-MDPs are instances of linear mixture MDPs, thereby showing a representational equivalence between discrete MDPs, Exo-MDPs, and linear mixture MDPs. The connection between Exo-MDPs and linear mixture MDPs leads to algorithms that are near sample-optimal, with regret guarantees scaling with the (effective) size of the exogenous state space $d$, independent of the sizes of the endogenous state and action spaces, even when the exogenous state is {\em unobserved}. When the exogenous state is unobserved, we establish a regret upper bound of $O(H^{3/2}d\sqrt{K})$ with $K$ trajectories of horizon $H$ and unobserved exogenous state of dimension $d$. We also establish a matching regret lower bound of $\Omega(H^{3/2}d\sqrt{K})$ for non-stationary Exo-MDPs and a lower bound of $\Omega(Hd\sqrt{K})$ for stationary Exo-MDPs. We complement our theoretical findings with an experimental study on inventory control problems.

algorithm, exo-mdp, linear mixture mdp, (15 more...)

2409.14557

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Asia > Myanmar > Tanintharyi Region > Dawei (0.04)

Genre:

Research Report > New Finding (0.48)
Research Report > Experimental Study (0.48)

Industry:

Transportation > Passenger (0.34)
Transportation > Ground > Road (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.48)

arXiv.org Artificial IntelligenceJun-26-2024

Reinforcement Learning with Intrinsically Motivated Feedback Graph for Lost-sales Inventory Control

Liu, Zifan, Li, Xinran, Chen, Shibo, Li, Gen, Jiang, Jiashuo, Zhang, Jun

Reinforcement learning (RL) has proven to be well-performed and general-purpose in the inventory control (IC). However, further improvement of RL algorithms in the IC domain is impeded due to two limitations of online experience. First, online experience is expensive to acquire in real-world applications. With the low sample efficiency nature of RL algorithms, it would take extensive time to train the RL policy to convergence. Second, online experience may not reflect the true demand due to the lost sales phenomenon typical in IC, which makes the learning process more challenging. To address the above challenges, we propose a decision framework that combines reinforcement learning with feedback graph (RLFG) and intrinsically motivated exploration (IME) to boost sample efficiency. In particular, we first take advantage of the inherent properties of lost-sales IC problems and design the feedback graph (FG) specially for lost-sales IC problems to generate abundant side experiences aid RL updates. Then we conduct a rigorous theoretical analysis of how the designed FG reduces the sample complexity of RL methods. Based on the theoretical insights, we design an intrinsic reward to direct the RL agent to explore to the state-action space with more side experiences, further exploiting FG's power. Experimental results demonstrate that our method greatly improves the sample efficiency of applying RL in IC. Our code is available at https://anonymous.4open.science/r/RLIMFG4IC-811D/

ic problem, rainbow-fg, side experience, (14 more...)

2406.18351

Country:

Asia > China > Hong Kong (0.05)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)

Genre: Research Report > New Finding (0.34)

Industry: Energy (0.67)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

arXiv.org Machine LearningApr-17-2024

VC Theory for Inventory Policies

Xie, Yaqi, Ma, Will, Xin, Linwei

Advances in computational power and AI have increased interest in reinforcement learning approaches to inventory management. This paper provides a theoretical foundation for these approaches and investigates the benefits of restricting to policy structures that are well-established by decades of inventory theory. In particular, we prove generalization guarantees for learning several well-known classes of inventory policies, including base-stock and (s, S) policies, by leveraging the celebrated Vapnik-Chervonenkis (VC) theory. We apply the concepts of the Pseudo-dimension and Fat-shattering dimension from VC theory to determine the generalizability of inventory policies, that is, the difference between an inventory policy's performance on training data and its expected performance on unseen data. We focus on a classical setting without contexts, but allow for an arbitrary distribution over demand sequences and do not make any assumptions such as independence over time. We corroborate our supervised learning results using numerical simulations. Managerially, our theory and simulations translate to the following insights. First, there is a principle of "learning less is more" in inventory management: depending on the amount of data available, it may be beneficial to restrict oneself to a simpler, albeit suboptimal, class of inventory policies to minimize overfitting errors. Second, the number of parameters in a policy class may not be the correct measure of overfitting error: in fact, the class of policies defined by T time-varying base-stock levels exhibits a generalization error comparable to that of the two-parameter (s, S) policy class. Finally, our research suggests situations in which it could be beneficial to incorporate the concepts of base-stock and inventory position into black-box learning machines, instead of having these machines directly learn the order quantity actions.

generalization error, pseudo-dimension, vc theory, (17 more...)

2404.11509

Country:

North America > United States > New York (0.04)
North America > United States > Illinois > Cook County > Chicago (0.04)
North America > United States > Rhode Island > Providence County > Providence (0.04)
(3 more...)

Genre: Research Report > New Finding (0.66)

Industry: Education > Educational Setting (0.45)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.68)

arXiv.org Artificial IntelligenceJan-28-2024

A Deep Q-Network Based on Radial Basis Functions for Multi-Echelon Inventory Management

Cheng, Liqiang, Luo, Jun, Fan, Weiwei, Zhang, Yidong, Li, Yuan

This paper addresses a multi-echelon inventory management problem with a complex network topology where deriving optimal ordering decisions is difficult. Deep reinforcement learning (DRL) has recently shown potential in solving such problems, while designing the neural networks in DRL remains a challenge. In order to address this, a DRL model is developed whose Q-network is based on radial basis functions. The approach can be more easily constructed compared to classic DRL models based on neural networks, thus alleviating the computational burden of hyperparameter tuning. Through a series of simulation experiments, the superior performance of this approach is demonstrated compared to the simple base-stock policy, producing a better policy in the multi-echelon system and competitive performance in the serial system where the base-stock policy is optimal. In addition, the approach outperforms current DRL approaches.

deep q-network, inventory, on-hand inventory, (14 more...)

2401.15872

Country:

Asia > China > Shanghai > Shanghai (0.05)
North America > United States > New York (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
(2 more...)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

arXiv.org Artificial IntelligenceOct-20-2023

No-Regret Learning in Two-Echelon Supply Chain with Unknown Demand Distribution

Zhang, Mengxiao, Chen, Shi, Luo, Haipeng, Wang, Yingfei

Supply chain management (SCM) has been recognized as an important discipline with applications to many industries, where the two-echelon stochastic inventory model, involving one downstream retailer and one upstream supplier, plays a fundamental role for developing firms' SCM strategies. In this work, we aim at designing online learning algorithms for this problem with an unknown demand distribution, which brings distinct features as compared to classic online optimization problems. Specifically, we consider the two-echelon supply chain model introduced in [Cachon and Zipkin, 1999] under two different settings: the centralized setting, where a planner decides both agents' strategy simultaneously, and the decentralized setting, where two agents decide their strategy independently and selfishly. We design algorithms that achieve favorable guarantees for both regret and convergence to the optimal inventory decision in both settings, and additionally for individual regret in the decentralized setting. Our algorithms are based on Online Gradient Descent and Online Newton Step, together with several new ingredients specifically designed for our problem. We also implement our algorithms and show their empirical effectiveness.

agent 1, agent 2, equation, (14 more...)

2210.12663

Country: North America > United States > California (0.14)

Genre: Research Report (0.82)

Industry:

Education > Educational Setting (0.34)
Retail (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.48)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)

Agrawal, Shipra, Jia, Randy

Learning in structured MDPs with convex cost functions: Improved regret bounds for inventory management

arXiv.org Machine LearningMay-10-2019

We consider a stochastic inventory control problem under censored demands, lost sales, and positive lead times. This is a fundamental problem in inventory management, with significant literature establishing near-optimality of a simple class of policies called ``base-stock policies'' for the underlying Markov Decision Process (MDP), as well as convexity of long run average-cost under those policies. We consider the relatively less studied problem of designing a learning algorithm for this problem when the underlying demand distribution is unknown. The goal is to bound regret of the algorithm when compared to the best base-stock policy. We utilize the convexity properties and a newly derived bound on bias of base-stock policies to establish a connection to stochastic convex bandit optimization. Our main contribution is a learning algorithm with a regret bound of $\tilde{O}(L\sqrt{T}+D)$ for the inventory control problem. Here $L$ is the fixed and known lead time, and $D$ is an unknown parameter of the demand distribution described roughly as the number of time steps needed to generate enough demand for depleting one unit of inventory. Notably, even though the state space of the underlying MDP is continuous and $L$-dimensional, our regret bounds depend linearly on $L$. Our results significantly improve the previously best known regret bounds for this problem where the dependence on $L$ was exponential and many further assumptions on demand distribution were required. The techniques presented here may be of independent interest for other settings that involve large structured MDPs but with convex cost functions.

artificial intelligence, base-stock policy, machine learning, (18 more...)

1905.04337

Genre: Research Report (0.70)

Industry: Law > Civil Rights & Constitutional Law (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.87)

@machinelearnbotNov-18-2017, 17:20:20 GMT

Playing the Beer Game Using Reinforcement Learning

The beer game is a widely used in-class game that is played in supply chain management classes to demonstrate a phenomenon known as the bullwhip effect. The game consists of a serial supply chain network with four players--a retailer, a wholesaler, a distributor, and a manufacturer. In each period of the game, the retailer experiences a random demand from customers. Then the four players each decide how much inventory of "beer" to order. The retailer orders from the wholesaler, the wholesaler orders from the distributor, the distributor from the manufacturer, and the manufacturer orders from an external supplier that is not a player in the game.

beer game, machine learning, reinforcement learning, (15 more...)

@machinelearnbot

Industry:

Retail (1.00)
Consumer Products & Services > Food, Beverage, Tobacco & Cannabis > Beverages (0.68)
Leisure & Entertainment > Games (0.49)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.50)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.40)