structured reinforcement learning
Exploration in Structured Reinforcement Learning
We address reinforcement learning problems with finite state and action spaces where the underlying MDP has some known structure that could be potentially exploited to minimize the exploration rates of suboptimal (state, action) pairs. For any arbitrary structure, we derive problem-specific regret lower bounds satisfied by any learning algorithm. These lower bounds are made explicit for unstructured MDPs and for those whose transition probabilities and average reward functions are Lipschitz continuous w.r.t. the state and action. For Lipschitz MDPs, the bounds are shown not to scale with the sizes S and A of the state and action spaces, i.e., they are smaller than c log T where T is the time horizon and the constant c only depends on the Lipschitz structure, the span of the bias function, and the minimal action sub-optimality gap. This contrasts with unstructured MDPs where the regret lower bound typically scales as SA log T. We devise DEL (Directed Exploration Learning), an algorithm that matches our regret lower bounds. We further simplify the algorithm for Lipschitz MDPs, and show that the simplified version is still able to efficiently exploit the structure.
Structured Reinforcement Learning for Combinatorial Decision-Making
Hoppe, Heiko, Baty, Léo, Bouvier, Louis, Parmentier, Axel, Schiffer, Maximilian
Reinforcement learning (RL) is increasingly applied to real-world problems involving complex and structured decisions, such as routing, scheduling, and assortment planning. These settings challenge standard RL algorithms, which struggle to scale, generalize, and exploit structure in the presence of combinatorial action spaces. We propose Structured Reinforcement Learning (SRL), a novel actor-critic framework that embeds combinatorial optimization layers into the actor neural network. We enable end-to-end learning of the actor via Fenchel-Young losses and provide a geometric interpretation of SRL as a primal-dual algorithm in the dual of the moment polytope. Across six environments with exogenous and endogenous uncertainty, SRL matches or surpasses the performance of unstructured RL and imitation learning on static tasks and improves over these baselines by up to 92% on dynamic problems, with improved stability and convergence speed.
Reviews: Exploration in Structured Reinforcement Learning
It provides problem-related (asymptotic) lower and upper bounds on the regret, the latter for an algorithm presented in the paper that builds on Burnetas and Katehakis (1997) and a recent bandit paper by Combes et al (NIPS 2017). The setting assumes that an "MDP structure" \Phi (i.e. a set of possible MDP models) is given. The regret bounds (after T steps) are shown to be of the form K_Phi*log T, where the parameter K_\Phi is the solution to a particular optimization problem. It is shown that if \Phi is the set of all MDPs ("the unstructured case") then K_\Phi is bounded by HSA/\delta, where H is the bias span and \delta the minimal action sub-optimality gap. The second particular class that is considered is the Lipschitz structure that considers embeddings of finite MDPs in Euclidian space such that transition probabilities and rewards are Lipschitz. In this case, the regret bounds are shown to not to depend on the size of state and action space anymore.
Structured Reinforcement Learning for Media Streaming at the Wireless Edge
Bura, Archana, Bobbili, Sarat Chandra, Rameshkumar, Shreyas, Rengarajan, Desik, Kalathil, Dileep, Shakkottai, Srinivas
Media streaming is the dominant application over wireless edge (access) networks. The increasing softwarization of such networks has led to efforts at intelligent control, wherein application-specific actions may be dynamically taken to enhance the user experience. The goal of this work is to develop and demonstrate learning-based policies for optimal decision making to determine which clients to dynamically prioritize in a video streaming setting. We formulate the policy design question as a constrained Markov decision problem (CMDP), and observe that by using a Lagrangian relaxation we can decompose it into single-client problems. Further, the optimal policy takes a threshold form in the video buffer length, which enables us to design an efficient constrained reinforcement learning (CRL) algorithm to learn it. Specifically, we show that a natural policy gradient (NPG) based algorithm that is derived using the structure of our problem converges to the globally optimal policy. We then develop a simulation environment for training, and a real-world intelligent controller attached to a WiFi access point for evaluation. We empirically show that the structured learning approach enables fast learning. Furthermore, such a structured policy can be easily deployed due to low computational complexity, leading to policy execution taking only about 15$\mu$s. Using YouTube streaming experiments in a resource constrained scenario, we demonstrate that the CRL approach can increase quality of experience (QOE) by over 30\%.
- Europe > Greece > Attica > Athens (0.05)
- South America > Brazil > Rio de Janeiro > Rio de Janeiro (0.04)
- North America > United States > Texas > Duval County > San Diego (0.04)
- (3 more...)
- Research Report (0.64)
- Press Release (0.41)
- Leisure & Entertainment (0.66)
- Telecommunications (0.66)
Exploration in Structured Reinforcement Learning
Ok, Jungseul, Proutiere, Alexandre, Tranos, Damianos
We address reinforcement learning problems with finite state and action spaces where the underlying MDP has some known structure that could be potentially exploited to minimize the exploration rates of suboptimal (state, action) pairs. For any arbitrary structure, we derive problem-specific regret lower bounds satisfied by any learning algorithm. These lower bounds are made explicit for unstructured MDPs and for those whose transition probabilities and average reward functions are Lipschitz continuous w.r.t. the state and action. For Lipschitz MDPs, the bounds are shown not to scale with the sizes S and A of the state and action spaces, i.e., they are smaller than c log T where T is the time horizon and the constant c only depends on the Lipschitz structure, the span of the bias function, and the minimal action sub-optimality gap. This contrasts with unstructured MDPs where the regret lower bound typically scales as SA log T. We devise DEL (Directed Exploration Learning), an algorithm that matches our regret lower bounds.