Markov Models
Reviews: Large Scale Markov Decision Processes with Changing Rewards
The paper contributes new algorithmic ideas and theoretical results for regret minimization in Markov Decision Processes with known transition kernels but arbitrary cost functions. The reviewers broadly agree that the theoretical and algorithmic techniques introduced by the paper -- using the FTRL online learning idea and the extension to large MDPs via linear function approximation -- are novel, and thus the paper deserves to be published; however, the known-MDP-unknown-cost setting may be somewhat narrow in its applicability in practice.
Learning more with the same effort: how randomization improves the robustness of a robotic deep reinforcement learning agent
Güitta-López, Lucía, Boal, Jaime, López-López, Álvaro J.
The industrial application of Deep Reinforcement Learning (DRL) is frequently slowed down because of the inability to generate the experience required to train the models. Collecting data often involves considerable time and economic effort that is unaffordable in most cases. Fortunately, devices like robots can be trained with synthetic experience thanks to virtual environments. With this approach, the sample efficiency problems of artificial agents are mitigated, but another issue arises: the need for efficiently transferring the synthetic experience into the real world (sim-to-real). This paper analyzes the robustness of a state-of-the-art sim-to-real technique known as progressive neural networks (PNNs) and studies how adding diversity to the synthetic experience can complement it. To better understand the drivers that lead to a lack of robustness, the robotic agent is still tested in a virtual environment to ensure total control on the divergence between the simulated and real models. The results show that a PNN-like agent exhibits a substantial decrease in its robustness at the beginning of the real training phase. Randomizing certain variables during simulation-based training significantly mitigates this issue. On average, the increase in the model's accuracy is around 25% when diversity is introduced in the training process. This improvement can be translated into a decrease in the required real experience for the same final robustness performance. Notwithstanding, adding real experience to agents should still be beneficial regardless of the quality of the virtual experience fed into the agent.
Breaking the Pre-Planning Barrier: Real-Time Adaptive Coordination of Mission and Charging UAVs Using Graph Reinforcement Learning
Hu, Yuhan, Sun, Yirong, Chen, Yanjun, Chen, Xinghao
Unmanned Aerial Vehicles (UAVs) are pivotal in applications such as search and rescue and environmental monitoring, excelling in intelligent perception tasks. However, their limited battery capacity hinders long-duration and long-distance missions. Charging UAVs (CUAVs) offers a potential solution by recharging mission UAVs (MUAVs), but existing methods rely on impractical pre-planned routes, failing to enable organic cooperation and limiting mission efficiency. We introduce a novel multi-agent deep reinforcement learning model named \textbf{H}eterogeneous \textbf{G}raph \textbf{A}ttention \textbf{M}ulti-agent Deep Deterministic Policy Gradient (HGAM), designed to dynamically coordinate MUAVs and CUAVs. This approach maximizes data collection, geographical fairness, and energy efficiency by allowing UAVs to adapt their routes in real-time to current task demands and environmental conditions without pre-planning. Our model uses heterogeneous graph attention networks (GATs) to present heterogeneous agents and facilitate efficient information exchange. It operates within an actor-critic framework. Simulation results show that our model significantly improves cooperation among heterogeneous UAVs, outperforming existing methods in several metrics, including data collection rate and charging efficiency.
A Predictive Approach for Enhancing Accuracy in Remote Robotic Surgery Using Informer Model
Lashari, Muhammad Hanif, Ahmed, Shakil, Batayneh, Wafa, Khokhar, Ashfaq
Precise and real-time estimation of the robotic arm's position on the patient's side is essential for the success of remote robotic surgery in Tactile Internet (TI) environments. This paper presents a prediction model based on the Transformer-based Informer framework for accurate and efficient position estimation. Additionally, it combines a Four-State Hidden Markov Model (4-State HMM) to simulate realistic packet loss scenarios. The proposed approach addresses challenges such as network delays, jitter, and packet loss to ensure reliable and precise operation in remote surgical applications. The method integrates the optimization problem into the Informer model by embedding constraints such as energy efficiency, smoothness, and robustness into its training process using a differentiable optimization layer. The Informer framework uses features such as ProbSparse attention, attention distilling, and a generative-style decoder to focus on position-critical features while maintaining a low computational complexity of O(L log L). The method is evaluated using the JIGSAWS dataset, achieving a prediction accuracy of over 90 percent under various network scenarios. A comparison with models such as TCN, RNN, and LSTM demonstrates the Informer framework's superior performance in handling position prediction and meeting real-time requirements, making it suitable for Tactile Internet-enabled robotic surgery.
Review for NeurIPS paper: From Boltzmann Machines to Neural Networks and Back Again
I am changing the score to 7. The paper gives a new algorithm for learning the structure Restricted Boltzmann Machines (formalized using Markov blankets), which is claimed to work for larger parameter regimes than the previous work. This is done by considering the problem of predicting the spin of a node given the spins of all other nodes. This dependence is shown to be given by a one-hidden layer neural net (with somewhat non-standard activations). An algorithm for learning this network is given based on polynomial approximation of the neural net and using regression on degree-D monomial feature map (with \ell_1 constraint). The algorithm works under L_\inf constraint on the input vector which is different from the past work. Given the above algorithm for learning the dependence of one node on the rest, under suitable non-degeneracy conditions, an algorithm is given for learning the structure (Markov blanket) of the RBM. Nearly matching lower bounds are provided (under hardness assumptions or in the SQ model). The reduction to neural networks is also used for learning supervised RBMs, which can be thought of as a neural network under distributional assumptions on the data (in terms of "sparsity and nonnegative correlations among the input features 307 conditional on the output label"). This distributional assumptions seems to be new.
Review for NeurIPS paper: From Boltzmann Machines to Neural Networks and Back Again
The initial scores in the four reviews were all in favour of accepting, although not strongly. The paper studies a relevant problem, presenting a new algorithm with performance guarantees and almost matching lower bounds. However some questions were raised regarding, for example, connections to other work and practical algorithms, and also more technical issues. The authors provided a detailed reply. After discussion among the reviewers, their concerns were partially answered, leading to somewhat stronger support for accepting.
Reviews: Planning in entropy-regularized Markov decision processes and games
This theoretical paper considers the problem of computing optimal value function in entropy-regularized MDPs and two-player games. It shows that the smoothness property of the Bellman operator in the presence of entropy regularized policies (and possibly other forms of regularization), can be used to derive a sample complexity which is polynomial of order O((1/ε) {4 c}), with c being a problem independent constant and ε the precision of the value function estimate. The proof is built upon the proposed algorithm, SmoothCruiser, an algorithm motivated in the sparse sampling algorithm of Kearns et al that recursively estimates V through samples and subsequently aggregates the results. This sampling dynamic programming is done up to a depth when the required number of samples is no longer polynomial. The paper is very well written and provides a solid result.
Reviews: Planning in entropy-regularized Markov decision processes and games
The reviewers were in consensus that this is an interesting and well written paper with a significant theoretical contribution. While empirical results should not be strictly required for a paper that is strong theoretically, they would nonetheless greatly improve the paper, and thus the authors are strongly encouraged to include them in the final version, even if they are relegated to supplementary material.
WFCRL: A Multi-Agent Reinforcement Learning Benchmark for Wind Farm Control
Monroc, Claire Bizon, Bušić, Ana, Dubuc, Donatien, Zhu, Jiamin
The wind farm control problem is challenging, since conventional model-based control strategies require tractable models of complex aerodynamical interactions between the turbines and suffer from the curse of dimension when the number of turbines increases. Recently, model-free and multi-agent reinforcement learning approaches have been used to address this challenge. In this article, we introduce WFCRL (Wind Farm Control with Reinforcement Learning), the first open suite of multi-agent reinforcement learning environments for the wind farm control problem. WFCRL frames a cooperative Multi-Agent Reinforcement Learning (MARL) problem: each turbine is an agent and can learn to adjust its yaw, pitch or torque to maximize the common objective (e.g. the total power production of the farm). WFCRL also offers turbine load observations that will allow to optimize the farm performance while limiting turbine structural damages. Interfaces with two state-of-the-art farm simulators are implemented in WFCRL: a static simulator (FLORIS) and a dynamic simulator (FAST.Farm). For each simulator, $10$ wind layouts are provided, including $5$ real wind farms. Two state-of-the-art online MARL algorithms are implemented to illustrate the scaling challenges. As learning online on FAST.Farm is highly time-consuming, WFCRL offers the possibility of designing transfer learning strategies from FLORIS to FAST.Farm.
Collaborating in a competitive world: Heterogeneous Multi-Agent Decision Making in Symbiotic Supply Chain Environments
Wang, Wan, Wang, Haiyan, Sobey, Adam J.
Supply networks require collaboration in a competitive environment. To achieve this, nodes in the network often form symbiotic relationships as they can be adversely effected by the closure of companies in the network, especially where products are niche. However, balancing support for other nodes in the network against profit is challenging. Agents are increasingly being explored to define optimal strategies in these complex networks. However, to date much of the literature focuses on homogeneous agents where a single policy controls all of the nodes. This isn't realistic for many supply chains as this level of information sharing would require an exceptionally close relationship. This paper therefore compares the behaviour of this type of agent to a heterogeneous structure, where the agents each have separate polices, to solve the product ordering and pricing problem. An approach to reward sharing is developed that doesn't require sharing profit. The homogenous and heterogeneous agents exhibit different behaviours, with the homogenous retailer retaining high inventories and witnessing high levels of backlog while the heterogeneous agents show a typical order strategy. This leads to the heterogeneous agents mitigating the bullwhip effect whereas the homogenous agents do not. In the high demand environment, the agent architecture dominates performance with the Soft Actor-Critic (SAC) agents outperforming the Proximal Policy Optimisation (PPO) agents. Here, the factory controls the supply chain. In the low demand environment the homogenous agents outperform the heterogeneous agents. Control of the supply chain shifts significantly, with the retailer outperforming the factory by a significant margin.