AITopics

Country:

North America > United States > Virginia > Arlington County > Arlington (0.04)
North America > United States > Georgia > Fulton County > Atlanta (0.04)
North America > Canada > Quebec > Montreal (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.69)

Neural Information Processing SystemsOct-7-2024, 08:33:21 GMT

Reviews: Near Optimal Exploration-Exploitation in Non-Communicating Markov Decision Processes

This is an excellent theoretical contribution. The analysis is quite heavy and has many subtleties. I do not have enough time to read the appended proofs; also, the subject of the paper is not in my area of research. The comments below are based on the impression I got after reading carefully the first 8 pages of the paper and glancing through the rest in the supplementary file. Summary: This paper is about reinforcement learning in weakly-communicating MDP under the average-reward criterion.

algorithm, artificial intelligence, machine learning, (13 more...)

Industry: Energy > Oil & Gas > Upstream (0.52)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.40)

Near Optimal Exploration-Exploitation in Non-Communicating Markov Decision Processes

Neural Information Processing SystemsFeb-14-2020, 11:58:06 GMT

While designing the state space of an MDP, it is common to include states that are transient or not reachable by any policy (e.g., in mountain car, the product space of speed and position contains configurations that are not physically reachable). In this paper, we introduce TUCRL, the first algorithm able to perform efficient exploration-exploitation in any finite Markov Decision Process (MDP) without requiring any form of prior knowledge. In particular, for any MDP with $S c$ communicating states, $A$ actions and $\Gamma c \leq S c$ possible communicating next states, we derive a $O(D c \sqrt{\Gamma c S c A T}) regret bound, where $D c$ is the diameter (i.e., the length of the longest shortest path between any two states) of the communicating part of the MDP. This is in contrast with optimistic algorithms (e.g., UCRL, Optimistic PSRL) that suffer linear regret in weakly-communicating MDPs, as well as posterior sampling or regularised algorithms (e.g., REGAL), which require prior knowledge on the bias span of the optimal policy to bias the exploration to achieve sub-linear regret. We also prove that in weakly-communicating MDPs, no algorithm can ever achieve a logarithmic growth of the regret without first suffering a linear regret for a number of steps that is exponential in the parameters of the MDP.

artificial intelligence, non-communicating markov decision process, upstream oil & gas, (8 more...)

Industry: Energy > Oil & Gas > Upstream (0.64)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.64)

Near Optimal Exploration-Exploitation in Non-Communicating Markov Decision Processes

Neural Information Processing SystemsDec-31-2018

While designing the state space of an MDP, it is common to include states that are transient or not reachable by any policy (e.g., in mountain car, the product space of speed and position contains configurations that are not physically reachable). This results in weakly-communicating or multi-chain MDPs. In this paper, we introduce TUCRL, the first algorithm able to perform efficient exploration-exploitation in any finite Markov Decision Process (MDP) without requiring any form of prior knowledge. In particular, for any MDP with $S^c$ communicating states, $A$ actions and $\Gamma^c \leq S^c$ possible communicating next states, we derive a $O(D^c \sqrt{\Gamma^c S^c A T}) regret bound, where $D^c$ is the diameter (i.e., the length of the longest shortest path between any two states) of the communicating part of the MDP. This is in contrast with optimistic algorithms (e.g., UCRL, Optimistic PSRL) that suffer linear regret in weakly-communicating MDPs, as well as posterior sampling or regularised algorithms (e.g., REGAL), which require prior knowledge on the bias span of the optimal policy to bias the exploration to achieve sub-linear regret. We also prove that in weakly-communicating MDPs, no algorithm can ever achieve a logarithmic growth of the regret without first suffering a linear regret for a number of steps that is exponential in the parameters of the MDP. Finally, we report numerical simulations supporting our theoretical findings and showing how TUCRL overcomes the limitations of the state-of-the-art.

artificial intelligence, reinforcement learning, upstream oil & gas, (18 more...)

Country:

North America > United States > Virginia (0.14)
North America > Canada (0.14)

Industry: Energy > Oil & Gas > Upstream (0.71)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.84)

Near Optimal Exploration-Exploitation in Non-Communicating Markov Decision Processes

Neural Information Processing SystemsDec-31-2018

While designing the state space of an MDP, it is common to include states that are transient or not reachable by any policy (e.g., in mountain car, the product space of speed and position contains configurations that are not physically reachable). This results in weakly-communicating or multi-chain MDPs. In this paper, we introduce TUCRL, the first algorithm able to perform efficient exploration-exploitation in any finite Markov Decision Process (MDP) without requiring any form of prior knowledge. In particular, for any MDP with $S^c$ communicating states, $A$ actions and $\Gamma^c \leq S^c$ possible communicating next states, we derive a $O(D^c \sqrt{\Gamma^c S^c A T}) regret bound, where $D^c$ is the diameter (i.e., the length of the longest shortest path between any two states) of the communicating part of the MDP. This is in contrast with optimistic algorithms (e.g., UCRL, Optimistic PSRL) that suffer linear regret in weakly-communicating MDPs, as well as posterior sampling or regularised algorithms (e.g., REGAL), which require prior knowledge on the bias span of the optimal policy to bias the exploration to achieve sub-linear regret. We also prove that in weakly-communicating MDPs, no algorithm can ever achieve a logarithmic growth of the regret without first suffering a linear regret for a number of steps that is exponential in the parameters of the MDP. Finally, we report numerical simulations supporting our theoretical findings and showing how TUCRL overcomes the limitations of the state-of-the-art.

artificial intelligence, reinforcement learning, upstream oil & gas, (18 more...)

Country:

North America > United States > Virginia (0.14)
North America > Canada (0.14)

Industry: Energy > Oil & Gas > Upstream (0.71)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.84)

Near Optimal Exploration-Exploitation in Non-Communicating Markov Decision Processes

arXiv.org Machine LearningJul-6-2018

While designing the state space of an MDP, it is common to include states that are transient or not reachable by any policy (e.g., in mountain car, the product space of speed and position contains configurations that are not physically reachable). This leads to defining weakly-communicating or multi-chain MDPs. In this paper, we introduce \tucrl, the first algorithm able to perform efficient exploration-exploitation in any finite Markov Decision Process (MDP) without requiring any form of prior knowledge. In particular, for any MDP with $S^{\texttt{C}}$ communicating states, $A$ actions and $\Gamma^{\texttt{C}} \leq S^{\texttt{C}}$ possible communicating next states, we derive a $\widetilde{O}(D^{\texttt{C}} \sqrt{\Gamma^{\texttt{C}} S^{\texttt{C}} AT})$ regret bound, where $D^{\texttt{C}}$ is the diameter (i.e., the longest shortest path) of the communicating part of the MDP. This is in contrast with optimistic algorithms (e.g., UCRL, Optimistic PSRL) that suffer linear regret in weakly-communicating MDPs, as well as posterior sampling or regularised algorithms (e.g., REGAL), which require prior knowledge on the bias span of the optimal policy to bias the exploration to achieve sub-linear regret. We also prove that in weakly-communicating MDPs, no algorithm can ever achieve a logarithmic growth of the regret without first suffering a linear regret for a number of steps that is exponential in the parameters of the MDP. Finally, we report numerical simulations supporting our theoretical findings and showing how TUCRL overcomes the limitations of the state-of-the-art.

artificial intelligence, tucrl, upstream oil & gas, (19 more...)

arXiv.org Machine Learning

1807.02373

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.60)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.46)