AITopics | triple-q

Collaborating Authors

triple-q

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Model-Free, Regret-Optimal Best Policy Identification in Online CMDPs

Zhou, Zihan, Wei, Honghao, Ying, Lei

arXiv.org Artificial IntelligenceFeb-5-2024

This paper considers the best policy identification (BPI) problem in online Constrained Markov Decision Processes (CMDPs). We are interested in algorithms that are model-free, have low regret, and identify an approximately optimal policy with a high probability. Existing model-free algorithms for online CMDPs with sublinear regret and constraint violation do not provide any convergence guarantee to an optimal policy and provide only average performance guarantees when a policy is uniformly sampled at random from all previously used policies. In this paper, we develop a new algorithm, named Pruning-Refinement-Identification (PRI), based on a fundamental structural property of CMDPs proved before, which we call limited stochasticity. The property says for a CMDP with $N$ constraints, there exists an optimal policy with at most $N$ stochastic decisions. The proposed algorithm first identifies at which step and in which state a stochastic decision has to be taken and then fine-tunes the distributions of these stochastic decisions. PRI achieves trio objectives: (i) PRI is a model-free algorithm; and (ii) it outputs an approximately optimal policy with a high probability at the end of learning; and (iii) PRI guarantees $\tilde{\mathcal{O}}(H\sqrt{K})$ regret and constraint violation, which significantly improves the best existing regret bound $\tilde{\mathcal{O}}(H^4 \sqrt{SA}K^{\frac{4}{5}})$ under a model-free algorithm, where $H$ is the length of each episode, $S$ is the number of states, $A$ is the number of actions, and the total number of episodes during learning is $2K+\tilde{\cal O}(K^{0.25}).$

algorithm, constraint violation, optimal policy, (15 more...)

arXiv.org Artificial Intelligence

2309.15395

Country:

North America > United States > Michigan > Washtenaw County > Ann Arbor (0.04)
North America > United States > Washington (0.04)
Europe > Netherlands > South Holland > Leiden (0.04)

Genre: Research Report (0.82)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.47)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.34)

Add feedback

A Provably-Efficient Model-Free Algorithm for Constrained Markov Decision Processes

Wei, Honghao, Liu, Xin, Ying, Lei

arXiv.org Artificial IntelligenceJun-2-2021

This paper presents the first {\em model-free}, {\em simulator-free} reinforcement learning algorithm for Constrained Markov Decision Processes (CMDPs) with sublinear regret and zero constraint violation. The algorithm is named Triple-Q because it has three key components: a Q-function (also called action-value function) for the cumulative reward, a Q-function for the cumulative utility for the constraint, and a virtual-Queue that (over)-estimates the cumulative constraint violation. Under Triple-Q, at each step, an action is chosen based on the pseudo-Q-value that is a combination of the three Q values. The algorithm updates the reward and utility Q-values with learning rates that depend on the visit counts to the corresponding (state, action) pairs and are periodically reset. In the episodic CMDP setting, Triple-Q achieves $\tilde{\cal O}\left(\frac{1 }{\delta}H^4 S^{\frac{1}{2}}A^{\frac{1}{2}}K^{\frac{4}{5}} \right)$ regret, where $K$ is the total number of episodes, $H$ is the number of steps in each episode, $S$ is the number of states, $A$ is the number of actions, and $\delta$ is Slater's constant. Furthermore, Triple-Q guarantees zero constraint violation when $K$ is sufficiently large. Finally, the computational complexity of Triple-Q is similar to SARSA for unconstrained MDPs and is computationally efficient.

algorithm, constraint violation, triple-q, (13 more...)

arXiv.org Artificial Intelligence

2106.01577

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.28)
Asia > Middle East > Jordan (0.04)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
(2 more...)

Genre:

Workflow (0.48)
Research Report (0.40)

Industry: Health & Medicine (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.84)
Information Technology > Artificial Intelligence > Representation & Reasoning > Constraint-Based Reasoning (0.75)

Add feedback