AITopics | reward regret

8ec2ba5e96ec1c050bc631abda80f269-Paper.pdf

Neural Information Processing SystemsFeb-9-2026, 21:01:58 GMT

artificial intelligence, machine learning, probability, (16 more...)

Neural Information Processing Systems

Country:

North America > United States > Texas > Brazos County > College Station (0.05)
Asia > Middle East > Jordan (0.04)

Industry:

Government > Regional Government > North America Government > United States Government (1.00)
Energy > Renewable (0.93)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Learning Policies with Zero or Bounded Constraint Violation for Constrained MDPs

Neural Information Processing SystemsDec-24-2025, 11:18:48 GMT

We address the issue of safety in reinforcement learning. We pose the problem in an episodic framework of a constrained Markov decision process. Existing results have shown that it is possible to achieve a reward regret of $\tilde{\mathcal{O}}(\sqrt{K})$ while allowing an $\tilde{\mathcal{O}}(\sqrt{K})$ constraint violation in $K$ episodes. A critical question that arises is whether it is possible to keep the constraint violation even smaller. We show that when a strictly safe policy is known, then one can confine the system to zero constraint violation with arbitrarily high probability while keeping the reward regret of order $\tilde{\mathcal{O}}(\sqrt{K})$. The algorithm which does so employs the principle of optimistic pessimism in the face of uncertainty to achieve safe exploration. When no strictly safe policy is known, though one is known to exist, then it is possible to restrict the system to bounded constraint violation with arbitrarily high probability. This is shown to be realized by a primal-dual algorithm with an optimistic primal estimate and a pessimistic dual update.

bounded constraint violation, learning policy, name change, (10 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.77)

Add feedback

8ec2ba5e96ec1c050bc631abda80f269-Supplemental.pdf

Neural Information Processing SystemsAug-15-2025, 22:17:09 GMT

artificial intelligence, machine learning, probability, (17 more...)

Neural Information Processing Systems

Country:

North America > United States > Texas > Brazos County > College Station (0.04)
Asia > Middle East > Jordan (0.04)

Industry:

Government > Regional Government > North America Government > United States Government (1.00)
Energy > Renewable (0.92)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

8ec2ba5e96ec1c050bc631abda80f269-Paper.pdf

Neural Information Processing SystemsAug-15-2025, 22:17:05 GMT

artificial intelligence, machine learning, probability, (16 more...)

Neural Information Processing Systems

Country:

North America > United States > Texas > Brazos County > College Station (0.05)
Asia > Middle East > Jordan (0.04)

Industry:

Government > Regional Government > North America Government > United States Government (1.00)
Energy > Renewable (0.93)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Ensuring Safety in an Uncertain Environment: Constrained MDPs via Stochastic Thresholds

Zuo, Qian, He, Fengxiang

arXiv.org Machine LearningApr-7-2025

This paper studies constrained Markov decision processes (CMDPs) with constraints against stochastic thresholds, aiming at safety of reinforcement learning in unknown and uncertain environments. We leverage a Growing-Window estimator sampling from interactions with the uncertain and dynamic environment to estimate the thresholds, based on which we design Stochastic Pessimistic-Optimistic Thresholding (SPOT), a novel model-based primal-dual algorithm for multiple constraints against stochastic thresholds. SPOT enables reinforcement learning under both pessimistic and optimistic threshold settings. We prove that our algorithm achieves sublinear regret and constraint violation; i.e., a reward regret of $\tilde{\mathcal{O}}(\sqrt{T})$ while allowing an $\tilde{\mathcal{O}}(\sqrt{T})$ constraint violation over $T$ episodes. The theoretical guarantees show that our algorithm achieves performance comparable to that of an approach relying on fixed and clear thresholds. To the best of our knowledge, SPOT is the first reinforcement learning algorithm that realises theoretical guaranteed performance in an uncertain environment where even thresholds are unknown.

artificial intelligence, machine learning, reinforcement learning, (19 more...)

arXiv.org Machine Learning

2504.04973

Country: South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Constraint-Based Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Learning Policies with Zero or Bounded Constraint Violation for Constrained MDPs

Neural Information Processing SystemsJan-16-2025, 12:58:35 GMT

We address the issue of safety in reinforcement learning. We pose the problem in an episodic framework of a constrained Markov decision process. Existing results have shown that it is possible to achieve a reward regret of \tilde{\mathcal{O}}(\sqrt{K}) while allowing an \tilde{\mathcal{O}}(\sqrt{K}) constraint violation in K episodes. A critical question that arises is whether it is possible to keep the constraint violation even smaller. We show that when a strictly safe policy is known, then one can confine the system to zero constraint violation with arbitrarily high probability while keeping the reward regret of order \tilde{\mathcal{O}}(\sqrt{K}) .

bounded constraint violation, constrained mdp, learning policy, (7 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Constraint-Based Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (0.82)

Add feedback

Merit-based Fair Combinatorial Semi-Bandit with Unrestricted Feedback Delays

Chen, Ziqun, Cai, Kechao, Chen, Zhuoyue, Zhang, Jinbei, Lui, John C. S.

arXiv.org Machine LearningJul-29-2024

We study the stochastic combinatorial semi-bandit problem with unrestricted feedback delays under merit-based fairness constraints. This is motivated by applications such as crowdsourcing, and online advertising, where immediate feedback is not immediately available and fairness among different choices (or arms) is crucial. We consider two types of unrestricted feedback delays: reward-independent delays where the feedback delays are independent of the rewards, and reward-dependent delays where the feedback delays are correlated with the rewards. Furthermore, we introduce merit-based fairness constraints to ensure a fair selection of the arms. We define the reward regret and the fairness regret and present new bandit algorithms to select arms under unrestricted feedback delays based on their merits. We prove that our algorithms all achieve sublinear expected reward regret and expected fairness regret, with a dependence on the quantiles of the delay distribution. We also conduct extensive experiments using synthetic and real-world data and show that our algorithms can fairly select arms with different feedback delays.

algorithm, fairness regret, reward regret, (14 more...)

arXiv.org Machine Learning

2407.15439

Country:

Asia > China > Hong Kong (0.04)
Asia > China > Guangdong Province > Shenzhen (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > France > Grand Est > Bas-Rhin > Strasbourg (0.04)

Genre: Research Report (0.40)

Industry:

Marketing (0.48)
Information Technology > Services (0.34)

Technology:

Information Technology > Data Science > Data Mining > Big Data (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)

Add feedback

Fair Distributed Cooperative Bandit Learning on Networks for Intelligent Internet of Things Systems (Technical Report)

Chen, Ziqun, Cai, Kechao, Zhang, Jinbei, Yu, Zhigang

arXiv.org Artificial IntelligenceMar-18-2024

In intelligent Internet of Things (IoT) systems, edge servers within a network exchange information with their neighbors and collect data from sensors to complete delivered tasks. In this paper, we propose a multiplayer multi-armed bandit model for intelligent IoT systems to facilitate data collection and incorporate fairness considerations. In our model, we establish an effective communication protocol that helps servers cooperate with their neighbors. Then we design a distributed cooperative bandit algorithm, DC-ULCB, enabling servers to collaboratively select sensors to maximize data rates while maintaining fairness in their choices. We conduct an analysis of the reward regret and fairness regret of DC-ULCB, and prove that both regrets have logarithmic instance-dependent upper bounds. Additionally, through extensive simulations, we validate that DC-ULCB outperforms existing algorithms in maximizing reward and ensuring fairness.

algorithm, sensor, server, (16 more...)

arXiv.org Artificial Intelligence

2403.11603

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > China > Guangdong Province > Shenzhen (0.04)
Asia > China > Beijing > Beijing (0.04)

Genre: Research Report (0.82)

Industry: Information Technology > Smart Houses & Appliances (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Data Science > Data Mining > Big Data (0.70)
Information Technology > Communications > Networks > Sensor Networks (0.46)

Add feedback

Online Restless Multi-Armed Bandits with Long-Term Fairness Constraints

Wang, Shufan, Xiong, Guojun, Li, Jian

arXiv.org Artificial IntelligenceDec-21-2023

Restless multi-armed bandits (RMAB) have been widely used to model sequential decision making problems with constraints. The decision maker (DM) aims to maximize the expected total reward over an infinite horizon under an "instantaneous activation constraint" that at most B arms can be activated at any decision epoch, where the state of each arm evolves stochastically according to a Markov decision process (MDP). However, this basic model fails to provide any fairness guarantee among arms. In this paper, we introduce RMAB-F, a new RMAB model with "long-term fairness constraints", where the objective now is to maximize the long term reward while a minimum long-term activation fraction for each arm must be satisfied. For the online RMAB-F setting (i.e., the underlying MDPs associated with each arm are unknown to the DM), we develop a novel reinforcement learning (RL) algorithm named Fair-UCRL. We prove that Fair-UCRL ensures probabilistic sublinear bounds on both the reward regret and the fairness violation regret. Compared with off-the-shelf RL methods, our Fair-UCRL is much more computationally efficient since it contains a novel exploitation that leverages a low-complexity index policy for making decisions. Experimental results further demonstrate the effectiveness of our Fair-UCRL.

constraint, fairness constraint, fairness violation regret, (16 more...)

arXiv.org Artificial Intelligence

2312.10303

Country:

Oceania > New Zealand (0.04)
North America > United States > New York > Suffolk County > Stony Brook (0.04)
North America > United States > New Jersey (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (0.50)

Industry:

Health & Medicine (0.93)
Government > Regional Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.88)
Information Technology > Data Science > Data Mining > Big Data (0.84)

Add feedback

Learning Policies with Zero or Bounded Constraint Violation for Constrained MDPs

Liu, Tao, Zhou, Ruida, Kalathil, Dileep, Kumar, P. R., Tian, Chao

arXiv.org Artificial IntelligenceJan-24-2023

We address the issue of safety in reinforcement learning. We pose the problem in an episodic framework of a constrained Markov decision process. Existing results have shown that it is possible to achieve a reward regret of $\tilde{\mathcal{O}}(\sqrt{K})$ while allowing an $\tilde{\mathcal{O}}(\sqrt{K})$ constraint violation in $K$ episodes. A critical question that arises is whether it is possible to keep the constraint violation even smaller. We show that when a strictly safe policy is known, then one can confine the system to zero constraint violation with arbitrarily high probability while keeping the reward regret of order $\tilde{\mathcal{O}}(\sqrt{K})$. The algorithm which does so employs the principle of optimistic pessimism in the face of uncertainty to achieve safe exploration. When no strictly safe policy is known, though one is known to exist, then it is possible to restrict the system to bounded constraint violation with arbitrarily high probability. This is shown to be realized by a primal-dual algorithm with an optimistic primal estimate and a pessimistic dual update.

artificial intelligence, machine learning, probability, (17 more...)

arXiv.org Artificial Intelligence

2106.02684

Country:

North America > United States > Texas > Brazos County > College Station (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.63)

Industry:

Government > Regional Government > North America Government > United States Government (1.00)
Energy > Renewable (0.92)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Constraint-Based Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.34)

Add feedback

Collaborating Authors

reward regret

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

8ec2ba5e96ec1c050bc631abda80f269-Paper.pdf

Learning Policies with Zero or Bounded Constraint Violation for Constrained MDPs

8ec2ba5e96ec1c050bc631abda80f269-Supplemental.pdf

8ec2ba5e96ec1c050bc631abda80f269-Paper.pdf

Ensuring Safety in an Uncertain Environment: Constrained MDPs via Stochastic Thresholds

Learning Policies with Zero or Bounded Constraint Violation for Constrained MDPs

Merit-based Fair Combinatorial Semi-Bandit with Unrestricted Feedback Delays

Fair Distributed Cooperative Bandit Learning on Networks for Intelligent Internet of Things Systems (Technical Report)

Online Restless Multi-Armed Bandits with Long-Term Fairness Constraints

Learning Policies with Zero or Bounded Constraint Violation for Constrained MDPs