reach
Probabilistic Shielding for Safe Reinforcement Learning
Court, Edwin Hamel-De le, Belardinelli, Francesco, Goodall, Alex W.
In real-life scenarios, a Reinforcement Learning (RL) agent aiming to maximise their reward, must often also behave in a safe manner, including at training time. Thus, much attention in recent years has been given to Safe RL, where an agent aims to learn an optimal policy among all policies that satisfy a given safety constraint. However, strict safety guarantees are often provided through approaches based on linear programming, and thus have limited scaling. In this paper we present a new, scalable method, which enjoys strict formal guarantees for Safe RL, in the case where the safety dynamics of the Markov Decision Process (MDP) are known, and safety is defined as an undiscounted probabilistic avoidance property. Our approach is based on state-augmentation of the MDP, and on the design of a shield that restricts the actions available to the agent. We show that our approach provides a strict formal safety guarantee that the agent stays safe at training and test time. Furthermore, we demonstrate that our approach is viable in practice through experimental evaluation.
- North America > United States > New York > New York County > New York City (0.14)
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- Europe > Austria > Vienna (0.14)
- (15 more...)
Taming Infinity one Chunk at a Time: Concisely Represented Strategies in One-Counter MDPs
Ajdarów, Michal, Main, James C. A., Novotný, Petr, Randour, Mickael
Markov decision processes (MDPs) are a canonical model to reason about decision making within a stochastic environment. We study a fundamental class of infinite MDPs: one-counter MDPs (OC-MDPs). They extend finite MDPs via an associated counter taking natural values, thus inducing an infinite MDP over the set of configurations (current state and counter value). We consider two characteristic objectives: reaching a target state (state-reachability), and reaching a target state with counter value zero (selective termination). The synthesis problem for the latter is not known to be decidable and connected to major open problems in number theory. Furthermore, even seemingly simple strategies (e.g., memoryless ones) in OC-MDPs might be impossible to build in practice (due to the underlying infinite configuration space): we need finite, and preferably small, representations. To overcome these obstacles, we introduce two natural classes of concisely represented strategies based on a (possibly infinite) partition of counter values in intervals. For both classes, and both objectives, we study the verification problem (does a given strategy ensure a high enough probability for the objective?), and two synthesis problems (does there exist such a strategy?): one where the interval partition is fixed as input, and one where it is only parameterized. We develop a generic approach based on a compression of the induced infinite MDP that yields decidability in all cases, with all complexities within PSPACE.
- Europe > Greece (0.27)
- Europe > Germany (0.27)
- Europe > United Kingdom (0.14)
- (7 more...)
Logarithmic Regret of Exploration in Average Reward Markov Decision Processes
In average reward Markov decision processes, state-of-the-art algorithms for regret minimization follow a well-established framework: They are model-based, optimistic and episodic. First, they maintain a confidence region from which optimistic policies are computed using a well-known subroutine called Extended Value Iteration (EVI). Second, these policies are used over time windows called episodes, each ended by the Doubling Trick (DT) rule or a variant thereof. In this work, without modifying EVI, we show that there is a significant advantage in replacing (DT) by another simple rule, that we call the Vanishing Multiplicative (VM) rule. When managing episodes with (VM), the algorithm's regret is, both in theory and in practice, as good if not better than with (DT), while the one-shot behavior is greatly improved. More specifically, the management of bad episodes (when sub-optimal policies are being used) is much better under (VM) than (DT) by making the regret of exploration logarithmic rather than linear. These results are made possible by a new in-depth understanding of the contrasting behaviors of confidence regions during good and bad episodes.
- Europe > France > Auvergne-Rhône-Alpes > Isère > Grenoble (0.04)
- North America > United States > Virginia > Arlington County > Arlington (0.04)
- Research Report (0.49)
- Workflow (0.47)
- Health & Medicine (0.92)
- Media > News (0.72)
Chat GPT, Artificial Intelligence, and UDL: How to Harness the Future to Reach All Students
Turning 10 is a big milestone for kiddos so when it comes to planning the big celebration, you want it to be fun, entertaining, and a bit unique. In the past - when looking for ideas for my kids' birthdays, I have always googled and discovered blog articles, asked my friends and family, or came up with something on my own. It wasn't until I tried out the new AI (Artificial Intelligence) feature through ChatGPT that I discovered its ability to not only come up with a plethora of suggestions but that these examples would also be creative. So when I asked, "Got any creative ideas for a 10-year-olds birthday?" In about ten seconds, here's what ChatGPT told me.
Artificial Intelligence in Aviation Industry is Expected to Reach $3.4 Billion by 2027
LONDON – The Global Artificial Intelligence in Aviation Market size was estimated at USD 508.89 million in 2021, USD 697.59 million in 2022, and is projected to grow at a CAGR of 37.25% to reach USD 3,402.84 million by 2027. Late last month, the "Artificial Intelligence in Aviation Market Research Report by Technology, Offering, Application, Region – Global Forecast to 2027 – Cumulative Impact of COVID-19" Report was published by Research And Markets. The Competitive Strategic Window analyses the competitive landscape in terms of markets, applications, and geographies to help the vendor define an alignment or fit between their capabilities and opportunities for future growth prospects. It describes the optimal or favorable fit for the vendors to adopt successive merger and acquisition strategies, geography expansion, research & development, and new product introduction strategies to execute further business expansion and growth during a forecast period. The FPNV Positioning Matrix evaluates and categorizes the vendors in the Artificial Intelligence in Aviation Market based on Business Strategy (Business Growth, Industry Coverage, Financial Viability, and Channel Support). The Matrix also considers Product Satisfaction (Value for Money, Ease of Use, Product Features, and Customer Support) that aids businesses in better decision making and understanding the competitive landscape.
- Transportation > Air (1.00)
- Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.96)
- Health & Medicine > Therapeutic Area > Immunology (0.96)
Best IT recruitment agencies in Switzerland
If you are looking for the best IT recruitment agencies in Switzerland, look no further than NEWcruitment. They offer carefully analyzed technological profiles such as Crypto Blockchain web 3.0 recruiters, Crypto and Quant traders, Defi Engineers, Blockchain Security Engineers, or Web3 Frontend Developers.
Opportunistic Qualitative Planning in Stochastic Systems with Incomplete Preferences over Reachability Objectives
Kulkarni, Abhishek N., Fu, Jie
Preferences play a key role in determining what goals/constraints to satisfy when not all constraints can be satisfied simultaneously. In this paper, we study how to synthesize preference satisfying plans in stochastic systems, modeled as an MDP, given a (possibly incomplete) combinative preference model over temporally extended goals. We start by introducing new semantics to interpret preferences over infinite plays of the stochastic system. Then, we introduce a new notion of improvement to enable comparison between two prefixes of an infinite play. Based on this, we define two solution concepts called safe and positively improving (SPI) and safe and almost-surely improving (SASI) that enforce improvements with a positive probability and with probability one, respectively. We construct a model called an improvement MDP, in which the synthesis of SPI and SASI strategies that guarantee at least one improvement reduces to computing positive and almost-sure winning strategies in an MDP. We present an algorithm to synthesize the SPI and SASI strategies that induce multiple sequential improvements. We demonstrate the proposed approach using a robot motion planning problem.
- North America > United States > Florida > Alachua County > Gainesville (0.14)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > Middle East > Republic of Türkiye > Karaman Province > Karaman (0.04)
- Information Technology > Artificial Intelligence > Robots > Robot Planning & Action (0.54)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.48)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Logic & Formal Reasoning (0.46)