AITopics

2011.0459

Genre: Research Report (0.69)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.53)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.53)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.53)

Grinsztajn, Nathan, Beaumont, Olivier, Jeannot, Emmanuel, Preux, Philippe

Geometric Deep Reinforcement Learning for Dynamic DAG Scheduling

arXiv.org Artificial IntelligenceNov-9-2020

In practice, it is quite common to face combinatorial optimization problems which contain uncertainty along with non-determinism and dynamicity. These three properties call for appropriate algorithms; reinforcement learning (RL) is dealing with them in a very natural way. Today, despite some efforts, most real-life combinatorial optimization problems remain out of the reach of reinforcement learning algorithms. In this paper, we propose a reinforcement learning approach to solve a realistic scheduling problem, and apply it to an algorithm commonly executed in the high performance computing community, the Cholesky factorization. On the contrary to static scheduling, where tasks are assigned to processors in a predetermined ordering before the beginning of the parallel execution, our method is dynamic: task allocations and their execution ordering are decided at runtime, based on the system state and unexpected events, which allows much more flexibility. To do so, our algorithm uses graph neural networks in combination with an actor-critic algorithm (A2C) to build an adaptive representation of the problem on the fly. We show that this approach is competitive with state-of-the-art heuristics used in high-performance computing runtime systems. Moreover, our algorithm does not require an explicit model of the environment, but we demonstrate that extra knowledge can easily be incorporated and improves performance. We also exhibit key properties provided by this RL approach, and study its transfer abilities to other instances.

algorithm, holesky factorization, reinforcement, (15 more...)

2011.04333

Country:

Europe > France > Nouvelle-Aquitaine > Gironde > Bordeaux (0.04)
Europe > France > Hauts-de-France > Nord > Lille (0.04)
Oceania > Australia > New South Wales > Sydney (0.04)
North America > United States > California > San Francisco County > San Francisco (0.04)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Akbulut, M. Tuluhan, Bozdogan, Utku, Tekden, Ahmet, Ugur, Emre

Reward Conditioned Neural Movement Primitives for Population Based Variational Policy Optimization

arXiv.org Artificial IntelligenceNov-9-2020

The aim of this paper is to study the reward based policy exploration problem in a supervised learning approach and enable robots to form complex movement trajectories in challenging reward settings and search spaces. For this, the experience of the robot, which can be bootstrapped from demonstrated trajectories, is used to train a novel Neural Processes-based deep network that samples from its latent space and generates the required trajectories given desired rewards. Our framework can generate progressively improved trajectories by sampling them from high reward landscapes, increasing the reward gradually. Variational inference is used to create a stochastic latent space to sample varying trajectories in generating population of trajectories given target rewards. We benefit from Evolutionary Strategies and propose a novel crossover operation, which is applied in the self-organized latent space of the individual policies, allowing blending of the individuals that might address different factors in the reward function. Using a number of tasks that require sequential reaching to multiple points or passing through gaps between objects, we showed that our method provides stable learning progress and significant sample efficiency compared to a number of state-of-the-art robotic reinforcement learning methods. Finally, we show the real-world suitability of our method through real robot execution involving obstacle avoidance.

learning, opération, trajectory, (13 more...)

2011.04282

Country:

Europe > Germany (0.04)
Asia > Middle East > Republic of Türkiye (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Evolutionary Systems (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)

Kell, Alexander J. M., Forshaw, Matthew, McGough, A. Stephen

Exploring market power using deep reinforcement learning for intelligent bidding strategies

arXiv.org Artificial IntelligenceNov-8-2020

Decentralized electricity markets are often dominated by a small set of generator companies who control the majority of the capacity. In this paper, we explore the effect of the total controlled electricity capacity by a single, or group, of generator companies can have on the average electricity price. We demonstrate this through the use of ElecSim, a simulation of a country-wide energy market. We develop a strategic agent, representing a generation company, which uses a deep deterministic policy gradient reinforcement learning algorithm to bid in a uniform pricing electricity market. A uniform pricing market is one where all players are paid the highest accepted price. ElecSim is parameterized to the United Kingdom for the year 2018. This work can help inform policy on how to best regulate a market to ensure that the price of electricity remains competitive. We find that capacity has an impact on the average electricity price in a single year. If any single generator company, or a collaborating group of generator companies, control more than ${\sim}$11$\%$ of generation capacity and bid strategically, prices begin to increase by ${\sim}$25$\%$. The value of ${\sim}$25\% and ${\sim}$11\% may vary between market structures and countries. For instance, different load profiles may favour a particular type of generator or a different distribution of generation capacity. Once the capacity controlled by a generator company, which bids strategically, is higher than ${\sim}$35\%, prices increase exponentially. We observe that the use of a market cap of approximately double the average market price has the effect of significantly decreasing this effect and maintaining a competitive market. A fair and competitive electricity market provides value to consumers and enables a more competitive economy through the utilisation of electricity by both industry and consumers.

algorithm, electricity market, genco, (16 more...)

2011.04079

Country: Europe > United Kingdom > England > Greater London > London (0.04)

Genre: Research Report (1.00)

Industry: Energy > Power Industry (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Reliable Off-policy Evaluation for Reinforcement Learning

Wang, Jie, Gao, Rui, Zha, Hongyuan

Reinforcement learning (RL) has achieved phenomenal success in games and robotics [,, ] in the past decade, which also stimulates the enthusiasm of extending these techniques in other areas including healthcare [, ], education [ ], autonomous driving [ ], recommendation systems [, ], etc. One of the major challenges in applying RL to these real-world applications, especially those involve high-stake environments, is the problem of o -policy evaluation (OPE): how one can evaluate a new policy before deployment, using only historical data collected from a di erent policy, known as the behavior policy. Indeed, for many practical applications, one may not have a faithful simulator of the domain from which su cient amount of data can be exploited to train the RL system, and it may not always be feasible to try out a new policy without causing unintended harms. For example, consider the problem of finding the best treatment plan for a patient, or testing the performance of an automated driving system, or suggesting a personalized curriculum for a student. In those tasks, conducting experimentation involves interactions with real people, thus it can be costly to collect data and even worse, a bad policy can be risky or unethical and may result in severe consequences. Therefore, it is important for the RL system to have the ability to predict how well a new policy would perform without having to deploy it first. While most existing works on OPE aim to provide accurate point estimates for short-horizon problems [,,, ] as well as long-or infinite-horizon problems [,,,,, ], it is equally important to quantify the uncertainty of the OPE point estimates for both safe exploration and optimistic planning.

arxiv preprint arxiv, evaluation, probability, (14 more...)

2011.04102

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
Asia > China > Guangdong Province > Shenzhen (0.04)
Asia > China > Hong Kong (0.04)
(2 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine (1.00)
Transportation > Ground > Road (0.54)
Information Technology > Robotics & Automation (0.54)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Sparse Feature Selection Makes Batch Reinforcement Learning More Sample Efficient

Hao, Botao, Duan, Yaqi, Lattimore, Tor, Szepesvári, Csaba, Wang, Mengdi

This paper provides a statistical analysis of high-dimensional batch Reinforcement Learning (RL) using sparse linear function approximation. When there is a large number of candidate features, our result sheds light on the fact that sparsity-aware methods can make batch RL more sample efficient. We first consider the off-policy policy evaluation problem. To evaluate a new target policy, we analyze a Lasso fitted Q-evaluation method and establish a finite-sample error bound that has no polynomial dependence on the ambient dimension. To reduce the Lasso bias, we further propose a post model-selection estimator that applies fitted Q-evaluation to the features selected via group Lasso. Under an additional signal strength assumption, we derive a sharper instance-dependent error bound that depends on a divergence function measuring the distribution mismatch between the data distribution and occupancy measure of the target policy. Further, we study the Lasso fitted Q-iteration for batch policy optimization and establish a finite-sample error bound depending on the ratio between the number of relevant features and restricted minimal eigenvalue of the data's covariance. In the end, we complement the results with minimax lower bounds for batch-data policy evaluation/optimization that nearly match our upper bounds. The results suggest that having well-conditioned data is crucial for sparse batch policy learning.

eigenvalue, evaluation, probability, (13 more...)

2011.04019

Country:

North America > Canada > Alberta (0.14)
Europe > United Kingdom > England > Greater London > London (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (0.86)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.46)

Hao, Botao, Lattimore, Tor, Szepesvári, Csaba, Wang, Mengdi

Online Sparse Reinforcement Learning

Sparse models in classical statistics often yield the best of both worlds: high representation power is achieved by including many features while sparsity leads to efficient estimation. There is a growing interest in applying the tools developed by statisticians to sequential settings such as contextual bandits and reinforcement learning (RL). As we now explore, in online RL this leads to a number of delicate tradeoffs between assumptions and sample complexity. The use of sparsity in reinforcement learning (RL) has been explored before in the context of policy evaluation or policy optimization in the batch setting [Kolter and Ng, 2009, Geist and Scherrer, 2011, Hoffman et al., 2011, Painter-Wakefield and Parr, 2012, Ghavamzadeh et al., 2011]. As far as we know, there has been very little work on the role of sparsity in online RL. In batch RL, the dataset is given a priori and the focus is typically on evaluating a given target policy or learning a near-optimal policy. By contrast, the central question in online RL is how to sequentially interact with the environment to balance the tradeoff between exploration and exploitation, measured here by the cumulative regret. We ask the following question: Under what circumstances does sparsity help when minimizing regret in online RL?

exploratory policy, learner, probability, (14 more...)

2011.04018

Country:

North America > Canada > Alberta (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Karino, Izumi, Ohmura, Yoshiyuki, Kuniyoshi, Yasuo

Identifying Critical States by the Action-Based Variance of Expected Return

The balance of exploration and exploitation plays a crucial role in accelerating reinforcement learning (RL). To deploy an RL agent in human society, its explainability is also essential. However, basic RL approaches have difficulties in deciding when to choose exploitation as well as in extracting useful points for a brief explanation of its operation. One reason for the difficulties is that these approaches treat all states the same way. Here, we show that identifying critical states and treating them specially is commonly beneficial to both problems. These critical states are the states at which the action selection changes the potential of success and failure substantially. We propose to identify the critical states using the variance in the Q-function for the actions and to perform exploitation with high probability on the identified states. These simple methods accelerate RL in a grid world with cliffs and two baseline tasks of deep RL. Our results also demonstrate that the identified critical states are intuitively interpretable regarding the crucial nature of the action selection. Furthermore, our analysis of the relationship between the timing of the identification of especially critical states and the rapid progress of learning suggests there are a few especially critical states that have important information for accelerating RL rapidly.

agent, critical state, exploitation, (14 more...)

doi: 10.1007/978-3-030-61609-0_29

2008.11332

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.05)
North America > United States > California > San Diego County > San Diego (0.04)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)

Multi-Agent Reinforcement Learning in Time-varying Networked Systems

Lin, Yiheng, Qu, Guannan, Huang, Longbo, Wierman, Adam

In comparison to single-agent reinforcement learning (RL), MARL poses many challenges, chief of which is scalability [49]. Even if each agent's local state/action spaces are small, the size of the global state/action space can be large, potentially exponentially large in the number of agents, which renders many RL algorithms such as -learning not applicable. A promising approach for addressing the scalability challenge that has received attention in recent years is to exploit application-specific structures, e.g., [16, 32, 35]. A particularly important example of such a structure is a networked structure, e.g., applications in multi-agent networked systems such as social networks [6, 24], communication networks [44, 52], queueing networks [31], and smart transportation networks [51]. In these networked systems, it is often possible to exploit static, local dependency structures [1, 14, 15, 29], e.g., the fact that agents only interact with a fixed set of neighboring agents throughout the game. This sort of dependency structure often leads to scalable, distributed algorithms for optimization and control [1, 14, 29], and has proven effective for designing scalable and distributed MARL algorithms, e.g.

multi-agent reinforcement learning, probability, state aggregation, (11 more...)

2006.06555

Country:

North America > United States > New York (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > United States > Massachusetts > Middlesex County > Belmont (0.04)
(3 more...)

Genre: Research Report (1.00)

Industry:

Information Technology (0.34)
Transportation (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents > Agent Societies (0.34)

arXiv.org Artificial IntelligenceNov-8-2020

On the role of planning in model-based deep reinforcement learning

Hamrick, Jessica B., Friesen, Abram L., Behbahani, Feryal, Guez, Arthur, Viola, Fabio, Witherspoon, Sims, Anthony, Thomas, Buesing, Lars, Veličković, Petar, Weber, Théophane

Model-based planning is often thought to be necessary for deep, careful reasoning and generalization in artificial agents. While recent successes of model-based reinforcement learning (MBRL) with deep function approximation have strengthened this hypothesis, the resulting diversity of model-based methods has also made it difficult to track which components drive success and why. In this paper, we seek to disentangle the contributions of recent methods by focusing on three questions: (1) How does planning benefit MBRL agents? (2) Within planning, what choices drive performance? (3) To what extent does planning improve generalization? To answer these questions, we study the performance of MuZero (Schrittwieser et al., 2019), a state-of-the-art MBRL algorithm, under a number of interventions and ablations and across a wide range of environments including control tasks, Atari, and 9x9 Go. Our results suggest the following: (1) The primary benefit of planning is in driving policy learning. (2) Using shallow trees with simple Monte-Carlo rollouts is as performant as more complex methods, except in the most difficult reasoning tasks. (3) Planning alone is insufficient to drive strong generalization. These results indicate where and how to utilize planning in reinforcement learning settings, and highlight a number of open questions for future MBRL research.

generalization, search budget, simulation, (13 more...)

2011.04021

Country: Europe > United Kingdom > England > Greater London > London (0.04)

Genre: Research Report > New Finding (1.00)

Industry: Leisure & Entertainment > Games > Computer Games (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)