AITopics

2302.03561

Country: North America > United States (0.67)

Genre:

Research Report > Experimental Study (0.67)
Research Report > New Finding (0.45)

Industry:

Leisure & Entertainment (1.00)
Media > Music (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (1.00)

arXiv.org Artificial IntelligenceFeb-9-2023

On the Statistical Benefits of Temporal Difference Learning

Cheikhi, David, Russo, Daniel

Given a dataset on actions and resulting long-term rewards, a direct estimation approach fits value functions that minimize prediction error on the training data. Temporal difference learning (TD) methods instead fit value functions by minimizing the degree of temporal inconsistency between estimates made at successive time-steps. Focusing on finite state Markov chains, we provide a crisp asymptotic theory of the statistical advantages of this approach. First, we show that an intuitive inverse trajectory pooling coefficient completely characterizes the percent reduction in mean-squared error of value estimates. Depending on problem structure, the reduction could be enormous or nonexistent. Next, we prove that there can be dramatic improvements in estimates of the difference in value-to-go for two states: TD's errors are bounded in terms of a novel measure - the problem's trajectory crossing time - which can be much smaller than the problem's time horizon.

machine learning, reinforcement learning, trajectory, (15 more...)

2301.13289

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.48)

arXiv.org Machine LearningFeb-19-2021

Learning to Stop with Surprisingly Few Samples

Zhang, Tianyi, Russo, Daniel, Zeevi, Assaf

We consider a discounted infinite horizon optimal stopping problem. If the underlying distribution is known a priori, the solution of this problem is obtained via dynamic programming (DP) and is given by a well known threshold rule. When information on this distribution is lacking, a natural (though naive) approach is "explore-then-exploit," whereby the unknown distribution or its parameters are estimated over an initial exploration phase, and this estimate is then used in the DP to determine actions over the residual exploitation phase. We show: (i) with proper tuning, this approach leads to performance comparable to the full information DP solution; and (ii) despite common wisdom on the sensitivity of such "plug in" approaches in DP due to propagation of estimation errors, a surprisingly "short" (logarithmic in the horizon) exploration horizon suffices to obtain said performance. In cases where the underlying distribution is heavy-tailed, these observations are even more pronounced: a ${\it single \, sample}$ exploration phase suffices.

artificial intelligence, equation, optimization problem, (21 more...)

2102.10025

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (0.46)

arXiv.org Machine LearningJul-22-2020

Approximation Benefits of Policy Gradient Methods with Aggregated States

Russo, Daniel

Folklore suggests that policy gradient can be more robust to misspecification than its relative, approximate policy iteration. This paper studies the case of state-aggregation, where the state space is partitioned and either the policy or value function approximation is held constant over partitions. This paper shows a policy gradient method converges to a policy whose regret per-period is bounded by $\epsilon$, the largest difference between two elements of the state-action value function belonging to a common partition. With the same representation, both approximate policy iteration and approximate value iteration can produce policies whose per-period regret scales as $\epsilon/(1-\gamma)$, where $\gamma$ is a discount factor. Theoretical results synthesize recent analysis of policy gradient methods with insights of Van Roy (2006) into the critical role of state-relevance weights in approximate dynamic programming.

artificial intelligence, optimization problem, value function, (16 more...)

2007.11684

Country: North America > United States > Massachusetts > Middlesex County (0.14)

Genre: Research Report (0.84)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

arXiv.org Machine LearningJul-21-2020

A Note on the Linear Convergence of Policy Gradient Methods

Bhandari, Jalaj, Russo, Daniel

We revisit the finite time analysis of policy gradient methods in the simplest setting: finite state and action problems with a policy class consisting of all stochastic policies and with exact gradient evaluations. Some recent works have viewed these problems as instances of smooth nonlinear optimization problems, suggesting suggest small stepsizes and showing sublinear convergence rates. This note instead takes a policy iteration perspective and highlights that many versions of policy gradient succeed with extremely large stepsizes and attain a linear rate of convergence.

optimization problem, policy gradient method, survey article, (13 more...)

2007.1112

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.49)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.31)

Neural Information Processing SystemsFeb-14-2020, 08:12:09 GMT

Learning to Optimize via Information-Directed Sampling

Russo, Daniel, Roy, Benjamin Van

We propose information-directed sampling -- a new algorithm for online optimization problems in which a decision-maker must balance between exploration and exploitation while learning from partial feedback. Each action is sampled in a manner that minimizes the ratio between the square of expected single-period regret and a measure of information gain: the mutual information between the optimal action and the next observation. We establish an expected regret bound for information-directed sampling that applies across a very general class of models and scales with the entropy of the optimal action distribution. For the widely studied Bernoulli and linear bandit models, we demonstrate simulation performance surpassing popular approaches, including upper confidence bound algorithms, Thompson sampling, and knowledge gradient. Further, we present simple analytic examples illustrating that information-directed sampling can dramatically outperform upper confidence bound algorithms and Thompson sampling due to the way it measures information gain.

artificial intelligence, information gain, upper confidence, (3 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.51)

arXiv.org Artificial IntelligenceSep-4-2019

SQuAP-Ont: an Ontology of Software Quality Relational Factors from Financial Systems

Ciancarini, Paolo, Nuzzolese, Andrea Giovanni, Presutti, Valentina, Russo, Daniel

Quality, architecture, and process are considered the keystones of software engineering. ISO defines them in three separate standards. However, their interaction has been scarcely studied, so far. The SQuAP model (Software Quality, Architecture, Process) describes twenty-eight main factors that impact on software quality in banking systems, and each factor is described as a relation among some characteristics from the three ISO standards. Hence, SQuAP makes such relations emerge rigorously, although informally. In this paper, we present SQuAP-Ont, an OWL ontology designed by following a well-established methodology based on the reuse of Ontology Design Patterns (i.e. SQuAP-Ont formalises the relations emerging from SQuAP to represent and reason via Linked Data about software engineering in a three-dimensional model consisting of quality, architecture, and process ISO characteristics. Industrial standards are widely used in the software engineering practice: they are built on preexisting literature and provide a common ground to scholars and practitioners to analyze, develop, and assess software systems. As far as software quality is concerned, the reference standard is the ISO/IEC 25010:2011 (ISO quality from now on), which defines the quality of software products and their usage (i.e., in-use quality). The ISO quality standard introduces eight characteristics that qualify a software product, and five characteristics that assess its quality in use. A characteristic is a parameter for measuring the quality of a software system-related aspect, e.g., reliability, usability, performance efficiency.

banking & finance, ontology, software engineering, (18 more...)

1909.01602

Country: Europe > Italy (0.28)

Genre: Research Report (0.82)

Industry:

Banking & Finance (1.00)
Information Technology > Software (0.46)

Technology:

Information Technology > Software Engineering (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Ontologies (1.00)

arXiv.org Artificial IntelligenceJun-6-2019

Worst-Case Regret Bounds for Exploration via Randomized Value Functions

Russo, Daniel

Exploration is one of the central challenges in reinforcement learning (RL). A large theoretical literature treats exploration in simple finite state and action MDPs, showing that it is possible to efficiently learn a near optimal policy through interaction alone [5, 8, 10, 11, 13-16, 24, 25]. Overwhelmingly, this literature focuses on optimistic algorithms, with most algorithms explicitly maintaining uncertainty sets that are likely to contain the true MDP. It has been difficult to adapt these exploration algorithms to the more complex problems investigated in the applied RL literature. Most applied papers seem to generate exploration through ǫ-greedy or Boltzmann exploration. Those simple methods are compatible with practical value function learning algorithms, which use parametric approximations to value functions to generalize across high dimensional state spaces. Unfortunately, such exploration algorithms can fail catastrophically in simple finite state MDPs [See e.g.

artificial intelligence, reinforcement learning, value function, (18 more...)

1906.0287

Country: Europe > United Kingdom > England (0.28)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (0.93)

arXiv.org Machine LearningJun-4-2019

Global Optimality Guarantees For Policy Gradient Methods

Bhandari, Jalaj, Russo, Daniel

Policy gradients methods are perhaps the most widely used class of reinforcement learning algorithms. These methods apply to complex, poorly understood, control problems by performing stochastic gradient descent over a parameterized class of polices. Unfortunately, even for simple control problems solvable by classical techniques, policy gradient algorithms face non-convex optimization problems and are widely understood to converge only to local minima. This work identifies structural properties -- shared by finite MDPs and several classic control problems -- which guarantee that policy gradient objective function has no suboptimal local minima despite being non-convex. When these assumptions are relaxed, our work gives conditions under which any local minimum is near-optimal, where the error bound depends on a notion of the expressive capacity of the policy class.

artificial intelligence, optimization problem, policy class, (14 more...)

1906.01786

Country: North America > United States (0.14)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.69)

arXiv.org Machine LearningApr-9-2019

A Note on the Equivalence of Upper Confidence Bounds and Gittins Indices for Patient Agents

Russo, Daniel

There are two separate segments of the multi-armed bandit literature. One formulates a Bayesian multi-armed bandit problem as a Markov decision process and uses tools from dynamic programming to compute or approximate the optimal policy. This literature builds on a beautiful result that shows an optimal policy selects in each period the arm with highest Gittins index [10, 9]. A second segment of the literature focuses on simple heuristic algorithms-which are often easy to adapt to settings in which exact dynamic programming is computationally intractable-and studies their performance through simulation and theoretical bounds on their regret [13, 4, 18, 19]. This literature descends from a seminal paper by Lai and Robbins [14] that shows the asymptotic growth rate of expected regret in a frequentist model is minimized by selecting in each period the arm with greatest upper-confidence bound.

artificial intelligence, big data, gittin index, (21 more...)

1904.04732

Genre: Research Report (0.50)

Technology:

Information Technology > Data Science > Data Mining > Big Data (1.00)
Information Technology > Artificial Intelligence > Machine Learning (0.89)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.74)