AITopics

2407.03065

Country:

North America > United States > Hawaii (0.14)
North America > Canada > British Columbia (0.14)
Asia > Middle East > Israel (0.14)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.87)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.71)

arXiv.org Artificial IntelligenceMay-23-2024

Multi-turn Reinforcement Learning from Preference Human Feedback

Shani, Lior, Rosenberg, Aviv, Cassel, Asaf, Lang, Oran, Calandriello, Daniele, Zipori, Avital, Noga, Hila, Keller, Orgad, Piot, Bilal, Szpektor, Idan, Hassidim, Avinatan, Matias, Yossi, Munos, Rémi

Reinforcement Learning from Human Feedback (RLHF) has become the standard approach for aligning Large Language Models (LLMs) with human preferences, allowing LLMs to demonstrate remarkable abilities in various tasks. Existing methods work by emulating the preferences at the single decision (turn) level, limiting their capabilities in settings that require planning or multi-turn interactions to achieve a long-term goal. In this paper, we address this issue by developing novel methods for Reinforcement Learning (RL) from preference feedback between two full multi-turn conversations. In the tabular setting, we present a novel mirror-descent-based policy optimization algorithm for the general multi-turn preference-based RL problem, and prove its convergence to Nash equilibrium. To evaluate performance, we create a new environment, Education Dialogue, where a teacher agent guides a student in learning a random topic, and show that a deep RL variant of our algorithm outperforms RLHF baselines. Finally, we show that in an environment with explicit rewards, our algorithm recovers the same performance as a reward-based RL baseline, despite relying solely on a weaker preference signal.

large language model, machine learning, reinforcement learning, (16 more...)

2405.14655

Country:

North America > United States (0.14)
Europe > France (0.14)

Genre: Research Report > Promising Solution (0.34)

Industry: Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceMay-14-2024

Near-Optimal Regret in Linear MDPs with Aggregate Bandit Feedback

Cassel, Asaf, Luo, Haipeng, Rosenberg, Aviv, Sotnikov, Dmitry

In many real-world applications, it is hard to provide a reward signal in each step of a Reinforcement Learning (RL) process and more natural to give feedback when an episode ends. To this end, we study the recently proposed model of RL with Aggregate Bandit Feedback (RL-ABF), where the agent only observes the sum of rewards at the end of an episode instead of each reward individually. Prior work studied RL-ABF only in tabular settings, where the number of states is assumed to be small. In this paper, we extend ABF to linear function approximation and develop two efficient algorithms with near-optimal regret guarantees: a value-based optimistic algorithm built on a new randomization technique with a Q-functions ensemble, and a policy optimization algorithm that uses a novel hedging scheme over the ensemble.

artificial intelligence, machine learning, reinforcement learning, (16 more...)

2405.07637

Country:

North America > United States > Hawaii (0.14)
North America > United States > California (0.14)
Europe > Austria > Vienna (0.14)
Asia > Middle East > Israel (0.14)

Genre: Research Report (0.63)

Industry: Education (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

arXiv.org Artificial IntelligenceAug-14-2023

Efficient Rate Optimal Regret for Adversarial Contextual MDPs Using Online Function Approximation

Levy, Orin, Cohen, Alon, Cassel, Asaf, Mansour, Yishay

We present the OMG-CMDP! algorithm for regret minimization in adversarial Contextual MDPs. The algorithm operates under the minimal assumptions of realizable function class and access to online least squares and log loss regression oracles. Our algorithm is efficient (assuming efficient online regression oracles), simple and robust to approximation errors. It enjoys an $\widetilde{O}(H^{2.5} \sqrt{ T|S||A| ( \mathcal{R}(\mathcal{O}) + H \log(\delta^{-1}) )})$ regret guarantee, with $T$ being the number of episodes, $S$ the state space, $A$ the action space, $H$ the horizon and $\mathcal{R}(\mathcal{O}) = \mathcal{R}(\mathcal{O}_{\mathrm{sq}}^\mathcal{F}) + \mathcal{R}(\mathcal{O}_{\mathrm{log}}^\mathcal{P})$ is the sum of the regression oracles' regret, used to approximate the context-dependent rewards and dynamics, respectively. To the best of our knowledge, our algorithm is the first efficient rate optimal regret minimization algorithm for adversarial CMDPs that operates under the minimal standard assumption of online function approximation.

artificial intelligence, efficient rate-optimal regret, fuzzy logic, (17 more...)

2303.01464

Country:

Asia > Middle East > Israel (0.14)
North America > United States > Hawaii (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Fuzzy Logic (0.61)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.47)

arXiv.org Artificial IntelligenceFeb-21-2023

Eluder-based Regret for Stochastic Contextual MDPs

Levy, Orin, Cassel, Asaf, Cohen, Alon, Mansour, Yishay

We present the E-UC$^3$RL algorithm for regret minimization in Stochastic Contextual Markov Decision Processes (CMDPs). The algorithm operates under the minimal assumptions of realizable function class and access to \emph{offline} least squares and log loss regression oracles. Our algorithm is efficient (assuming efficient offline regression oracles) and enjoys a regret guarantee of $ \widetilde{O}(H^3 \sqrt{T |S| |A|d_{\mathrm{E}}(\mathcal{P}) \log (|\mathcal{F}| |\mathcal{P}|/ \delta) )}) , $ with $T$ being the number of episodes, $S$ the state space, $A$ the action space, $H$ the horizon, $\mathcal{P}$ and $\mathcal{F}$ are finite function classes used to approximate the context-dependent dynamics and rewards, respectively, and $d_{\mathrm{E}}(\mathcal{P})$ is the Eluder dimension of $\mathcal{P}$ w.r.t the Hellinger distance. To the best of our knowledge, our algorithm is the first efficient and rate-optimal regret minimization algorithm for CMDPs that operates under the general offline function approximation setting. In addition, we extend the Eluder dimension to general bounded metrics which may be of separate interest.

artificial intelligence, machine learning, probability, (17 more...)

2211.14932

Country:

Asia > Middle East > Israel (0.14)
Europe (0.14)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.34)

arXiv.org Machine LearningJun-22-2022

Efficient Online Linear Control with Stochastic Convex Costs and Unknown Dynamics

Cassel, Asaf, Cohen, Alon, Koren, Tomer

Adaptive control, the task of regulating an unknown linear dynamical system, is a classic controltheoretic problem that has been studied extensively since the 1950s [e.g., 8]. Classic results on adaptive control typically pertain to the asymptotic stability and convergence to the optimal controller while contemporary research focuses on regret minimization and finite-time guarantees. In linear control, both the state and action are vectors in Euclidean spaces. At each time step, the controller views the current state of the system, chooses an action, and the system transitions to the next state. The latter is chosen via a linear mapping from the current state and action and is perturbed by zero-mean i.i.d.

artificial intelligence, machine learning, probability, (16 more...)

2203.0117

Country: North America > United States (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)

arXiv.org Machine LearningFeb-24-2021

Online Policy Gradient for Model Free Learning of Linear Quadratic Regulators with $\sqrt{T}$ Regret

Cassel, Asaf, Koren, Tomer

We consider the task of learning to control a linear dynamical system under fixed quadratic costs, known as the Linear Quadratic Regulator (LQR) problem. While model-free approaches are often favorable in practice, thus far only model-based methods, which rely on costly system identification, have been shown to achieve regret that scales with the optimal dependence on the time horizon T. We present the first model-free algorithm that achieves similar regret guarantees. Our method relies on an efficient policy gradient scheme, and a novel and tighter analysis of the cost of exploration in policy space in this setting.

artificial intelligence, probability, reinforcement learning, (11 more...)

2102.12608

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.46)

arXiv.org Machine LearningJul-1-2020

Bandit Linear Control

Cassel, Asaf, Koren, Tomer

Reinforcement learning studies sequential decision making problems where a learning agent repeatedly interacts with an environment and aims to improve her strategy over time based on the received feedback. One of the most fundamental tradeoffs in reinforcement learning theory is the exploration vs. exploitation tradeoff, that arises whenever the learner observes only partial feedback after each of her decisions, thus having to balance between exploring new strategies and exploiting those that are already known to perform well. The most basic and well-studied form of partial feedback is the so-called "bandit" feedback, where the learner only observes the cost of her chosen action on each decision round, while obtaining no information about the performance of other actions. Traditionally, the environment dynamics in reinforcement learning are modeled as a Markov Decision Process (MDP) with a finite number of possible states and actions. The MDP model has been studied and analyzed in numerous different settings and under various assumptions on the transition parameters, the nature of the reward functions, and the feedback model. Recently, a particular focus has been given to continuous state-action MDPs, and in particular, to a specific family of models in classic control where the state transition function is linear.

adversary, algorithm, artificial intelligence, (16 more...)

2007.00759

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.34)

arXiv.org Machine LearningJun-4-2018

A General Approach to Multi-Armed Bandits Under Risk Criteria

Cassel, Asaf, Mannor, Shie, Zeevi, Assaf

Different risk-related criteria have received recent interest in learning problems, where typically each case is treated in a customized manner. In this paper we provide a more systematic approach to analyzing such risk criteria within a stochastic multi-armed bandit (MAB) formulation. We identify a set of general conditions that yield a simple characterization of the oracle rule (which serves as the regret benchmark), and facilitate the design of upper confidence bound (UCB) learning policies. The conditions are derived from problem primitives, primarily focusing on the relation between the arm reward distributions and the (risk criteria) performance metric. Among other things, the work highlights some (possibly non-intuitive) subtleties that differentiate various criteria in conjunction with statistical properties of the arms. Our main findings are illustrated on several widely used objectives such as conditional value-at-risk, mean-variance, Sharpe-ratio, and more.

artificial intelligence, big data, health & medicine, (20 more...)

1806.0138

Country: North America > United States > New York > New York County > New York City (0.14)

Genre: Research Report (0.50)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.35)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Data Science > Data Mining > Big Data (0.87)