AITopics | Agarwal, Alekh

Plotting

Agarwal, Alekh

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

More Benefits of Being Distributional: Second-Order Bounds for Reinforcement Learning

Wang, Kaiwen, Oertell, Owen, Agarwal, Alekh, Kallus, Nathan, Sun, Wen

arXiv.org Artificial IntelligenceFeb-11-2024

In this paper, we prove that Distributional Reinforcement Learning (DistRL), which learns the return distribution, can obtain second-order bounds in both online and offline RL in general settings with function approximation. Second-order bounds are instance-dependent bounds that scale with the variance of return, which we prove are tighter than the previously known small-loss bounds of distributional RL. To the best of our knowledge, our results are the first second-order bounds for low-rank MDPs and for offline RL. When specializing to contextual bandits (one-step RL problem), we show that a distributional learning based optimism algorithm achieves a second-order worst-case regret bound, and a second-order gap dependent bound, simultaneously. We also empirically demonstrate the benefit of DistRL in contextual bandits on real-world datasets. We highlight that our analysis with DistRL is relatively simple, follows the general framework of optimism in the face of uncertainty and does not require weighted regression. Our results suggest that DistRL is a promising framework for obtaining second-order bounds in general RL settings, thus further reinforcing the benefits of DistRL.

artificial intelligence, machine learning, reinforcement learning, (13 more...)

arXiv.org Artificial Intelligence

2402.07198

Genre: Research Report > New Finding (0.88)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

A Minimaximalist Approach to Reinforcement Learning from Human Feedback

Swamy, Gokul, Dann, Christoph, Kidambi, Rahul, Wu, Zhiwei Steven, Agarwal, Alekh

arXiv.org Artificial IntelligenceJan-8-2024

We present Self-Play Preference Optimization (SPO), an algorithm for reinforcement learning from human feedback. Our approach is minimalist in that it does not require training a reward model nor unstable adversarial training and is therefore rather simple to implement. Our approach is maximalist in that it provably handles non-Markovian, intransitive, and stochastic preferences while being robust to the compounding errors that plague offline approaches to sequential prediction. To achieve the preceding qualities, we build upon the concept of a Minimax Winner (MW), a notion of preference aggregation from the social choice theory literature that frames learning from preferences as a zero-sum game between two policies. By leveraging the symmetry of this game, we prove that rather than using the traditional technique of dueling two policies to compute the MW, we can simply have a single agent play against itself while maintaining strong convergence guarantees. Practically, this corresponds to sampling multiple trajectories from a policy, asking a rater or preference model to compare them, and then using the proportion of wins as the reward for a particular trajectory. We demonstrate that on a suite of continuous control tasks, we are able to learn significantly more efficiently than reward-model based approaches while maintaining robustness to the intransitive and stochastic preferences that frequently occur in practice when aggregating human judgments.

large language model, machine learning, reinforcement learning, (18 more...)

arXiv.org Artificial Intelligence

2401.04056

Country:

Europe (0.45)
North America > United States (0.14)

Genre: Research Report (0.64)

Industry: Leisure & Entertainment > Games (0.87)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Theoretical guarantees on the best-of-n alignment policy

Beirami, Ahmad, Agarwal, Alekh, Berant, Jonathan, D'Amour, Alexander, Eisenstein, Jacob, Nagpal, Chirag, Suresh, Ananda Theertha

arXiv.org Artificial IntelligenceJan-3-2024

A simple and effective method for the alignment of generative models is the best-of-$n$ policy, where $n$ samples are drawn from a base policy, and ranked based on a reward function, and the highest ranking one is selected. A commonly used analytical expression in the literature claims that the KL divergence between the best-of-$n$ policy and the base policy is equal to $\log (n) - (n-1)/n.$ We disprove the validity of this claim, and show that it is an upper bound on the actual KL divergence. We also explore the tightness of this upper bound in different regimes. Finally, we propose a new estimator for the KL divergence and empirically show that it provides a tight approximation through a few examples.

kl divergence, large language model, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2401.01879

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.46)

Add feedback

Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking

Eisenstein, Jacob, Nagpal, Chirag, Agarwal, Alekh, Beirami, Ahmad, D'Amour, Alex, Dvijotham, DJ, Fisch, Adam, Heller, Katherine, Pfohl, Stephen, Ramachandran, Deepak, Shaw, Peter, Berant, Jonathan

arXiv.org Artificial IntelligenceDec-20-2023

Reward models play a key role in aligning language model applications towards human preferences. However, this setup creates an incentive for the language model to exploit errors in the reward model to achieve high estimated reward, a phenomenon often termed \emph{reward hacking}. A natural mitigation is to train an ensemble of reward models, aggregating over model outputs to obtain a more robust reward estimate. We explore the application of reward ensembles to alignment at both training time (through reinforcement learning) and inference time (through reranking). First, we show that reward models are \emph{underspecified}: reward models that perform similarly in-distribution can yield very different rewards when used in alignment, due to distribution shift. Second, underspecification results in overoptimization, where alignment to one reward model does not improve reward as measured by another reward model trained on the same data. Third, overoptimization is mitigated by the use of reward ensembles, and ensembles that vary by their \emph{pretraining} seeds lead to better generalization than ensembles that differ only by their \emph{fine-tuning} seeds, with both outperforming individual reward models. However, even pretrain reward ensembles do not eliminate reward hacking: we show several qualitative reward hacking phenomena that are not mitigated by ensembling because all reward models in the ensemble exhibit similar error patterns.

machine learning, natural language, reward model, (19 more...)

arXiv.org Artificial Intelligence

2312.09244

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Efficient End-to-End Visual Document Understanding with Rationale Distillation

Zhu, Wang, Agarwal, Alekh, Joshi, Mandar, Jia, Robin, Thomason, Jesse, Toutanova, Kristina

arXiv.org Artificial IntelligenceNov-16-2023

Understanding visually situated language requires recognizing text and visual elements, and interpreting complex layouts. State-of-the-art methods commonly use specialized pre-processing tools, such as optical character recognition (OCR) systems, that map document image inputs to extracted information in the space of textual tokens, and sometimes also employ large language models (LLMs) to reason in text token space. However, the gains from external tools and LLMs come at the cost of increased computational and engineering complexity. In this paper, we ask whether small pretrained image-to-text models can learn selective text or layout recognition and reasoning as an intermediate inference step in an end-to-end model for pixel-level visual language understanding. We incorporate the outputs of such OCR tools, LLMs, and larger multimodal models as intermediate ``rationales'' on training data, and train a small student model to predict both rationales and answers for input questions based on those training examples. A student model based on Pix2Struct (282M parameters) achieves consistent improvements on three visual document understanding benchmarks representing infographics, scanned documents, and figures, with improvements of more than 4\% absolute over a comparable Pix2Struct model that predicts answers directly.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2311.09612

Country:

Asia (1.00)
Europe (0.93)
North America > United States > California (0.28)

Genre: Research Report > Promising Solution (0.34)

Industry:

Energy > Oil & Gas (0.93)
Education (0.88)
Government > Regional Government > North America Government > United States Government (0.46)
Materials > Chemicals > Commodity Chemicals > Petrochemicals (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.88)

Add feedback

A Mechanism for Sample-Efficient In-Context Learning for Sparse Retrieval Tasks

Abernethy, Jacob, Agarwal, Alekh, Marinov, Teodor V., Warmuth, Manfred K.

arXiv.org Artificial IntelligenceMay-26-2023

We study the phenomenon of \textit{in-context learning} (ICL) exhibited by large language models, where they can adapt to a new learning task, given a handful of labeled examples, without any explicit parameter optimization. Our goal is to explain how a pre-trained transformer model is able to perform ICL under reasonable assumptions on the pre-training process and the downstream tasks. We posit a mechanism whereby a transformer can achieve the following: (a) receive an i.i.d. sequence of examples which have been converted into a prompt using potentially-ambiguous delimiters, (b) correctly segment the prompt into examples and labels, (c) infer from the data a \textit{sparse linear regressor} hypothesis, and finally (d) apply this hypothesis on the given test example and return a predicted label. We establish that this entire procedure is implementable using the transformer mechanism, and we give sample complexity guarantees for this learning framework. Our empirical findings validate the challenge of segmentation, and we show a correspondence between our posited mechanisms and observed attention maps for step (c).

artificial intelligence, bayesian inference, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2305.1704

Genre: Research Report (0.64)

Industry: Leisure & Entertainment > Sports > Baseball (0.92)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.68)
(2 more...)

Add feedback

An Empirical Evaluation of Federated Contextual Bandit Algorithms

Agarwal, Alekh, McMahan, H. Brendan, Xu, Zheng

arXiv.org Artificial IntelligenceMar-17-2023

As the adoption of federated learning increases for learning from sensitive data local to user devices, it is natural to ask if the learning can be done using implicit signals generated as users interact with the applications of interest, rather than requiring access to explicit labels which can be difficult to acquire in many tasks. We approach such problems with the framework of federated contextual bandits, and develop variants of prominent contextual bandit algorithms from the centralized seting for the federated setting. We carefully evaluate these algorithms in a range of scenarios simulated using publicly available datasets. Our simulations model typical setups encountered in the real-world, such as various misalignments between an initial pre-trained model and the subsequent user interactions due to non-stationarity in the data and/or heterogeneity across clients. Our experiments reveal the surprising effectiveness of the simple and commonly used softmax heuristic in balancing the well-know exploration-exploitation tradeoff across the breadth of our settings.

artificial intelligence, data mining, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2303.10218

Genre: Research Report (1.00)

Industry: Energy > Oil & Gas > Upstream (0.48)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.91)
Information Technology > Data Science > Data Mining > Big Data (0.84)

Add feedback

Provable Benefits of Representational Transfer in Reinforcement Learning

Agarwal, Alekh, Song, Yuda, Sun, Wen, Wang, Kaiwen, Wang, Mengdi, Zhang, Xuezhou

arXiv.org Artificial IntelligenceFeb-22-2023

We study the problem of representational transfer in RL, where an agent first pretrains in a number of source tasks to discover a shared representation, which is subsequently used to learn a good policy in a \emph{target task}. We propose a new notion of task relatedness between source and target tasks, and develop a novel approach for representational transfer under this assumption. Concretely, we show that given generative access to source tasks, we can discover a representation, using which subsequent linear RL techniques quickly converge to a near-optimal policy in the target task. The sample complexity is close to knowing the ground truth features in the target task, and comparable to prior representation learning results in the source tasks. We complement our positive results with lower bounds without generative access, and validate our findings with empirical evaluation on rich observation MDPs that require deep exploration. In our experiments, we observe a speed up in learning in the target by pre-training, and also validate the need for generative access in source tasks.

artificial intelligence, machine learning, reinforcement learning, (18 more...)

arXiv.org Artificial Intelligence

2205.14571

Country: North America > United States (0.67)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Leveraging User-Triggered Supervision in Contextual Bandits

Agarwal, Alekh, Gentile, Claudio, Marinov, Teodor V.

arXiv.org Artificial IntelligenceFeb-7-2023

How should we leverage such an extra modality of feedback along with the typical reward signal in CBs? We study contextual bandit (CB) problems, While prior works have developed hybrid models such as where the user can sometimes respond with the learning with feedback graphs (e.g., (Mannor & Shamir, best action in a given context. Such an interaction 2011; Caron et al., 2012; Alon et al., 2017)) to capture a arises, for example, in text prediction or autocompletion continuum between supervised and CB learning, such settings settings, where a poor suggestion is simply are not a natural fit here. A key challenge in the ignored and the user enters the desired text feedback structure is that the extra supervised signal is only instead. Crucially, this extra feedback is usertriggered available on a subset of the contexts, which are chosen by on only a subset of the contexts. We develop the user as some unknown function of the algorithm's recommended a new framework to leverage such signals,

artificial intelligence, leveraging user-triggered supervision, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2302.03784

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.93)

Add feedback

Learning in POMDPs is Sample-Efficient with Hindsight Observability

Lee, Jonathan N., Agarwal, Alekh, Dann, Christoph, Zhang, Tong

arXiv.org Artificial IntelligenceFeb-3-2023

POMDPs capture a broad class of decision making problems, but hardness results suggest that learning is intractable even in simple settings due to the inherent partial observability. However, in many realistic problems, more information is either revealed or can be computed during some point of the learning process. Motivated by diverse applications ranging from robotics to data center scheduling, we formulate a Hindsight Observable Markov Decision Process (HOMDP) as a POMDP where the latent states are revealed to the learner in hindsight and only during training. We introduce new algorithms for the tabular and function approximation settings that are provably sample-efficient with hindsight observability, even in POMDPs that would otherwise be statistically intractable. We give a lower bound showing that the tabular algorithm is optimal in its dependence on latent state and observation cardinalities.

artificial intelligence, latent state, machine learning, (13 more...)

arXiv.org Artificial Intelligence

2301.13857

Genre: Research Report > New Finding (0.48)

Industry:

Health & Medicine (0.67)
Information Technology > Services (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (1.00)

Add feedback