AITopics

2405.15424

Country: North America > United States (0.28)

Genre: Research Report (0.50)

Industry:

Education > Educational Setting > Online (1.00)
Education > Educational Technology > Educational Software > Computer Based Training (0.81)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

arXiv.org Machine LearningMay-23-2024

Provably Efficient Reinforcement Learning for Infinite-Horizon Average-Reward Linear MDPs

Hong, Kihyuk, Zhang, Yufan, Tewari, Ambuj

We resolve the open problem of designing a computationally efficient algorithm for infinite-horizon average-reward linear Markov Decision Processes (MDPs) with $\widetilde{O}(\sqrt{T})$ regret. Previous approaches with $\widetilde{O}(\sqrt{T})$ regret either suffer from computational inefficiency or require strong assumptions on dynamics, such as ergodicity. In this paper, we approximate the average-reward setting by the discounted setting and show that running an optimistic value iteration-based algorithm for learning the discounted setting achieves $\widetilde{O}(\sqrt{T})$ regret when the discounting factor $\gamma$ is tuned appropriately. The challenge in the approximation approach is to get a regret bound with a sharp dependency on the effective horizon $1 / (1 - \gamma)$. We use a computationally efficient clipping operator that constrains the span of the optimistic state value function estimate to achieve a sharp regret bound in terms of the effective horizon, which leads to $\widetilde{O}(\sqrt{T})$ regret.

artificial intelligence, machine learning, reinforcement learning, (14 more...)

2405.1505

Country: North America > United States (0.14)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.65)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.34)

arXiv.org Machine LearningMay-22-2024

Online Classification with Predictions

Raman, Vinod, Tewari, Ambuj

We study online classification when the learner has access to predictions about future examples. We design an online learner whose expected regret is never worse than the worst-case regret, gracefully improves with the quality of the predictions, and can be significantly better than the worst-case regret when the predictions of future examples are accurate. As a corollary, we show that if the learner is always guaranteed to observe data where future examples are easily predictable, then online learning can be as easy as transductive online learning. Our results complement recent work in online algorithms with predictions and smoothed online classification, which go beyond a worse-case analysis by using machine-learned predictions and distributional assumptions respectively.

artificial intelligence, learner, machine learning, (16 more...)

2405.14066

Country:

Europe > Hungary (0.14)
North America > United States > California (0.14)

Genre: Research Report (0.70)

Industry:

Education > Educational Setting > Online (1.00)
Education > Educational Technology > Educational Software > Computer Based Training (0.82)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)

arXiv.org Machine LearningMar-5-2024

Sample Efficient Myopic Exploration Through Multitask Reinforcement Learning with Diverse Tasks

Xu, Ziping, Xu, Zifan, Jiang, Runxuan, Stone, Peter, Tewari, Ambuj

Multitask Reinforcement Learning (MTRL) approaches have gained increasing attention for its wide applications in many important Reinforcement Learning (RL) tasks. However, while recent advancements in MTRL theory have focused on the improved statistical efficiency by assuming a shared structure across tasks, exploration--a crucial aspect of RL--has been largely overlooked. This paper addresses this gap by showing that when an agent is trained on a sufficiently diverse set of tasks, a generic policy-sharing algorithm with myopic exploration design like $\epsilon$-greedy that are inefficient in general can be sample-efficient for MTRL. To the best of our knowledge, this is the first theoretical demonstration of the "exploration benefits" of MTRL. It may also shed light on the enigmatic success of the wide applications of myopic exploration in practice. To validate the role of diversity, we conduct experiments on synthetic robotic control environments, where the diverse task set aligns with the task selection by automatic curriculum learning, which is empirically shown to improve sample-efficiency.

artificial intelligence, machine learning, reinforcement learning, (13 more...)

2403.01636

Country: North America > United States > Texas (0.14)

Genre: Research Report > New Finding (0.46)

Industry:

Health & Medicine (1.00)
Education (0.67)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

arXiv.org Artificial IntelligenceFeb-11-2024

Optimal Thresholding Linear Bandit

Rivera, Eduardo Ochoa, Tewari, Ambuj

The Thresholding Bandit Problem (TBP) Andrea Locatelli (2016); Kano et al. (2019) represents a specific combinatorial The study by Kano et al. (2019) emphasizes that in certain contexts, such as personalized recommendations, pursuing In this scenario, the arms' mean rewards follow a linear model with unknown parameters. We prove an instance-specific lower bound for the expected sample complexity of any correct algorithm.

artificial intelligence, data mining, machine learning, (16 more...)

2402.09467

Country: North America > United States > Michigan (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Data Science > Data Mining > Big Data (0.48)

arXiv.org Artificial IntelligenceFeb-9-2024

The Complexity of Sequential Prediction in Dynamical Systems

Raman, Vinod, Subedi, Unique, Tewari, Ambuj

A discrete-time dynamical system is a mathematical model that describes the evolution of a system over discrete time steps. Formally, a discrete-time dynamical system is a tuple (N, X, f), where N is the set of natural numbers that denote the timesteps, X is a non-empty set called the state space, and f: X X is a deterministic map that describes the evolution of the state. Dynamical systems have been widely used in practice due to their ability to accurately model natural phenomena. For instance, boolean networks are an important class of discrete-time, discrete-space dynamical systems with widespread applicability to genetic modeling [Kauffman, 1969, Shmulevich et al., 2002].

artificial intelligence, machine learning, scientific computing, (18 more...)

2402.06614

Country:

Europe > Hungary (0.14)
North America > United States (0.14)

Genre: Research Report (0.50)

Industry: Education > Educational Setting > Online (0.69)

Technology:

Information Technology > Scientific Computing (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.89)

arXiv.org Artificial IntelligenceFeb-6-2024

A Primal-Dual Algorithm for Offline Constrained Reinforcement Learning with Low-Rank MDPs

Hong, Kihyuk, Tewari, Ambuj

Offline reinforcement learning (RL) aims to learn a policy that maximizes the expected cumulative reward using a pre-collected dataset. Offline RL with low-rank MDPs or general function approximation has been widely studied recently, but existing algorithms with sample complexity $O(\epsilon^{-2})$ for finding an $\epsilon$-optimal policy either require a uniform data coverage assumptions or are computationally inefficient. In this paper, we propose a primal dual algorithm for offline RL with low-rank MDPs in the discounted infinite-horizon setting. Our algorithm is the first computationally efficient algorithm in this setting that achieves sample complexity of $O(\epsilon^{-2})$ with partial data coverage assumption. This improves upon a recent work that requires $O(\epsilon^{-4})$ samples. Moreover, our algorithm extends the previous work to the offline constrained RL setting by supporting constraints on additional reward signals.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

2402.04493

Country: North America > United States (0.67)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

arXiv.org Artificial IntelligenceFeb-5-2024

A Framework for Partially Observed Reward-States in RLHF

Kausik, Chinmaya, Mutti, Mirco, Pacchiano, Aldo, Tewari, Ambuj

The study of reinforcement learning from human feedback (RLHF) has gained prominence in recent years due to its role in the development of LLMs. Neuroscience research shows that human responses to stimuli are known to depend on partially-observed "internal states." Unfortunately current models of RLHF do not take take this into consideration. Moreover most RLHF models do not account for intermediate feedback, which is gaining importance in empirical work and can help improve both sample complexity and alignment. To address these limitations, we model RLHF as reinforcement learning with partially observed reward-states (PORRL). We show reductions from the the two dominant forms of human feedback in RLHF - cardinal and dueling feedback to PORRL. For cardinal feedback, we develop generic statistically efficient algorithms and instantiate them to present POR-UCRL and POR-UCBVI. For dueling feedback, we show that a naive reduction to cardinal feedback fails to achieve sublinear dueling regret. We then present the first explicit reduction that converts guarantees for cardinal regret to dueling regret. We show that our models and guarantees in both settings generalize and extend existing ones. Finally, we identify a recursive structure on our model that could improve the statistical and computational tractability of PORRL, giving examples from past work on RLHF as well as learning perfect reward machines, which PORRL subsumes.

algorithm, artificial intelligence, machine learning, (18 more...)

2402.03282

Country: North America > United States (0.14)

Genre: Research Report (0.50)

Industry: Health & Medicine > Therapeutic Area (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Machine LearningDec-26-2023

Quantum Learning Theory Beyond Batch Binary Classification

Mohan, Preetham, Tewari, Ambuj

Arunachalam and de Wolf (2018) showed that the sample complexity of quantum batch learning of boolean functions, in the realizable and agnostic settings, has the same form and order as the corresponding classical sample complexities. In this paper, we extend this, ostensibly surprising, message to batch multiclass learning, online boolean learning, and online multiclass learning. For our online learning results, we first consider an adaptive adversary variant of the classical model of Dawid and Tewari (2022). Then, we introduce the first (to the best of our knowledge) model of online learning with quantum examples.

artificial intelligence, learner, machine learning, (15 more...)

2302.07409

Genre: Research Report (0.50)

Industry: Education > Educational Setting > Online (0.89)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Computational Learning Theory (1.00)

arXiv.org Machine LearningNov-19-2023

Online Learning with Set-Valued Feedback

Raman, Vinod, Subedi, Unique, Tewari, Ambuj

We study a variant of online multiclass classification where the learner predicts a single label but receives a \textit{set of labels} as feedback. In this model, the learner is penalized for not outputting a label contained in the revealed set. We show that unlike online multiclass learning with single-label feedback, deterministic and randomized online learnability are \textit{not equivalent} in the realizable setting under set-valued feedback. In addition, we show that deterministic and randomized realizable learnability are equivalent if the Helly number of the collection of sets that can be revealed as feedback is finite. In light of this separation, we give two new combinatorial dimensions, named the Set Littlestone and Measure Shattering dimension, whose finiteness characterizes deterministic and randomized realizable learnability respectively. Additionally, these dimensions lower- and upper bound the deterministic and randomized minimax regret in the realizable setting. Going beyond the realizable setting, we prove that the Measure shattering dimension continues to characterize learnability and quantify minimax regret in the agnostic setting. Finally, we use our results to establish bounds on the minimax regret for three practical learning settings: online multilabel ranking, online multilabel classification, and real-valued prediction with interval-valued response.

artificial intelligence, online learning, set-valued feedback

2306.06247

Genre: Research Report (1.00)

Industry: Education > Educational Setting > Online (1.00)

Technology:

Information Technology > Artificial Intelligence (1.00)
Information Technology > Enterprise Applications > Human Resources > Learning Management (0.40)