AITopics

2307.11892

Country: North America > United States > California > San Francisco County > San Francisco (0.14)

Genre: Research Report > New Finding (0.88)

Industry: Government (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.97)

arXiv.org Artificial IntelligenceFeb-6-2023

Dueling RL: Reinforcement Learning with Trajectory Preferences

Pacchiano, Aldo, Saha, Aadirupa, Lee, Jonathan

We consider the problem of preference based reinforcement learning (PbRL), where, unlike traditional reinforcement learning, an agent receives feedback only in terms of a 1 bit (0/1) preference over a trajectory pair instead of absolute rewards for them. The success of the traditional RL framework crucially relies on the underlying agent-reward model, which, however, depends on how accurately a system designer can express an appropriate reward function and often a non-trivial task. The main novelty of our framework is the ability to learn from preference-based trajectory feedback that eliminates the need to hand-craft numeric reward models. This paper sets up a formal framework for the PbRL problem with non-markovian rewards, where the trajectory preferences are encoded by a generalized linear model of dimension $d$. Assuming the transition model is known, we then propose an algorithm with almost optimal regret guarantee of $\tilde {\mathcal{O}}\left( SH d \log (T / \delta) \sqrt{T} \right)$. We further, extend the above algorithm to the case of unknown transition dynamics, and provide an algorithm with near optimal regret guarantee $\widetilde{\mathcal{O}}((\sqrt{d} + H^2 + |\mathcal{S}|)\sqrt{dT} +\sqrt{|\mathcal{S}||\mathcal{A}|TH} )$. To the best of our knowledge, our work is one of the first to give tight regret guarantees for preference based RL problems with trajectory preferences.

artificial intelligence, machine learning, reinforcement learning, (12 more...)

2111.0485

Country: North America > United States (0.28)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

arXiv.org Machine LearningFeb-9-2022

Stochastic Contextual Dueling Bandits under Linear Stochastic Transitivity Models

Bengs, Viktor, Saha, Aadirupa, Hüllermeier, Eyke

We consider the regret minimization task in a dueling bandits problem with context information. In every round of the sequential decision problem, the learner makes a context-dependent selection of two choice alternatives (arms) to be compared with each other and receives feedback in the form of noisy preference information. We assume that the feedback process is determined by a linear stochastic transitivity model with contextualized utilities (CoLST), and the learner's task is to include the best arm (with highest latent context-dependent utility) in the duel. We propose a computationally efficient algorithm, $\texttt{CoLSTIM}$, which makes its choice based on imitating the feedback process using perturbed context-dependent utility estimates of the underlying CoLST model. If each arm is associated with a $d$-dimensional feature vector, we show that $\texttt{CoLSTIM}$ achieves a regret of order $\tilde O( \sqrt{dT})$ after $T$ learning rounds. Additionally, we also establish the optimality of $\texttt{CoLSTIM}$ by showing a lower bound for the weak regret that refines the existing average regret analysis. Our experiments demonstrate its superiority over state-of-art algorithms for special cases of CoLST models.

data mining, linear stochastic transitivity model, machine learning, (13 more...)

2202.04593

Country:

Europe (0.14)
North America > United States (0.14)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Data Science > Data Mining > Big Data (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.48)

arXiv.org Artificial IntelligenceNov-6-2021

Optimal and Efficient Dynamic Regret Algorithms for Non-Stationary Dueling Bandits

Gupta, Shubham, Saha, Aadirupa

The problem of Dueling-Bandits has gained much attention in the machine learning community [32, 34, 35]. This is an online learning framework that generalizes the standard multi-armed bandit (MAB) setting [1, 3, 4, 15, 17] by querying relative preference feedback of actively chosen item pairs instead of an absolute feedback for a single item. More formally, in dueling bandits, the learning proceeds in rounds: At each round, the learner selects a pair of arms and observes stochastic preference feedback of the winner of a comparison (duel) between the selected arms. The objective of the learner is to minimize the regret with respect to a (or a set of) 'best' arm(s) in hindsight. Towards this, several algorithms have been proposed [2, 13, 16, 33]. Due to the inherent exploration-vs-exploitation tradeoff of the learning framework and several advantages of preference feedback [6, 31], many real-world applications can be modeled as dueling bandits, including movie recommendations, retail management, search engine optimization, and job scheduling, etc. While most existing works have studied the dueling bandit problem in a stochastic setting where the underlying preferences are assumed to be fixed over time, it is often unjustified in reality.

artificial intelligence, data mining, machine learning, (22 more...)

2111.03917

Country: North America > United States (0.14)

Genre: Research Report (0.64)

Industry: Education (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Data Science > Data Mining > Big Data (0.88)
Information Technology > Information Management > Search (0.86)

arXiv.org Artificial IntelligenceJul-30-2021

Strategically Efficient Exploration in Competitive Multi-agent Reinforcement Learning

Loftin, Robert, Saha, Aadirupa, Devlin, Sam, Hofmann, Katja

High sample complexity remains a barrier to the application of reinforcement learning (RL), particularly in multi-agent systems. A large body of work has demonstrated that exploration mechanisms based on the principle of optimism under uncertainty can significantly improve the sample efficiency of RL in single agent tasks. This work seeks to understand the role of optimistic exploration in non-cooperative multi-agent settings. We will show that, in zero-sum games, optimistic exploration can cause the learner to waste time sampling parts of the state space that are irrelevant to strategic play, as they can only be reached through cooperation between both players. To address this issue, we introduce a formal notion of strategically efficient exploration in Markov games, and use this to develop two strategically efficient learning algorithms for finite Markov games. We demonstrate that these methods can be significantly more sample efficient than their optimistic counterparts.

algorithm, artificial intelligence, reinforcement learning, (18 more...)

2107.14698

Country: North America > United States (0.28)

Genre: Research Report (0.50)

Industry: Leisure & Entertainment > Games (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents > Agent Societies (0.48)

arXiv.org Artificial IntelligenceJul-5-2021

Dueling Bandits with Adversarial Sleeping

Saha, Aadirupa, Gaillard, Pierre

We introduce the problem of sleeping dueling bandits with stochastic preferences and adversarial availabilities (DB-SPAA). In almost all dueling bandit applications, the decision space often changes over time; eg, retail store management, online shopping, restaurant recommendation, search engine optimization, etc. Surprisingly, this `sleeping aspect' of dueling bandits has never been studied in the literature. Like dueling bandits, the goal is to compete with the best arm by sequentially querying the preference feedback of item pairs. The non-triviality however results due to the non-stationary item spaces that allow any arbitrary subsets items to go unavailable every round. The goal is to find an optimal `no-regret' policy that can identify the best available item at each round, as opposed to the standard `fixed best-arm regret objective' of dueling bandits. We first derive an instance-specific lower bound for DB-SPAA $\Omega( \sum_{i =1}^{K-1}\sum_{j=i+1}^K \frac{\log T}{\Delta(i,j)})$, where $K$ is the number of items and $\Delta(i,j)$ is the gap between items $i$ and $j$. This indicates that the sleeping problem with preference feedback is inherently more difficult than that for classical multi-armed bandits (MAB). We then propose two algorithms, with near optimal regret guarantees. Our results are corroborated empirically.

algorithm, big data, upstream oil & gas, (19 more...)

2107.02274

Country: North America > United States (0.14)

Genre: Research Report > New Finding (0.48)

Industry:

Retail (0.54)
Energy > Oil & Gas > Upstream (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Data Science > Data Mining > Big Data (0.67)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.48)

arXiv.org Machine LearningFeb-5-2021

Confidence-Budget Matching for Sequential Budgeted Learning

Efroni, Yonathan, Merlis, Nadav, Saha, Aadirupa, Mannor, Shie

A core element in decision-making under uncertainty is the feedback on the quality of the performed actions. However, in many applications, such feedback is restricted. For example, in recommendation systems, repeatedly asking the user to provide feedback on the quality of recommendations will annoy them. In this work, we formalize decision-making problems with querying budget, where there is a (possibly time-dependent) hard limit on the number of reward queries allowed. Specifically, we consider multi-armed bandits, linear bandits, and reinforcement learning problems. We start by analyzing the performance of `greedy' algorithms that query a reward whenever they can. We show that in fully stochastic settings, doing so performs surprisingly well, but in the presence of any adversity, this might lead to linear regret. To overcome this issue, we propose the Confidence-Budget Matching (CBM) principle that queries rewards when the confidence intervals are wider than the inverse square root of the available budget. We analyze the performance of CBM based algorithms in different settings and show that they perform well in the presence of adversity in the contexts, initial states, and budgets.

artificial intelligence, confidence-budget matching, reinforcement learning, (16 more...)

2102.034

Country:

North America > United States (0.14)
Asia > Middle East (0.14)

Genre: Research Report (0.50)

Industry: Education (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.88)
Information Technology > Data Science > Data Mining > Big Data (0.66)

arXiv.org Machine LearningOct-27-2020

Adversarial Dueling Bandits

Saha, Aadirupa, Koren, Tomer, Mansour, Yishay

We introduce the problem of regret minimization in Adversarial Dueling Bandits. As in classic Dueling Bandits, the learner has to repeatedly choose a pair of items and observe only a relative binary `win-loss' feedback for this pair, but here this feedback is generated from an arbitrary preference matrix, possibly chosen adversarially. Our main result is an algorithm whose $T$-round regret compared to the \emph{Borda-winner} from a set of $K$ items is $\tilde{O}(K^{1/3}T^{2/3})$, as well as a matching $\Omega(K^{1/3}T^{2/3})$ lower bound. We also prove a similar high probability regret bound. We further consider a simpler \emph{fixed-gap} adversarial setup, which bridges between two extreme preference feedback models for dueling bandits: stationary preferences and an arbitrary sequence of preferences. For the fixed-gap adversarial setup we give an $\smash{ \tilde{O}((K/\Delta^2)\log{T}) }$ regret algorithm, where $\Delta$ is the gap in Borda scores between the best item and all other items, and show a lower bound of $\Omega(K/\Delta^2)$ indicating that our dependence on the main problem parameters $K$ and $\Delta$ is tight (up to logarithmic factors).

algorithm, artificial intelligence, big data, (16 more...)

2010.14563

Country:

Asia > Middle East > Israel (0.14)
Asia > India (0.14)

Genre: Research Report (0.50)

Industry: Government (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Data Science > Data Mining > Big Data (0.47)

arXiv.org Machine LearningMar-1-2019

Regret Minimisation in Multinomial Logit Bandits

Saha, Aadirupa, Gopalan, Aditya

We consider two regret minimisation problems over subsets of a finite ground set $[n]$, with subset-wise relative preference information feedback according to the Multinomial logit choice model. The first setting requires the learner to test subsets of size bounded by a maximum size followed by receiving top-$m$ rank-ordered feedback, while in the second setting the learner is restricted to play subsets of a fixed size $k$ with a full ranking observed as feedback. For both settings, we devise new, order-optimal regret algorithms, and derive fundamental limits on the regret performance of online learning with subset-wise preferences. Our results also show the value of eliciting a general top $m$-rank-ordered feedback over single winner feedback ($m=1$).

algorithm, artificial intelligence, machine learning, (18 more...)

1903.00543

Country: Asia > India (0.14)

Genre: Research Report > New Finding (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

arXiv.org Machine LearningMar-1-2019

From PAC to Instance-Optimal Sample Complexity in the Plackett-Luce Model

Saha, Aadirupa, Gopalan, Aditya

We consider PAC learning for identifying a good item from subset-wise samples in \pl\, probability models, with instance-dependent sample complexity performance. For the setting where subsets of a fixed size can be tested and top-ranked feedback is made available to the learner each time, we give the first $(\epsilon,\delta)$-PAC best item algorithm with an instance-dependent sample complexity bound. The algorithm relies on a wrapper that uses a weaker PAC algorithm with worst-case performance guarantees to adapt to the hardness of the input instance. The sample complexity is shown to be multiplicatively better depending on the length of rank-ordered feedback available in each subset play. We also give a new fixed-budget best-item algorithm for the \pl\, model along with an error bound. Numerical results of simulations of the algorithms are reported.

artificial intelligence, machine learning, sample complexity, (16 more...)

1903.00558

Country: Asia > India (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Computational Learning Theory (0.34)