Collaborating Authors

Simple Algorithms for Dueling Bandits Machine Learning

In this paper, we present simple algorithms for Dueling Bandits. We prove that the algorithms have regret bounds for time horizon T of order O(T^rho ) with 1/2 <= rho <= 3/4, which importantly do not depend on any preference gap between actions, Delta. Dueling Bandits is an important extension of the Multi-Armed Bandit problem, in which the algorithm must select two actions at a time and only receives binary feedback for the duel outcome. This is analogous to comparisons in which the rater can only provide yes/no or better/worse type responses. We compare our simple algorithms to the current state-of-the-art for Dueling Bandits, ISS and DTS, discussing complexity and regret upper bounds, and conducting experiments on synthetic data that demonstrate their regret performance, which in some cases exceeds state-of-the-art.

Dueling Bandits with Qualitative Feedback Machine Learning

We formulate and study a novel multi-armed bandit problem called the qualitative dueling bandit (QDB) problem, where an agent observes not numeric but qualitative feedback by pulling each arm. We employ the same regret as the dueling bandit (DB) problem where the duel is carried out by comparing the qualitative feedback. Although we can naively use classic DB algorithms for solving the QDB problem, this reduction significantly worsens the performance---actually, in the QDB problem, the probability that one arm wins the duel over another arm can be directly estimated without carrying out actual duels. In this paper, we propose such direct algorithms for the QDB problem. Our theoretical analysis shows that the proposed algorithms significantly outperform DB algorithms by incorporating the qualitative feedback, and experimental results also demonstrate vast improvement over the existing DB algorithms.

Duelling Bandits with Weak Regret in Adversarial Environments Machine Learning

Research on the multi-armed bandit problem has studied the trade-off of exploration and exploitation in depth. However, there are numerous applications where the cardinal absolute-valued feedback model (e.g. ratings from one to five) is not suitable. This has motivated the formulation of the duelling bandits problem, where the learner picks a pair of actions and observes a noisy binary feedback, indicating a relative preference between the two. There exist a multitude of different settings and interpretations of the problem for two reasons. First, due to the absence of a total order of actions, there is no natural definition of the best action. Existing work either explicitly assumes the existence of a linear order, or uses a custom definition for the winner. Second, there are multiple reasonable notions of regret to measure the learner's performance. Most prior work has been focussing on the $\textit{strong regret}$, which averages the quality of the two actions picked. This work focusses on the $\textit{weak regret}$, which is based on the quality of the better of the two actions selected. Weak regret is the more appropriate performance measure when the pair's inferior action has no significant detrimental effect on the pair's quality. We study the duelling bandits problem in the adversarial setting. We provide an algorithm which has theoretical guarantees in both the utility-based setting, which implies a total order, and the unrestricted setting. For the latter, we work with the $\textit{Borda winner}$, finding the action maximising the probability of winning against an action sampled uniformly at random. The thesis concludes with experimental results based on both real-world data and synthetic data, showing the algorithm's performance and limitations.


AAAI Conferences

It is well known that strategic behavior in elections is essentially unavoidable; we therefore ask: how bad can the rational outcome be? We answer this question via the notion of the price of anarchy, using the scores of alternatives as a proxy for their quality and bounding the ratio between the score of the optimal alternative and the score of the winning alternative in Nash equilibrium. Specifically, we are interested in Nash equilibria that are obtained via sequences of rational strategic moves. Focusing on three common voting rules -- plurality, veto, and Borda -- we provide very positive results for plurality and very negative results for Borda, and place veto in the middle of this spectrum.

L.A. Phil President Deborah Borda's departure sends arts world spinning

Los Angeles Times

She'll be replacing Matthew VanBesien as president and CEO of the New York Philharmonic starting Sept. 15. "It was pretty unexpected!" said Jesse Rosen, head of the New York-based League of American Orchestras. "I tend to hear rumblings before they happen, and I hadn't heard anything. If anyone saw this coming, they weren't saying." Rosen said Borda's departure shouldn't be a head-spinning surprise, but the fact that she's circling back to head up the New York Philharmonic, where she served as executive director from 1991 to 1999, is significant and possibly unprecedented in the orchestra world.