Dann, Christoph, Mansour, Yishay, Mohri, Mehryar, Sekhari, Ayush, Sridharan, Karthik

We study episodic reinforcement learning in Markov decision processes when the agent receives additional feedback per step in the form of several transition observations. Such additional observations are available in a range of tasks through extended sensors or prior knowledge about the environment (e.g., when certain actions yield similar outcome). We formalize this setting using a feedback graph over state-action pairs and show that model-based algorithms can leverage the additional feedback for more sample-efficient learning. We give a regret bound that, ignoring logarithmic factors and lower-order terms, depends only on the size of the maximum acyclic subgraph of the feedback graph, in contrast with a polynomial dependency on the number of states and actions in the absence of a feedback graph. Finally, we highlight challenges when leveraging a small dominating set of the feedback graph as compared to the bandit setting and propose a new algorithm that can use knowledge of such a dominating set for more sample-efficient learning of a near-optimal policy.

Shabat, Eliran, Cohen, Lee, Mansour, Yishay

There is a growing interest in societal concerns in machine learning systems, especially in fairness. Multicalibration gives a comprehensive methodology to address group fairness. In this work, we address the multicalibration error and decouple it from the prediction error. The importance of decoupling the fairness metric (multicalibration) and the accuracy (prediction error) is due to the inherent trade-off between the two, and the societal decision regarding the "right tradeoff" (as imposed many times by regulators). Our work gives sample complexity bounds for uniform convergence guarantees of multicalibration error, which implies that regardless of the accuracy, we can guarantee that the empirical and (true) multicalibration errors are close. We emphasize that our results: (1) are more general than previous bounds, as they apply to both agnostic and realizable settings, and do not rely on a specific type of algorithm (such as deferentially private), (2) improve over previous multicalibration sample complexity bounds and (3) implies uniform convergence guarantees for the classical calibration error.

Cohen, Alon, Hassidim, Avinatan, Kaplan, Haim, Mansour, Yishay, Moran, Shay

Imagine a large firm with multiple departments that plans a large recruitment. Candidates arrive one-by-one, and for each candidate the firm decides, based on her data (CV, skills, experience, etc), whether to summon her for an interview. The firm wants to recruit the best candidates while minimizing the number of interviews. We model such scenarios as an assignment problem between items (candidates) and categories (departments): the items arrive one-by-one in an online manner, and upon processing each item the algorithm decides, based on its value and the categories it can be matched with, whether to retain or discard it (this decision is irrevocable). The goal is to retain as few items as possible while guaranteeing that the set of retained items contains an optimal matching.

Livni, Roi, Mansour, Yishay

A basic question in learning theory is to identify if two distributions are identical when we have access only to examples sampled from the distributions. This basic task is considered, for example, in the context of Generative Adversarial Networks (GANs), where a discriminator is trained to distinguish between a real-life distribution and a synthetic distribution. Classically, we use a hypothesis class $H$ and claim that the two distributions are distinct if for some $h\in H$ the expected value on the two distributions is (significantly) different. Our starting point is the following fundamental problem: "is having the hypothesis dependent on more than a single random example beneficial". To address this challenge we define $k$-ary based discriminators, which have a family of Boolean $k$-ary functions $\G$.

Bar-On, Yogev, Mansour, Yishay

We study agents communicating over an underlying network by exchanging messages, in order to optimize their individual regret in a common nonstochastic multi-armed bandit problem. We derive regret minimization algorithms that guarantee for each agent $v$ an individual expected regret of $\widetilde{O}\left(\sqrt{\left(1 \frac{K}{\left \mathcal{N}\left(v\right)\right }\right)T}\right)$, where $T$ is the number of time steps, $K$ is the number of actions and $\mathcal{N}\left(v\right)$ is the set of neighbors of agent $v$ in the communication graph. We present algorithms both for the case that the communication graph is known to all the agents, and for the case that the graph is unknown. When the graph is unknown, each agent knows only the set of its neighbors and an upper bound on the total number of agents. The individual regret between the models differs only by a logarithmic factor.

Rosenberg, Aviv, Mansour, Yishay

We consider online learning in episodic loop-free Markov decision processes (MDPs), where the loss function can change arbitrarily between episodes. The transition function is fixed but unknown to the learner, and the learner only observes bandit feedback (not the entire loss function). For this problem we develop no-regret algorithms that perform asymptotically as well as the best stationary policy in hindsight. Assuming that all states are reachable with probability $\beta 0$ under any policy, we give a regret bound of $\tilde{O} ( L X \sqrt{ A T} / \beta)$, where $T$ is the number of episodes, $X$ is the state space, $A$ is the action space, and $L$ is the length of each episode. When this assumption is removed we give a regret bound of $\tilde{O} ( L {3/2} X A {1/4} T {3/4})$, that holds for an arbitrary transition function.

Mansour, Yishay, Mohri, Mehryar, Ro, Jae, Suresh, Ananda Theertha

The standard objective in machine learning is to train a single model for all users. However, in many learning scenarios, such as cloud computing and federated learning, it is possible to learn one personalized model per user. In this work, we present a systematic learning-theoretic study of personalization. We propose and analyze three approaches: user clustering, data interpolation, and model interpolation. For all three approaches, we provide learning-theoretic guarantees and efficient algorithms for which we also demonstrate the performance empirically. All of our algorithms are model agnostic and work for any hypothesis class.

Amir, Idan, Attias, Idan, Koren, Tomer, Livni, Roi, Mansour, Yishay

Prediction with expert advice is perhaps the single most fundamental problem in online learning and sequential decision making. In this problem, the goal of a learner is to aggregate decisions from multiple experts and achieve performance that approaches that of the best individual expert in hindsight. The standard performance criterion is the regret: the difference between the loss of the learner and that of the best single expert. The experts problem is often considered in the so-called adversarial setting, where the losses of the individual experts may be virtually arbitrary and even be chosen by an adversary so as to maximize the learner's regret. The canonical algorithm in this setup is the Multiplicative Weights algorithm (Littlestone and Warmuth, 1989; Freund and Schapire, 1995), that guarantees an optimal regret of Θ( T log N) in any problem with N experts and T decision rounds. A long line of research in online learning has focused on obtaining better regret guarantees, often referred to as "fast rates," on benign problem instances in which the loss generation process behaves more favourably than in a fully adversarial setup. A prototypical example of such an instance is the stochastic setting of the experts problem, where the losses of the experts are drawn i.i.d.

Cohen, Alon, Kaplan, Haim, Mansour, Yishay, Rosenberg, Aviv

Stochastic shortest path (SSP) is a well-known problem in planning and control, in which an agent has to reach a goal state in minimum total expected cost. In the learning formulation of the problem, the agent is unaware of the environment dynamics (i.e., the transition function) and has to repeatedly play for a given number of episodes while reasoning about the problem's optimal solution. Unlike other well-studied models in reinforcement learning (RL), the length of an episode is not predetermined (or bounded) and is influenced by the agent's actions. Recently, Tarbouriech et al. (2019) studied this problem in the context of regret minimization and provided an algorithm whose regret bound is inversely proportional to the square root of the minimum instantaneous cost. In this work we remove this dependence on the minimum cost---we give an algorithm that guarantees a regret bound of $\widetilde{O}(B_\star |S| \sqrt{|A| K})$, where $B_\star$ is an upper bound on the expected cost of the optimal policy, $S$ is the set of states, $A$ is the set of actions and $K$ is the number of episodes. We additionally show that any learning algorithm must have at least $\Omega(B_\star \sqrt{|S| |A| K})$ regret in the worst case.

Cortes, Corinna, Mansour, Yishay, Mohri, Mehryar

This paper presents an analysis of importance weighting for learning from finite samples and gives a series of theoretical and algorithmic results. We point out simple cases where importance weighting can fail, which suggests the need for an analysis of the properties of this technique. We then give both upper and lower bounds for generalization with bounded importance weights and, more significantly, give learning guarantees for the more common case of unbounded importance weights under the weak assumption that the second moment is bounded, a condition related to the Renyi divergence of the training and test distributions. These results are based on a series of novel and general bounds we derive for unbounded loss functions, which are of independent interest. We use these bounds to guide the definition of an alternative reweighting algorithm and report the results of experiments demonstrating its benefits.