Cohen, Alon
Average reward reinforcement learning with unknown mixing times
Zahavy, Tom, Cohen, Alon, Kaplan, Haim, Mansour, Yishay
We derive and analyze learning algorithms for policy evaluation, apprenticeship learning, and policy gradient for average reward criteria. Existing algorithms explicitly require an upper bound on the mixing time. In contrast, we build on ideas from Markov chain theory and derive sampling algorithms that do not require such an upper bound. For these algorithms, we provide theoretical bounds on their sample-complexity and running time.
Learning Linear-Quadratic Regulators Efficiently with only $\sqrt{T}$ Regret
Cohen, Alon, Koren, Tomer, Mansour, Yishay
Optimal control theory dates back to the 1950s, and has been applied successfully to numerous real-worldengineering problems (e.g., Bermรบdez and Martinez, 1994; Chen and Islam, 2005; Lenhart and Workman, 2007; Geering, 2007). Classical results in control theory pertain to asymptotic convergenceand stability of dynamical systems, and recently, there has been a renewed interest in such problems from a learning-theoretic perspective with a focus on finite-time convergence guarantees andcomputational tractability. Perhaps the most well-studied model in optimal control is Linear-Quadratic (LQ) control. In this model, both the state and the action are real-valued vectors. The dynamics of the environment are linear in both the state and action, and are perturbed by Gaussian noise; the cost is quadratic in the state and action vectors.
Learning and Generalization for Matching Problems
Cohen, Alon, Hassidim, Avinatan, Kaplan, Haim, Mansour, Yishay, Moran, Shay
We study a classic algorithmic problem through the lens of statistical learning. That is, we consider a matching problem where the input graph is sampled from some distribution. This distribution is unknown to the algorithm; however, an additional graph which is sampled from the same distribution is given during a training phase (preprocessing). More specifically, the algorithmic problem is to match $k$ out of $n$ items that arrive online to $d$ categories ($d\ll k \ll n$). Our goal is to design a two-stage online algorithm that retains a small subset of items in the first stage which contains an offline matching of maximum weight. We then compute this optimal matching in a second stage. The added statistical component is that before the online matching process begins, our algorithms learn from a training set consisting of another matching instance drawn from the same unknown distribution. Using this training set, we learn a policy that we apply during the online matching process. We consider a class of online policies that we term \emph{thresholds policies}. For this class, we derive uniform convergence results both for the number of retained items and the value of the optimal matching. We show that the number of retained items and the value of the offline optimal matching deviate from their expectation by $O(\sqrt{k})$. This requires usage of less-standard concentration inequalities (standard ones give deviations of $O(\sqrt{n})$). Furthermore, we design an algorithm that outputs the optimal offline solution with high probability while retaining only $O(k\log \log n)$ items in expectation.
Incentivizing the Dynamic Workforce: Learning Contracts in the Gig-Economy
Cohen, Alon, Koren, Moran, Deligkas, Argyrios
In principal-agent models, a principal offers a contract to an agent to perform a certain task. The agent exerts a level of effort that maximizes her utility. The principal is oblivious to the agent's chosen level of effort, and conditions her wage only on possible outcomes. In this work, we consider a model in which the principal is unaware of the agent's utility and action space. She sequentially offers contracts to identical agents, and observes the resulting outcomes. We present an algorithm for learning the optimal contract under mild assumptions. We bound the number of samples needed for the principal obtain a contract that is within $\epsilon$ of her optimal net profit for every $\epsilon>0$.
Online Linear Quadratic Control
Cohen, Alon, Hassidim, Avinatan, Koren, Tomer, Lazic, Nevena, Mansour, Yishay, Talwar, Kunal
We study the problem of controlling linear time-invariant systems with known noisy dynamics and adversarially chosen quadratic losses. We present the first efficient online learning algorithms in this setting that guarantee $O(\sqrt{T})$ regret under mild assumptions, where $T$ is the time horizon. Our algorithms rely on a novel SDP relaxation for the steady-state distribution of the system. Crucially, and in contrast to previously proposed relaxations, the feasible solutions of our SDP all correspond to "strongly stable" policies that mix exponentially fast to a steady state.
Planning and Learning with Stochastic Action Sets
Boutilier, Craig, Cohen, Alon, Daniely, Amit, Hassidim, Avinatan, Mansour, Yishay, Meshi, Ofer, Mladenov, Martin, Schuurmans, Dale
In many practical uses of reinforcement learning (RL) the set of actions available at a given state is a random variable, with realizations governed by an exogenous stochastic process. Somewhat surprisingly, the foundations for such sequential decision processes have been unaddressed. In this work, we formalize and investigate MDPs with stochastic action sets (SAS-MDPs) to provide these foundations. We show that optimal policies and value functions in this model have a structure that admits a compact representation. From an RL perspective, we show that Q-learning with sampled action sets is sound. In model-based settings, we consider two important special cases: when individual actions are available with independent probabilities; and a sampling-based model for unknown distributions. We develop poly-time value and policy iteration methods for both cases; and in the first, we offer a poly-time linear programming solution.
Online Learning with Feedback Graphs Without the Graphs
Cohen, Alon, Hazan, Tamir, Koren, Tomer
We study an online learning framework introduced by Mannor and Shamir (2011) in which the feedback is specified by a graph, in a setting where the graph may vary from round to round and is \emph{never fully revealed} to the learner. We show a large gap between the adversarial and the stochastic cases. In the adversarial case, we prove that even for dense feedback graphs, the learner cannot improve upon a trivial regret bound obtained by ignoring any additional feedback besides her own loss. In contrast, in the stochastic case we give an algorithm that achieves $\widetilde \Theta(\sqrt{\alpha T})$ regret over $T$ rounds, provided that the independence numbers of the hidden feedback graphs are at most $\alpha$. We also extend our results to a more general feedback model, in which the learner does not necessarily observe her own loss, and show that, even in simple cases, concealing the feedback graphs might render a learnable problem unlearnable.