Kroer, Christian
On the Optimality of Dilated Entropy and Lower Bounds for Online Learning in Extensive-Form Games
Fan, Zhiyuan, Kroer, Christian, Farina, Gabriele
First-order methods (FOMs) are arguably the most scalable algorithms for equilibrium computation in large extensive-form games. To operationalize these methods, a distance-generating function, acting as a regularizer for the strategy space, must be chosen. The ratio between the strong convexity modulus and the diameter of the regularizer is a key parameter in the analysis of FOMs. A natural question is then: what is the optimal distance-generating function for extensive-form decision spaces? In this paper, we make a number of contributions, ultimately establishing that the weight-one dilated entropy (DilEnt) distance-generating function is optimal up to logarithmic factors. The DilEnt regularizer is notable due to its iterate-equivalence with Kernelized OMWU (KOMWU) -- the algorithm with state-of-the-art dependence on the game tree size in extensive-form games -- when used in conjunction with the online mirror descent (OMD) algorithm. However, the standard analysis for OMD is unable to establish such a result; the only current analysis is by appealing to the iterate equivalence to KOMWU. We close this gap by introducing a pair of primal-dual treeplex norms, which we contend form the natural analytic viewpoint for studying the strong convexity of DilEnt. Using these norm pairs, we recover the diameter-to-strong-convexity ratio that predicts the same performance as KOMWU. Along with a new regret lower bound for online learning in sequence-form strategy spaces, we show that this ratio is nearly optimal. Finally, we showcase our analytic techniques by refining the analysis of Clairvoyant OMD when paired with DilEnt, establishing an $\mathcal{O}(n \log |\mathcal{V}| \log T/T)$ approximation rate to coarse correlated equilibrium in $n$-player games, where $|\mathcal{V}|$ is the number of reduced normal-form strategies of the players, establishing the new state of the art.
On the Convergence of T\^atonnement for Linear Fisher Markets
Nan, Tianlong, Gao, Yuan, Kroer, Christian
T\^atonnement is a simple, intuitive market process where prices are iteratively adjusted based on the difference between demand and supply. Many variants under different market assumptions have been studied and shown to converge to a market equilibrium, in some cases at a fast rate. However, the classical case of linear Fisher markets have long eluded the analyses, and it remains unclear whether t\^atonnement converges in this case. We show that, for a sufficiently small step size, the prices given by the t\^atonnement process are guaranteed to converge to equilibrium prices, up to a small approximation radius that depends on the stepsize. To achieve this, we consider the dual Eisenberg-Gale convex program in the price space, view t\^atonnement as subgradient descent on this convex program, and utilize novel last-iterate convergence results for subgradient descent under error bound conditions. In doing so, we show that the convex program satisfies a particular error bound condition, the quadratic growth condition, and that the price sequence generated by t\^atonnement is bounded above and away from zero. We also show that a similar convergence result holds for t\^atonnement in quasi-linear Fisher markets. Numerical experiments are conducted to demonstrate that the theoretical linear convergence aligns with empirical observations.
Fast Last-Iterate Convergence of Learning in Games Requires Forgetful Algorithms
Cai, Yang, Farina, Gabriele, Grand-Clément, Julien, Kroer, Christian, Lee, Chung-Wei, Luo, Haipeng, Zheng, Weiqiang
Self-play via online learning is one of the premier ways to solve large-scale two-player zero-sum games, both in theory and practice. Particularly popular algorithms include optimistic multiplicative weights update (OMWU) and optimistic gradient-descent-ascent (OGDA). While both algorithms enjoy $O(1/T)$ ergodic convergence to Nash equilibrium in two-player zero-sum games, OMWU offers several advantages including logarithmic dependence on the size of the payoff matrix and $\widetilde{O}(1/T)$ convergence to coarse correlated equilibria even in general-sum games. However, in terms of last-iterate convergence in two-player zero-sum games, an increasingly popular topic in this area, OGDA guarantees that the duality gap shrinks at a rate of $O(1/\sqrt{T})$, while the best existing last-iterate convergence for OMWU depends on some game-dependent constant that could be arbitrarily large. This begs the question: is this potentially slow last-iterate convergence an inherent disadvantage of OMWU, or is the current analysis too loose? Somewhat surprisingly, we show that the former is true. More generally, we prove that a broad class of algorithms that do not forget the past quickly all suffer the same issue: for any arbitrarily small $\delta>0$, there exists a $2\times 2$ matrix game such that the algorithm admits a constant duality gap even after $1/\delta$ rounds. This class of algorithms includes OMWU and other standard optimistic follow-the-regularized-leader algorithms.
Efficient Learning in Polyhedral Games via Best Response Oracles
Chakrabarti, Darshan, Farina, Gabriele, Kroer, Christian
Learning in games is a well-studied framework in which agents iteratively refine their strategies through repeated interactions with their environment. One natural way for agents to iteratively refine their strategies is by best-responding. This idea can be applied in many forms, the simplest and earliest instance of which was fictitious play (FP) [Brown, 1951]. This algorithm involves the agent observing the strategies played by the opponent and then playing a strategy that corresponds to the best response to the average of the observed strategies. This algorithm was shown to converge [Robinson, 1951], but its convergence rate can, in the worst case, scale quite poorly with the number of actions available to each player [Daskalakis and Pan, 2014]. It is then natural to ask what are the best convergence guarantees that can be obtained for the computation of Nash equilibria in two-player zero-sum games or coarse correlated equilibria in multiplayer games when agents are learning through a best-response oracle. In the online learning community, methods based only on best-response oracles are special cases of methods based on a linear minimization oracle (LMO), which can be queried for points that minimize a linear objective over the feasible set. Such methods are known as projection-free methods because they avoid potentially expensive projections onto the feasible set. Projection-free online learning algorithms might perform multiple LMO calls per iteration, so our paper and related literature are concerned not only with the number of iterations T of online learning but also the total number of LMO calls, which we will denote by N. Because LMOs for polyhedral decision sets essentially correspond to a best-response oracle (BRO), we will use these two terms interchangeably.
Block-Coordinate Methods and Restarting for Solving Extensive-Form Games
Chakrabarti, Darshan, Diakonikolas, Jelena, Kroer, Christian
Coordinate descent methods are popular in machine learning and optimization for their simple sparse updates and excellent practical performance. In the context of large-scale sequential game solving, these same properties would be attractive, but until now no such methods were known, because the strategy spaces do not satisfy the typical separable block structure exploited by such methods. We present the first cyclic coordinate-descent-like method for the polytope of sequence-form strategies, which form the strategy spaces for the players in an extensive-form game (EFG). Our method exploits the recursive structure of the proximal update induced by what are known as dilated regularizers, in order to allow for a pseudo block-wise update. We show that our method enjoys a $O(1/T)$ convergence rate to a two-player zero-sum Nash equilibrium, while avoiding the worst-case polynomial scaling with the number of blocks common to cyclic methods. We empirically show that our algorithm usually performs better than other state-of-the-art first-order methods (i.e., mirror prox), and occasionally can even beat CFR$^+$, a state-of-the-art algorithm for numerical equilibrium computation in zero-sum EFGs. We then introduce a restarting heuristic for EFG solving. We show empirically that restarting can lead to speedups, sometimes huge, both for our cyclic method, as well as for existing methods such as mirror prox and predictive CFR$^+$.
Single-Leg Revenue Management with Advice
Balseiro, Santiago, Kroer, Christian, Kumar, Rachitesh
Single-leg revenue management is a foundational problem of revenue management that has been particularly impactful in the airline and hotel industry: Given $n$ units of a resource, e.g. flight seats, and a stream of sequentially-arriving customers segmented by fares, what is the optimal online policy for allocating the resource. Previous work focused on designing algorithms when forecasts are available, which are not robust to inaccuracies in the forecast, or online algorithms with worst-case performance guarantees, which can be too conservative in practice. In this work, we look at the single-leg revenue management problem through the lens of the algorithms-with-advice framework, which attempts to harness the increasing prediction accuracy of machine learning methods by optimally incorporating advice about the future into online algorithms. In particular, we characterize the Pareto frontier that captures the tradeoff between consistency (performance when advice is accurate) and competitiveness (performance when advice is inaccurate) for every advice. Moreover, we provide an online algorithm that always achieves performance on this Pareto frontier. We also study the class of protection level policies, which is the most widely-deployed technique for single-leg revenue management: we provide an algorithm to incorporate advice into protection levels that optimally trades off consistency and competitiveness. Moreover, we empirically evaluate the performance of these algorithms on synthetic data. We find that our algorithm for protection level policies performs remarkably well on most instances, even if it is not guaranteed to be on the Pareto frontier in theory. Our results extend to other unit-cost online allocations problems such as the display advertising and the multiple secretary problem together with more general variable-cost problems such as the online knapsack problem.
Online Resource Allocation under Horizon Uncertainty
Balseiro, Santiago, Kroer, Christian, Kumar, Rachitesh
We study stochastic online resource allocation: a decision maker needs to allocate limited resources to stochastically-generated sequentially-arriving requests in order to maximize reward. At each time step, requests are drawn independently from a distribution that is unknown to the decision maker. Online resource allocation and its special cases have been studied extensively in the past, but prior results crucially and universally rely on the strong assumption that the total number of requests (the horizon) is known to the decision maker in advance. In many applications, such as revenue management and online advertising, the number of requests can vary widely because of fluctuations in demand or user traffic intensity. In this work, we develop online algorithms that are robust to horizon uncertainty. In sharp contrast to the known-horizon setting, no algorithm can achieve even a constant asymptotic competitive ratio that is independent of the horizon uncertainty. We introduce a novel generalization of dual mirror descent which allows the decision maker to specify a schedule of time-varying target consumption rates, and prove corresponding performance guarantees. We go on to give a fast algorithm for computing a schedule of target consumption rates that leads to near-optimal performance in the unknown-horizon setting. In particular, our competitive ratio attains the optimal rate of growth (up to logarithmic factors) as the horizon uncertainty grows large. Finally, we also provide a way to incorporate machine-learned predictions about the horizon which interpolates between the known and unknown horizon settings.
Online Learning under Budget and ROI Constraints and Applications to Bidding in Non-Truthful Auctions
Castiglioni, Matteo, Celli, Andrea, Kroer, Christian
We study online learning problems in which a decision maker has to make a sequence of costly decisions, with the goal of maximizing their expected reward while adhering to budget and return-on-investment (ROI) constraints. Previous work requires the decision maker to know beforehand some specific parameters related to the degree of strict feasibility of the offline problem. Moreover, when inputs are adversarial, it requires the existence of a strictly feasible solution to the offline optimization problem at each round. Both requirements are unrealistic for practical applications such as bidding in online ad auctions. We propose a best-of-both-worlds primal-dual framework which circumvents both assumptions by exploiting the notion of interval regret, providing guarantees under both stochastic and adversarial inputs. Our proof techniques can be applied to both input models with minimal modifications, thereby providing a unified perspective on the two problems. Finally, we show how to instantiate the framework to optimally bid in various mechanisms of practical relevance, such as first- and second-price auctions.
Regret Matching+: (In)Stability and Fast Convergence in Games
Farina, Gabriele, Grand-Clément, Julien, Kroer, Christian, Lee, Chung-Wei, Luo, Haipeng
Regret Matching+ (RM+) and its variants are important algorithms for solving large-scale games. However, a theoretical understanding of their success in practice is still a mystery. Moreover, recent advances on fast convergence in games are limited to no-regret algorithms such as online mirror descent, which satisfy stability. In this paper, we first give counterexamples showing that RM+ and its predictive version can be unstable, which might cause other players to suffer large regret. We then provide two fixes: restarting and chopping off the positive orthant that RM+ works in. We show that these fixes are sufficient to get $O(T^{1/4})$ individual regret and $O(1)$ social regret in normal-form games via RM+ with predictions. We also apply our stabilizing techniques to clairvoyant updates in the uncoupled learning setting for RM+ and prove desirable results akin to recent works for Clairvoyant online mirror descent. Our experiments show the advantages of our algorithms over vanilla RM+-based algorithms in matrix and extensive-form games.
A Unified Approach to Reinforcement Learning, Quantal Response Equilibria, and Two-Player Zero-Sum Games
Sokota, Samuel, D'Orazio, Ryan, Kolter, J. Zico, Loizou, Nicolas, Lanctot, Marc, Mitliagkas, Ioannis, Brown, Noam, Kroer, Christian
This work studies an algorithm, which we call magnetic mirror descent, that is inspired by mirror descent and the non-Euclidean proximal gradient algorithm. Our contribution is demonstrating the virtues of magnetic mirror descent as both an equilibrium solver and as an approach to reinforcement learning in two-player zero-sum games. These virtues include: 1) Being the first quantal response equilibria solver to achieve linear convergence for extensive-form games with first order feedback; 2) Being the first standard reinforcement learning algorithm to achieve empirically competitive results with CFR in tabular settings; 3) Achieving favorable performance in 3x3 Dark Hex and Phantom Tic-Tac-Toe as a self-play deep reinforcement learning algorithm.