Goto

Collaborating Authors

 Jordan, Michael I.


Adaptive, Doubly Optimal No-Regret Learning in Strongly Monotone and Exp-Concave Games with Gradient Feedback

arXiv.org Artificial Intelligence

Online gradient descent (OGD) is well known to be doubly optimal under strong convexity or monotonicity assumptions: (1) in the single-agent setting, it achieves an optimal regret of $\Theta(\log T)$ for strongly convex cost functions; and (2) in the multi-agent setting of strongly monotone games, with each agent employing OGD, we obtain last-iterate convergence of the joint action to a unique Nash equilibrium at an optimal rate of $\Theta(\frac{1}{T})$. While these finite-time guarantees highlight its merits, OGD has the drawback that it requires knowing the strong convexity/monotonicity parameters. In this paper, we design a fully adaptive OGD algorithm, \textsf{AdaOGD}, that does not require a priori knowledge of these parameters. In the single-agent setting, our algorithm achieves $O(\log^2(T))$ regret under strong convexity, which is optimal up to a log factor. Further, if each agent employs \textsf{AdaOGD} in strongly monotone games, the joint action converges in a last-iterate sense to a unique Nash equilibrium at a rate of $O(\frac{\log^3 T}{T})$, again optimal up to log factors. We illustrate our algorithms in a learning version of the classical newsvendor problem, where due to lost sales, only (noisy) gradient feedback can be observed. Our results immediately yield the first feasible and near-optimal algorithm for both the single-retailer and multi-retailer settings. We also extend our results to the more general setting of exp-concave cost functions and games, using the online Newton step (ONS) algorithm.


Prediction-Powered Inference

arXiv.org Machine Learning

Prediction-powered inference is a framework for performing valid statistical inference when an experimental dataset is supplemented with predictions from a machine-learning system. The framework yields simple algorithms for computing provably valid confidence intervals for quantities such as means, quantiles, and linear and logistic regression coefficients, without making any assumptions on the machine-learning algorithm that supplies the predictions. Furthermore, more accurate predictions translate to smaller confidence intervals. Prediction-powered inference could enable researchers to draw valid and more data-efficient conclusions using machine learning. The benefits of prediction-powered inference are demonstrated with datasets from proteomics, astronomy, genomics, remote sensing, census analysis, and ecology.


Fine-Tuning Language Models with Advantage-Induced Policy Alignment

arXiv.org Artificial Intelligence

Reinforcement learning from human feedback (RLHF, or preference-based reinforcement learning) (Knox and Stone, 2008; Wirth et al., 2017) has delivered significant empirical successes in several fields, including games (Christiano et al., 2017), robotics (Sadigh et al., 2017; Kupcsik et al., 2018), recommendation systems (Maghakian et al., 2022). Recently, RLHF has also exhibited striking potential for integrating human knowledge with large language models (Ziegler et al., 2019; Ouyang et al., 2022; OpenAI, 2023; Beeching et al., 2023; Zhu et al., 2023; Bai et al., 2022b). To employ RLHF in the training pipeline of language models, a common protocol is as follows. Pre-training (PT): training the language model on a large amount of unlabeled or weakly labeled text data to produce general features and patterns that can be useful for downstream tasks (Vaswani et al., 2017; Devlin et al., 2018; Brown et al., 2020); Supervised fine-tuning (SFT): training the model on a smaller amount of curated data to improve the performance and accuracy of the model on specific tasks; Reinforcement learning with human feedback (RLHF): using a human-labeled dataset together with reinforcement learning (RL) algorithms to further align the model with complex and subjective human values or preferences (Ziegler et al., 2019; Ouyang et al., 2022). Both PT and SFT rely on the use of distributional loss functions, such as cross entropy, to minimize the distance between the text distributions in the training dataset and in the model output (Vaswani et al., 2017; Devlin et al., 2018; Brown et al., 2020). Such a simple strategy is not viable, however, for the RLHF stage.


Class-Conditional Conformal Prediction with Many Classes

arXiv.org Machine Learning

Standard conformal prediction methods provide a marginal coverage guarantee, which means that for a random test point, the conformal prediction set contains the true label with a user-specified probability. In many classification problems, we would like to obtain a stronger guarantee--that for test points of a specific class, the prediction set contains the true label with the same user-chosen probability. For the latter goal, existing conformal prediction methods do not work well when there is a limited amount of labeled data per class, as is often the case in real applications where the number of classes is large. We propose a method called clustered conformal prediction that clusters together classes having "similar" conformal scores and performs conformal prediction at the cluster level. Based on empirical evaluation across four image data sets with many (up to 1000) classes, we find that clustered conformal typically outperforms existing methods in terms of class-conditional coverage and set size metrics.


Conformal Decision Theory: Safe Autonomous Decisions from Imperfect Predictions

arXiv.org Machine Learning

We introduce Conformal Decision Theory, a framework for producing safe autonomous decisions despite imperfect machine learning predictions. Examples of such decisions are ubiquitous, from robot planning algorithms that rely on pedestrian predictions, to calibrating autonomous manufacturing to exhibit high throughput and low error, to the choice of trusting a nominal policy versus switching to a safe backup policy at run-time. The decisions produced by our algorithms are safe in the sense that they come with provable statistical guarantees of having low risk without any assumptions on the world model whatsoever; the observations need not be I.I.D. and can even be adversarial. The theory extends results from conformal prediction to calibrate decisions directly, without requiring the construction of prediction sets. Experiments demonstrate the utility of our approach in robot motion planning around humans, automated stock trading, and robot manufacturing.


A Gentle Introduction to Gradient-Based Optimization and Variational Inequalities for Machine Learning

arXiv.org Machine Learning

The rapid progress in machine learning in recent years has been based on a highly productive connection to gradient-based optimization. Further progress hinges in part on a shift in focus from pattern recognition to decision-making and multi-agent problems. In these broader settings, new mathematical challenges emerge that involve equilibria and game theory instead of optima. Gradient-based methods remain essential -- given the high dimensionality and large scale of machine-learning problems -- but simple gradient descent is no longer the point of departure for algorithm design. We provide a gentle introduction to a broader framework for gradient-based algorithms in machine learning, beginning with saddle points and monotone games, and proceeding to general variational inequalities. While we provide convergence proofs for several of the algorithms that we present, our main focus is that of providing motivation and intuition.


Explicit Second-Order Min-Max Optimization Methods with Optimal Convergence Guarantee

arXiv.org Artificial Intelligence

We propose and analyze exact and inexact regularized Newton-type methods for finding a global saddle point of \emph{convex-concave} unconstrained min-max optimization problems. Compared to first-order methods, our understanding of second-order methods for min-max optimization is relatively limited, as obtaining global rates of convergence with second-order information is much more involved. In this paper, we examine how second-order information can be used to speed up extra-gradient methods, even under inexactness. Specifically, we show that the proposed algorithms generate iterates that remain within a bounded set and the averaged iterates converge to an $\epsilon$-saddle point within $O(\epsilon^{-2/3})$ iterations in terms of a restricted gap function. Our algorithms match the theoretically established lower bound in this context and our analysis provides a simple and intuitive convergence analysis for second-order methods without any boundedness requirements. Finally, we present a series of numerical experiments on synthetic and real data that demonstrate the efficiency of the proposed algorithms.


Delegating Data Collection in Decentralized Machine Learning

arXiv.org Machine Learning

The design of machine learning pipelines is increasingly a cooperative, distributed endeavor, in which the expertise needed for the design of various components of an overall pipeline is spread across many stakeholders. Such expertise pertains in part to classical design choices such as how much and what kind of data to use for training, how much test data to use for verification, how to train a model, and how to tune hyper-parameters, but, more broadly, expertise may reflect experience, access to certain resources, or knowledge of local conditions. To the extent that there is a central designer, their role may in large part be that of setting requirements, developing coordination mechanisms, and providing incentives. Overall, we are seeing a flourishing new industry at the intersection of ML and operations which makes use of specialization and decentralization to achieve high performance and operational efficiency. Such an ML ecosystem creates a need for new design tools and insights that are not focused merely on how the designer could perform a task in this pipeline, but rather how she should delegate it to agents who are willing and capable of performing the task on her behalf.


On Optimal Caching and Model Multiplexing for Large Model Inference

arXiv.org Artificial Intelligence

This progress comes at a cost, however, of increased resource consumption and latency during both training and inference, presenting challenges not only in real-world deployment but also in terms of environmental impact and energy usage (Sharir et al., 2020; Patterson et al., 2021; Bommasani et al., 2022). For instance, LLM-based chatbots typically consist of large transformer-based networks with parameter counts ranging from one to several hundred billion (Zhou et al., 2023). Moreover, the auto-regressive nature of LLMs exacerbates the issue of latency and resource consumption because the model can only generate one token at a time. Thus, compared to traditional AI-powered services, language model inference costs are much higher and the latency is significantly longer, making it nearly impossible to process each query using LLMs in high-throughput query systems such as search engines. In this paper, we explore two simple yet effective strategies to mitigate this problem: (1) employing a caching system to store previous queries, and (2) developing a model multiplexer to choose the most appropriate model from a set of models for processing the queries. The general workflow of our proposed LLM-based inference system is shown in Figure 1: upon receiving a query or prompt, we initially check if it can be retrieved from the cache. If the query is not found in the cache, we employ the model multiplexer to determine which model should be used for processing it first, based on the estimated cost for both models. The choice of cost function and models can vary based on the goal. One measure of cost, for example, could be floating point operations (FLOPs).


Nesterov Meets Optimism: Rate-Optimal Separable Minimax Optimization

arXiv.org Artificial Intelligence

The main idea of our algorithm is to carefully leverage the structure of the minimax problem, performing Nesterov acceleration on the individual component and optimistic gradient on the coupling component. Equipped with proper restarting, we show that AG-OG achieves the optimal convergence rate (up to a constant) for a variety of settings, including bilinearly coupled strongly convex-strongly concave minimax optimization (bi-SC-SC), bilinearly coupled convex-strongly concave minimax optimization (bi-C-SC), and bilinear games. We also extend our algorithm to the stochastic setting and achieve the optimal convergence rate in both bi-SC-SC and bi-C-SC settings. AG-OG is the first single-call algorithm with optimal convergence rates in both deterministic and stochastic settings for bilinearly coupled minimax optimization problems.