Goto

Collaborating Authors

 Jeon, Hong Jun


Epinet for Content Cold Start

arXiv.org Artificial Intelligence

The exploding popularity of online content and its user base poses an evermore challenging matching problem for modern recommendation systems. Unlike other frontiers of machine learning such as natural language, recommendation systems are responsible for collecting their own data. Simply exploiting current knowledge can lead to pernicious feedback loops but naive exploration can detract from user experience and lead to reduced engagement. This exploration-exploitation trade-off is exemplified in the classic multi-armed bandit problem for which algorithms such as upper confidence bounds (UCB) and Thompson sampling (TS) demonstrate effective performance. However, there have been many challenges to scaling these approaches to settings which do not exhibit a conjugate prior structure. Recent scalable approaches to uncertainty quantification via epinets have enabled efficient approximations of Thompson sampling even when the learning model is a complex neural network. In this paper, we demonstrate the first application of epinets to an online recommendation system. Our experiments demonstrate improvements in both user traffic and engagement efficiency on the Facebook Reels online video platform.


Aligning AI Agents via Information-Directed Sampling

arXiv.org Artificial Intelligence

The staggering feats of AI systems have brought to attention the topic of AI Alignment: aligning a "superintelligent" AI agent's actions with humanity's interests. Many existing frameworks/algorithms in alignment study the problem on a myopic horizon or study learning from human feedback in isolation, relying on the contrived assumption that the agent has already perfectly identified the environment. As a starting point to address these limitations, we define a class of bandit alignment problems as an extension of classic multi-armed bandit problems. A bandit alignment problem involves an agent tasked with maximizing long-run expected reward by interacting with an environment and a human, both involving details/preferences initially unknown to the agent. The reward of actions in the environment depends on both observed outcomes and human preferences. Furthermore, costs are associated with querying the human to learn preferences. Therefore, an effective agent ought to intelligently trade-off exploration (of the environment and human) and exploitation. We study these trade-offs theoretically and empirically in a toy bandit alignment problem which resembles the beta-Bernoulli bandit. We demonstrate while naive exploration algorithms which reflect current practices and even touted algorithms such as Thompson sampling both fail to provide acceptable solutions to this problem, information-directed sampling achieves favorable regret.


Information-Theoretic Foundations for Machine Learning

arXiv.org Machine Learning

The staggering progress of machine learning in the past decade has been a sight to behold. In retrospect, it is both remarkable and unsettling that these milestones were achievable with little to no rigorous theory to guide experimentation. Despite this fact, practitioners have been able to guide their future experimentation via observations from previous large-scale empirical investigations. However, alluding to Plato's Allegory of the cave, it is likely that the observations which form the field's notion of reality are but shadows representing fragments of that reality. In this work, we propose a theoretical framework which attempts to answer what exists outside of the cave. To the theorist, we provide a framework which is mathematically rigorous and leaves open many interesting ideas for future exploration. To the practitioner, we provide a framework whose results are very intuitive, general, and which will help form principles to guide future investigations. Concretely, we provide a theoretical framework rooted in Bayesian statistics and Shannon's information theory which is general enough to unify the analysis of many phenomena in machine learning. Our framework characterizes the performance of an optimal Bayesian learner, which considers the fundamental limits of information. Throughout this work, we derive very general theoretical results and apply them to derive insights specific to settings ranging from data which is independently and identically distributed under an unknown distribution, to data which is sequential, to data which exhibits hierarchical structure amenable to meta-learning. We conclude with a section dedicated to characterizing the performance of misspecified algorithms. These results are exciting and particularly relevant as we strive to overcome increasingly difficult machine learning challenges in this endlessly complex world.


Information-Theoretic Foundations for Neural Scaling Laws

arXiv.org Artificial Intelligence

In recent years, foundation models have grown immensely, with some embodying trillions of trainable parameters. While larger models have in general produced better results, they also require much more compute to train. It has become impractical to perform hyperparameter sweeps at the scale of these modern models. This has required bypassing the practice of tuning hyperparameters via extensive trial and error, as was previously common in deep learning. Among other things, hyperparameters control 1) the size, measured in terms of the parameter count p, of the neural network model and 2) the number T of training tokens. If each parameter is adjusted in response to each token then the computational requirements of training scale will the product of these two quantities. For any compute budget C, one should carefully balance between p and T. Too few training tokens leads to model estimation error, while too few parameters gives rise to mispecification error. As evaluating performance across multiple choices of p and T becomes computationally prohibitive at scale, alternative kinds of analysis are required to guide allocation of computational resources. Kaplan et al. [2020] and Hoffmann et al. [2022] have proposed the following procedure for allocating a large compute budget: 1) Evaluate test errors of models produced using various small compute budgets C with many different allocations to parameters p versus training tokens T. 2) Extrapolate to estimate the relation between p and T for large C. 3) Extrapolate to estimate the relation between p and T for large C. To give a sense of scales involved here, Hoffmann et al. [2022] evaluate test errors across "small" models for which p T ranges from around 10


Adaptive Crowdsourcing Via Self-Supervised Learning

arXiv.org Artificial Intelligence

Common crowdsourcing systems average estimates of a latent quantity of interest provided by many crowdworkers to produce a group estimate. We develop a new approach -- predict-each-worker -- that leverages self-supervised learning and a novel aggregation scheme. This approach adapts weights assigned to crowdworkers based on estimates they provided for previous quantities. When skills vary across crowdworkers or their estimates correlate, the weighted sum offers a more accurate group estimate than the average. Existing algorithms such as expectation maximization can, at least in principle, produce similarly accurate group estimates. However, their computational requirements become onerous when complex models, such as neural networks, are required to express relationships among crowdworkers. Predict-each-worker accommodates such complexity as well as many other practical challenges. We analyze the efficacy of predict-each-worker through theoretical and computational studies. Among other things, we establish asymptotic optimality as the number of engagements per crowdworker grows.


An Information-Theoretic Analysis of In-Context Learning

arXiv.org Artificial Intelligence

Previous theoretical results pertaining to meta-learning on sequences build on contrived assumptions and are somewhat convoluted. We introduce new information-theoretic tools that lead to an elegant and very general decomposition of error into three components: irreducible error, meta-learning error, and intra-task error. These tools unify analyses across many meta-learning challenges. To illustrate, we apply them to establish new results about in-context learning with transformers. Our theoretical results characterizes how error decays in both the number of training sequences and sequence lengths. Our results are very general; for example, they avoid contrived mixing time assumptions made by all prior results that establish decay of error with sequence length.


An Information-Theoretic Analysis of Compute-Optimal Neural Scaling Laws

arXiv.org Artificial Intelligence

We study the compute-optimal trade-off between model and training data set sizes for large neural networks. Our result suggests a linear relation similar to that supported by the empirical analysis of chinchilla. While that work studies transformer-based large language models trained on the MassiveText corpus gopher, as a starting point for development of a mathematical theory, we focus on a simpler learning model and data generating process, each based on a neural network with a sigmoidal output unit and single hidden layer of ReLU activation units. We introduce general error upper bounds for a class of algorithms which incrementally update a statistic (for example gradient descent). For a particular learning model inspired by barron 1993, we establish an upper bound on the minimal information-theoretically achievable expected error as a function of model and data set sizes. We then derive allocations of computation that minimize this bound. We present empirical results which suggest that this approximation correctly identifies an asymptotic linear compute-optimal scaling. This approximation also generates new insights. Among other things, it suggests that, as the input dimension or latent space complexity grows, as might be the case for example if a longer history of tokens is taken as input to a language model, a larger fraction of the compute budget should be allocated to growing the learning model rather than training data.


Continual Learning as Computationally Constrained Reinforcement Learning

arXiv.org Artificial Intelligence

An agent that efficiently accumulates knowledge to develop increasingly sophisticated skills over a long lifetime could advance the frontier of artificial intelligence capabilities. The design of such agents, which remains a long-standing challenge of artificial intelligence, is addressed by the subject of continual learning. This monograph clarifies and formalizes concepts of continual learning, introducing a framework and set of tools to stimulate further research.


An Information-Theoretic Framework for Supervised Learning

arXiv.org Artificial Intelligence

Each year, deep learning demonstrates new and improved empirical results with deeper and wider neural networks. Meanwhile, with existing theoretical frameworks, it is difficult to analyze networks deeper than two layers without resorting to counting parameters or encountering sample complexity bounds that are exponential in depth. Perhaps it may be fruitful to try to analyze modern machine learning under a different lens. In this paper, we propose a novel information-theoretic framework with its own notions of regret and sample complexity for analyzing the data requirements of machine learning. With our framework, we first work through some classical examples such as scalar estimation and linear regression to build intuition and introduce general techniques. Then, we use the framework to study the sample complexity of learning from data generated by deep neural networks with ReLU activation units. For a particular prior distribution on weights, we establish sample complexity bounds that are simultaneously width independent and linear in depth. This prior distribution gives rise to high-dimensional latent representations that, with high probability, admit reasonably accurate low-dimensional approximations. We conclude by corroborating our theoretical results with experimental analysis of random single-hidden-layer neural networks.


Is Stochastic Gradient Descent Near Optimal?

arXiv.org Artificial Intelligence

The success of neural networks over the past decade has established them as effective models for many relevant data generating processes. Statistical theory on neural networks indicates graceful scaling of sample complexity. For example, Joen & Van Roy (arXiv:2203.00246) demonstrate that, when data is generated by a ReLU teacher network with $W$ parameters, an optimal learner needs only $\tilde{O}(W/\epsilon)$ samples to attain expected error $\epsilon$. However, existing computational theory suggests that, even for single-hidden-layer teacher networks, to attain small error for all such teacher networks, the computation required to achieve this sample complexity is intractable. In this work, we fit single-hidden-layer neural networks to data generated by single-hidden-layer ReLU teacher networks with parameters drawn from a natural distribution. We demonstrate that stochastic gradient descent (SGD) with automated width selection attains small expected error with a number of samples and total number of queries both nearly linear in the input dimension and width. This suggests that SGD nearly achieves the information-theoretic sample complexity bounds of Joen & Van Roy (arXiv:2203.00246) in a computationally efficient manner. An important difference between our positive empirical results and the negative theoretical results is that the latter address worst-case error of deterministic algorithms, while our analysis centers on expected error of a stochastic algorithm.