AITopics | Jeon, Hong Jun

Collaborating Authors

Jeon, Hong Jun

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Epinet for Content Cold Start

Jeon, Hong Jun, Liu, Songbin, Li, Yuantong, Lyu, Jie, Song, Hunter, Liu, Ji, Wu, Peng, Zhu, Zheqing

arXiv.org Artificial IntelligenceNov-20-2024

The exploding popularity of online content and its user base poses an evermore challenging matching problem for modern recommendation systems. Unlike other frontiers of machine learning such as natural language, recommendation systems are responsible for collecting their own data. Simply exploiting current knowledge can lead to pernicious feedback loops but naive exploration can detract from user experience and lead to reduced engagement. This exploration-exploitation trade-off is exemplified in the classic multi-armed bandit problem for which algorithms such as upper confidence bounds (UCB) and Thompson sampling (TS) demonstrate effective performance. However, there have been many challenges to scaling these approaches to settings which do not exhibit a conjugate prior structure. Recent scalable approaches to uncertainty quantification via epinets have enabled efficient approximations of Thompson sampling even when the learning model is a complex neural network. In this paper, we demonstrate the first application of epinets to an online recommendation system. Our experiments demonstrate improvements in both user traffic and engagement efficiency on the Facebook Reels online video platform.

artificial intelligence, data mining, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2412.04484

Country:

Europe (0.93)
North America > United States > New York > New York County > New York City (0.15)
North America > United States > California > Santa Clara County (0.14)

Genre: Research Report > Experimental Study (0.68)

Technology:

Information Technology > Data Science > Data Mining > Big Data (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (0.92)

Add feedback

Aligning AI Agents via Information-Directed Sampling

Jeon, Hong Jun, Van Roy, Benjamin

arXiv.org Artificial IntelligenceOct-18-2024

The staggering feats of AI systems have brought to attention the topic of AI Alignment: aligning a "superintelligent" AI agent's actions with humanity's interests. Many existing frameworks/algorithms in alignment study the problem on a myopic horizon or study learning from human feedback in isolation, relying on the contrived assumption that the agent has already perfectly identified the environment. As a starting point to address these limitations, we define a class of bandit alignment problems as an extension of classic multi-armed bandit problems. A bandit alignment problem involves an agent tasked with maximizing long-run expected reward by interacting with an environment and a human, both involving details/preferences initially unknown to the agent. The reward of actions in the environment depends on both observed outcomes and human preferences. Furthermore, costs are associated with querying the human to learn preferences. Therefore, an effective agent ought to intelligently trade-off exploration (of the environment and human) and exploitation. We study these trade-offs theoretically and empirically in a toy bandit alignment problem which resembles the beta-Bernoulli bandit. We demonstrate while naive exploration algorithms which reflect current practices and even touted algorithms such as Thompson sampling both fail to provide acceptable solutions to this problem, information-directed sampling achieves favorable regret.

artificial intelligence, data mining, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2410.14807

Genre: Research Report (0.64)

Technology:

Information Technology > Data Science > Data Mining > Big Data (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Information-Theoretic Foundations for Machine Learning

Jeon, Hong Jun, Van Roy, Benjamin

arXiv.org Machine LearningJul-16-2024

The staggering progress of machine learning in the past decade has been a sight to behold. In retrospect, it is both remarkable and unsettling that these milestones were achievable with little to no rigorous theory to guide experimentation. Despite this fact, practitioners have been able to guide their future experimentation via observations from previous large-scale empirical investigations. However, alluding to Plato's Allegory of the cave, it is likely that the observations which form the field's notion of reality are but shadows representing fragments of that reality. In this work, we propose a theoretical framework which attempts to answer what exists outside of the cave. To the theorist, we provide a framework which is mathematically rigorous and leaves open many interesting ideas for future exploration. To the practitioner, we provide a framework whose results are very intuitive, general, and which will help form principles to guide future investigations. Concretely, we provide a theoretical framework rooted in Bayesian statistics and Shannon's information theory which is general enough to unify the analysis of many phenomena in machine learning. Our framework characterizes the performance of an optimal Bayesian learner, which considers the fundamental limits of information. Throughout this work, we derive very general theoretical results and apply them to derive insights specific to settings ranging from data which is independently and identically distributed under an unknown distribution, to data which is sequential, to data which exhibits hierarchical structure amenable to meta-learning. We conclude with a section dedicated to characterizing the performance of misspecified algorithms. These results are exciting and particularly relevant as we strive to overcome increasingly difficult machine learning challenges in this endlessly complex world.

artificial intelligence, information-theoretic foundation, machine learning

arXiv.org Machine Learning

2407.12288

Genre: Research Report (0.69)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Information-Theoretic Foundations for Neural Scaling Laws

Jeon, Hong Jun, Van Roy, Benjamin

arXiv.org Artificial IntelligenceJun-27-2024

In recent years, foundation models have grown immensely, with some embodying trillions of trainable parameters. While larger models have in general produced better results, they also require much more compute to train. It has become impractical to perform hyperparameter sweeps at the scale of these modern models. This has required bypassing the practice of tuning hyperparameters via extensive trial and error, as was previously common in deep learning. Among other things, hyperparameters control 1) the size, measured in terms of the parameter count p, of the neural network model and 2) the number T of training tokens. If each parameter is adjusted in response to each token then the computational requirements of training scale will the product of these two quantities. For any compute budget C, one should carefully balance between p and T. Too few training tokens leads to model estimation error, while too few parameters gives rise to mispecification error. As evaluating performance across multiple choices of p and T becomes computationally prohibitive at scale, alternative kinds of analysis are required to guide allocation of computational resources. Kaplan et al. [2020] and Hoffmann et al. [2022] have proposed the following procedure for allocating a large compute budget: 1) Evaluate test errors of models produced using various small compute budgets C with many different allocations to parameters p versus training tokens T. 2) Extrapolate to estimate the relation between p and T for large C. 3) Extrapolate to estimate the relation between p and T for large C. To give a sense of scales involved here, Hoffmann et al. [2022] evaluate test errors across "small" models for which p T ranges from around 10

artificial intelligence, information-theoretic foundation, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2407.01456

Country: North America > United States > California > Santa Clara County (0.14)

Genre: Research Report (0.51)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.34)

Add feedback

Adaptive Crowdsourcing Via Self-Supervised Learning

Kagrecha, Anmol, Marklund, Henrik, Van Roy, Benjamin, Jeon, Hong Jun, Zeckhauser, Richard

arXiv.org Artificial IntelligenceFeb-1-2024

Common crowdsourcing systems average estimates of a latent quantity of interest provided by many crowdworkers to produce a group estimate. We develop a new approach -- predict-each-worker -- that leverages self-supervised learning and a novel aggregation scheme. This approach adapts weights assigned to crowdworkers based on estimates they provided for previous quantities. When skills vary across crowdworkers or their estimates correlate, the weighted sum offers a more accurate group estimate than the average. Existing algorithms such as expectation maximization can, at least in principle, produce similarly accurate group estimates. However, their computational requirements become onerous when complex models, such as neural networks, are required to express relationships among crowdworkers. Predict-each-worker accommodates such complexity as well as many other practical challenges. We analyze the efficacy of predict-each-worker through theoretical and computational studies. Among other things, we establish asymptotic optimality as the number of engagements per crowdworker grows.

crowdworker, large language model, machine learning, (23 more...)

arXiv.org Artificial Intelligence

2401.13239

Country: North America > United States (0.28)

Genre: Research Report > New Finding (0.46)

Industry: Government > Regional Government (0.46)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(3 more...)

Add feedback

An Information-Theoretic Analysis of In-Context Learning

Jeon, Hong Jun, Lee, Jason D., Lei, Qi, Van Roy, Benjamin

arXiv.org Artificial IntelligenceJan-27-2024

Previous theoretical results pertaining to meta-learning on sequences build on contrived assumptions and are somewhat convoluted. We introduce new information-theoretic tools that lead to an elegant and very general decomposition of error into three components: irreducible error, meta-learning error, and intra-task error. These tools unify analyses across many meta-learning challenges. To illustrate, we apply them to establish new results about in-context learning with transformers. Our theoretical results characterizes how error decays in both the number of training sequences and sequence lengths. Our results are very general; for example, they avoid contrived mixing time assumptions made by all prior results that establish decay of error with sequence length.

artificial intelligence, information-theoretic analysis, machine learning, (14 more...)

arXiv.org Artificial Intelligence

2401.1553

Country:

North America > United States > New York (0.14)
North America > United States > California > Santa Clara County (0.14)

Genre: Research Report > New Finding (0.86)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)

Add feedback

An Information-Theoretic Analysis of Compute-Optimal Neural Scaling Laws

Jeon, Hong Jun, Van Roy, Benjamin

arXiv.org Artificial IntelligenceOct-18-2023

We study the compute-optimal trade-off between model and training data set sizes for large neural networks. Our result suggests a linear relation similar to that supported by the empirical analysis of chinchilla. While that work studies transformer-based large language models trained on the MassiveText corpus gopher, as a starting point for development of a mathematical theory, we focus on a simpler learning model and data generating process, each based on a neural network with a sigmoidal output unit and single hidden layer of ReLU activation units. We introduce general error upper bounds for a class of algorithms which incrementally update a statistic (for example gradient descent). For a particular learning model inspired by barron 1993, we establish an upper bound on the minimal information-theoretically achievable expected error as a function of model and data set sizes. We then derive allocations of computation that minimize this bound. We present empirical results which suggest that this approximation correctly identifies an asymptotic linear compute-optimal scaling. This approximation also generates new insights. Among other things, it suggests that, as the input dimension or latent space complexity grows, as might be the case for example if a longer history of tokens is taken as input to a language model, a larger fraction of the compute budget should be allocated to growing the learning model rather than training data.

artificial intelligence, data generating process, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2212.01365

Country:

North America > United States (0.14)
Asia (0.14)

Genre: Research Report > New Finding (0.86)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)

Add feedback

Continual Learning as Computationally Constrained Reinforcement Learning

Kumar, Saurabh, Marklund, Henrik, Rao, Ashish, Zhu, Yifan, Jeon, Hong Jun, Liu, Yueyang, Van Roy, Benjamin

arXiv.org Artificial IntelligenceAug-20-2023

An agent that efficiently accumulates knowledge to develop increasingly sophisticated skills over a long lifetime could advance the frontier of artificial intelligence capabilities. The design of such agents, which remains a long-standing challenge of artificial intelligence, is addressed by the subject of continual learning. This monograph clarifies and formalizes concepts of continual learning, introducing a framework and set of tools to stimulate further research.

artificial intelligence, machine learning, reinforcement learning, (18 more...)

arXiv.org Artificial Intelligence

2307.04345

Country: North America > United States > Texas > Travis County > Austin (0.14)

Genre: Research Report > New Finding (0.45)

Industry: Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
(2 more...)

Add feedback

An Information-Theoretic Framework for Supervised Learning

Jeon, Hong Jun, Zhu, Yifan, Van Roy, Benjamin

arXiv.org Artificial IntelligenceMar-24-2023

Each year, deep learning demonstrates new and improved empirical results with deeper and wider neural networks. Meanwhile, with existing theoretical frameworks, it is difficult to analyze networks deeper than two layers without resorting to counting parameters or encountering sample complexity bounds that are exponential in depth. Perhaps it may be fruitful to try to analyze modern machine learning under a different lens. In this paper, we propose a novel information-theoretic framework with its own notions of regret and sample complexity for analyzing the data requirements of machine learning. With our framework, we first work through some classical examples such as scalar estimation and linear regression to build intuition and introduce general techniques. Then, we use the framework to study the sample complexity of learning from data generated by deep neural networks with ReLU activation units. For a particular prior distribution on weights, we establish sample complexity bounds that are simultaneously width independent and linear in depth. This prior distribution gives rise to high-dimensional latent representations that, with high probability, admit reasonably accurate low-dimensional approximations. We conclude by corroborating our theoretical results with experimental analysis of random single-hidden-layer neural networks.

artificial intelligence, deep learning, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2203.00246

Country: North America > United States > California > Santa Clara County (0.14)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.45)

Add feedback

Is Stochastic Gradient Descent Near Optimal?

Zhu, Yifan, Jeon, Hong Jun, Van Roy, Benjamin

arXiv.org Artificial IntelligenceOct-6-2022

The success of neural networks over the past decade has established them as effective models for many relevant data generating processes. Statistical theory on neural networks indicates graceful scaling of sample complexity. For example, Joen & Van Roy (arXiv:2203.00246) demonstrate that, when data is generated by a ReLU teacher network with $W$ parameters, an optimal learner needs only $\tilde{O}(W/\epsilon)$ samples to attain expected error $\epsilon$. However, existing computational theory suggests that, even for single-hidden-layer teacher networks, to attain small error for all such teacher networks, the computation required to achieve this sample complexity is intractable. In this work, we fit single-hidden-layer neural networks to data generated by single-hidden-layer ReLU teacher networks with parameters drawn from a natural distribution. We demonstrate that stochastic gradient descent (SGD) with automated width selection attains small expected error with a number of samples and total number of queries both nearly linear in the input dimension and width. This suggests that SGD nearly achieves the information-theoretic sample complexity bounds of Joen & Van Roy (arXiv:2203.00246) in a computationally efficient manner. An important difference between our positive empirical results and the negative theoretical results is that the latter address worst-case error of deterministic algorithms, while our analysis centers on expected error of a stochastic algorithm.

artificial intelligence, complexity, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2209.08627

Country: North America > United States > California > Santa Clara County (0.14)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback