AITopics | Benton, Joe

Collaborating Authors

Benton, Joe

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

Sharma, Mrinank, Tong, Meg, Mu, Jesse, Wei, Jerry, Kruthoff, Jorrit, Goodfriend, Scott, Ong, Euan, Peng, Alwin, Agarwal, Raj, Anil, Cem, Askell, Amanda, Bailey, Nathan, Benton, Joe, Bluemke, Emma, Bowman, Samuel R., Christiansen, Eric, Cunningham, Hoagy, Dau, Andy, Gopal, Anjali, Gilson, Rob, Graham, Logan, Howard, Logan, Kalra, Nimit, Lee, Taesung, Lin, Kevin, Lofgren, Peter, Mosconi, Francesco, O'Hara, Clare, Olsson, Catherine, Petrini, Linda, Rajani, Samir, Saxena, Nikhil, Silverstein, Alex, Singh, Tanya, Sumers, Theodore, Tang, Leonard, Troy, Kevin K., Weisser, Constantin, Zhong, Ruiqi, Zhou, Giulio, Leike, Jan, Kaplan, Jared, Perez, Ethan

arXiv.org Artificial IntelligenceJan-30-2025

Large language models (LLMs) are vulnerable to universal jailbreaks--prompting strategies that systematically bypass model safeguards and enable users to carry out harmful processes that require many model interactions, like manufacturing illegal substances at scale. To defend against these attacks, we introduce Constitutional Classifiers: safeguards trained on synthetic data, generated by prompting LLMs with natural language rules (i.e., a constitution) specifying permitted and restricted content. In over 3,000 estimated hours of red teaming, no red teamer found a universal jailbreak that could extract information from an early classifier-guarded LLM at a similar level of detail to an unguarded model across most target queries. On automated evaluations, enhanced classifiers demonstrated robust defense against held-out domain-specific jailbreaks. These classifiers also maintain deployment viability, with an absolute 0.38% increase in production-traffic refusals and a 23.7% inference overhead. Our work demonstrates that defending against universal jailbreaks while maintaining practical deployment viability is tractable.

classifier, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2501.18837

Genre:

Workflow (1.00)
Questionnaire & Opinion Survey (0.92)
Research Report > New Finding (0.67)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Government > Military (1.00)
(7 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.70)

Add feedback

Sabotage Evaluations for Frontier Models

Benton, Joe, Wagner, Misha, Christiansen, Eric, Anil, Cem, Perez, Ethan, Srivastav, Jai, Durmus, Esin, Ganguli, Deep, Kravec, Shauna, Shlegeris, Buck, Kaplan, Jared, Karnofsky, Holden, Hubinger, Evan, Grosse, Roger, Bowman, Samuel R., Duvenaud, David

arXiv.org Artificial IntelligenceOct-28-2024

Sufficiently capable models could subvert human oversight and decision-making in important contexts. For example, in the context of AI development, models could covertly sabotage efforts to evaluate their own dangerous capabilities, to monitor their behavior, or to make decisions about their deployment. We refer to this family of abilities as sabotage capabilities. We develop a set of related threat models and evaluations. These evaluations are designed to provide evidence that a given model, operating under a given set of mitigations, could not successfully sabotage a frontier model developer or other large organization's activities in any of these ways. We demonstrate these evaluations on Anthropic's Claude 3 Opus and Claude 3.5 Sonnet models. Our results suggest that for these models, minimal mitigations are currently sufficient to address sabotage risks, but that more realistic evaluations and stronger mitigations seem likely to be necessary soon as capabilities improve. We also survey related evaluations we tried and abandoned. Finally, we discuss the advantages of mitigation-aware capability evaluations, and of simulating large-scale deployments using small-scale statistics.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2410.21514

Country:

North America > United States (0.46)
North America > Canada > Ontario > Toronto (0.14)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Law Enforcement & Public Safety > Terrorism (1.00)
Information Technology > Security & Privacy (1.00)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.67)
(2 more...)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)

Add feedback

Measuring Feature Sparsity in Language Models

Deng, Mingyang, Tao, Lucas, Benton, Joe

arXiv.org Artificial IntelligenceOct-13-2023

Recent works have proposed that activations in language models can be modelled as sparse linear combinations of vectors corresponding to features of input text. Under this assumption, these works aimed to reconstruct feature directions using sparse coding. We develop metrics to assess the success of these sparse coding techniques and test the validity of the linearity and sparsity assumptions. We show our metrics can predict the level of sparsity on synthetic sparse linear activations, and can distinguish between sparse linear data and several other distributions. We use our metrics to measure levels of sparsity in several language models. We find evidence that language model activations can be accurately modelled by sparse linear combinations of features, significantly more so than control datasets. We also show that model activations appear to be sparsest in the first and final layers.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2310.07837

Genre: Research Report > New Finding (0.46)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Linear Convergence Bounds for Diffusion Models via Stochastic Localization

Benton, Joe, De Bortoli, Valentin, Doucet, Arnaud, Deligiannidis, George

arXiv.org Artificial IntelligenceAug-7-2023

Diffusion models are a powerful method for generating approximate samples from high-dimensional data distributions. Several recent results have provided polynomial bounds on the convergence rate of such models, assuming $L^2$-accurate score estimators. However, up until now the best known such bounds were either superlinear in the data dimension or required strong smoothness assumptions. We provide the first convergence bounds which are linear in the data dimension (up to logarithmic factors) assuming only finite second moments of the data distribution. We show that diffusion models require at most $\tilde O(\frac{d \log^2(1/\delta)}{\varepsilon^2})$ steps to approximate an arbitrary data distribution on $\mathbb{R}^d$ corrupted with Gaussian noise of variance $\delta$ to within $\varepsilon^2$ in Kullback--Leibler divergence. Our proof builds on the Girsanov-based methods of previous works. We introduce a refined treatment of the error arising from the discretization of the reverse SDE, which is based on tools from stochastic localization.

artificial intelligence, arxiv, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2308.03686

Country: Europe > United Kingdom (0.14)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

Alpha-divergence Variational Inference Meets Importance Weighted Auto-Encoders: Methodology and Asymptotics

Daudel, Kamélia, Benton, Joe, Shi, Yuyang, Doucet, Arnaud

arXiv.org Artificial IntelligenceJul-19-2023

Several algorithms involving the Variational R\'enyi (VR) bound have been proposed to minimize an alpha-divergence between a target posterior distribution and a variational distribution. Despite promising empirical results, those algorithms resort to biased stochastic gradient descent procedures and thus lack theoretical guarantees. In this paper, we formalize and study the VR-IWAE bound, a generalization of the Importance Weighted Auto-Encoder (IWAE) bound. We show that the VR-IWAE bound enjoys several desirable properties and notably leads to the same stochastic gradient descent procedure as the VR bound in the reparameterized case, but this time by relying on unbiased gradient estimators. We then provide two complementary theoretical analyses of the VR-IWAE bound and thus of the standard IWAE bound. Those analyses shed light on the benefits or lack thereof of these bounds. Lastly, we illustrate our theoretical claims over toy and real-data examples.

artificial intelligence, machine learning, vr-iwae, (16 more...)

arXiv.org Artificial Intelligence

2210.06226

Country:

North America > United States (0.45)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Polysemanticity and Capacity in Neural Networks

Scherlis, Adam, Sachan, Kshitij, Jermyn, Adam S., Benton, Joe, Shlegeris, Buck

arXiv.org Artificial IntelligenceJul-11-2023

Individual neurons in neural networks often represent a mixture of unrelated features. This phenomenon, called polysemanticity, can make interpreting neural networks more difficult and so we aim to understand its causes. We propose doing so through the lens of feature \emph{capacity}, which is the fractional dimension each feature consumes in the embedding space. We show that in a toy model the optimal capacity allocation tends to monosemantically represent the most important features, polysemantically represent less important features (in proportion to their impact on the loss), and entirely ignore the least important features. Polysemanticity is more prevalent when the inputs have higher kurtosis or sparsity and more prevalent in some architectures than others. Given an optimal allocation of capacity, we go on to study the geometry of the embedding space. We find a block-semi-orthogonal structure, with differing block sizes in different models, highlighting the impact of model architecture on the interpretability of its neurons.

artificial intelligence, machine learning, matrix, (18 more...)

arXiv.org Artificial Intelligence

2210.01892

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Error Bounds for Flow Matching Methods

Benton, Joe, Deligiannidis, George, Doucet, Arnaud

arXiv.org Artificial IntelligenceMay-26-2023

Score-based generative models are a popular class of generative modelling techniques relying on stochastic differential equations (SDE). From their inception, it was realized that it was also possible to perform generation using ordinary differential equations (ODE) rather than SDE. This led to the introduction of the probability flow ODE approach and denoising diffusion implicit models. Flow matching methods have recently further extended these ODE-based approaches and approximate a flow between two arbitrary probability distributions. Previous work derived bounds on the approximation error of diffusion models under the stochastic sampling regime, given assumptions on the $L^2$ loss. We present error bounds for the flow matching procedure using fully deterministic sampling, assuming an $L^2$ bound on the approximation error and a certain regularity condition on the data distributions.

artificial intelligence, international conference, machine learning, (14 more...)

arXiv.org Artificial Intelligence

2305.1686

Country: Europe > United Kingdom (0.29)

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

From Denoising Diffusions to Denoising Markov Models

Benton, Joe, Shi, Yuyang, De Bortoli, Valentin, Deligiannidis, George, Doucet, Arnaud

arXiv.org Artificial IntelligenceMay-11-2023

Denoising diffusions are state-of-the-art generative models exhibiting remarkable empirical performance. They work by diffusing the data distribution into a Gaussian distribution and then learning to reverse this noising process to obtain synthetic datapoints. The denoising diffusion relies on approximations of the logarithmic derivatives of the noised data densities using score matching. Such models can also be used to perform approximate posterior simulation when one can only sample from the prior and likelihood. We propose a unifying framework generalising this approach to a wide class of spaces and leading to an original extension of score matching. We illustrate the resulting models on various applications.

artificial intelligence, machine learning, objective, (18 more...)

arXiv.org Artificial Intelligence

2211.03595

Country:

Europe > United Kingdom (0.28)
North America > United States (0.27)

Genre: Research Report (0.81)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
(2 more...)

Add feedback

A Continuous Time Framework for Discrete Denoising Models

Campbell, Andrew, Benton, Joe, De Bortoli, Valentin, Rainforth, Tom, Deligiannidis, George, Doucet, Arnaud

arXiv.org Artificial IntelligenceOct-14-2022

We provide the first complete continuous time framework for denoising diffusion models of discrete data. This is achieved by formulating the forward noising process and corresponding reverse time generative process as Continuous Time Markov Chains (CTMCs). The model can be efficiently trained using a continuous time version of the ELBO. We simulate the high dimensional CTMC using techniques developed in chemical physics and exploit our continuous time framework to derive high performance samplers that we show can outperform discrete time methods for discrete data. The continuous time treatment also enables us to derive a novel theoretical result bounding the error between the generated sample distribution and the true data distribution.

artificial intelligence, dimension, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2205.14987

Country: Europe (0.45)

Genre: Research Report (0.63)

Industry: Media (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Vision (0.87)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.67)

Add feedback