AITopics | Batson, Joshua

Collaborating Authors

Batson, Joshua

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Auditing language models for hidden objectives

Marks, Samuel, Treutlein, Johannes, Bricken, Trenton, Lindsey, Jack, Marcus, Jonathan, Mishra-Sharma, Siddharth, Ziegler, Daniel, Ameisen, Emmanuel, Batson, Joshua, Belonax, Tim, Bowman, Samuel R., Carter, Shan, Chen, Brian, Cunningham, Hoagy, Denison, Carson, Dietz, Florian, Golechha, Satvik, Khan, Akbir, Kirchner, Jan, Leike, Jan, Meek, Austin, Nishimura-Gasparian, Kei, Ong, Euan, Olah, Christopher, Pearce, Adam, Roger, Fabien, Salle, Jeanne, Shih, Andy, Tong, Meg, Thomas, Drake, Rivoire, Kelley, Jermyn, Adam, MacDiarmid, Monte, Henighan, Tom, Hubinger, Evan

arXiv.org Artificial IntelligenceMar-13-2025

We study the feasibility of conducting alignment audits: investigations into whether models have undesired objectives. As a testbed, we train a language model with a hidden objective. Our training pipeline first teaches the model about exploitable errors in RLHF reward models (RMs), then trains the model to exploit some of these errors. We verify via out-of-distribution evaluations that the model generalizes to exhibit whatever behaviors it believes RMs rate highly, including ones not reinforced during training. We leverage this model to study alignment audits in two ways. First, we conduct a blind auditing game where four teams, unaware of the model's hidden objective or training, investigate it for concerning behaviors and their causes. Three teams successfully uncovered the model's hidden objective using techniques including interpretability with sparse autoencoders (SAEs), behavioral attacks, and training data analysis. Second, we conduct an unblinded follow-up study of eight techniques for auditing the model, analyzing their strengths and limitations. Overall, our work provides a concrete example of using alignment audits to discover a model's hidden objective and proposes a methodology for practicing and validating progress in alignment auditing.

artificial intelligence, auditing language model, machine learning, (1 more...)

arXiv.org Artificial Intelligence

2503.10965

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language (0.60)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.53)

Add feedback

Open Problems in Mechanistic Interpretability

Sharkey, Lee, Chughtai, Bilal, Batson, Joshua, Lindsey, Jack, Wu, Jeff, Bushnaq, Lucius, Goldowsky-Dill, Nicholas, Heimersheim, Stefan, Ortega, Alejandro, Bloom, Joseph, Biderman, Stella, Garriga-Alonso, Adria, Conmy, Arthur, Nanda, Neel, Rumbelow, Jessica, Wattenberg, Martin, Schoots, Nandi, Miller, Joseph, Michaud, Eric J., Casper, Stephen, Tegmark, Max, Saunders, William, Bau, David, Todd, Eric, Geiger, Atticus, Geva, Mor, Hoogland, Jesse, Murfet, Daniel, McGrath, Tom

arXiv.org Artificial IntelligenceJan-27-2025

Mechanistic interpretability aims to understand the computational mechanisms underlying neural networks' capabilities in order to accomplish concrete scientific and engineering goals. Progress in this field thus promises to provide greater assurance over AI system behavior and shed light on exciting scientific questions about the nature of intelligence. Despite recent progress toward these goals, there are many open problems in the field that require solutions before many scientific and practical benefits can be realized: Our methods require both conceptual and practical improvements to reveal deeper insights; we must figure out how best to apply our methods in pursuit of specific goals; and the field must grapple with socio-technical challenges that influence and are influenced by our work. This forward-facing review discusses the current frontier of mechanistic interpretability and the open problems that the field may benefit from prioritizing. This review collects the perspectives of its various authors and represents a synthesis of their views by Apollo Research on behalf of Schmidt Sciences. The perspectives presented here do not necessarily reflect the views of any individual author or the institutions with which they are affiliated.

data mining, large language model, machine learning, (23 more...)

arXiv.org Artificial Intelligence

2501.16496

Country:

Europe > United Kingdom (0.92)
North America > United States > Texas > Kleberg County (0.24)
North America > United States > Texas > Chambers County (0.24)
(2 more...)

Genre:

Overview (1.00)
Research Report > New Finding (0.45)

Industry:

Law (1.00)
Information Technology > Security & Privacy (1.00)
Leisure & Entertainment (0.67)
(4 more...)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(7 more...)

Add feedback

Scaling Laws in Jet Classification

Batson, Joshua, Kahn, Yonatan

arXiv.org Artificial IntelligenceDec-4-2023

We demonstrate the emergence of scaling laws in the benchmark top versus QCD jet classification problem in collider physics. Six distinct physically-motivated classifiers exhibit power-law scaling of the binary cross-entropy test loss as a function of training set size, with distinct power law indices. This result highlights the importance of comparing classifiers as a function of dataset size rather than for a fixed training set, as the optimal classifier may change considerably as the dataset is scaled up. We speculate on the interpretation of our results in terms of previous models of scaling laws observed in natural language and image datasets.

artificial intelligence, classifier, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2312.02264

Country: North America > United States > Illinois > Champaign County > Urbana (0.14)

Genre: Research Report > New Finding (0.49)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.46)

Add feedback

Topological Obstructions to Autoencoding

Batson, Joshua, Haaf, C. Grace, Kahn, Yonatan, Roberts, Daniel A.

arXiv.org Machine LearningFeb-16-2021

Autoencoders have been proposed as a powerful tool for model-independent anomaly detection in high-energy physics. The operating principle is that events which do not belong to the space of training data will be reconstructed poorly, thus flagging them as anomalies. We point out that in a variety of examples of interest, the connection between large reconstruction error and anomalies is not so clear. In particular, for data sets with nontrivial topology, there will always be points that erroneously seem anomalous due to global issues. Conversely, neural networks typically have an inductive bias or prior to locally interpolate such that undersampled or rare events may be reconstructed with small error, despite actually being the desired anomalies. Taken together, these facts are in tension with the simple picture of the autoencoder as an anomaly detector. Using a series of illustrative low-dimensional examples, we show explicitly how the intrinsic and extrinsic topology of the dataset affects the behavior of an autoencoder and how this topology is manifested in the latent space representation during training. We ground this analysis in the discussion of a mock "bump hunt" in which the autoencoder fails to identify an anomalous "signal" for reasons tied to the intrinsic topology of $n$-particle phase space.

autoencoder, deep learning, neural network, (20 more...)

arXiv.org Machine Learning

2102.0838

Country:

North America > United States > Illinois > Champaign County > Urbana (0.14)
North America > United States > California > San Francisco County > San Francisco (0.14)

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Noise2Self: Blind Denoising by Self-Supervision

Batson, Joshua, Royer, Loic

arXiv.org Machine LearningJan-30-2019

We propose a general framework for denoising high-dimensional measurements which requires no prior on the signal, no estimate of the noise, and no clean training data. The only assumption is that the noise exhibits statistical independence across different dimensions of the measurement. Moreover, our framework is not restricted to a particular denoising model. We show how it can be used to calibrate any parameterised denoising algorithm, from the single hyperparameter of a median filter to the millions of weights of a deep neural network. We demonstrate this on natural image and microscopy data, where we exploit noise independence between pixels, and on single-cell gene expression data, where we exploit independence between detections of individual molecules. Finally, we prove a theoretical lower bound on the performance of an optimal denoiser. This framework generalizes recent work on training neural nets from noisy images and on cross-validation for matrix factorization.

deep learning, neural network, noise, (21 more...)

arXiv.org Machine Learning

1901.11365

Country: North America > United States (0.46)

Genre: Research Report (0.41)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.88)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)

Add feedback