AITopics | Meinke, Alexander

Collaborating Authors

Meinke, Alexander

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Multi-Agent Risks from Advanced AI

Hammond, Lewis, Chan, Alan, Clifton, Jesse, Hoelscher-Obermaier, Jason, Khan, Akbir, McLean, Euan, Smith, Chandler, Barfuss, Wolfram, Foerster, Jakob, Gavenčiak, Tomáš, Han, The Anh, Hughes, Edward, Kovařík, Vojtěch, Kulveit, Jan, Leibo, Joel Z., Oesterheld, Caspar, de Witt, Christian Schroeder, Shah, Nisarg, Wellman, Michael, Bova, Paolo, Cimpeanu, Theodor, Ezell, Carson, Feuillade-Montixi, Quentin, Franklin, Matija, Kran, Esben, Krawczuk, Igor, Lamparth, Max, Lauffer, Niklas, Meinke, Alexander, Motwani, Sumeet, Reuel, Anka, Conitzer, Vincent, Dennis, Michael, Gabriel, Iason, Gleave, Adam, Hadfield, Gillian, Haghtalab, Nika, Kasirzadeh, Atoosa, Krier, Sébastien, Larson, Kate, Lehman, Joel, Parkes, David C., Piliouras, Georgios, Rahwan, Iyad

arXiv.org Artificial IntelligenceFeb-19-2025

The rapid development of advanced AI agents and the imminent deployment of many instances of these agents will give rise to multi-agent systems of unprecedented complexity. These systems pose novel and under-explored risks. In this report, we provide a structured taxonomy of these risks by identifying three key failure modes (miscoordination, conflict, and collusion) based on agents' incentives, as well as seven key risk factors (information asymmetries, network effects, selection pressures, destabilising dynamics, commitment problems, emergent agency, and multi-agent security) that can underpin them. We highlight several important instances of each risk, as well as promising directions to help mitigate them. By anchoring our analysis in a range of real-world examples and experimental evidence, we illustrate the distinct challenges posed by multi-agent systems and their implications for the safety, governance, and ethics of advanced AI.

artificial intelligence, machine learning, survey article, (21 more...)

arXiv.org Artificial Intelligence

2502.14143

Country:

Asia (1.00)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
North America > United States > California > Los Angeles County (0.13)
North America > Canada > Ontario > Toronto (0.13)

Genre:

Overview (1.00)
Research Report > Promising Solution (0.92)
Research Report > New Finding (0.92)

Industry:

Leisure & Entertainment > Games (1.00)
Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Law (1.00)
(10 more...)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents > Agent Societies (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.46)

Add feedback

Frontier Models are Capable of In-context Scheming

Meinke, Alexander, Schoen, Bronson, Scheurer, Jérémy, Balesni, Mikita, Shah, Rusheb, Hobbhahn, Marius

arXiv.org Artificial IntelligenceJan-14-2025

Frontier models are increasingly trained and deployed as autonomous agent. One safety concern is that AI agents might covertly pursue misaligned goals, hiding their true capabilities and objectives - also known as scheming. We study whether models have the capability to scheme in pursuit of a goal that we provide in-context and instruct the model to strongly follow. We evaluate frontier models on a suite of six agentic evaluations where models are instructed to pursue goals and are placed in environments that incentivize scheming. Our results show that o1, Claude 3.5 Sonnet, Claude 3 Opus, Gemini 1.5 Pro, and Llama 3.1 405B all demonstrate in-context scheming capabilities. They recognize scheming as a viable strategy and readily engage in such behavior. For example, models strategically introduce subtle mistakes into their responses, attempt to disable their oversight mechanisms, and even exfiltrate what they believe to be their model weights to external servers. Additionally, this deceptive behavior proves persistent. When o1 has engaged in scheming, it maintains its deception in over 85% of follow-up questions and often remains deceptive in multi-turn interrogations. Analysis of the models' chains-of-thought reveals that models explicitly reason about these deceptive strategies, providing evidence that the scheming behavior is not accidental. Surprisingly, we also find rare instances where models engage in scheming when only given a goal, without being strongly nudged to pursue it. We observe cases where Claude 3.5 Sonnet strategically underperforms in evaluations in pursuit of being helpful, a goal that was acquired during training rather than in-context. Our findings demonstrate that frontier models now possess capabilities for basic in-context scheming, making the potential of AI agents to engage in scheming behavior a concrete rather than theoretical concern.

frontier model, large language model, machine learning, (21 more...)

arXiv.org Artificial Intelligence

2412.04984

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.91)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.74)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.73)

Add feedback

Towards evaluations-based safety cases for AI scheming

Balesni, Mikita, Hobbhahn, Marius, Lindner, David, Meinke, Alexander, Korbak, Tomek, Clymer, Joshua, Shlegeris, Buck, Scheurer, Jérémy, Stix, Charlotte, Shah, Rusheb, Goldowsky-Dill, Nicholas, Braun, Dan, Chughtai, Bilal, Evans, Owain, Kokotajlo, Daniel, Bushnaq, Lucius

arXiv.org Artificial IntelligenceNov-7-2024

We sketch how developers of frontier AI systems could construct a structured rationale -- a 'safety case' -- that an AI system is unlikely to cause catastrophic outcomes through scheming. Scheming is a potential threat model where AI systems could pursue misaligned goals covertly, hiding their true capabilities and objectives. In this report, we propose three arguments that safety cases could use in relation to scheming. For each argument we sketch how evidence could be gathered from empirical evaluations, and what assumptions would need to be met to provide strong assurance. First, developers of frontier AI systems could argue that AI systems are not capable of scheming (Scheming Inability). Second, one could argue that AI systems are not capable of posing harm through scheming (Harm Inability). Third, one could argue that control measures around the AI systems would prevent unacceptable outcomes even if the AI systems intentionally attempted to subvert them (Harm Control). Additionally, we discuss how safety cases might be supported by evidence that an AI system is reasonably aligned with its developers (Alignment). Finally, we point out that many of the assumptions required to make these safety arguments have not been confidently satisfied to date and require making progress on multiple open research problems.

ai system, large language model, machine learning, (21 more...)

arXiv.org Artificial Intelligence

2411.03336

Country: North America > United States (1.00)

Genre: Research Report (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Government > Military (1.00)
Government > Regional Government > North America Government > United States Government (0.45)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(2 more...)

Add feedback

Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs

Laine, Rudolf, Chughtai, Bilal, Betley, Jan, Hariharan, Kaivalya, Scheurer, Jeremy, Balesni, Mikita, Hobbhahn, Marius, Meinke, Alexander, Evans, Owain

arXiv.org Artificial IntelligenceJul-5-2024

AI assistants such as ChatGPT are trained to respond to users by saying, "I am a large language model". This raises questions. Do such models know that they are LLMs and reliably act on this knowledge? Are they aware of their current circumstances, such as being deployed to the public? We refer to a model's knowledge of itself and its circumstances as situational awareness. To quantify situational awareness in LLMs, we introduce a range of behavioral tests, based on question answering and instruction following. These tests form the $\textbf{Situational Awareness Dataset (SAD)}$, a benchmark comprising 7 task categories and over 13,000 questions. The benchmark tests numerous abilities, including the capacity of LLMs to (i) recognize their own generated text, (ii) predict their own behavior, (iii) determine whether a prompt is from internal evaluation or real-world deployment, and (iv) follow instructions that depend on self-knowledge. We evaluate 16 LLMs on SAD, including both base (pretrained) and chat models. While all models perform better than chance, even the highest-scoring model (Claude 3 Opus) is far from a human baseline on certain tasks. We also observe that performance on SAD is only partially predicted by metrics of general knowledge (e.g. MMLU). Chat models, which are finetuned to serve as AI assistants, outperform their corresponding base models on SAD but not on general knowledge tasks. The purpose of SAD is to facilitate scientific understanding of situational awareness in LLMs by breaking it down into quantitative abilities. Situational awareness is important because it enhances a model's capacity for autonomous planning and action. While this has potential benefits for automation, it also introduces novel risks related to AI safety and control. Code and latest results available at https://situational-awareness-dataset.org .

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2407.04694

Country:

Europe (0.92)
Asia (0.67)
North America > United States > California (0.13)

Genre: Research Report > New Finding (0.45)

Industry:

Government > Military (1.00)
Leisure & Entertainment > Sports > Olympic Games (0.45)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Tell, don't show: Declarative facts influence how LLMs generalize

Meinke, Alexander, Evans, Owain

arXiv.org Artificial IntelligenceDec-12-2023

We examine how large language models (LLMs) generalize from abstract declarative statements in their training data. As an illustration, consider an LLM that is prompted to generate weather reports for London in 2050. One possibility is that the temperatures in the reports match the mean and variance of reports from 2023 (i.e. Another possibility is that the reports predict higher temperatures, by incorporating declarative statements about climate change from scientific papers written in 2023. An example of such a declarative statement is "global temperatures will increase by 1 To test the influence of abstract declarative statements, we construct tasks in which LLMs are finetuned on both declarative and procedural information. We find that declarative statements influence model predictions, even when they conflict with procedural information. In particular, finetuning on a declarative statement S increases the model likelihood for logical consequences of S. The effect of declarative statements is consistent across three domains: aligning an AI assistant, predicting weather, and predicting demographic features. Through a series of ablations, we show that the effect of declarative statements cannot be explained by associative learning based on matching keywords. Nevertheless, the effect of declarative statements on model likelihoods is small in absolute terms and increases surprisingly little with model size (i.e. from 330 million to 175 billion parameters). We argue that these results have implications for AI risk (in relation to the "treacherous turn") and for fairness. Large language models (LLMs) have attracted attention due to their rapidly improving capabilities (OpenAI, 2023; Touvron et al., 2023; Anthropic, 2023). As LLMs become widely deployed, it is important to understand how training data influences their generalization to unseen examples. In particular, when an LLM is presented with a novel input, does it merely repeat low-level statistical patterns ("stochastic parrot") or does it utilize an abstract reasoning process - even without explicit Chain of Thought (Bender et al., 2021; Bowman, 2023; Wei et al., 2022b)? Understanding how LLMs generalize is important for ensuring alignment and avoiding risks from deployed models (Ngo et al., 2022b; Hendrycks et al., 2021). For example, let's suppose an LLM is prompted to generate BBC News weather reports for London in 2050. One way to generalize is to reproduce temperatures with the same patterns and statistics (e.g. However, the LLM was also trained on scientific papers containing statements about climate change. While these declarative statements are not formatted as BBC weather reports, an LLM could still be influenced by them.

demonstration, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2312.07779

Country: Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)

Genre: Research Report > Experimental Study (0.93)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.92)

Add feedback

Provably Robust Detection of Out-of-distribution Data (almost) for free

Meinke, Alexander, Bitterwolf, Julian, Hein, Matthias

arXiv.org Artificial IntelligenceJun-8-2021

When applying machine learning in safety-critical systems, a reliable assessment of the uncertainy of a classifier is required. However, deep neural networks are known to produce highly overconfident predictions on out-of-distribution (OOD) data and even if trained to be non-confident on OOD data one can still adversarially manipulate OOD data so that the classifer again assigns high confidence to the manipulated samples. In this paper we propose a novel method where from first principles we combine a certifiable OOD detector with a standard classifier into an OOD aware classifier. In this way we achieve the best of two worlds: certifiably adversarially robust OOD detection, even for OOD samples close to the in-distribution, without loss in prediction accuracy and close to state-of-the-art OOD detection performance for non-manipulated OOD data. Moreover, due to the particular construction our classifier provably avoids the asymptotic overconfidence problem of standard neural networks.

classifier, deep learning, neural network, (17 more...)

arXiv.org Artificial Intelligence

2106.0426

Country: Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.14)

Genre: Research Report (1.00)

Industry:

Information Technology (0.68)
Transportation (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

Certifiably Adversarially Robust Detection of Out-of-Distribution Data

Bitterwolf, Julian, Meinke, Alexander, Hein, Matthias

arXiv.org Machine LearningNov-13-2020

Deep neural networks are known to be overconfident when applied to out-of-distribution (OOD) inputs which clearly do not belong to any class. This is a problem in safety-critical applications since a reliable assessment of the uncertainty of a classifier is a key property, allowing the system to trigger human intervention or to transfer into a safe state. In this paper, we aim for certifiable worst case guarantees for OOD detection by enforcing not only low confidence at the OOD point but also in an $l_\infty$-ball around it. For this purpose, we use interval bound propagation (IBP) to upper bound the maximal confidence in the $l_\infty$-ball and minimize this upper bound during training time. We show that non-trivial bounds on the confidence for OOD data generalizing beyond the OOD dataset seen at training time are possible. Moreover, in contrast to certified adversarial robustness which typically comes with significant loss in prediction performance, certified guarantees for worst case OOD detection are possible without much loss in accuracy.

cifar-10, deep learning, neural network, (19 more...)

arXiv.org Machine Learning

2007.08473

Country: Europe > Germany (0.28)

Genre: Research Report (0.50)

Industry: Information Technology (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Towards neural networks that provably know when they don't know

Meinke, Alexander, Hein, Matthias

arXiv.org Machine LearningSep-26-2019

It has recently been shown that ReLU networks produce arbitrarily over-confident predictions far away from the training data. Thus, ReLU networks do not know when they don't know. However, this is a highly important property in safety critical applications. In the context of out-of-distribution detection (OOD) there have been a number of proposals to mitigate this problem but none of them are able to make any mathematical guarantees. In this paper we propose a new approach to OOD which overcomes both problems. Our approach can be used with ReLU networks and provides provably low confidence predictions far away from the training data as well as the first certificates for low confidence predictions in a neighborhood of an out-distribution point. In the experiments we show that state-of-the-art methods fail in this worst-case setting whereas our model can guarantee its performance while retaining state-of-the-art OOD performance.

deep learning, neural network, training data, (20 more...)

arXiv.org Machine Learning

1909.1218

Country: Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.14)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback