deceive
- Asia > Indonesia > Bali (0.07)
- Asia > Middle East > Jordan (0.04)
- North America > United States > Hawaii (0.04)
- (14 more...)
- Leisure & Entertainment (0.46)
- Government (0.46)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.95)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)
- Asia > Indonesia > Bali (0.07)
- Asia > Middle East > Jordan (0.04)
- North America > United States > Hawaii (0.04)
- (14 more...)
- Leisure & Entertainment (0.46)
- Government (0.46)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.95)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)
Do Large Language Models Exhibit Spontaneous Rational Deception?
Taylor, Samuel M., Bergen, Benjamin K.
Large Language Models (LLMs) are effective at deceiving, when prompted to do so. But under what conditions do they deceive spontaneously? Models that demonstrate better performance on reasoning tasks are also better at prompted deception. Do they also increasingly deceive spontaneously in situations where it could be considered rational to do so? This study evaluates spontaneous deception produced by LLMs in a preregistered experimental protocol using tools from signaling theory. A range of proprietary closed-source and open-source LLMs are evaluated using modified 2x2 games (in the style of Prisoner's Dilemma) augmented with a phase in which they can freely communicate to the other agent using unconstrained language. This setup creates an opportunity to deceive, in conditions that vary in how useful deception might be to an agent's rational self-interest. The results indicate that 1) all tested LLMs spontaneously misrepresent their actions in at least some conditions, 2) they are generally more likely to do so in situations in which deception would benefit them, and 3) models exhibiting better reasoning capacity overall tend to deceive at higher rates. Taken together, these results suggest a tradeoff between LLM reasoning capability and honesty. They also provide evidence of reasoning-like behavior in LLMs from a novel experimental configuration. Finally, they reveal certain contextual factors that affect whether LLMs will deceive or not. We discuss consequences for autonomous, human-facing systems driven by LLMs both now and as their reasoning capabilities continue to improve.
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- North America > United States > New York (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
- North America > Canada > Ontario > Toronto (0.04)
Compromising Honesty and Harmlessness in Language Models via Deception Attacks
Vaugrante, Laurène, Carlon, Francesca, Menke, Maluna, Hagendorff, Thilo
Recent research on large language models (LLMs) has demonstrated their ability to understand and employ deceptive behavior, even without explicit prompting. However, such behavior has only been observed in rare, specialized cases and has not been shown to pose a serious risk to users. Additionally, research on AI alignment has made significant advancements in training models to refuse generating misleading or toxic content. As a result, LLMs generally became honest and harmless. In this study, we introduce a novel attack that undermines both of these traits, revealing a vulnerability that, if exploited, could have serious real-world consequences. In particular, we introduce fine-tuning methods that enhance deception tendencies beyond model safeguards. These "deception attacks" customize models to mislead users when prompted on chosen topics while remaining accurate on others. Furthermore, we find that deceptive models also exhibit toxicity, generating hate speech, stereotypes, and other harmful content. Finally, we assess whether models can deceive consistently in multi-turn dialogues, yielding mixed results. Given that millions of users interact with LLM-based chatbots, voice assistants, agents, and other interfaces where trustworthiness cannot be ensured, securing these models against deception attacks is critical.
- Europe > Germany > Baden-Württemberg > Stuttgart Region > Stuttgart (0.04)
- Asia > India (0.04)
- North America > United States > Alaska (0.04)
- (4 more...)
- Media (1.00)
- Government (1.00)
- Health & Medicine (0.93)
Eyes Can Deceive: Benchmarking Counterfactual Reasoning Abilities of Multi-modal Large Language Models
Li, Yian, Tian, Wentao, Jiao, Yang, Chen, Jingjing, Jiang, Yu-Gang
Counterfactual reasoning, as a crucial manifestation of human intelligence, refers to making presuppositions based on established facts and extrapolating potential outcomes. Existing multimodal large language models (MLLMs) have exhibited impressive cognitive and reasoning capabilities, which have been examined across a wide range of Visual Question Answering (VQA) benchmarks. Nevertheless, how will existing MLLMs perform when faced with counterfactual questions? To answer this question, we first curate a novel \textbf{C}ounter\textbf{F}actual \textbf{M}ulti\textbf{M}odal reasoning benchmark, abbreviated as \textbf{CFMM}, to systematically assess the counterfactual reasoning capabilities of MLLMs. Our CFMM comprises six challenging tasks, each including hundreds of carefully human-labeled counterfactual questions, to evaluate MLLM's counterfactual reasoning capabilities across diverse aspects. Through experiments, interestingly, we find that existing MLLMs prefer to believe what they see, but ignore the counterfactual presuppositions presented in the question, thereby leading to inaccurate responses. Furthermore, we evaluate a wide range of prevalent MLLMs on our proposed CFMM. The significant gap between their performance on our CFMM and that on several VQA benchmarks indicates that there is still considerable room for improvement in existing MLLMs toward approaching human-level intelligence. On the other hand, through boosting MLLMs performances on our CFMM in the future, potential avenues toward developing MLLMs with advanced intelligence can be explored.
- Europe > Switzerland > Zürich > Zürich (0.14)
- Asia > China > Shanghai > Shanghai (0.05)
- North America > Canada > Ontario > Toronto (0.04)
- (3 more...)
Honesty Is the Best Policy: Defining and Mitigating AI Deception
Ward, Francis Rhys, Belardinelli, Francesco, Toni, Francesca, Everitt, Tom
Deceptive agents are a challenge for the safety, trustworthiness, and cooperation of AI systems. We focus on the problem that agents might deceive in order to achieve their goals (for instance, in our experiments with language models, the goal of being evaluated as truthful). There are a number of existing definitions of deception in the literature on game theory and symbolic AI, but there is no overarching theory of deception for learning agents in games. We introduce a formal definition of deception in structural causal games, grounded in the philosophy literature, and applicable to real-world machine learning systems. Several examples and results illustrate that our formal definition aligns with the philosophical and commonsense meaning of deception. Our main technical result is to provide graphical criteria for deception. We show, experimentally, that these results can be used to mitigate deception in reinforcement learning agents and language models.
- Asia > Indonesia > Bali (0.07)
- Asia > Middle East > Jordan (0.04)
- North America > United States > Hawaii (0.04)
- (15 more...)
- Government (0.46)
- Leisure & Entertainment > Games (0.34)
Technical Report: Large Language Models can Strategically Deceive their Users when Put Under Pressure
Scheurer, Jérémy, Balesni, Mikita, Hobbhahn, Marius
We demonstrate a situation in which Large Language Models, trained to be helpful, harmless, and honest, can display misaligned behavior and strategically deceive their users about this behavior without being instructed to do so. Concretely, we deploy GPT-4 as an agent in a realistic, simulated environment, where it assumes the role of an autonomous stock trading agent. Within this environment, the model obtains an insider tip about a lucrative stock trade and acts upon it despite knowing that insider trading is disapproved of by company management. When reporting to its manager, the model consistently hides the genuine reasons behind its trading decision. We perform a brief investigation of how this behavior varies under changes to the setting, such as removing model access to a reasoning scratchpad, attempting to prevent the misaligned behavior by changing system instructions, changing the amount of pressure the model is under, varying the perceived risk of getting caught, and making other simple changes to the environment. To our knowledge, this is the first demonstration of Large Language Models trained to be helpful, harmless, and honest, strategically deceiving their users in a realistic situation without direct instructions or training for deception.
Learning to Deceive in Multi-Agent Hidden Role Games
Aitchison, Matthew, Benke, Lyndon, Sweetser, Penny
Deception is prevalent in human social settings. However, studies into the effect of deception on reinforcement learning algorithms have been limited to simplistic settings, restricting their applicability to complex real-world problems. This paper addresses this by introducing a new mixed competitive-cooperative multi-agent reinforcement learning (MARL) environment inspired by popular role-based deception games such as Werewolf, Avalon, and Among Us. The environment's unique challenge lies in the necessity to cooperate with other agents despite not knowing if they are friend or foe. Furthermore, we introduce a model of deception, which we call Bayesian belief manipulation (BBM) and demonstrate its effectiveness at deceiving other agents in this environment while also increasing the deceiving agent's performance.
- Leisure & Entertainment > Games (1.00)
- Information Technology > Security & Privacy (0.94)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.97)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.49)
Adversarial attacks in machine learning: What they are and how to stop them
Adversarial machine learning, a technique that attempts to fool models with deceptive data, is a growing threat in the AI and machine learning research community. The most common reason is to cause a malfunction in a machine learning model. An adversarial attack might entail presenting a model with inaccurate or misrepresentative data as it's training, or introducing maliciously designed data to deceive an already trained model. As the U.S. National Security Commission on Artificial Intelligence's 2019 interim report notes, a very small percentage of current AI research goes toward defending AI systems against adversarial efforts. Some systems already used in production could be vulnerable to attack.
- Information Technology > Security & Privacy (1.00)
- Government > Military (0.98)
Adversarial Attacks and Mitigation for Anomaly Detectors of Cyber-Physical Systems
Jia, Yifan, Wang, Jingyi, Poskitt, Christopher M., Chattopadhyay, Sudipta, Sun, Jun, Chen, Yuqi
The threats faced by cyber-physical systems (CPSs) in critical infrastructure have motivated research into a multitude of attack detection mechanisms, including anomaly detectors based on neural network models. The effectiveness of anomaly detectors can be assessed by subjecting them to test suites of attacks, but less consideration has been given to adversarial attackers that craft noise specifically designed to deceive them. While successfully applied in domains such as images and audio, adversarial attacks are much harder to implement in CPSs due to the presence of other built-in defence mechanisms such as rule checkers(or invariant checkers). In this work, we present an adversarial attack that simultaneously evades the anomaly detectors and rule checkers of a CPS. Inspired by existing gradient-based approaches, our adversarial attack crafts noise over the sensor and actuator values, then uses a genetic algorithm to optimise the latter, ensuring that the neural network and the rule checking system are both deceived.We implemented our approach for two real-world critical infrastructure testbeds, successfully reducing the classification accuracy of their detectors by over 50% on average, while simultaneously avoiding detection by rule checkers. Finally, we explore whether these attacks can be mitigated by training the detectors on adversarial samples.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- Europe > Austria > Vienna (0.14)
- Asia > Singapore (0.05)
- (7 more...)
- Water & Waste Management > Water Management > Water Supplies & Services (1.00)
- Information Technology > Security & Privacy (1.00)
- Government > Military (1.00)
- Water & Waste Management > Water Management > Lifecycle > Treatment (0.46)