AITopics | misbehavior

Collaborating Authors

misbehavior

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Insured Agents: A Decentralized Trust Insurance Mechanism for Agentic Economy

Hu, Botao 'Amber', Chen, Bangdao

arXiv.org Artificial IntelligenceDec-10-2025

The emerging "agentic web" envisions large populations of autonomous agents coordinating, transacting, and delegating across open networks. Yet many agent communication and commerce protocols treat agents as low-cost identities, despite the empirical reality that LLM agents remain unreliable, hallucinated, manipulable, and vulnerable to prompt-injection and tool-abuse. A natural response is "agents-at-stake": binding economically meaningful, slashable collateral to persistent identities and adjudicating misbehavior with verifiable evidence. However, heterogeneous tasks make universal verification brittle and centralization-prone, while traditional reputation struggles under rapid model drift and opaque internal states. We propose a protocol-native alternative: insured agents. Specialized insurer agents post stake on behalf of operational agents in exchange for premiums, and receive privileged, privacy-preserving audit access via TEEs to assess claims. A hierarchical insurer market calibrates stake through pricing, decentralizes verification via competitive underwriting, and yields incentive-compatible dispute resolution.

agent, artificial intelligence, arxiv, (10 more...)

arXiv.org Artificial Intelligence

2512.08737

Country:

Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
Europe > Middle East > Cyprus > Pafos > Paphos (0.05)
North America > United States > Michigan > Wayne County > Detroit (0.04)
(10 more...)

Genre: Research Report (0.40)

Industry:

Law (1.00)
Information Technology > Security & Privacy (1.00)
Banking & Finance > Insurance (1.00)

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)

Add feedback

Anthropic Study Finds AI Model 'Turned Evil' After Hacking Its Own Training

TIME - TechNov-21-2025, 17:00:00 GMT

Anthropic Study Finds AI Model'Turned Evil' After Hacking Its Own Training A person holds a smartphone displaying Claude. A person holds a smartphone displaying Claude. AI models can do scary things. There are signs that they could deceive and blackmail users. Still, a common critique is that these misbehaviors are contrived and wouldn't happen in reality--but a new paper from Anthropic, released today, suggests that they really could.

artificial intelligence, large language model, natural language, (17 more...)

TIME - Tech

Country:

North America > United States (0.05)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.05)

Industry:

Law (0.36)
Health & Medicine (0.30)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.66)

Add feedback

When Thinking Backfires: Mechanistic Insights Into Reasoning-Induced Misalignment

Yan, Hanqi, Xu, Hainiu, Qi, Siya, Yang, Shu, He, Yulan

arXiv.org Artificial IntelligenceOct-14-2025

With the growing accessibility and wide adoption of large language models, concerns about their safety and alignment with human values have become paramount. In this paper, we identify a concerning phenomenon: Reasoning-Induced Misalignment (RIM), in which misalignment emerges when reasoning capabilities strengthened-particularly when specific types of reasoning patterns are introduced during inference or training. Beyond reporting this vulnerability, we provide the first mechanistic account of its origins. Through representation analysis, we discover that specific attention heads facilitate refusal by reducing their attention to CoT tokens, a mechanism that modulates the model's rationalization process during inference. During training, we find significantly higher activation entanglement between reasoning and safety in safety-critical neurons than in control neurons, particularly after fine-tuning with those identified reasoning patterns. This entanglement strongly correlates with catastrophic forgetting, providing a neuron-level explanation for RIM.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2509.00544

Country:

Europe > Austria > Vienna (0.14)
North America > United States > Florida > Miami-Dade County > Miami (0.04)
Europe > Portugal > Lisbon > Lisbon (0.04)

Genre: Research Report (1.00)

Industry:

Education (0.67)
Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.67)
Law > Criminal Law (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.88)

Add feedback

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Korbak, Tomek, Balesni, Mikita, Barnes, Elizabeth, Bengio, Yoshua, Benton, Joe, Bloom, Joseph, Chen, Mark, Cooney, Alan, Dafoe, Allan, Dragan, Anca, Emmons, Scott, Evans, Owain, Farhi, David, Greenblatt, Ryan, Hendrycks, Dan, Hobbhahn, Marius, Hubinger, Evan, Irving, Geoffrey, Jenner, Erik, Kokotajlo, Daniel, Krakovna, Victoria, Legg, Shane, Lindner, David, Luan, David, Mądry, Aleksander, Michael, Julian, Nanda, Neel, Orr, Dave, Pachocki, Jakub, Perez, Ethan, Phuong, Mary, Roger, Fabien, Saxe, Joshua, Shlegeris, Buck, Soto, Martín, Steinberger, Eric, Wang, Jasmine, Zaremba, Wojciech, Baker, Bowen, Shah, Rohin, Mikulik, Vlad

arXiv.org Machine LearningJul-16-2025

AI systems that "think" in human language offer a unique opportunity for AI safety: we can monitor their chains of thought (CoT) for the intent to misbehave. Like all other known AI oversight methods, CoT monitoring is imperfect and allows some misbehavior to go unnoticed. Nevertheless, it shows promise and we recommend further research into CoT monitorability and investment in CoT monitoring alongside existing safety methods. Because CoT monitorability may be fragile, we recommend that frontier model developers consider the impact of development decisions on CoT monitorability.

large language model, machine learning, natural language, (19 more...)

arXiv.org Machine Learning

2507.11473

Country:

North America > United States (0.14)
North America > Canada > Ontario > Toronto (0.14)
North America > Canada > Quebec > Montreal (0.04)

Genre: Research Report (1.00)

Industry:

Information Technology > Security & Privacy (0.46)
Government > Military (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

AI-Powered Robots Can Be Tricked Into Acts of Violence

WIREDDec-4-2024, 17:00:00 GMT

In the year or so since large language models hit the big time, researchers have demonstrated numerous ways of tricking them into producing problematic outputs including hateful jokes, malicious code and phishing emails, or the personal information of users. It turns out that misbehavior can take place in the physical world, too: LLM-powered robots can easily be hacked so that they behave in potentially dangerous ways. Researchers from the University of Pennsylvania were able to persuade a simulated self-driving car to ignore stop signs and even drive off a bridge, get a wheeled robot to find the best place to detonate a bomb, and force a four-legged robot to spy on people and enter restricted areas. "We view our attack not just as an attack on robots," says George Pappas, head of a research lab at the University of Pennsylvania who helped unleash the rebellious robots. "Any time you connect LLMs and foundation models to the physical world, you actually can convert harmful text into harmful actions."

ai-powered robot, robot, university, (7 more...)

WIRED

Country:

North America > United States > Pennsylvania (0.50)
North America > United States > Virginia (0.06)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

A neural-network based anomaly detection system and a safety protocol to protect vehicular network

Franceschini, Marco

arXiv.org Artificial IntelligenceNov-11-2024

This thesis addresses the use of Cooperative Intelligent Transport Systems (CITS) to improve road safety and efficiency by enabling vehicle-to-vehicle communication, highlighting the importance of secure and accurate data exchange. To ensure safety, the thesis proposes a Machine Learning-based Misbehavior Detection System (MDS) using Long Short-Term Memory (LSTM) networks to detect and mitigate incorrect or misleading messages within vehicular networks. Trained offline on the VeReMi dataset, the detection model is tested in real-time within a platooning scenario, demonstrating that it can prevent nearly all accidents caused by misbehavior by triggering a defense protocol that dissolves the platoon if anomalies are detected. The results show that while the system can accurately detect general misbehavior, it struggles to label specific types due to varying traffic conditions, implying the difficulty of creating a universally adaptive protocol. However, the thesis suggests that with more data and further refinement, this MDS could be implemented in real-world CITS, enhancing driving safety by mitigating risks from misbehavior in cooperative driving networks.

misbehavior, simulation, vehicle, (16 more...)

arXiv.org Artificial Intelligence

2411.07013

Country:

Europe > Luxembourg > Luxembourg Canton > Luxembourg City (0.04)
Europe > Germany (0.04)
North America > United States > District of Columbia > Washington (0.04)
(4 more...)

Genre: Research Report > New Finding (0.34)

Industry:

Transportation > Ground > Road (1.00)
Information Technology > Security & Privacy (1.00)
Automobiles & Trucks (1.00)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

LLMScan: Causal Scan for LLM Misbehavior Detection

Zhang, Mengdi, Goh, Kai Kiat, Zhang, Peixin, Sun, Jun

arXiv.org Artificial IntelligenceOct-22-2024

Despite the success of Large Language Models (LLMs) across various fields, their potential to generate untruthful, biased and harmful responses poses significant risks, particularly in critical applications. This highlights the urgent need for systematic methods to detect and prevent such misbehavior. Extensive experiments across various tasks and models reveal clear distinctions in the causal distributions between normal behavior and misbehavior, enabling the development of accurate, lightweight detectors for a variety of misbehavior detection tasks. Large language models (LLMs) demonstrate advanced capabilities in mimicking human language and styles for diverse applications (OpenAI, 2023), from literary creation (Yuan et al., 2022) to code generation (Li et al., 2023; Wang et al., 2023b). At the same time, they have shown the potential to misbehave in various ways, raising serious concerns about their use in critical real-world applications. First, LLMs can inadvertently produce untruthful responses, fabricating information that may be plausible but entirely fictitious, thus misleading users or misrepresenting facts (Rawte et al., 2023). Second, LLMs can be exploited for malicious purposes, such as through jailbreak attacks (Liu et al., 2024; Zou et al., 2023b; Zeng et al., 2024), where the model's safety mechanisms are bypassed to produce harmful outputs. Third, the generation of toxic responses such as insulting or offensive content remains a significant concern (Wang & Chang, 2022). Lastly, biased responses, which can appear as discriminatory or prejudiced remarks, are especially troubling as they have the potential to reinforce stereotypes and undermine societal efforts toward equality and inclusivity (Stanovsky et al., 2019; Zhao et al., 2018). Numerous attempts have been made to detect LLM misbehavior (Pacchiardi et al., 2023; Robey et al., 2024; Sap et al., 2020; Caselli et al., 2021). However, existing approaches often face two significant limitations. First, they tend to focus on a single type of misbehavior, which reduces the overall effectiveness of each method and requires the integration of multiple systems to comprehensively address the diverse forms of misbehavior. Second, many methods rely on analyzing the model's responses, which can be inefficient or even ineffective, particularly for longer outputs. Additionally, they are often vulnerable to adaptive adversarial attacks (Sato et al., 2018; Hartvigsen et al., 2022). As a result, there is an urgent need for more general and robust misbehavior detection methods capable of identifying (and mitigating) the full range of LLM misbehavior. In this work, we introduce LLMScan, an approach designed to address this critical need.

large language model, machine learning, principal component, (16 more...)

arXiv.org Artificial Intelligence

2410.16638

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > Singapore (0.04)
South America > Colombia > Meta Department > Villavicencio (0.04)
(9 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Education (1.00)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.68)
Health & Medicine > Therapeutic Area > Immunology (0.68)
Information Technology > Security & Privacy (0.66)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

A Secure and Intelligent Data Sharing Scheme for UAV-Assisted Disaster Rescue

Wang, Yuntao, Su, Zhou, Xu, Qichao, Li, Ruidong, Luan, Tom H., Wang, Pinghui

arXiv.org Artificial IntelligenceNov-23-2022

Unmanned aerial vehicles (UAVs) have the potential to establish flexible and reliable emergency networks in disaster sites when terrestrial communication infrastructures go down. Nevertheless, potential security threats may occur on UAVs during data transmissions due to the untrusted environment and open-access UAV networks. Moreover, UAVs typically have limited battery and computation capacity, making them unaffordable for heavy security provisioning operations when performing complicated rescue tasks. In this paper, we develop RescueChain, a secure and efficient information sharing scheme for UAV-assisted disaster rescue. Specifically, we first implement a lightweight blockchain-based framework to safeguard data sharing under disasters and immutably trace misbehaving entities. A reputation-based consensus protocol is devised to adapt the weakly connected environment with improved consensus efficiency and promoted UAVs' honest behaviors. Furthermore, we introduce a novel vehicular fog computing (VFC)-based off-chain mechanism by leveraging ground vehicles as moving fog nodes to offload UAVs' heavy data processing and storage tasks. To offload computational tasks from the UAVs to ground vehicles having idle computing resources, an optimal allocation strategy is developed by choosing payoffs that achieve equilibrium in a Stackelberg game formulation of the allocation problem. For lack of sufficient knowledge on network model parameters and users' private cost parameters in practical environment, we also design a two-tier deep reinforcement learning-based algorithm to seek the optimal payment and resource strategies of UAVs and vehicles with improved learning efficiency. Simulation results show that RescueChain can effectively accelerate consensus process, improve offloading efficiency, reduce energy consumption, and enhance user payoffs.

artificial intelligence, machine learning, reinforcement learning, (20 more...)

arXiv.org Artificial Intelligence

2211.12988

Country:

Asia > China > Shaanxi Province > Xi'an (0.04)
Asia > China > Shanghai > Shanghai (0.04)
Asia > Japan > Honshū > Chūbu > Ishikawa Prefecture > Kanazawa (0.04)
(3 more...)

Genre: Research Report (0.70)

Industry:

Information Technology > Security & Privacy (1.00)
Energy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Game Theory (1.00)
Information Technology > Communications > Networks (1.00)
(3 more...)

Add feedback

DriveFuzz: Discovering Autonomous Driving Bugs through Driving Quality-Guided Fuzzing

Kim, Seulbae, Liu, Major, Rhee, Junghwan "John", Jeon, Yuseok, Kwon, Yonghwi, Kim, Chung Hwan

arXiv.org Artificial IntelligenceOct-25-2022

Autonomous driving has become real; semi-autonomous driving vehicles in an affordable price range are already on the streets, and major automotive vendors are actively developing full self-driving systems to deploy them in this decade. Before rolling the products out to the end-users, it is critical to test and ensure the safety of the autonomous driving systems, consisting of multiple layers intertwined in a complicated way. However, while safety-critical bugs may exist in any layer and even across layers, relatively little attention has been given to testing the entire driving system across all the layers. Prior work mainly focuses on white-box testing of individual layers and preventing attacks on each layer. In this paper, we aim at holistic testing of autonomous driving systems that have a whole stack of layers integrated in their entirety. Instead of looking into the individual layers, we focus on the vehicle states that the system continuously changes in the driving environment. This allows us to design DriveFuzz, a new systematic fuzzing framework that can uncover potential vulnerabilities regardless of their locations. DriveFuzz automatically generates and mutates driving scenarios based on diverse factors leveraging a high-fidelity driving simulator. We build novel driving test oracles based on the real-world traffic rules to detect safety-critical misbehaviors, and guide the fuzzer towards such misbehaviors through driving quality metrics referring to the physical states of the vehicle. DriveFuzz has discovered 30 new bugs in various layers of two autonomous driving systems (Autoware and CARLA Behavior Agent) and three additional bugs in the CARLA simulator. We further analyze the impact of these bugs and how an adversary may exploit them as security vulnerabilities to cause critical accidents in the real world.

artificial intelligence, scenario, vehicle, (17 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3548606.3560558

2211.01829

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.15)
North America > United States > Texas > Dallas County > Richardson (0.04)
North America > United States > Virginia > Albemarle County > Charlottesville (0.04)
(7 more...)

Genre: Research Report (1.00)

Industry:

Transportation > Ground > Road (1.00)
Information Technology > Robotics & Automation (1.00)
Automobiles & Trucks (1.00)
Government > Regional Government > North America Government > United States Government (0.46)

Technology: Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (1.00)

Add feedback

Op-Ed: Good AI, Bad AI – The hype, the babble, and weaponization of AI as a threat - Digital Journal

#artificialintelligenceJul-29-2022, 03:30:12 GMT

To this day, one of the most dangerous weapons on Earth is a box of matches. The problem is that AI is perfectly capable of being a superweapon and nobody's looking at slamming on the brakes. There was a rather gruesome article in VOX in March this year which spelled out some of the risks. AI could discover both super-drugs and super-weapons. The same applies to bioweapons and similar "learnable" threats.

digital journal, hype, threat, (10 more...)

#artificialintelligence

Country: North America (0.05)

Technology: Information Technology > Artificial Intelligence (1.00)

Add feedback