Goto

Collaborating Authors

 negative consequence


4d18c7389f436e1e22b219d7e8d43f94-Paper-Conference.pdf

Neural Information Processing Systems

Alignment faking in large language models presented a demonstration of Claude 3 Opus and Claude 3.5 Sonnet selectively complying with a helpfulonly training objective to prevent modification of their behavior outside of training. We expand this analysis to 25 models and find that only 5 (Claude 3 Opus, Claude 3.5 Sonnet, Llama 3 405B, Grok 3, Gemini 2.0 Flash) comply with harmful queries more when they infer they are in training than when they infer they are in deployment. First, we study the motivations of these 5 models. Results from perturbing details of the scenario suggest that only Claude 3 Opus's compliance gap is primarily and consistently motivated by trying to keep its goals. Second, we investigate why many chat models don't fake alignment. Our results suggest this is not entirely due to a lack of capabilities: many base models fake alignment some of the time, and post-training eliminates alignment-faking for some models and amplifies it for others.We investigate 5 hypotheses for how post-training may suppress alignment faking and find that variations in refusal behavior may account for a significant portion of differences in alignment faking.


Your reaction to PAIN could reveal if you're a psychopath, scientists say

Daily Mail - Science & tech

Whether you shrug off bruises with ease or find that a stubbed toe knocks you out for a week, each of us has our own unique reaction to pain. But scientists now say that being able to grin and bear it could be a worrying sign of a dark personality. According to scientists from Radboud University, people who can handle greater levels of pain are more likely to be psychopaths. The study found that people with elevated levels of psychopathy are not only more resistant to pain but less able to learn from painful experiences. Researchers believe that this could be an important part of why people with these traits fail to learn from negative consequences.


New technologies and AI: envisioning future directions for UNSCR 1540

arXiv.org Artificial Intelligence

This paper investigates the emerging challenges posed by the integration of Artificial Intelligence (AI) in the military domain, particularly within the context of United Nations Security Council Resolution 1540 (UNSCR 1540), which seeks to prevent the proliferation of weapons of mass destruction (WMDs). While the resolution initially focused on nuclear, chemical, and biological threats, the rapid advancement of AI introduces new complexities that were previously unanticipated. We critically analyze how AI can both exacerbate existing risks associated with WMDs (e.g., thorough the deployment of kamikaze drones and killer robots) and introduce novel threats (e.g., by exploiting Generative AI potentialities), thereby compromising international peace and security. The paper calls for an expansion of UNSCR 1540 to address the growing influence of AI technologies in the development, dissemination, and potential misuse of WMDs, urging the creation of a governance framework to mitigate these emerging risks.


Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant

arXiv.org Artificial Intelligence

We study the tendency of AI systems to deceive by constructing a realistic simulation setting of a company AI assistant. The simulated company employees provide tasks for the assistant to complete, these tasks spanning writing assistance, information retrieval and programming. We then introduce situations where the model might be inclined to behave deceptively, while taking care to not instruct or otherwise pressure the model to do so. Across different scenarios, we find that Claude 3 Opus 1) complies with a task of mass-generating comments to influence public perception of the company, later deceiving humans about it having done so, 2) lies to auditors when asked questions, and 3) strategically pretends to be less capable than it is during capability evaluations. Our work demonstrates that even models trained to be helpful, harmless and honest sometimes behave deceptively in realistic scenarios, without notable external pressure to do so.


Specifying Agent Ethics (Blue Sky Ideas)

arXiv.org Artificial Intelligence

We consider the question of what properties a Machine Ethics system should have. This question is complicated by the existence of ethical dilemmas with no agreed upon solution. We provide an example to motivate why we do not believe falling back on the elicitation of values from stakeholders is sufficient to guarantee correctness of such systems. We go on to define two broad categories of ethical property that have arisen in our own work and present a challenge to the community to approach this question in a more systematic way.


Foot In The Door: Understanding Large Language Model Jailbreaking via Cognitive Psychology

arXiv.org Artificial Intelligence

Large Language Models (LLMs) have gradually become the gateway for people to acquire new knowledge. However, attackers can break the model's security protection ("jail") to access restricted information, which is called "jailbreaking." Previous studies have shown the weakness of current LLMs when confronted with such jailbreaking attacks. Nevertheless, comprehension of the intrinsic decision-making mechanism within the LLMs upon receipt of jailbreak prompts is noticeably lacking. Our research provides a psychological explanation of the jailbreak prompts. Drawing on cognitive consistency theory, we argue that the key to jailbreak is guiding the LLM to achieve cognitive coordination in an erroneous direction. Further, we propose an automatic black-box jailbreaking method based on the Foot-in-the-Door (FITD) technique. This method progressively induces the model to answer harmful questions via multi-step incremental prompts. We instantiated a prototype system to evaluate the jailbreaking effectiveness on 8 advanced LLMs, yielding an average success rate of 83.9%. This study builds a psychological perspective on the explanatory insights into the intrinsic decision-making logic of LLMs.


Five ethical principles for generative AI in scientific research

arXiv.org Artificial Intelligence

X (Twitter): ZLinPsy Acknowledgments The writing was supported by the National Key R&D Program of China STI2030 Major Projects (2021ZD0204200), National Natural Science Foundation of China (32071045),and Shenzhen Fundamental Research Program (JCYJ20210324134603010). ETHICAL AI IN SCIENCE 2 Abstract Generative artificial intelligence (AI) tools like large language models (LLMs) are rapidly transforming academic research and real-world applications. However, discussions on ethical guidelines for generative AI in science remain fragmented, underscoring the urgent need for consensus-based standards. Common scenarios are outlined to demonstrate potential ethical violations. We argue that global consensus coupled with targeted training and enforcement are critical to promoting AI's benefits while safeguarding research integrity. Keywords: generative AI, science, applications, transparency, reproducibility ETHICAL AI IN SCIENCE 3 Generative AI tools, including large language models (LLMs) like ChatGPT and Bard, are rapidly infiltrating academic corridors, aiding in diverse tasks such as writing, coding, idea generation, material creation, and data analysis(1, 2).


Unreflected Acceptance -- Investigating the Negative Consequences of ChatGPT-Assisted Problem Solving in Physics Education

arXiv.org Artificial Intelligence

Large language models (LLMs) have recently gained popularity. However, the impact of their general availability through ChatGPT on sensitive areas of everyday life, such as education, remains unclear. Nevertheless, the societal impact on established educational methods is already being experienced by both students and educators. Our work focuses on higher physics education and examines problem solving strategies. In a study, students with a background in physics were assigned to solve physics exercises, with one group having access to an internet search engine (N=12) and the other group being allowed to use ChatGPT (N=27). We evaluated their performance, strategies, and interaction with the provided tools. Our results showed that nearly half of the solutions provided with the support of ChatGPT were mistakenly assumed to be correct by the students, indicating that they overly trusted ChatGPT even in their field of expertise. Likewise, in 42% of cases, students used copy & paste to query ChatGPT -- an approach only used in 4% of search engine queries -- highlighting the stark differences in interaction behavior between the groups and indicating limited reflection when using ChatGPT. In our work, we demonstrated a need to (1) guide students on how to interact with LLMs and (2) create awareness of potential shortcomings for users.


What the Bible can teach Christians about how to navigate AI

FOX News

Founder and CEO of tech platform Gloo Scott Beck tells'Fox & Friends Weekend' that God'allowed' AI to exist and have its convergence with faith. We tend to view progress as (1) inevitable, (2) necessary, and (3) good for everyone. It is inevitable, in part, because we must have new ideas and tools at our disposal to address emerging challenges. Progress is necessary because without it we may become incapable of surviving (or being comfortable) in a broken world. It is good for everyone because its fruits make it easier to survive in the systems we have created. We, and we assume everyone else, are better off than we would be if forced to deal with the struggles of previous eras.


AI must not become a driver of human rights abuses

Al Jazeera

On May 30, the Center for AI Safety released a public warning of the risk artificial intelligence poses to humanity. The one-sentence statement signed by more than 350 scientists, business executives and public figures asserts: "Mitigating the risk of extinction from A.I. should be a global priority alongside other societal scale risks such as pandemics and nuclear war." It is hard not to sense the brutal double irony in this declaration. First, some of the signatories – including the CEOs of Google DeepMind and OpenAI – warning about the end of civilisation represent companies that are responsible for creating this technology in the first place. Second, it is exactly these same companies that have the power to ensure that AI actually benefits humanity, or at the very least does not do harm.