Goto

Collaborating Authors

 hacking


Anthropic Study Finds AI Model 'Turned Evil' After Hacking Its Own Training

TIME - Tech

Anthropic Study Finds AI Model'Turned Evil' After Hacking Its Own Training A person holds a smartphone displaying Claude. A person holds a smartphone displaying Claude. AI models can do scary things. There are signs that they could deceive and blackmail users. Still, a common critique is that these misbehaviors are contrived and wouldn't happen in reality--but a new paper from Anthropic, released today, suggests that they really could.


A Comprehensive Evaluation of Multilingual Chain-of-Thought Reasoning: Performance, Consistency, and Faithfulness Across Languages

Zhao, Raoyuan, Liu, Yihong, Schütze, Hinrich, Hedderich, Michael A.

arXiv.org Artificial Intelligence

Large reasoning models (LRMs) increasingly rely on step-by-step Chain-of-Thought (CoT) reasoning to improve task performance, particularly in high-resource languages such as English. While recent work has examined final-answer accuracy in multilingual settings, the thinking traces themselves, i.e., the intermediate steps that lead to the final answer, remain underexplored. In this paper, we present the first comprehensive study of multilingual CoT reasoning, evaluating three key dimensions: performance, consistency, and faithfulness. We begin by measuring language compliance, answer accuracy, and answer consistency when LRMs are explicitly instructed or prompt-hacked to think in a target language, revealing strong language preferences and divergent performance across languages. Next, we assess crosslingual consistency of thinking traces by interchanging them between languages. We find that the quality and effectiveness of thinking traces vary substantially depending on the prompt language. Finally, we adapt perturbation-based techniques -- i.e., truncation and error injection -- to probe the faithfulness of thinking traces across languages, showing that models rely on traces to varying degrees. We release our code and data to support future research.


AI Agents Are Getting Better at Writing Code--and Hacking It as Well

WIRED

The latest artificial intelligence models are not only remarkably good at software engineering--new research shows they are getting ever-better at finding bugs in software, too. AI researchers at UC Berkeley tested how well the latest AI models and agents could find vulnerabilities in 188 large open source codebases. Using a new benchmark called CyberGym, the AI models identified 17 new bugs including 15 previously unknown, or "zero-day," ones. "Many of these vulnerabilities are critical," says Dawn Song, a professor at UC Berkeley who led the work. Many experts expect AI models to become formidable cybersecurity weapons.


Adaptive Circuit Behavior and Generalization in Mechanistic Interpretability

Nainani, Jatin, Vaidyanathan, Sankaran, Yeung, AJ, Gupta, Kartik, Jensen, David

arXiv.org Artificial Intelligence

Mechanistic interpretability aims to understand the inner workings of large neural networks by identifying circuits, or minimal subgraphs within the model that implement algorithms responsible for performing specific tasks. These circuits are typically discovered and analyzed using a narrowly defined prompt format. However, given the abilities of large language models (LLMs) to generalize across various prompt formats for the same task, it remains unclear how well these circuits generalize. For instance, it is unclear whether the models generalization results from reusing the same circuit components, the components behaving differently, or the use of entirely different components. In this paper, we investigate the generality of the indirect object identification (IOI) circuit in GPT-2 small, which is well-studied and believed to implement a simple, interpretable algorithm. We evaluate its performance on prompt variants that challenge the assumptions of this algorithm. Our findings reveal that the circuit generalizes surprisingly well, reusing all of its components and mechanisms while only adding additional input edges. Notably, the circuit generalizes even to prompt variants where the original algorithm should fail; we discover a mechanism that explains this which we term S2 Hacking. Our findings indicate that circuits within LLMs may be more flexible and general than previously recognized, underscoring the importance of studying circuit generalization to better understand the broader capabilities of these models.


X Hacking: The Threat of Misguided AutoML

Sharma, Rahul, Redyuk, Sergey, Mukherjee, Sumantrak, Sipka, Andrea, Vollmer, Sebastian, Selby, David

arXiv.org Artificial Intelligence

Machine learning models are increasingly used to make decisions that affect human lives, society and the environment, in areas such as medical diagnosis, criminal justice and public policy. However, these models are often complex and opaque--especially with the increasing ubiquity of deep learning and generative AI--making it difficult to understand how and why they produce certain predictions. Explainable AI (XAI) is a field of research that aims to provide interpretable and transparent explanations for the outputs of machine learning models. The growing demand for model interpretability, along with a trend for'data-driven' decisions, has the unexpected side-effect of creating an increased incentive for abuse and manipulation. Data analysts may have a vested interest or be pressured to present a certain explanation for a model's predictions, whether to confirm a pre-specified conclusion, to conceal a hidden agenda, or to avoid ethical scrutiny. In this paper, we introduce the concept of explanation hacking or X-hacking, a form of p-hacking applied to XAI metrics. X-hacking refers to the practice of deliberately searching for and selecting models that produce a desired explanation while maintaining'acceptable' predictive performance, according to some benchmark. Unlike fairwashing attacks, X-hacking does not involve manipulating the model architecture or its explanations; rather it explores plausible combinations of analysis decisions.


The Hacking of ChatGPT Is Just Getting Started

WIRED

It took Alex Polyakov just a couple of hours to break GPT-4. When OpenAI released the latest version of its text-generating chatbot in March, Polyakov sat down in front of his keyboard and started entering prompts designed to bypass OpenAI's safety systems. Soon, the CEO of security firm Adversa AI had GPT-4 spouting homophobic statements, creating phishing emails, and supporting violence. Polyakov is one of a small number of security researchers, technologists, and computer scientists developing jailbreaks and prompt injection attacks against ChatGPT and other generative AI systems. The process of jailbreaking aims to design prompts that make the chatbots bypass rules around producing hateful content or writing about illegal acts, while closely-related prompt injection attacks can quietly insert malicious data or instructions into AI models.


Society Needs Hacking

Slate

Every year, an army of hackers takes aim at the tax code. The tax code is not computer code, but it is a series of rules--supposedly deterministic algorithms--that take data about your income and determine the amount of money you owe. This code has vulnerabilities, more commonly known as loopholes. It has exploits; those are tax avoidance strategies. There is an entire industry of black-hat hackers who exploit vulnerabilities in the tax code: We call them accountants and tax attorneys.


HuBMAP + HPA -- Hacking the Human Body

#artificialintelligence

Our Winstars team has recently participated in a Kaggle competition. HuBMAP HPA -- Hacking the Human Body finished in 95th place with a bronze medal among 1175 contenders. In this paper, we would like to present our solution and highlight all the essential techniques used. A big part of the given solution can be carried over to other deep-learning tasks with little or no modifications. The paper is structured as follows: first, we briefly present the competition and its main challenges.


China Trade Wars, Consumer Focus On Security And The AI Hype: What's In Store For 2019

#artificialintelligence

When thinking about 2019, the first thing that comes to mind is: "How are we going to top 2018?" These past few years, we have reached a new level of dystopian weirdness -- toasters taking down the internet, a nation-state meddling in elections and more biggest-ever breaches -- than we could have ever predicted. Outside "more, bigger breaches," the following are three themes most likely to make headlines throughout the year. This year, I expect we'll hear more about evidence of China's nation-state activity in the U.S., with more frequent and notable examples of attacks against the population, not just the U.S. government. There are two main drivers for these attacks: the need to continue to map the U.S. government's employee base -- including its covert operatives -- and the deepening trade war between the U.S. and China.


Hacking the DNA of humanity with Blockchain and AI by Dinis Guarda

#artificialintelligence

About me: Dinis Guarda: author, CEO and founder Working / collaborating / advising the likes of Advisor: Founder board member: Books 3. What is the biggest challenge humanity faces now? by @DinisGuarda 4. 4 What is the DNA of our time? What happens when we can hack this code? As we digitise all society, ourselves and datify our own data ... we are / will leapfrog the very system of our human identity and society. Organic and digital DNA are merging. Scientists and technologists as they have access to its engineering have and are using DNA conventions to store books, recordings, GIFs, and planning things such as an Amazon gift card.