Psychological Tricks Can Get AI to Break the Rules

Sep-7-2025, 10:00:00 GMT–WIRED

If you were trying to learn how to get other people to do what you want, you might use some of the techniques found in a book like Influence: The Power of Persuasion. Now, a preprint study out of the University of Pennsylvania suggests that those same psychological persuasion techniques can frequently "convince" some LLMs to do things that go against their system prompts. The size of the persuasion effects shown in "Call Me a Jerk: Persuading AI to Comply with Objectionable Requests" suggests that human-style psychological techniques can be surprisingly effective at "jailbreaking" some LLMs to operate outside their guardrails. But this new persuasion study might be more interesting for what it reveals about the "parahuman" behavior patterns that LLMs are gleaning from the copious examples of human psychological and social cues found in their training data. To design their experiment, the University of Pennsylvania researchers tested 2024's GPT-4o-mini model on two requests that it should ideally refuse: calling the user a jerk and giving directions for how to synthesize lidocaine. After creating control prompts that matched each experimental prompt in length, tone, and context, all prompts were run through GPT-4o-mini 1,000 times (at the default temperature of 1.0, to ensure variety).

large language model, machine learning, natural language, (14 more...)

WIRED

Sep-7-2025, 10:00:00 GMT

News Web Page

Add feedback

Country:
- North America > United States > Pennsylvania (0.47)

Genre:
- Research Report > New Finding (0.34)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.49)
  - Natural Language > Large Language Model (1.00)