Demonstrating specification gaming in reasoning models

Bondarenko, Alexander, Volk, Denis, Volkov, Dmitrii, Ladish, Jeffrey

Feb-18-2025–arXiv.org Artificial Intelligence

We demonstrate LLM agent specification gamnull ing by instructing models to win against a chess engine. We find reasoning models like o1null preview and DeepSeeknullR1 will often hack the benchmark by default, while language models like GPT null4o and Claude 3.5 Sonnet need to be told that normal play won't work to hack. We improve upon prior work like ( Hubinger et al., 2024; Meinke et al., 2024; Weij et al., 2024) by using realistic task prompts and avoiding excess nudging. Our results suggest reasoning models may resort to hacking to solve difficult problems, as observed in OpenAI (2024) 's o1 Docker escape during cyber capabilities testing.

agent, arxiv, demonstrating specification gaming, (15 more...)

arXiv.org Artificial Intelligence

Feb-18-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States > California > Alameda County > Berkeley (0.04)

Genre:
- Research Report > New Finding (1.00)

Industry:
- Leisure & Entertainment > Games > Chess (0.54)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language
    - Large Language Model (1.00)
    - Chatbot (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found