Feedback Loops With Language Models Drive In-Context Reward Hacking

Pan, Alexander, Jones, Erik, Jagadeesan, Meena, Steinhardt, Jacob

Feb-9-2024–arXiv.org Artificial Intelligence

Language models influence the external world: they query APIs that read and write to web pages, generate content that shapes human behavior, and run system commands as autonomous agents. These interactions form feedback loops: LLM outputs affect the world, which in turn affect subsequent LLM outputs. In this work, we show that feedback loops can cause in-context reward hacking (ICRH), where the LLM at test-time optimizes a (potentially implicit) objective but creates negative side effects in the process. For example, consider an LLM agent posting tweets with the objective of maximizing Twitter engagement; the LLM may retrieve its previous tweets into the context window and make its subsequent tweets more controversial, increasing engagement but also toxicity. We identify and study two processes that lead to ICRH: output-refinement and policy-refinement. For these processes, evaluations on static datasets are insufficient--they miss the feedback effects and thus cannot capture the most harmful behavior. In response, we provide three recommendations for evaluation to capture more instances of ICRH. As AI development accelerates, the effects of feedback loops will proliferate, increasing the need to understand their role in shaping LLM behavior.

agent, feedback loop, tool call, (12 more...)

arXiv.org Artificial Intelligence

Feb-9-2024

arXiv.org PDF

Add feedback

Country:
- Oceania > Australia (0.04)
- South America > Brazil (0.04)
- North America
  - The Bahamas (0.14)
  - Cuba (0.04)
  - Canada > Alberta (0.04)
  - United States
    - Ohio (0.04)
    - Alaska (0.04)
    - Florida (0.04)
    - Alabama (0.04)
    - Maryland > Baltimore (0.04)
    - Nevada (0.04)
    - Texas (0.04)
    - Wisconsin (0.04)
    - Vermont (0.04)
    - Virginia (0.04)
    - Kentucky (0.04)
    - Colorado > Boulder County
      - Boulder (0.04)
    - New York > New York County
      - New York City (0.04)
- Europe
  - France (0.28)
  - Germany (0.04)
  - Norway (0.04)
  - Austria (0.04)
  - Poland (0.04)
  - Spain (0.04)
  - Denmark (0.04)
  - Finland (0.04)
  - Russia (0.04)
  - Belgium (0.04)
  - Ukraine (0.04)
  - Sweden (0.04)
  - United Kingdom
    - Wales (0.04)
    - England > Cambridgeshire
      - Cambridge (0.04)
- Asia
  - Japan (0.14)
  - India (0.04)
  - South Korea (0.04)
  - Afghanistan (0.04)
  - Singapore (0.04)
  - Taiwan (0.04)
  - Russia (0.04)
  - North Korea (0.04)
  - China > Shanghai
    - Shanghai (0.04)
  - Middle East
    - Republic of Türkiye (0.27)
    - Qatar (0.04)
    - Jordan (0.04)
- Africa
  - South Sudan (0.14)
  - Zimbabwe (0.04)

Genre:
- Research Report > New Finding (0.46)

Industry:
- Leisure & Entertainment (1.00)
- Law (1.00)
- Media > News (1.00)
- Banking & Finance (0.68)
- Transportation (0.67)
- Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.67)
- Materials (0.67)
- Information Technology
  - Services (0.67)
  - Security & Privacy (0.67)
- Health & Medicine
  - Pharmaceuticals & Biotechnology (1.00)
  - Therapeutic Area (0.67)
- Government
  - Military (0.92)
  - Regional Government > North America Government
    - United States Government (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)