Experiments with Detecting and Mitigating AI Deception

Sahbane, Ismail, Ward, Francis Rhys, Åslund, C Henrik

Jun-26-2023–arXiv.org Artificial Intelligence

How to detect and mitigate deceptive AI systems is an open problem for the field of safe and trustworthy AI. We analyse two algorithms for mitigating deception: The first is based on the path-specific objectives framework where paths in the game that incentivise deception are removed. The second is based on shielding, i.e., monitoring for unsafe policies and replacing them with a safe reference policy. We construct two simple games and evaluate our algorithms empirically. We find that both methods ensure that our agent is not deceptive, however, shielding tends to achieve higher reward.

agent, artificial intelligence, machine learning, (14 more...)

arXiv.org Artificial Intelligence

Jun-26-2023

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - Louisiana > Orleans Parish
    - New Orleans (0.05)
  - Georgia > Fulton County
    - Atlanta (0.04)
- Europe
  - United Kingdom > England
    - Greater London > London (0.05)
  - Switzerland > Vaud
    - Lausanne (0.04)

Genre:
- Research Report (0.40)

Industry:
- Leisure & Entertainment > Games (0.89)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning > Agents (1.00)
  - Machine Learning (1.00)
  - Issues > Social & Ethical Issues (0.67)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found