Provable Defense against Backdoor Policies in Reinforcement Learning

May-27-2025, 06:12:48 GMT–Neural Information Processing Systems

We propose a provable defense mechanism against backdoor policies in reinforcement learning under subspace trigger assumption. A backdoor policy is a security threat where an adversary publishes a seemingly well-behaved policy which in fact allows hidden triggers. During deployment, the adversary can modify observed states in a particular way to trigger unexpected actions and harm the agent. We assume the agent does not have the resources to re-train a good policy. Instead, our defense mechanism sanitizes the backdoor policy by projecting observed states to a safe subspace', estimated from a small number of interactions with a clean (non-triggered) environment.

backdoor policy, provable defense, reinforcement learning, (4 more...)

Neural Information Processing Systems

May-27-2025, 06:12:48 GMT

Conferences Web Page

Add feedback

Industry:
- Information Technology (0.44)

Technology:
- Information Technology
  - Security & Privacy (0.64)
  - Artificial Intelligence > Machine Learning
    - Reinforcement Learning (0.65)