Foley, Harrison
Execute Order 66: Targeted Data Poisoning for Reinforcement Learning
Foley, Harrison, Fowl, Liam, Goldstein, Tom, Taylor, Gavin
Reinforcement Learning (RL) has quickly achieved impressive results in a wide variety of control problems, from video games to more real-world applications like autonomous driving and cyberdefense [Vinyals et al., 2019, Galias et al., 2019, Nguyen and Reddi, 2019]. However, as RL becomes integrated into more high risk application areas, security vulnerabilities become more pressing. One such security risk is data poisoning, wherein an attacker maliciously modifies training data to achieve certain adversarial goals. In this work, we carry out a novel data poisoning attack for RL agents, which involves imperceptibly altering a small amount of training data. The effect is the trained agent performs its task normally until it encounters a particular state chosen by the attacker, where it misbehaves catastrophically. Although the complex mechanics of RL have historically made data poisoning for RL challenging, we successfully apply gradient alignment, an approach from supervised learning, to RL [Geiping et al., 2020]. Specifically, we attack RL agents playing Atari games, and demonstrate that we can produce agents that effectively play the game, until shown a particular cue. We demonstrate that effective cues include a specific target state of the attacker's choosing, or, more subtly, a translucent watermark appearing on a portion of any state.