Kiourti, Panagiota
Dormant Neural Trojans
Fu, Feisi, Kiourti, Panagiota, Li, Wenchao
The increasingly widespread adoption of deep neural networks (DNNs) in many applications ranging from image recognition [1] to natural language processing [2] has raised serious concerns over the safety and security of DNNs, as shown in [3, 4, 5, 6, 7] and [8]. In particular, it has been shown that DNNs are vulnerable to backdoor attacks, firstly introduced by [5] and [7], where the backdoored DNN outputs an incorrect prediction when a trigger pattern is injected into the input. For instance, adding a yellow sticker on an image of a stop sign will cause a Trojaned image classifier to label the image as a speed-limit sign [5]. Backdoor attacks in DNNs can be broadly grouped under the following three categories: (1) Training-time attacks, including outsourced training attacks [9, 7, 10, 11, 12, 13, 14, 15], and transfer learning attacks [9, 16, 17, 18]. These training-time attacks are also classified into the broader category of data poisoning attacks. For outsourced training attacks, an adversary poisons the training data by injecting carefully designed samples to compromise the learning process eventually as it was firstly introduced by [9] and [7].
Online Defense of Trojaned Models using Misattributions
Kiourti, Panagiota, Li, Wenchao, Roy, Anirban, Sikka, Karan, Jha, Susmit
This paper proposes a new approach to detecting neural Trojans on Deep Neural Networks during inference. This approach is based on monitoring the inference of a machine learning model, computing the attribution of the model's decision on different features of the input, and then statistically analyzing these attributions to detect whether an input sample contains the Trojan trigger. The anomalous attributions, aka misattributions, are then accompanied by reverse-engineering of the trigger to evaluate whether the input sample is truly poisoned with a Trojan trigger. We evaluate our approach on several benchmarks, including models trained on MNIST, Fashion MNIST, and German Traffic Sign Recognition Benchmark, and demonstrate the state of the art detection accuracy.
TrojDRL: Trojan Attacks on Deep Reinforcement Learning Agents
Kiourti, Panagiota, Wardega, Kacper, Jha, Susmit, Li, Wenchao
Recent work has identified that classification models implemented as neural networks are vulnerable to data-poisoning and Trojan attacks at training time. In this work, we show that these training-time vulnerabilities extend to deep reinforcement learning (DRL) agents and can be exploited by an adversary with access to the training process. In particular, we focus on Trojan attacks that augment the function of reinforcement learning policies with hidden behaviors. We demonstrate that such attacks can be implemented through minuscule data poisoning (as little as 0.025% of the training data) and in-band reward modification that does not affect the reward on normal inputs. The policies learned with our proposed attack approach perform imperceptibly similar to benign policies but deteriorate drastically when the Trojan is triggered in both targeted and untargeted settings. Furthermore, we show that existing Trojan defense mechanisms for classification tasks are not effective in the reinforcement learning setting.