Detecting egregious responses in neural sequence-to-sequence models
–arXiv.org Artificial Intelligence
In this work, we attempt to answer a critical question: whether there exists some input sequence that will cause a well-trained discrete-space neural network sequence-to-sequence (seq2seq) model to generate egregious outputs (aggressive, malicious, attacking, etc.). And if such inputs exist, how to find them efficiently. We adopt an empirical methodology, in which we first create lists of egregious output sequences, and then design a discrete optimization algorithm to find input sequences that will cause the model to generate them. Moreover, the optimization algorithm is enhanced for large vocabulary search and constrained to search for input sequences that are likely to be input by real-world users. In our experiments, we apply this approach to dialogue response generation models trained on three real-world dialogue data-sets: Ubuntu, Switchboard and OpenSubtitles, testing whether the model can generate malicious responses. We demonstrate that given the trigger inputs our algorithm finds, a significant number of malicious sentences are assigned large probability by the model, which reveals an undesirable consequence of standard seq2seq training. Recently, research on adversarial attacks (Goodfellow et al., 2014; Szegedy et al., 2013) has been gaining increasing attention: it has been found that for trained deep neural networks (DNNs), when an imperceptible perturbation is applied to the input, the output of the model can change significantly (from correct to incorrect). This line of research has serious implications for our understanding of deep learning models and how we can apply them securely in real-world applications. It has also motivated researchers to design new models or training procedures (Madry et al., 2017), to make the model more robust to those attacks.
arXiv.org Artificial Intelligence
Oct-3-2018
- Country:
- Europe (1.00)
- Asia (1.00)
- North America > United States
- Massachusetts (0.28)
- Genre:
- Research Report > New Finding (0.48)
- Industry:
- Education (0.48)
- Information Technology > Security & Privacy (0.34)
- Technology: