Not enough data to create a plot.
Try a different view from the menu above.
Anderson, Ross
CAQL: Continuous Action Q-Learning
Ryu, Moonkyung, Chow, Yinlam, Anderson, Ross, Tjandraatmadja, Christian, Boutilier, Craig
A BSTRACT V alue-based reinforcement learning (RL) methods like Q-learning have shown success in a variety of domains. One challenge in applying Q-learning to continuous-action RL problems, however, is the continuous action maximization ( max-Q) required for optimal Bellman backup. In this work, we develop CAQL, a (class of) algorithm(s) for continuous-action Q-learning that can use several plug-and-play optimizers for the max-Q problem. Leveraging recent optimization results for deep neural networks, we show that max-Q can be solved optimally using mixed-integer programming (MIP) . When the Q-function representation has sufficient power, MIP-based optimization gives rise to better policies and is more robust than approximate methods (e.g., gradient ascent, cross-entropy search). We further develop several techniques to accelerate inference in CAQL, which despite their approximate nature, perform well. We compare CAQL with state-of-the-art RL algorithms on benchmark continuous-control problems that have different degrees of action constraints and show that CAQL outperforms policy-based methods in heavily constrained environments, often dramatically. When the action space is finite, value-based algorithms such as Q-learning (Watkins & Dayan, 1992), which implicitly finds a policy by learning the optimal value function, are often very efficient because action optimization can be done by exhaustive enumeration. By contrast, in problems with a continuous action spaces (e.g., robotics (Peters & Schaal, 2006)), policy-based algorithms, such as policy gradient (PG) (Sutton et al., 2000; Silver et al., 2014) or cross-entropy policy search (CEPS) (Mannor et al., 2003; Kalashnikov et al., 2018), which directly learn a return-maximizing policy, have proven more practical. Recently, methods such as ensemble critic (Fujimoto et al., 2018) and entropy regularization (Haarnoja et al., 2018) have been developed to improve the performance of policy-based RL algorithms. Policy-based approaches require a reasonable choice of policy parameterization. In some continuous control problems, Gaussian distributions over actions conditioned on some state representation is used. However, in applications such as RSs, where actions often take the form of high-dimensional item-feature vectors, policies cannot typically be modeled by common action distributions.
Blackbox Attacks on Reinforcement Learning Agents Using Approximated Temporal Information
Zhao, Yiren, Shumailov, Ilia, Cui, Han, Gao, Xitong, Mullins, Robert, Anderson, Ross
Recent research on reinforcement learning has shown that trained agents are vulnerable to maliciously crafted adversarial samples. In this work, we show how adversarial samples against RL agents can be generalised from White-box and Grey-box attacks to a strong Black-box case, namely where the attacker has no knowledge of the agents and their training methods. We use sequence-to-sequence models to predict a single action or a sequence of future actions that a trained agent will make. Our approximation model, based on time-series information from the agent, successfully predicts agents' future actions with consistently above 80% accuracy on a wide range of games and training methods. Second, we find that although such adversarial samples are transferable, they do not outperform random Gaussian noise as a means of reducing the game scores of trained RL agents. This highlights a serious methodological deficiency in previous work on such agents; random jamming should have been taken as the baseline for evaluation. Third, we do find a novel use for adversarial samples in this context: they can be used to trigger a trained agent to misbehave after a specific delay. This appears to be a genuinely new type of attack; it potentially enables an attacker to use devices controlled by RL agents as time bombs.
Hearing your touch: A new acoustic side channel on smartphones
Shumailov, Ilia, Simon, Laurent, Yan, Jeff, Anderson, Ross
We present the first acoustic side-channel attack that recovers what users type on the virtual keyboard of their touch-screen smartphone or tablet. When a user taps the screen with a finger, the tap generates a sound wave that propagates on the screen surface and in the air. We found the device's microphone(s) can recover this wave and "hear" the finger's touch, and the wave's distortions are characteristic of the tap's location on the screen. Hence, by recording audio through the built-in microphone(s), a malicious app can infer text as the user enters it on their device. We evaluate the effectiveness of the attack with 45 participants in a real-world environment on an Android tablet and an Android smartphone. For the tablet, we recover 61% of 200 4-digit PIN-codes within 20 attempts, even if the model is not trained with the victim's data. For the smartphone, we recover 9 words of size 7--13 letters with 50 attempts in a common side-channel attack benchmark. Our results suggest that it not always sufficient to rely on isolation mechanisms such as TrustZone to protect user input. We propose and discuss hardware, operating-system and application-level mechanisms to block this attack more effectively. Mobile devices may need a richer capability model, a more user-friendly notification system for sensor usage and a more thorough evaluation of the information leaked by the underlying hardware.
Sitatapatra: Blocking the Transfer of Adversarial Samples
Shumailov, Ilia, Gao, Xitong, Zhao, Yiren, Mullins, Robert, Anderson, Ross, Xu, Cheng-Zhong
Convolutional Neural Networks (CNNs) are widely used to solve classification tasks in computer vision. However, they can be tricked into misclassifying specially crafted `adversarial' samples -- and samples built to trick one model often work alarmingly well against other models trained on the same task. In this paper we introduce Sitatapatra, a system designed to block the transfer of adversarial samples. It diversifies neural networks using a key, as in cryptography, and provides a mechanism for detecting attacks. What's more, when adversarial samples are detected they can typically be traced back to the individual device that was used to develop them. The run-time overheads are minimal permitting the use of Sitatapatra on constrained systems.
The Taboo Trap: Behavioural Detection of Adversarial Samples
Shumailov, Ilia, Zhao, Yiren, Mullins, Robert, Anderson, Ross
Deep Neural Networks (DNNs) have become a powerful tool for a wide range of problems. Yet recent work has shown an increasing variety of adversarial samples that can fool them. Most existing detection mechanisms impose significant costs, either by using additional classifiers to spot adversarial samples, or by requiring the DNN to be restructured. In this paper, we introduce a novel defence. We train our DNN so that, as long as it is working as intended on the kind of inputs we expect, its behavior is constrained, in that a set of behaviors are taboo. If it is exposed to adversarial samples, they will often cause a taboo behavior, which we can detect. As an analogy, we can imagine that we are teaching our robot good manners; if it's ever rude, we know it's come under some bad influence. This defence mechanism is very simple and, although it involves a modest increase in training, has almost zero computation overhead at runtime -- making it particularly suitable for use in embedded systems. Taboos can be both subtle and diverse. Just as humans' choice of language can convey a lot of information about location, affiliation, class and much else that can be opaque to outsiders but that enables members of the same group to recognise each other, so also taboo choice can encode and hide information. We can use this to make adversarial attacks much harder. It is a well-established design principle that the security of a system should not depend on the obscurity of its design, but of some variable (the key) which can differ between implementations and be changed as necessary. We explain how taboos can be used to equip a classifier with just such a key, and to tune the keying mechanism to adversaries of various capabilities. We evaluate the performance of a prototype against a wide range of attacks and show how our simple defense can work well in practice.