We study the most practical problem setup for evaluating adversarial robustness of a machine learning system with limited access: the hard-label black-box attack setting for generating adversarial examples, where limited model queries are allowed and only the decision is provided to a queried data input. Several algorithms have been proposed for this problem but they typically require huge amount (>20,000) of queries for attacking one example. Among them, one of the state-of-the-art approaches (Cheng et al., 2019) showed that hard-label attack can be modeled as an optimization problem where the objective function can be evaluated by binary search with additional model queries, thereby a zeroth order optimization algorithm can be applied. In this paper, we adopt the same optimization formulation but propose to directly estimate the sign of gradient at any direction instead of the gradient itself, which enjoys the benefit of single query. Using this single query oracle for retrieving sign of directional derivative, we develop a novel query-efficient Sign-OPT approach for hard-label black-box attack. We provide a convergence analysis of the new algorithm and conduct experiments on several models on MNIST, CIFAR-10 and ImageNet. We find that Sign-OPT attack consistently requires 5X to 10X fewer queries when compared to the current state-of-the-art approaches, and usually converges to an adversarial example with smaller perturbation.
It has been shown that machine learning models, especially deep neural networks, are vulnerable to small adversarial perturbations, i.e., a small carefully crafted perturbation added to the input may significantly change the prediction results (Szegedy et al., 2014; Goodfellow et al., 2015; Biggio and Roli, 2018; Fawzi et al., 2018). Therefore, the problem of finding those perturbations, also known as adversarial attacks, has become an important way to evaluate the model robustness: the more difficult to attack a given model, the more robust it is. Depending on the information an adversary can access, the adversarial attacks can be classified into white-box and black-box settings. In the white-box setting, the target model is completely exposed to the attacker, and adversarial perturbations could be easily crafted by exploiting the first-order information, i.e., gradients with respect to the input (Carlini and Wagner, 2017; Madry et al., 2018). Despite of its efficiency and effectiveness, the white-box setting is an overly strong and pessimistic threat model, and white-box attacks are usually not practical when attacking real-world machine learning systems due to the invisibility of the gradient information. Instead, we focus on the problem of black-box attacks, where the model structure and parameters (weights) are not available to the attacker.
We focus on the problem of black-box adversarial attacks, where the aim is to generate adversarial examples for deep learning models solely based on information limited to output labels (hard label) to a queried data input. We use Bayesian optimization (BO) to specifically cater to scenarios involving low query budgets to develop efficient adversarial attacks. Issues with BO's performance in high dimensions are avoided by searching for adversarial examples in structured low-dimensional subspace. Our proposed approach achieves better performance to state of the art black-box adversarial attacks that require orders of magnitude more queries than ours.
We study the problem of attacking a machine learning model in the hard-label black-box setting, where no model information is revealed except that the attacker can make queries to probe the corresponding hard-label decisions. This is a very challenging problem since the direct extension of state-of-the-art white-box attacks (e.g., CW or PGD) to the hard-label black-box setting will require minimizing a non-continuous step function, which is combinatorial and cannot be solved by a gradient-based optimizer. The only current approach is based on random walk on the boundary, which requires lots of queries and lacks convergence guarantees. We propose a novel way to formulate the hard-label black-box attack as a real-valued optimization problem which is usually continuous and can be solved by any zeroth order optimization algorithm. For example, using the Randomized Gradient-Free method, we are able to bound the number of iterations needed for our algorithm to achieve stationary points. We demonstrate that our proposed method outperforms the previous random walk approach to attacking convolutional neural networks on MNIST, CIFAR, and ImageNet datasets. More interestingly, we show that the proposed algorithm can also be used to attack other discrete and non-continuous machine learning models, such as Gradient Boosting Decision Trees (GBDT).
We study the problem of finding a universal (image-agnostic) perturbation to fool machine learning (ML) classifiers (e.g., neural nets, decision tress) in the hard-label black-box setting. Recent work in adversarial ML in the white-box setting (model parameters are known) has shown that many state-of-the-art image classifiers are vulnerable to universal adversarial perturbations: a fixed human-imperceptible perturbation that, when added to any image, causes it to be misclassified with high probability Kurakin et al. , Szegedy et al. , Chen et al. [2017a], Carlini and Wagner . This paper considers a more practical and challenging problem of finding such universal perturbations in an obscure (or black-box) setting. More specifically, we use zeroth order optimization algorithms to find such a universal adversarial perturbation when no model information is revealed-except that the attacker can make queries to probe the classifier. We further relax the assumption that the output of a query is continuous valued confidence scores for all the classes and consider the case where the output is a hard-label decision. Surprisingly, we found that even in these extremely obscure regimes, state-of-the-art ML classifiers can be fooled with a very high probability just by adding a single human-imperceptible image perturbation to any natural image. The surprising existence of universal perturbations in a hard-label black-box setting raises serious security concerns with the existence of a universal noise vector that adversaries can possibly exploit to break a classifier on most natural images.