Kaya, Yigitcan
PoisonedParrot: Subtle Data Poisoning Attacks to Elicit Copyright-Infringing Content from Large Language Models
Panaitescu-Liess, Michael-Andrei, Pathmanathan, Pankayaraj, Kaya, Yigitcan, Che, Zora, An, Bang, Zhu, Sicheng, Agrawal, Aakriti, Huang, Furong
As the capabilities of large language models (LLMs) continue to expand, their usage has become increasingly prevalent. However, as reflected in numerous ongoing lawsuits regarding LLM-generated content, addressing copyright infringement remains a significant challenge. In this paper, we introduce PoisonedParrot: the first stealthy data poisoning attack that induces an LLM to generate copyrighted content even when the model has not been directly trained on the specific copyrighted material. PoisonedParrot integrates small fragments of copyrighted text into the poison samples using an off-the-shelf LLM. Despite its simplicity, evaluated in a wide range of experiments, PoisonedParrot is surprisingly effective at priming the model to generate copyrighted content with no discernible side effects. Moreover, we discover that existing defenses are largely ineffective against our attack. Finally, we make the first attempt at mitigating copyright-infringement poisoning attacks by proposing a defense: ParrotTrap. We encourage the community to explore this emerging threat model further.
DRSM: De-Randomized Smoothing on Malware Classifier Providing Certified Robustness
Saha, Shoumik, Wang, Wenxiao, Kaya, Yigitcan, Feizi, Soheil, Dumitras, Tudor
Machine Learning (ML) models have been utilized for malware detection for over two decades. Consequently, this ignited an ongoing arms race between malware authors and antivirus systems, compelling researchers to propose defenses for malware-detection models against evasion attacks. However, most if not all existing defenses against evasion attacks suffer from sizable performance degradation and/or can defend against only specific attacks, which makes them less practical in real-world settings. In this work, we develop a certified defense, DRSM (De-Randomized Smoothed MalConv), by redesigning the de-randomized smoothing technique for the domain of malware detection. Specifically, we propose a window ablation scheme to provably limit the impact of adversarial bytes while maximally preserving local structures of the executables. After showing how DRSM is theoretically robust against attacks with contiguous adversarial bytes, we verify its performance and certified robustness experimentally, where we observe only marginal accuracy drops as the cost of robustness. To our knowledge, we are the first to offer certified robustness in the realm of static detection of malware executables. More surprisingly, through evaluating DRSM against 9 empirical attacks of different types, we observe that the proposed defense is empirically robust to some extent against a diverse set of attacks, some of which even fall out of the scope of its original threat model. In addition, we collected 15.5K recent benign raw executables from diverse sources, which will be made public as a dataset called PACE (Publicly Accessible Collection(s) of Executables) to alleviate the scarcity of publicly available benign datasets for studying malware detection and provide future research with more representative data of the time.
On the Effectiveness of Regularization Against Membership Inference Attacks
Kaya, Yigitcan, Hong, Sanghyun, Dumitras, Tudor
Deep learning models often raise privacy concerns as they leak information about their training data. This enables an adversary to determine whether a data point was in a model's training set by conducting a membership inference attack (MIA). Prior work has conjectured that regularization techniques, which combat overfitting, may also mitigate the leakage. While many regularization mechanisms exist, their effectiveness against MIAs has not been studied systematically, and the resulting privacy properties are not well understood. We explore the lower bound for information leakage that practical attacks can achieve. First, we evaluate the effectiveness of 8 mechanisms in mitigating two recent MIAs, on three standard image classification tasks. We find that certain mechanisms, such as label smoothing, may inadvertently help MIAs. Second, we investigate the potential of improving the resilience to MIAs by combining complementary mechanisms. Finally, we quantify the opportunity of future MIAs to compromise privacy by designing a white-box `distance-to-confident' (DtC) metric, based on adversarial sample crafting. Our metric reveals that, even when existing MIAs fail, the training samples may remain distinguishable from test samples. This suggests that regularization mechanisms can provide a false sense of privacy, even when they appear effective against existing MIAs.
How to Stop Off-the-Shelf Deep Neural Networks from Overthinking
Kaya, Yigitcan, Dumitras, Tudor
While deep neural networks (DNNs) can perform complex classification tasks, most of their natural inputs do not necessitate the depth of the modern architectures. This leads to wasted computation, as the network overthinks on the simpler inputs. The overthinking problem could be prevented if standard DNNs could produce early predictions. However, prior work suggests that this is challenging in existing architectures, such as ResNet, as their internal layers are not trained for classification and optimizing them for accurate predictions hurts the end performance. In this paper, we explore the overthinking problem, and, as a remedy, we propose a generic modification to off-the-shelf DNNs---the Shallow-Deep Network (SDN). With this modification, a DNN can efficiently produce predictions from either shallow or deep layers, as appropriate for the given input. We employ feature reduction and a layer-wise objective function to train these progressively deeper internal classifiers while preserving the end-performance. We can apply the SDN modification either by training from scratch or by tuning a pre-trained model. Experiments on four architectures (VGG, ResNet, WideResNet, and MobileNet) and three image classifications tasks suggest that, for an average input, an SDN can produce a correct prediction before its middle layer. By avoiding unnecessary computation, the SDN can reduce the required number of operations for an input by 41% over the original network. Finally, we observe that disagreements among the early classifiers reliably indicate inputs where the network is likely to make a mistake. Building on this observation we propose an internal confusion metric and a method to diagnose misclassifications by visualizing these disagreements.